2023 October 15
Given the complexity and interactions among all of the moving pieces of this site, this is what appears to be the best plan to complete the migration. 1. Place the production site in read-only mode. 2. Make a complete Discourse backup of the production site. Copy off the production machine. 3. Make an AWS AMI backup of the production machine, with reboot. 3a. Make a juno:~/Scanalyst_Backup of the production machine. 4. Shut down the production machine. 5. Disassociate the Elastic IP address for scanalyst.fourmilab.ch from the production site. 6. Attach the Elastic IP address to the new site. Rebooot and verify it is accessible through scanalyst.fourmilab.ch. 7. Do a full installation of Discourse on the new site. This should install Let's Encrypt and either verify or re-issue the certificate for the site. This is launched by: super cd ~/discourse/image ./discourse-setup This should use the containers/app.yml file currently in use by the production site. 8. Enable restoration of backups on the new site. 9. Restore the backup made before shutting down the old site. 10. Disable restoration of backups on the new site. 11. Verify access to content restored from the backup. 12. Verify Let's Encrypt certificate status. You can see the log with: cd ~/discourse/image ./launcher logs app | grep -i letsencrypt Keys and certificates are placed in: /var/discourse/shared/standalone/ssl 13. Verify ability to send E-mail via Amazon SES. If this fails, then we may be able to fall back on restoring the /var/discourse directory tree from the backup and rebuilding the Discourse app. Doing this alone will probably not set up everything needed by Let's Encrypt and the other items installed along with the Docker container. To prepare for this cut-over procedure, performed the following (hopefully!) non-destructive experiments on the production site to see if and how they work. Tested putting the site into read-only mode with Admin/Backups and then pressing "Enable read-only". After confirming, a blue banner appears at the top of all pages saying: This site is in read only mode. Please continue to browse, but replying, likes, and other actions are disabled for now. When in this mode, the user can open a Reply or New Topic, but when trying to post it, a pop-up appears saying the site is in read-only mode and rejecting the operation. There's a bug in that after this happens, trying to Delete the failed comment or post does nothing: you can only delete it after read-only mode is turned off. Read-only mode is disabled by pressing the "Disable read-only" button that appears on the Backups page. Tested performing a manual backup with the Backup button and selecting "Include all uploads". This skips to the Logs page which shows the progress of the backup. The backup takes about four minutes and sends a message to the administrator when it is complete. When the backup is done, pressing the Download button sends an E-mail to the administrator with a download link like: https://scanalyst.fourmilab.ch/admin/backups/scanalyst-2023-10-15-132118-v20230926165821.tar.gz?token=REDACTED Clicking the link starts a browser download of the backup file. This 2.4 Gb Gzipped file took around five minutes to download. The contents of this file looks like: -rw-r--r-- discourse/www-data 226750451 2023-10-15 15:21 dump.sql.gz drwxr-xr-x discourse/www-data 0 2021-10-03 01:35 uploads/default/ drwxr-xr-x discourse/www-data 0 2022-01-31 23:44 uploads/default/optimized/ drwxr-xr-x discourse/www-data 0 2023-09-25 14:18 uploads/default/optimized/1X/ -rw-r--r-- discourse/www-data 34305 2021-10-03 01:31 uploads/default/optimized/1X/_129430568242d1b7f853bb13ebea28b3f6af4e7_2_512x512.png -rw-r--r-- discourse/www-data 1810 2021-10-03 01:31 uploads/default/optimized/1X/_129430568242d1b7f853bb13ebea28b3f6af4e7_2_32x32.png The dump.sql.gz file is a sequence of PostgresSQL commands to recreate the database structure and all of its contents. -- -- PostgreSQL database dump -- -- Dumped from database version 13.12 (Debian 13.12-1.pgdg110+1) -- Dumped by pg_dump version 13.12 (Debian 13.12-1.pgdg110+1) -- Started on 2023-10-15 13:21:19 UTC SET statement_timeout = 0; SET lock_timeout = 0; SET idle_in_transaction_session_timeout = 0; SET client_encoding = 'UTF8'; SET standard_conforming_strings = on; SELECT pg_catalog.set_config('search_path', '', false); SET check_function_bodies = false; SET xmloption = content; SET client_min_messages = warning; SET row_security = off; -- -- TOC entry 8 (class 2615 OID 2200) -- Name: public; Type: SCHEMA; Schema: -; Owner: - -- CREATE SCHEMA public; -- -- TOC entry 6902 (class 0 OID 0) -- Dependencies: 8 -- Name: SCHEMA public; Type: COMMENT; Schema: -; Owner: - -- COMMENT ON SCHEMA public IS 'standard public schema'; It is a total of 2,262,052 lines of SQL. Tested enabling restoration of backups with Admin/Settings/Backups: allow restore This enables the Restore buttons in the Admin/Settings/Backups page. Unchecking it disables the Restore buttons again. At 19:00 UTC, I set the site to read-only mode and started a manual backup of the database and uploads. Backup finished at 19:05. At 19:13 started download of backup file scanalyst-2023-10-15-190116-v20230926165821.tar.gz to Hayek. Download completed at 19:18, the file size is 2,628,856,601 bytes (2.5 Gb). Made a snapshot backup to Juno. Started at 19:20, completed at 19:45 UTC. Made a backup AMI of L2023: Scanalyst Backup 2023-10-15 ami-0538ccd8c4c7b2f06 / snap-06510cb44818ae525 /server snap-0adfafe587d78673d Made a backup AMI of production Scanalyst: Scanalyst Production Backup 2023-10-15 ami-067358c923bee6eb2 / snap-0ca666501de3468ee /server snap-0a9f753aec8bcefdf Backup AMI creation complete at 20:11. Shut down the production machine with: ssh sc super cd ~/discourse/image ./launcher stop app cd shutdown -h now At 20:19 the AWS Instances panel showed the Scanalyst instance stopped. In the AWS EC2 console, selected the following Elastic IP address: Name: Scanalyst Allocated IPv4: 18.195.73.61 Allocation ID: eipalloc-80070b8d Associated Instance: i-0756b0658cac51c9d Private IP address: 172.31.22.248 and performed Actions/Disassociate Elastic IP address. Selected Elastic IP address Scanalyst and Actions/Associate Elastic IP address: Name: Scanalyst Resource type: Instance Instance: i-092217f3135c36549 (Scanalyst L2023) Private IP address: 172.31.19.123 Reassociation: No Association ID: eipassoc-077900921ae1436c7 The Instances item for Scanalyst L2023 now shows the Elastic IP address as its public IPv4 address. I can SSH to it after clearing the changed host key warning. At 20:28 began the process of Discourse installation: super cd ~/discourse/image ./discourse-setup It says it is using my app.yml, which it accepted. Downloaded components of base system. Accepted site configuration parameters from app.yml. At 22:37 installation failed: FAILED -------------------- Pups::ExecError: cd /var/www/discourse && su postgres -c 'psql discourse -c "create extension if not exists embedding;"' failed with return # Location of failure: /usr/local/lib/ruby/gems/3.2.0/gems/pups-1.1.1/lib/pups/exec_command.rb:117:in `spawn' exec failed with the params {"cd"=>"$home", "cmd"=>["su postgres -c 'psql discourse -c \"create extension if not exists embedding;\"'"]} bootstrap failed with exit code 1 ** FAILED TO BOOTSTRAP ** please scroll up and look for earlier error messages, there may be more than one. ./discourse-doctor may help diagnose the problem. b53141feb640d2203450b96a453a472cab0455b529c9675d45dce0f971447786 Well, this error message gives me no clue at all what or where the problem may be. Let's try this ./discourse-doctor thing, which never helped before, but you never know. Failed again with: Pups::ExecError: cd /var/www/discourse && su postgres -c 'psql discourse -c "create extension if not exists embedding;"' failed with return # Location of failure: /usr/local/lib/ruby/gems/3.2.0/gems/pups-1.1.1/lib/pups/exec_command.rb:117:in `spawn' exec failed with the params {"cd"=>"$home", "cmd"=>["su postgres -c 'psql discourse -c \"create extension if not exists embedding;\"'"]} bootstrap failed with exit code 1 ** FAILED TO BOOTSTRAP ** please scroll up and look for earlier error messages, there may be more than one. ./discourse-doctor may help diagnose the problem. All right, let's try commenting out all of the local plug-ins and try the discourse-setup with a completely stock system. Edit ~/discourse/image/containers/app.yml and comment out all local plug-in code and comment out plug-ins. cd ~/discourse/image ./discourse-setup Oops! I accidentally disabled the built-in (and presumably required) plug-in docker_manager. Killed build, added it back to app.yml and started new build. cd ~/discourse/image ./discourse-setup This time the build ran to completion and executed the /usr/bin/docker run command at 21:22. Ruby runs CPU bound in several processes for a while, then settles down to around 1% of CPU and 9% of memory. CAZART!!! We got the "Congratulations, you installed Discourse!" page. Registered administration account: Email: REDACTED@fourmilab.ch (won't let me change this) Username: Fourmilab Password: REDACTED Says it sent activation mail. Activation mail arrived OK at the alias address for REDACTED@fourmilab.ch. Copied the link into the Chrome browser. It said the account was activated and prompted for: About your site I filled in placeholder information Member experience Require approval, Enable chat, Disable sidebar It then allows me to Configure more or Jump in! Since we're going to replace the entire database with a restore of the production database, I jumped in. Now we're at the home page of a generic empty site, logged in with an administrator account. Let's Encrypt apparently found the certificate it issued earlier from the entry in the DNS for the site address and re-issued it when discourse-setup queried it. Uploaded the ~/scanalyst-2023-10-15-190116-v20230926165821.tar.gz backup file from Hayek to the new install with: Admin/Backups/Upload. The backup upload fails consistently at the 20% mark with: Sorry, there was an error uploading scanalyst-2023-10-15-190116-v20230926165821.tar.gz. Please try again. When this happens, Discourse recommends: https://meta.discourse.org/t/restore-a-backup-from-the-command-line/108034 restoring the backup from the command line. On Hayek, after re-introducing root to the new host key, I copied: scp -p ~/scanalyst-2023-10-15-190116-v20230926165821.tar.gz sc:/var/discourse/shared/standalone/backups/default After the copy is complete, the backup shows up in the Admin/Backups tab. I opted to complete the restore using the Docker command line procedure. super cd ~/discourse/image ./launcher enter app discourse enable_restore Restore are now permitted. Disable them with `disable_restore` discourse restore scanalyst-2023-10-15-190116-v20230926165821.tar.gz Starting restore: scanalyst-2023-10-15-190116-v20230926165821.tar.gz [STARTED] 'system' has started the restore! Marking restore as running... Making sure /var/www/discourse/tmp/restores/default/2023-10-15-220557 exists... Copying archive to tmp directory... Unzipping archive, this may take a while... Extracting dump file... Validating metadata... Current version: 20231004020328 Restored version: 20230926165821 Enabling readonly mode... Pausing sidekiq... Waiting up to 60 seconds for Sidekiq to finish running jobs... Creating missing functions in the discourse_functions schema... Restoring dump file... (this may take a while) SET SET SET SET SET set_config ------------ (1 row) [Many more messages] CREATE INDEX ERROR: unrecognized parameter "dims" EXCEPTION: psql failed: ERROR: unrecognized parameter "dims" /var/www/discourse/lib/backup_restore/database_restorer.rb:92:in `restore_dump' /var/www/discourse/lib/backup_restore/database_restorer.rb:26:in `restore' /var/www/discourse/lib/backup_restore/restorer.rb:51:in `run' script/discourse:149:in `restore' /var/www/discourse/vendor/bundle/ruby/3.2.0/gems/thor-1.2.2/lib/thor/command.rb:27:in `run' /var/www/discourse/vendor/bundle/ruby/3.2.0/gems/thor-1.2.2/lib/thor/invocation.rb:127:in `invoke_command' /var/www/discourse/vendor/bundle/ruby/3.2.0/gems/thor-1.2.2/lib/thor.rb:392:in `dispatch' /var/www/discourse/vendor/bundle/ruby/3.2.0/gems/thor-1.2.2/lib/thor/base.rb:485:in `start' script/discourse:290:in `' /usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.4.13/lib/bundler/cli/exec.rb:58:in `load' /usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.4.13/lib/bundler/cli/exec.rb:58:in `kernel_load' /usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.4.13/lib/bundler/cli/exec.rb:23:in `run' /usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.4.13/lib/bundler/cli.rb:492:in `exec' /usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.4.13/lib/bundler/vendor/thor/lib/thor/command.rb:27:in `run' /usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.4.13/lib/bundler/vendor/thor/lib/thor/invocation.rb:127:in `invoke_command' /usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.4.13/lib/bundler/vendor/thor/lib/thor.rb:392:in `dispatch' /usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.4.13/lib/bundler/cli.rb:34:in `dispatch' /usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.4.13/lib/bundler/vendor/thor/lib/thor/base.rb:485:in `start' /usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.4.13/lib/bundler/cli.rb:28:in `start' /usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.4.13/exe/bundle:45:in `block in ' /usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.4.13/lib/bundler/friendly_errors.rb:117:in `with_friendly_errors' /usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.4.13/exe/bundle:33:in `' /usr/local/bin/bundle:25:in `load' /usr/local/bin/bundle:25:in `' Trying to rollback... Rolling back... Cleaning stuff up... Dropping functions from the discourse_functions schema... Removing tmp '/var/www/discourse/tmp/restores/default/2023-10-15-220557' directory... Unpausing sidekiq... Marking restore as finished... Notifying 'system' of the end of the restore... Finished! [FAILED] Restore done. Blooie. My guess is that the restore is trying to create plug-in information for plug-ins we have disabled. Let's try turning them back on. First, start with the easy, standard ones, discourse-spoiler-alert and discourse-math. ./launcher rebuild app All right, that rebuilt with no errors. The Admin/Plugins page shows both new plugins installed. Tested both plug-ins and they work. Now let's try bringing Shalmaneser back. Enable all of the commented out preliminary code in app.yml and the plug-in. ./launcher rebuild app Failed on the command: postgres -c 'psql discourse -c "create extension if not exists embedding;"' which is the last command in the additions required to support chatbot. I'm pretty sure now what's happening is that the SQL restore file contains references to tables created by chatbot which haven't been created. The question is whether we can get by this without having to go back to the old site and patching the database to remove them and then ditching chatbot before we export the database and start all over again. Rebuilt the site once again without chatbot. Now we have a platform to load a backup without the chatbot entries which cause it to fail. Spent (wasted?) an hour and a half manually deleting items in the PostgresSQL dump which looked like they belonged to chatbot. Tarred up the compressed patched dump with the uploads from the original backup and attempted to restore it from the command line. Verdict: wasted. Failed with: ERROR: extra data after last expected column CONTEXT: COPY category_tag_stats, line 406: "5033 16 4076 1 4345 7 3559 1" EXCEPTION: psql failed: CONTEXT: COPY category_tag_stats, line 406: "5033 16 4076 1 4345 7 3559 1" And now we've run past the 2023-10-16 00:00 drop dead time I gave as the end of the maintenance window. But the last collapse looks like a fat finger whilst editing the SQL dump, so let's fix it and have another go before giving up and reversing everything back to the status quo ante. Blooie: ERROR: extra data after last expected column CONTEXT: COPY category_tag_stats, line 1406: "4950 7 4020 1 3587 6 2958 1" EXCEPTION: psql failed: CONTEXT: COPY category_tag_stats, line 1406: "4950 7 4020 1 3587 6 2958 1" Another fat finger. Let's try fixing this one. This time we got to "Migrating the database" and "Reconnecting to the database". Then: Migrating the database... Compiled theme-transpiler: tmp/theme-transpiler.js == 20230807033021 AddGroupToWebHookEventType: migrating ======================= -- add_column(:web_hook_event_types, :group, :integer) -> 0.0008s == 20230807033021 AddGroupToWebHookEventType: migrated (0.0018s) ============== == 20230807040058 MoveWebHooksToNewEventIds: migrating ======================== -- execute("INSERT INTO web_hook_event_types_hooks(web_hook_event_type_id, web_hook_id)\nSELECT 101, == 20230807040058 MoveWebHooksToNewEventIds: migrated (0.0026s) =============== == 20231004020328 MigrateLegacyNavigationMenuSiteSetting: migrating =========== -- execute("UPDATE site_settings SET value = 'header dropdown' WHERE name = 'navigation_menu' AND value = 'legacy'") -> 0.0005s == 20231004020328 MigrateLegacyNavigationMenuSiteSetting: migrated (0.0009s) == Reconnecting to the database... Reloading site settings... Disabling outgoing emails for non-staff users... Disabling readonly mode... Clearing category cache... Reloading translations... Remapping uploads... Restoring uploads, this may take a while... Posts will be rebaked by a background job in sidekiq. You will see missing images until that has completed. You can expedite the process by manually running "rake posts:rebake_uncooked_posts" Clearing emoji cache... Clear theme cache Executing the after_restore_hook... Cleaning stuff up... Dropping functions from the discourse_functions schema... Removing tmp '/var/www/discourse/tmp/restores/default/2023-10-16-001735' directory... Unpausing sidekiq... Marking restore as finished... Notifying 'system' of the end of the restore... Finished! [SUCCESS] Restore done. I was logged out. At this point, I could log back in to my user account and, at a first glance, everything looked OK. I was able to log in to the administrator account. All logins (at least as administrator) show a banner at the top: Outgoing email has been disabled for non-staff users. Not sure what that's about. Sent an E-mail to myself from the Fourmilab account. It was sent and received correctly. The Plugins page shows docker_manager, discourse-spoiler-alert, and discourse-math installed and enabled. I tested the latter two and they are working. As of 00:19 UTC on 2023-10-16, the site is back up and accessible through the usual scanalyst.fourmilab.ch URL. That's 20 minutes after my "no later than" time, but I judged that better than starting all over back at the old site. The "Outgoing email has been disabled" after migration to a new server is discussed in: https://meta.discourse.org/t/migration-to-new-server-outgoing-email-has-been-disabled-for-non-staff-users/242123 To fix this, go to Admin/Settings and search for "disable emails". Then change the setting from "non-staff" to "no". The banner goes away. Memo to file: there is a tremendous amount of cruft in: /var/discourse/shared/standalone/backups/default resulting from today's misadventures. This should be cleaned up once we're confident it won't be needed again. Confirmed that "allow restore" of backups is disabled in Admin/Settings. This was apparently done automatically at the completion of the restore. The bottom line is that the site now appears to be running normally, at least after cursory examination and functional testing. The casualty in the migration is Shalmaneser, the chatbot, which has now caused major build disasters in two of the last three system updates. My usual policy with third-party add-ons to this site is "One strike and you're out", but this was so cool I decided to cut it a little slack. Now we've breached the "Two strikes and you're banned from the league for life" threshold, so farewell, Shal.