Development Log 2023-10-15

johnwalker · 16 October 2023 01:19

2023 October 15

Given the complexity and interactions among all of the moving pieces of
this site, this is what appears to be the best plan to complete the
migration.

1. Place the production site in read-only mode.
2. Make a complete Discourse backup of the production site.
Copy off the production machine.
3. Make an AWS AMI backup of the production machine, with reboot.
3a. Make a juno:~/Scanalyst_Backup of the production machine.
4. Shut down the production machine.

5. Disassociate the Elastic IP address for scanalyst.fourmilab.ch
from the production site.
6. Attach the Elastic IP address to the new site. Rebooot and
verify it is accessible through scanalyst.fourmilab.ch.

7. Do a full installation of Discourse on the new site. This
should install Let's Encrypt and either verify or re-issue the
certificate for the site. This is launched by:
super
cd ~/discourse/image
./discourse-setup
This should use the containers/app.yml file currently in use by
the production site.
8. Enable restoration of backups on the new site.
9. Restore the backup made before shutting down the old site.
10. Disable restoration of backups on the new site.
11. Verify access to content restored from the backup.
12. Verify Let's Encrypt certificate status. You can see the log
with:
cd ~/discourse/image
./launcher logs app | grep -i letsencrypt
Keys and certificates are placed in:
/var/discourse/shared/standalone/ssl
13. Verify ability to send E-mail via Amazon SES.

If this fails, then we may be able to fall back on restoring the
/var/discourse directory tree from the backup and rebuilding the
Discourse app. Doing this alone will probably not set up everything
needed by Let's Encrypt and the other items installed along with the
Docker container.

To prepare for this cut-over procedure, performed the following
(hopefully!) non-destructive experiments on the production site to see
if and how they work.

Tested putting the site into read-only mode with Admin/Backups and then
pressing "Enable read-only". After confirming, a blue banner appears
at the top of all pages saying:
This site is in read only mode. Please continue to browse, but
replying, likes, and other actions are disabled for now.
When in this mode, the user can open a Reply or New Topic, but when
trying to post it, a pop-up appears saying the site is in read-only
mode and rejecting the operation. There's a bug in that after this
happens, trying to Delete the failed comment or post does nothing:
you can only delete it after read-only mode is turned off. Read-only
mode is disabled by pressing the "Disable read-only" button that
appears on the Backups page.

Tested performing a manual backup with the Backup button and selecting
"Include all uploads". This skips to the Logs page which shows the
progress of the backup. The backup takes about four minutes and sends
a message to the administrator when it is complete.

When the backup is done, pressing the Download button sends an E-mail
to the administrator with a download link like:
https://scanalyst.fourmilab.ch/admin/backups/scanalyst-2023-10-15-132118-v20230926165821.tar.gz?token=REDACTED
Clicking the link starts a browser download of the backup file. This
2.4 Gb Gzipped file took around five minutes to download. The contents
of this file looks like:
-rw-r--r-- discourse/www-data 226750451 2023-10-15 15:21 dump.sql.gz
drwxr-xr-x discourse/www-data 0 2021-10-03 01:35 uploads/default/
drwxr-xr-x discourse/www-data 0 2022-01-31 23:44 uploads/default/optimized/
drwxr-xr-x discourse/www-data 0 2023-09-25 14:18 uploads/default/optimized/1X/
-rw-r--r-- discourse/www-data 34305 2021-10-03 01:31 uploads/default/optimized/1X/_129430568242d1b7f853bb13ebea28b3f6af4e7_2_512x512.png
-rw-r--r-- discourse/www-data 1810 2021-10-03 01:31 uploads/default/optimized/1X/_129430568242d1b7f853bb13ebea28b3f6af4e7_2_32x32.png
The dump.sql.gz file is a sequence of PostgresSQL commands to recreate
the database structure and all of its contents.
--
-- PostgreSQL database dump
--

-- Dumped from database version 13.12 (Debian 13.12-1.pgdg110+1)
-- Dumped by pg_dump version 13.12 (Debian 13.12-1.pgdg110+1)

-- Started on 2023-10-15 13:21:19 UTC

SET statement_timeout = 0;
SET lock_timeout = 0;
SET idle_in_transaction_session_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;
SELECT pg_catalog.set_config('search_path', '', false);
SET check_function_bodies = false;
SET xmloption = content;
SET client_min_messages = warning;
SET row_security = off;

--
-- TOC entry 8 (class 2615 OID 2200)
-- Name: public; Type: SCHEMA; Schema: -; Owner: -
--

CREATE SCHEMA public;

--
-- TOC entry 6902 (class 0 OID 0)
-- Dependencies: 8
-- Name: SCHEMA public; Type: COMMENT; Schema: -; Owner: -
--

COMMENT ON SCHEMA public IS 'standard public schema';
It is a total of 2,262,052 lines of SQL.

Tested enabling restoration of backups with Admin/Settings/Backups:
allow restore
This enables the Restore buttons in the Admin/Settings/Backups page.
Unchecking it disables the Restore buttons again.

At 19:00 UTC, I set the site to read-only mode and started a manual
backup of the database and uploads. Backup finished at 19:05.

At 19:13 started download of backup file
scanalyst-2023-10-15-190116-v20230926165821.tar.gz
to Hayek. Download completed at 19:18, the file size is 2,628,856,601
bytes (2.5 Gb).

Made a snapshot backup to Juno. Started at 19:20, completed at 19:45
UTC.

Made a backup AMI of L2023:
Scanalyst Backup 2023-10-15 ami-0538ccd8c4c7b2f06
/ snap-06510cb44818ae525
/server snap-0adfafe587d78673d

Made a backup AMI of production Scanalyst:
Scanalyst Production Backup 2023-10-15 ami-067358c923bee6eb2
/ snap-0ca666501de3468ee
/server snap-0a9f753aec8bcefdf

Backup AMI creation complete at 20:11.

Shut down the production machine with:
ssh sc
super
cd ~/discourse/image
./launcher stop app
cd
shutdown -h now
At 20:19 the AWS Instances panel showed the Scanalyst instance stopped.

In the AWS EC2 console, selected the following Elastic IP address:
Name: Scanalyst
Allocated IPv4: 18.195.73.61
Allocation ID: eipalloc-80070b8d
Associated Instance: i-0756b0658cac51c9d
Private IP address: 172.31.22.248
and performed Actions/Disassociate Elastic IP address.

Selected Elastic IP address Scanalyst and Actions/Associate Elastic IP
address:
Name: Scanalyst
Resource type: Instance
Instance: i-092217f3135c36549 (Scanalyst L2023)
Private IP address: 172.31.19.123
Reassociation: No

Association ID: eipassoc-077900921ae1436c7

The Instances item for Scanalyst L2023 now shows the Elastic IP address
as its public IPv4 address. I can SSH to it after clearing the changed
host key warning.

At 20:28 began the process of Discourse installation:
super
cd ~/discourse/image
./discourse-setup
It says it is using my app.yml, which it accepted.
Downloaded components of base system.
Accepted site configuration parameters from app.yml.

At 22:37 installation failed:
FAILED
--------------------
Pups::ExecError: cd /var/www/discourse && su postgres -c 'psql discourse -c "create extension if not exists embedding;"' failed with return #
Location of failure: /usr/local/lib/ruby/gems/3.2.0/gems/pups-1.1.1/lib/pups/exec_command.rb:117:in `spawn'
exec failed with the params {"cd"=>"$home", "cmd"=>["su postgres -c 'psql discourse -c \"create extension if not exists embedding;\"'"]}
bootstrap failed with exit code 1
** FAILED TO BOOTSTRAP ** please scroll up and look for earlier error messages, there may be more than one.
./discourse-doctor may help diagnose the problem.
b53141feb640d2203450b96a453a472cab0455b529c9675d45dce0f971447786

Well, this error message gives me no clue at all what or where the
problem may be. Let's try this ./discourse-doctor thing, which never
helped before, but you never know.

Failed again with:
Pups::ExecError: cd /var/www/discourse && su postgres -c 'psql discourse -c "create extension if not exists embedding;"' failed with return #
Location of failure: /usr/local/lib/ruby/gems/3.2.0/gems/pups-1.1.1/lib/pups/exec_command.rb:117:in `spawn'
exec failed with the params {"cd"=>"$home", "cmd"=>["su postgres -c 'psql discourse -c \"create extension if not exists embedding;\"'"]}
bootstrap failed with exit code 1
** FAILED TO BOOTSTRAP ** please scroll up and look for earlier error messages, there may be more than one.
./discourse-doctor may help diagnose the problem.

All right, let's try commenting out all of the local plug-ins and try
the discourse-setup with a completely stock system. Edit
~/discourse/image/containers/app.yml and comment out all local plug-in
code and comment out plug-ins.
cd ~/discourse/image
./discourse-setup
Oops! I accidentally disabled the built-in (and presumably required)
plug-in docker_manager. Killed build, added it back to app.yml and
started new build.
cd ~/discourse/image
./discourse-setup
This time the build ran to completion and executed the /usr/bin/docker run
command at 21:22. Ruby runs CPU bound in several processes for a while,
then settles down to around 1% of CPU and 9% of memory.

CAZART!!! We got the "Congratulations, you installed Discourse!" page.

Registered administration account:
Email: REDACTED@fourmilab.ch (won't let me change this)
Username: Fourmilab
Password: REDACTED
Says it sent activation mail.

Activation mail arrived OK at the alias address for REDACTED@fourmilab.ch.
Copied the link into the Chrome browser. It said the account was
activated and prompted for:
About your site
I filled in placeholder information
Member experience
Require approval, Enable chat, Disable sidebar
It then allows me to Configure more or Jump in! Since we're going to
replace the entire database with a restore of the production database,
I jumped in.

Now we're at the home page of a generic empty site, logged in with an
administrator account.

Let's Encrypt apparently found the certificate it issued earlier from
the entry in the DNS for the site address and re-issued it when
discourse-setup queried it.

Uploaded the ~/scanalyst-2023-10-15-190116-v20230926165821.tar.gz
backup file from Hayek to the new install with: Admin/Backups/Upload.
The backup upload fails consistently at the 20% mark with:
Sorry, there was an error uploading
scanalyst-2023-10-15-190116-v20230926165821.tar.gz. Please try again.
When this happens, Discourse recommends:
https://meta.discourse.org/t/restore-a-backup-from-the-command-line/108034
restoring the backup from the command line. On Hayek, after
re-introducing root to the new host key, I copied:
scp -p ~/scanalyst-2023-10-15-190116-v20230926165821.tar.gz sc:/var/discourse/shared/standalone/backups/default
After the copy is complete, the backup shows up in the Admin/Backups
tab. I opted to complete the restore using the Docker command line
procedure.
super
cd ~/discourse/image
./launcher enter app
discourse enable_restore
Restore are now permitted. Disable them with `disable_restore`
discourse restore scanalyst-2023-10-15-190116-v20230926165821.tar.gz
Starting restore: scanalyst-2023-10-15-190116-v20230926165821.tar.gz
[STARTED]
'system' has started the restore!
Marking restore as running...
Making sure /var/www/discourse/tmp/restores/default/2023-10-15-220557 exists...
Copying archive to tmp directory...
Unzipping archive, this may take a while...
Extracting dump file...
Validating metadata...
Current version: 20231004020328
Restored version: 20230926165821
Enabling readonly mode...
Pausing sidekiq...
Waiting up to 60 seconds for Sidekiq to finish running jobs...
Creating missing functions in the discourse_functions schema...
Restoring dump file... (this may take a while)
SET
SET
SET
SET
SET
set_config
------------
(1 row)
[Many more messages]
CREATE INDEX
ERROR: unrecognized parameter "dims"
EXCEPTION: psql failed: ERROR: unrecognized parameter "dims"
/var/www/discourse/lib/backup_restore/database_restorer.rb:92:in `restore_dump'
/var/www/discourse/lib/backup_restore/database_restorer.rb:26:in `restore'
/var/www/discourse/lib/backup_restore/restorer.rb:51:in `run'
script/discourse:149:in `restore'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/thor-1.2.2/lib/thor/command.rb:27:in `run'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/thor-1.2.2/lib/thor/invocation.rb:127:in `invoke_command'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/thor-1.2.2/lib/thor.rb:392:in `dispatch'
/var/www/discourse/vendor/bundle/ruby/3.2.0/gems/thor-1.2.2/lib/thor/base.rb:485:in `start'
script/discourse:290:in `'
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.4.13/lib/bundler/cli/exec.rb:58:in `load'
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.4.13/lib/bundler/cli/exec.rb:58:in `kernel_load'
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.4.13/lib/bundler/cli/exec.rb:23:in `run'
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.4.13/lib/bundler/cli.rb:492:in `exec'
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.4.13/lib/bundler/vendor/thor/lib/thor/command.rb:27:in `run'
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.4.13/lib/bundler/vendor/thor/lib/thor/invocation.rb:127:in `invoke_command'
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.4.13/lib/bundler/vendor/thor/lib/thor.rb:392:in `dispatch'
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.4.13/lib/bundler/cli.rb:34:in `dispatch'
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.4.13/lib/bundler/vendor/thor/lib/thor/base.rb:485:in `start'
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.4.13/lib/bundler/cli.rb:28:in `start'
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.4.13/exe/bundle:45:in `block in '
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.4.13/lib/bundler/friendly_errors.rb:117:in `with_friendly_errors'
/usr/local/lib/ruby/gems/3.2.0/gems/bundler-2.4.13/exe/bundle:33:in `'
/usr/local/bin/bundle:25:in `load'
/usr/local/bin/bundle:25:in `'
Trying to rollback...
Rolling back...
Cleaning stuff up...
Dropping functions from the discourse_functions schema...
Removing tmp '/var/www/discourse/tmp/restores/default/2023-10-15-220557' directory...
Unpausing sidekiq...
Marking restore as finished...
Notifying 'system' of the end of the restore...
Finished!
[FAILED]
Restore done.
Blooie. My guess is that the restore is trying to create plug-in
information for plug-ins we have disabled. Let's try turning them
back on. First, start with the easy, standard ones, discourse-spoiler-alert
and discourse-math.
./launcher rebuild app
All right, that rebuilt with no errors. The Admin/Plugins page shows
both new plugins installed. Tested both plug-ins and they work.

Now let's try bringing Shalmaneser back. Enable all of the commented
out preliminary code in app.yml and the plug-in.
./launcher rebuild app
Failed on the command:
postgres -c 'psql discourse -c "create extension if not exists embedding;"'
which is the last command in the additions required to support chatbot.

I'm pretty sure now what's happening is that the SQL restore file
contains references to tables created by chatbot which haven't been
created. The question is whether we can get by this without having to
go back to the old site and patching the database to remove them and
then ditching chatbot before we export the database and start all
over again.

Rebuilt the site once again without chatbot. Now we have a platform to
load a backup without the chatbot entries which cause it to fail.

Spent (wasted?) an hour and a half manually deleting items in the
PostgresSQL dump which looked like they belonged to chatbot. Tarred
up the compressed patched dump with the uploads from the original
backup and attempted to restore it from the command line. Verdict:
wasted. Failed with:
ERROR: extra data after last expected column
CONTEXT: COPY category_tag_stats, line 406: "5033 16 4076 1 4345 7 3559 1"
EXCEPTION: psql failed: CONTEXT: COPY category_tag_stats, line 406: "5033 16 4076 1 4345 7 3559 1"
And now we've run past the 2023-10-16 00:00 drop dead time I gave as the
end of the maintenance window. But the last collapse looks like a fat
finger whilst editing the SQL dump, so let's fix it and have another
go before giving up and reversing everything back to the status quo
ante.

Blooie:
ERROR: extra data after last expected column
CONTEXT: COPY category_tag_stats, line 1406: "4950 7 4020 1 3587 6 2958 1"
EXCEPTION: psql failed: CONTEXT: COPY category_tag_stats, line 1406: "4950 7 4020 1 3587 6 2958 1"
Another fat finger. Let's try fixing this one.

This time we got to "Migrating the database" and "Reconnecting to the
database". Then:
Migrating the database...
Compiled theme-transpiler: tmp/theme-transpiler.js
== 20230807033021 AddGroupToWebHookEventType: migrating =======================
-- add_column(:web_hook_event_types, :group, :integer)
-> 0.0008s
== 20230807033021 AddGroupToWebHookEventType: migrated (0.0018s) ==============

== 20230807040058 MoveWebHooksToNewEventIds: migrating ========================
-- execute("INSERT INTO web_hook_event_types_hooks(web_hook_event_type_id, web_hook_id)\nSELECT 101, == 20230807040058 MoveWebHooksToNewEventIds: migrated (0.0026s) ===============

== 20231004020328 MigrateLegacyNavigationMenuSiteSetting: migrating ===========
-- execute("UPDATE site_settings SET value = 'header dropdown' WHERE name = 'navigation_menu' AND value = 'legacy'")
-> 0.0005s
== 20231004020328 MigrateLegacyNavigationMenuSiteSetting: migrated (0.0009s) ==

Reconnecting to the database...
Reloading site settings...
Disabling outgoing emails for non-staff users...
Disabling readonly mode...
Clearing category cache...
Reloading translations...
Remapping uploads...
Restoring uploads, this may take a while...
Posts will be rebaked by a background job in sidekiq. You will see missing images until that has completed.
You can expedite the process by manually running "rake posts:rebake_uncooked_posts"
Clearing emoji cache...
Clear theme cache
Executing the after_restore_hook...
Cleaning stuff up...
Dropping functions from the discourse_functions schema...
Removing tmp '/var/www/discourse/tmp/restores/default/2023-10-16-001735' directory...
Unpausing sidekiq...
Marking restore as finished...
Notifying 'system' of the end of the restore...
Finished!
[SUCCESS]
Restore done.

I was logged out. At this point, I could log back in to my user
account and, at a first glance, everything looked OK.

I was able to log in to the administrator account. All logins (at
least as administrator) show a banner at the top:
Outgoing email has been disabled for non-staff users.
Not sure what that's about.

Sent an E-mail to myself from the Fourmilab account. It was sent and
received correctly.

The Plugins page shows docker_manager, discourse-spoiler-alert, and
discourse-math installed and enabled. I tested the latter two and they
are working.

As of 00:19 UTC on 2023-10-16, the site is back up and accessible
through the usual scanalyst.fourmilab.ch URL. That's 20 minutes after
my "no later than" time, but I judged that better than starting all over
back at the old site.

The "Outgoing email has been disabled" after migration to a new server
is discussed in:
https://meta.discourse.org/t/migration-to-new-server-outgoing-email-has-been-disabled-for-non-staff-users/242123
To fix this, go to Admin/Settings and search for "disable emails". Then
change the setting from "non-staff" to "no". The banner goes away.

Memo to file: there is a tremendous amount of cruft in:
/var/discourse/shared/standalone/backups/default
resulting from today's misadventures. This should be cleaned up once
we're confident it won't be needed again.

Confirmed that "allow restore" of backups is disabled in Admin/Settings.
This was apparently done automatically at the completion of the restore.

The bottom line is that the site now appears to be running normally, at
least after cursory examination and functional testing. The casualty
in the migration is Shalmaneser, the chatbot, which has now caused
major build disasters in two of the last three system updates. My
usual policy with third-party add-ons to this site is "One strike and
you're out", but this was so cool I decided to cut it a little slack.
Now we've breached the "Two strikes and you're banned from the league
for life" threshold, so farewell, Shal.