frickler | lp changed their favicon, my inner monk is upset | 11:50 |
---|---|---|
Clark[m] | Gerrit 3.5 upgrade begins at 20:00 utc today. I'll pop back in around then to help out if necessary | 14:45 |
fungi | i'll try to be around at that time as well | 18:44 |
fungi | ~1.25 hours from now | 18:44 |
ianw | o/ | 19:59 |
clarkb | good morning | 19:59 |
fungi | ohai! | 20:00 |
ianw | #status notice "Gerrit will be unavailable for a short time as it is upgraded to the 3.5 release" | 20:00 |
opendevstatus | ianw: sending notice | 20:00 |
-opendevstatus- NOTICE: "Gerrit will be unavailable for a short time as it is upgraded to the 3.5 release" | 20:00 | |
ianw | https://etherpad.opendev.org/p/gerrit-upgrade-3.5 is the checklist | 20:01 |
clarkb | ianw: just rereviewing the checklist again we don't seem to have an explicit reindex step. Are we relying on online reindexing then? | 20:02 |
ianw | i think so, must be what we did last time too? | 20:03 |
opendevstatus | ianw: finished sending notice | 20:03 |
clarkb | oh yes I think that is correct. I'm thinking of the init step not the reindex though | 20:03 |
clarkb | rereading the upgrade notes there is no schema change and an offline reindex is only necessary if upgrading from 3.3 or older | 20:04 |
clarkb | I've added a note to step 13 to ensure online reindexing is completed | 20:05 |
clarkb | that should cover all my concern here | 20:05 |
ianw | oh, running the system backup with the mariadb container down doesn't work, doh | 20:08 |
ianw | that step should be before shutting down containers | 20:08 |
clarkb | ah right because it uses mysql_dump | 20:09 |
clarkb | and that needs a running mysql server | 20:09 |
ianw | i just restarted mariadb and did another run, so we have the full backup now | 20:10 |
clarkb | ianw: did you stop the db again? | 20:11 |
ianw | yep | 20:11 |
clarkb | ack | 20:11 |
ianw | the 645dc2 image is still the lastest, and everything lines up there | 20:14 |
ianw | i agree on waiting for the reindex. i think we should still do the mariadb update in a separate step, just to make sure gerrit 3.5 is happy first | 20:15 |
clarkb | wfm the mariadb upgrade is also much lower priority we can skip it if necessary | 20:16 |
clarkb | any idea what those key exchange errors are about? | 20:17 |
ianw | hrm, nothing too much in the error log, but there are two hosts that seem to be looping around failing to authenticate | 20:18 |
ianw | it's weird that the id is null@<ip> | 20:18 |
clarkb | ianw: it may be the user isn't passed until after kex happens? | 20:19 |
clarkb | I am able to ssh and gerrit ls-projects so ssh seems to work generally | 20:19 |
clarkb | one host appears to be an opensuse host and another an IBM host? I think we cna likely proceed and try to followup with them later | 20:20 |
clarkb | and possibly block them via iptbales if the logging becomes too much | 20:21 |
ianw | yeah, it's also been happening well before this, at least since 05-31 | 20:21 |
clarkb | ah ok | 20:22 |
ianw | maybe we should iptables block them; perhaps whatever is doing it doesn't handle kex errors well, but might raise an error to it's owners if it's cut off? clearly nobody is looking at whatever it's doing too closely | 20:22 |
clarkb | browsing random changes I see that a little "login is required to perform this action" popup occurs in the bottom left when opening file diffs | 20:22 |
clarkb | I suspect this is a regression in the UI not handling anonymous users properly. I don't think that is fatal enough to rollback either | 20:23 |
clarkb | thats the sort of thing I can dig into later this week and probably push a fix for if no one else is interested | 20:23 |
ianw | ++ i agree i see that on an anonymous browse too | 20:24 |
clarkb | if I had to guess it is trying to mark the files as reviewed | 20:24 |
clarkb | but it can't do that unless logged in | 20:24 |
clarkb | 1717 tasks remaining down from 1882 a few minutes ago. Reindexing seems to be progressing | 20:25 |
ianw | an initial watch of the network requests when it pops up doesn't show anything incredibly obvious. so yeah agree we can work on it after | 20:26 |
fungi | lgtm so far | 20:30 |
fungi | (sorry, had to step away for a few minutes) | 20:30 |
clarkb | ianw: I'll let you drive step 13, but let me knwo if I can help with any of those sub tasks | 20:31 |
clarkb | if someone pushes a change that will check zuul and gitea replication transitively | 20:32 |
clarkb | web response I think looks good | 20:32 |
opendevreview | Ian Wienand proposed opendev/system-config master: [dnm] trigger bunch of jobs to test gerrit 3.5 https://review.opendev.org/c/opendev/system-config/+/846510 | 20:34 |
clarkb | that also checks that gerritbot is happy :) | 20:34 |
clarkb | zuul has queued up jobs for that change. | 20:34 |
fungi | yay! | 20:34 |
clarkb | the replication logs for that look good too, now to check I can fetch the ref | 20:35 |
clarkb | I can fetch refs/changes/10/846510/1 from at least one of the giteas using the load balanced frontend | 20:36 |
clarkb | have we rechecked any changes? | 20:37 |
clarkb | down to 1085 tasks | 20:37 |
ianw | https://review.opendev.org/844912 is in the queue from a recheck | 20:37 |
clarkb | agreed that lgtm too | 20:37 |
fungi | yep, seems to be working | 20:43 |
clarkb | under 500 tasks now | 20:44 |
clarkb | seems to be moving very quickly | 20:44 |
ianw | https://twitter.com/opendevinfra/status/1538613440511713281 also put a pin in the right place. i think that's the first time it's seen a notice level | 20:44 |
clarkb | 50 now | 20:45 |
clarkb | error log reports it is done reindexing | 20:46 |
ianw | i see it all done | 20:46 |
clarkb | Reindex changes to version 71 complete then Using changes schema version 71 | 20:46 |
clarkb | maybe let it steady state for a few minutes then proceed with the db work? though it should be fine to proceed at this point | 20:46 |
ianw | ++ | 20:48 |
fungi | agreed | 20:52 |
fungi | we're still well within the hour estimate | 20:52 |
fungi | not that it's a big deal if we go longer | 20:53 |
fungi | that was an estimate for the outage anyway, which was over in a few minutes | 20:53 |
clarkb | yup though looks like ianw is proceeding | 20:53 |
clarkb | And gerrit is up again. Time for me to login and review the changes that update our configs | 20:55 |
clarkb | in the process of doing ^ I checked that I could mark files unreviewd and then review them and have them get marked reviewed again. That bit all looked fine to me | 20:57 |
ianw | mark reviewed wfm | 20:57 |
clarkb | and sudo docker ps -a confirms we're running a 10.6 mariadb image (at least it is tagged that way) | 20:58 |
clarkb | fungi: https://review.opendev.org/c/opendev/system-config/+/844362/1 and child are the two changes we need to land to reflect the new upgraded state if you are happy with the results | 20:58 |
ianw | Code-Review 0 (vote reset) -- that feels new, that it points out this is a vote reset | 21:00 |
fungi | yep, both of those lgtm | 21:00 |
clarkb | yes I think that is new | 21:00 |
clarkb | fungi has approved both changes. I think thats it for now? we wait for changes to land and remove the host from the emergency file? | 21:01 |
ianw | ++ | 21:01 |
clarkb | ianw: I think you can remove the hosts from the emergency file nowish? since we don't run hourly service-review.yaml? But I'm happy to let you coordinate that as I'm likely to pay less attention to it than you are | 21:01 |
ianw | probably should have squished those actually, to avoid small window of us rewriting it back to mariadb 10.4 | 21:02 |
clarkb | ianw: since we don't auto restart things I think it will be ok. But agreed | 21:02 |
ianw | i think for ^ above leave it in emergency until both merge, then we'll just write the latest config | 21:02 |
clarkb | wfm | 21:02 |
clarkb | the gerrit image build jobs against 846510 are being retried | 21:03 |
clarkb | there don't appear to be logs for the first build. Not super concerned about that but something to followup on if we've got less reliable image builds for some reason | 21:03 |
clarkb | the gate jobs for the chagnes that matter will use the already built images | 21:04 |
ianw | #status ok Gerrit 3.5 upgrade is complete. Please reach us in #opendev if you see any issues | 21:07 |
opendevstatus | ianw: sending ok | 21:07 |
-opendevstatus- NOTICE: Gerrit 3.5 upgrade is complete. Please reach us in #opendev if you see any issues | 21:07 | |
ianw | i'm not sure if that works if not in alert | 21:07 |
clarkb | the 3.6 upgrade process is a bit more invovled. Will have to look at the upgrade job to see how to incorporate the extra bits for that | 21:07 |
ianw | i guess it does :) | 21:07 |
clarkb | the second attempt at those image builds in for 846510 succeeded | 21:08 |
clarkb | also I think this cadence where we end up doing a major release update after that release has had a couple of bug fix releases is working out for us. Lots of people had problems with 3.6 initially after the upgrade | 21:10 |
clarkb | But at the same time if we get clsoe enough to master then our CI can maybe help catch those problems before anyone upgrades and that would be a big win too | 21:10 |
opendevstatus | ianw: finished sending ok | 21:12 |
clarkb | ianw: the green check mark has some trailing text on twitter for that ok message | 21:13 |
clarkb | ✅\efe0fGerrit 3.5 upgrade is complete | 21:13 |
ianw | yeah, i'll look into that :) | 21:14 |
fungi | quoting of the ok notice on twitter looks odd | 21:14 |
ianw | yeah we don't need the "" on the alert | 21:14 |
ianw | oh, yeah also the OK is a bit borked. i think it's the first time we used it | 21:15 |
fungi | yeah, the "\efe0f" looks like some encoding hork-up | 21:15 |
clarkb | https://review.opendev.org/c/openstack/tripleo-heat-templates/+/841207 just merged | 21:31 |
clarkb | a good sign that zuul is happy | 21:31 |
fungi | excellent | 21:31 |
clarkb | ianw: re the emergency file I think the infra-prod-service-review job will only run when triggered by the changes merging. Or we wait for the periodic run later today or we trigger it manually. | 21:36 |
clarkb | Maybe the idea is to update the emergency file just before the second change starts running its job but after the first one is complete? | 21:37 |
clarkb | In any case I'm not too concerned about it since it is a simple template update and we can followup later if necessary | 21:37 |
opendevreview | Merged opendev/system-config master: gerrit: Update mariadb to 10.6 https://review.opendev.org/c/opendev/system-config/+/844362 | 21:39 |
opendevreview | Merged opendev/system-config master: gerrit: Update to 3.5 for production https://review.opendev.org/c/opendev/system-config/+/844363 | 21:40 |
clarkb | The deploy job for that first one is running already. | 21:40 |
clarkb | ianw: fungi: should I go ahead and remove review02 from the emergency file as soon as that first job is done running ansible? | 21:41 |
clarkb | I think ti just finished | 21:42 |
clarkb | I'm going to go ahead and remove it from the emergency fiel | 21:42 |
clarkb | thats done | 21:43 |
clarkb | ok first jobs is done. Second one should start running shortly and it should noop apply the update | 21:44 |
clarkb | this one will run manage-projects too fwiw | 21:44 |
clarkb | (because it updated the vars for gerrit not just the docker compose template) | 21:44 |
fungi | yeah, that seems safe | 21:45 |
clarkb | it said changed false on the put dockeer compose file in place task | 21:47 |
clarkb | same for the various gerrit config files | 21:47 |
clarkb | ok service-review is done and it lgtm | 21:49 |
clarkb | I checked the docker compose file and it appears as I expect and the containers weren't restarted (also as expected) | 21:49 |
clarkb | infra-prod-manage-projects will run shortly | 21:50 |
clarkb | it is starting to run its playbook now | 21:52 |
clarkb | I think manage projects nooped as much as it normally does. Lots of reports that it is skipping management of projects because ACLs match | 21:58 |
clarkb | I've marked off the last two items from the etherpad, but feel free to review the logs for the deployment for idempotency | 21:59 |
ianw | thanks for updating that | 22:10 |
opendevreview | Ian Wienand proposed opendev/statusbot master: Fix typo on OK tick https://review.opendev.org/c/opendev/statusbot/+/846533 | 23:42 |
opendevreview | Ian Wienand proposed opendev/statusbot master: twitter: Fix typo on OK tick https://review.opendev.org/c/opendev/statusbot/+/846533 | 23:42 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!