fungi | which jdk version do we need for gerrit 3.9? | 00:56 |
---|---|---|
fungi | 11 seems to be too old, with 21 it looks like maybe gerrit isn't starting (tests are timing out) | 00:57 |
*** dmitriis is now known as Guest8110 | 01:29 | |
Clark[m] | 17 | 01:44 |
fungi | thanks, adjusting | 01:59 |
opendevreview | Jeremy Stanley proposed opendev/git-review master: Update the upper bound for Python and Gerrit tests https://review.opendev.org/c/opendev/git-review/+/920845 | 02:02 |
opendevreview | Lukas Kranz proposed openstack/diskimage-builder master: Fix setting apt mirror for noble https://review.opendev.org/c/openstack/diskimage-builder/+/920466 | 06:18 |
clarkb | looking at https://zuul.opendev.org/t/openstack/builds?project=opendev%2Fsystem-config&pipeline=periodic&skip=0 the only daily infra-prod job that failed was the clodu launcher which we know about and doesn't directly impact the gerrit upgrade so I think we're good from that perspective | 13:51 |
clarkb | I'm going to sort out some breakfast and tea but I'm currently operating under the assumption that we will proceed as planned. At around 1500 UTC I'll put hosts in the emergency file so that is done well in advance of the 1600 outage window. We can send the service notice at 1600 and take gerrit down as soon as that is done reporting | 13:58 |
clarkb | I might swap the target backup host for the db backup as the one in the pad just reported it needs pruning. | 13:59 |
clarkb | though really this db is relatively small and shouldn't have a big impact on that | 13:59 |
tonyb | sounds good. I need to setup for the day and once that has happened I can prune the backups so we have options | 14:10 |
tonyb | it needs to happen anyway so we may as well don't now | 14:10 |
tonyb | clarkb: pruning is running on both backup servers | 14:27 |
clarkb | ack (I think it was only the one that needed it but shouldn't be an issue to do both) | 14:27 |
tonyb | Yeah I did backup02 recently but it isn't any harder to do both | 14:28 |
clarkb | https://lists.opendev.org/archives/list/service-announce@lists.opendev.org/thread/X2Z3BY4ON2KY3CC7E7QEPXCJGGREPK5M/ This is the upgrade downtime announcement | 14:44 |
clarkb | https://etherpad.opendev.org/p/gerrit-upgrade-3.9 is the documented plan and testing and all that. Importantly it has the steps we'll be running through to do the upgrade | 14:44 |
clarkb | tonyb: is the pruning still in process? I guess I don't remember how long that took the last time I did it | 14:54 |
clarkb | ok hosts are in the emergency file | 15:10 |
clarkb | and root screen on review02 has been started | 15:11 |
clarkb | I think we're ready to proceed at 1600 assuming the backup servers are still happy | 15:12 |
tonyb | backup02 is done, backup01 is still running. | 15:13 |
clarkb | ack thanks | 15:14 |
clarkb | 02 is the one I've got in the plan so that should be good to go | 15:14 |
tonyb | Okay. I'm ready to roll when the clock strikes 9/11/16 | 15:26 |
clarkb | I find myself being impatient. Should've scheduled for 15:30 :) | 15:55 |
tonyb | LOL | 15:55 |
tonyb | 5 more mins | 15:56 |
clarkb | I can't really do anything else because I know in the not distant future I'm goign to be focused on oen particular thing | 15:56 |
corvus | while you're waiting; what's the latest on the git cache clone perms stuff? | 15:56 |
clarkb | so I end up sitting here thinking "should've started sooner" | 15:56 |
clarkb | corvus: the only current problem I am aware of with that is in the cloud launcher job. All of the other daily infra-prod jobs we ran last night passed | 15:57 |
clarkb | corvus: https://zuul.opendev.org/t/openstack/builds?project=opendev%2Fsystem-config&pipeline=periodic&skip=0 | 15:57 |
corvus | cool, will the same fix work for cloud launcher? | 15:57 |
clarkb | corvus: I think rebuilding of jammy and noble with the chown got those fixed and I suspect nothing else was really broken on the CI side | 15:57 |
clarkb | corvus: no, there the failure is due to root on bridge trying to install an ansible module (the cloud integration stuff I think) from a zuul owned repo | 15:58 |
clarkb | corvus: for that one we probably will just trust the path since it is a single path and we don't have to worry about thousands | 15:58 |
tonyb | So same issue just in reverse | 15:58 |
clarkb | and not in our ci images | 15:58 |
tonyb | Yeah | 15:58 |
tonyb | t-45s | 15:59 |
clarkb | yup I'll send the notice momentarily | 15:59 |
tonyb | ++ | 15:59 |
clarkb | #status notice Gerrit on review.opendev.org is being upgraded to version 3.9 and will be offline. We have allocated an hour for the outage window lasting until 1700 UTC | 16:00 |
opendevstatus | clarkb: sending notice | 16:00 |
-opendevstatus- NOTICE: Gerrit on review.opendev.org is being upgraded to version 3.9 and will be offline. We have allocated an hour for the outage window lasting until 1700 UTC | 16:00 | |
clarkb | I will run the docker compose down once that is done sending then continue to proceed through my list of steps on the etherpad | 16:00 |
tonyb | Okay | 16:01 |
clarkb | corvus: tonyb or maybe we can change the clone from a local clone to a remote one. We don't really need to ensure the latest version of that at all times since it rarely changes | 16:02 |
tonyb | clarkb: I'm looking at it now. | 16:02 |
opendevstatus | clarkb: finished sending notice | 16:02 |
clarkb | alright proceeding with the shutdown | 16:03 |
clarkb | both halves of the backup look good to me (I tailed the log in another terminal) | 16:05 |
clarkb | the image that was pulled lgtm. I'm going to proceed with the actual upgrade | 16:09 |
clarkb | and starting gerrit nowish | 16:10 |
clarkb | [main] INFO com.google.gerrit.pgm.Daemon : Gerrit Code Review 3.9.5-4-g782ac1b464-dirty ready | 16:11 |
clarkb | reindexing has started too as expected | 16:11 |
clarkb | show-queue -w -q is qutie impressive (lots of stuff to reindex) | 16:12 |
clarkb | my dashboard loads and the reported version looks correct | 16:13 |
clarkb | there are warnings like "Did not find a single change associated with parent revision sha1here" for some changes in the error log. I suspect htis is an artifact of reindexing though as the changes seem to load just fine for me and have working diffs | 16:14 |
clarkb | oh nope those are merge commits so there is no parent change in gerrit | 16:15 |
clarkb | probably a bit verbose to have warnings about that as it is my udnerstanding that is a perfectly normal operating state for gerrit | 16:15 |
clarkb | tonyb: maybe you can recheck a change and see if it is happy? I've got a change queued up I'll push too | 16:16 |
opendevreview | Clark Boylan proposed openstack/project-config master: Update jeepyb to run Gerrit 3.9 image builds https://review.opendev.org/c/openstack/project-config/+/920922 | 16:16 |
clarkb | change push ^ | 16:16 |
clarkb | hrm i don't see 920922 in zuul status yet | 16:17 |
tonyb | I have rechcked https://review.opendev.org/c/opendev/system-config/+/920760 | 16:17 |
clarkb | nevermind 920922 is there I am just impatient (as noted previously | 16:17 |
clarkb | 920760 has enqueued now too so that is looking good | 16:18 |
tonyb | Yup | 16:18 |
clarkb | the last thing on the check functionality list is to check replication is working | 16:18 |
clarkb | I'll work on that | 16:19 |
clarkb | then once reindexing is confirmed to be done and successful we can probably remove hosts from the emergency file and approve the change to update our docker-compose config to match what I did by hand | 16:19 |
tonyb | Sounds good, as I have no idea how to verify replication is working | 16:20 |
tonyb | Okay. | 16:20 |
clarkb | we can fetch the refs/changes/something/something path out of gitea to check replication of premerge items | 16:20 |
clarkb | I'm figure those paths out for 920922 now as I know that wasn't created until after the upgrade | 16:20 |
tonyb | Ah okay. I see | 16:21 |
clarkb | tonyb: `git fetch origin refs/changes/22/920922/1` | 16:21 |
clarkb | that did work for me | 16:21 |
clarkb | for origin https://opendev.org/openstack/project-config (fetch) | 16:21 |
tonyb | Okay I know for next time | 16:22 |
clarkb | and git show FETCH_HEAD looks how I expect | 16:22 |
clarkb | oh need to compare the config diffs too. Doing that now | 16:22 |
clarkb | the only files changed are the soy templates for email. We don't manage those and let gerrit decide what goes in them (this was also expected) | 16:23 |
tonyb | Okay | 16:24 |
clarkb | there was an exception caused by someone fetching a tgz archive but looking at it the tracebacks impy the reason was the peer (so client) closed the connection early and we couldn't write the complete archive to them (again I think this is fine) | 16:24 |
clarkb | does to 1300 tasks. Seem to mostly be change reindexing | 16:25 |
clarkb | s/does/down/ | 16:25 |
tonyb | +1 | 16:25 |
corvus | i think gertty may need some adjustment after this upgrade.... it's running into duplicate key errors, so i wonder if something about the whole change id rigamarole has changed enough to trip something up. | 16:26 |
clarkb | corvus: ack, let us know if you think that exposes a critical problem in general | 16:27 |
clarkb | all reindexing but changes is complete and the new versions lgtm | 16:27 |
corvus | oh interesting, it looks like the "new style" change id updated from `zuul%2Fzuul~master~Icd22461961f991cb0f50a19427c2182c19902d27` to `zuul%2Fzuul~920693` | 16:28 |
corvus | that's what's tripping up gertty... i don't think we use that in zuul (i'm not even sure if gertty uses it) | 16:29 |
clarkb | corvus: fwiw I don't think they called that out in the release notes either | 16:29 |
tonyb | That does seem like something that should have been called out :/ | 16:30 |
corvus | looks like {"id":"zuul%2Fzuul~920693","triplet_id":"zuul%2Fzuul~master~Icd22461961f991cb0f50a19427c2182c19902d27" is what the json says now | 16:31 |
clarkb | oh good you can easily refer to triplet id in that case I guess | 16:32 |
corvus | yeah, so that's probably the fix for gertty (or maybe just drop that column). i'm 99% sure this shouldn't affect zuul (searching for "id" is hard) | 16:33 |
corvus | i think we're good, and i'll just try to fix gertty over the weekend | 16:34 |
clarkb | corvus: cool thank you for checking | 16:34 |
clarkb | reindexing of changes has completed so all online reindexing is done | 16:34 |
corvus | np | 16:34 |
clarkb | I get the sense that we can proceed with alnding https://review.opendev.org/c/opendev/system-config/+/920412 ? | 16:35 |
clarkb | Iv'e removed my WIP on that if yall approve it I'll take that as the signal to also remove hosts from the emergency file | 16:35 |
tonyb | Looks like it came from: https://gerrit.googlesource.com/gerrit/+/5073525af2791e778fa70fd977c986652de230f1 | 16:35 |
tonyb | clarkb: I'm good with that. | 16:36 |
clarkb | corvus: do you want ot be the second ack on 920412? | 16:36 |
corvus | we do actually use the triplet format in zuul for posting reviews, but we generate it ourselves rather than using a value sent to us. i assume that will still work since the quickstart still works :) | 16:37 |
corvus | but that is a thing to pay attention to | 16:37 |
corvus | +3 on 920412 | 16:37 |
clarkb | corvus: ya and 920922 has successfully had things posted to it from zuul | 16:38 |
clarkb | Iv'e ermoved the hosts from the emergency file | 16:39 |
corvus | groovy. the other thing to keep an eye out for is that behavior could be different between changes zuul has seen in the last 2 hours vs ones it has never seen | 16:39 |
corvus | (since there is a change cache and it expires 2 hours after a change hasn't been seen in any pipeline or event) | 16:39 |
corvus | 920922 bing a new change should cover one of those cases | 16:40 |
clarkb | https://review.opendev.org/920430 is one that is in gate and should merge soon and has been in the gate since before the upgrade | 16:40 |
clarkb | to cover the other case | 16:40 |
clarkb | and it just merged so I think we're good for both cases? | 16:40 |
tonyb | \o/ | 16:41 |
corvus | yep! | 16:41 |
clarkb | 920412 is another example but will take longer :) | 16:41 |
corvus | i'm going to afk now; thanks clarkb and tonyb ! | 16:42 |
clarkb | corvus: and thank you! | 16:42 |
tonyb | The next infra-prod run will in in ~18mins? | 16:43 |
clarkb | tonyb: the next hourly runs happen then but do not run service-review hourly. 920412 should merge and trigger the infra-prod-service-review job that we need to check | 16:43 |
clarkb | I'm going to quit the screen now and save the logfile | 16:43 |
tonyb | Okay x 2 | 16:43 |
clarkb | then once that is one and we've confirmed that docker-compose.yaml looks good we can probably consider this done for now. tonyb afterwards maybe we check that the new key name is in the clouds and then swap nodepool over? | 16:47 |
clarkb | tonyb: https://opendev.org/openstack/openstack-ansible/commits/branch/master is the easier way to check replication works but requires waiting for things to merge (that shows the commit for 920430 above that merged is in the repo on gitea) | 16:50 |
clarkb | tonyb: the change for the id triplet change shows release notes skip so ya they didn't put them in there | 16:50 |
tonyb | Sounds good. I did notice that it was 'skipped' I disagree with that call but what's done is done | 16:52 |
clarkb | indeed | 16:52 |
tonyb | Seems like the desire is for clients to use the new form | 16:52 |
clarkb | fwiw I am not able to reproduce the tgz download connection errors from my browser | 16:56 |
clarkb | so I can only assume those are legit client side networking problems | 16:56 |
tonyb | Thanks for checking. | 16:57 |
clarkb | looks like we've got about an hour before the config update to 3.9 lands. I'm going to take a walk outside to let my eyes focus on things more than a meter in front of my face. I won't be far and I can keep an eye on IRC if anything important pops up | 17:00 |
tonyb | Sounds good to me | 17:00 |
tonyb | Thanks for doing all the planning and ensuring that was super smooth | 17:01 |
clarkb | and thank you for volunteering to be a second set of eyes/hands. ianw would offer to do gerrit upgrades solo. Not sure I've got that in me :) | 17:03 |
tonyb | hehe | 17:03 |
tonyb | clarkb: https://zuul.opendev.org/t/openstack/builds?pipeline=opendev-prod-hourly&skip=0&limit=6 Looks good | 17:32 |
clarkb | agreed | 17:34 |
clarkb | I think the merge of this change will race the next hour's hourly runs :/ so may be a little extra time before the job runs | 17:35 |
tonyb | Yeah :/ Oh well | 17:35 |
tonyb | When you're home I think I verified that the new key is installed on all clouds | 17:35 |
clarkb | ya I just got back | 17:36 |
tonyb | Okay | 17:36 |
clarkb | cool then the next step is to push a change to project-config that changes the key name in nodepool/nl*.yaml files | 17:36 |
clarkb | to match the new key name. Then new instances that are booted should include the new keys | 17:36 |
clarkb | s/include/use/ | 17:36 |
tonyb | On bridge there is a script kp-list.sh that I wrote and run, the log in in kp_list.log If they look good I can do the project-config chnage | 17:37 |
clarkb | looking | 17:37 |
tonyb | backup01 is still pruning BTW | 17:38 |
clarkb | tonyb: ovh SBG1 failed but we don't use that one so its fine | 17:39 |
clarkb | wow | 17:39 |
clarkb | tonyb: does the log look like it is making progress at least? | 17:39 |
clarkb | tonyb: also the jenkins account name for rax was wrong and those failed | 17:40 |
tonyb | Yup | 17:40 |
clarkb | Still no reason to assume they didn't get updated in the rax cloud regions but we should double check | 17:40 |
tonyb | Oh Okay | 17:41 |
tonyb | #oops | 17:41 |
clarkb | ya I think that all looks good except needing to confirm the rax account regions got it too | 17:41 |
clarkb | I suspect they applied since they applied everywhere else.We just lack that info | 17:41 |
tonyb | Yup I fixed the script and it looks good now | 17:46 |
tonyb | kp_list2.log if you're interested | 17:47 |
clarkb | the review02 backups failed to backup01 due to a failure to grab locks. I assume this is directly related to the pruning process having the locks held | 17:47 |
clarkb | so that all looks "ok" | 17:47 |
tonyb | Yeah. | 17:47 |
clarkb | maybe this is a reason to not prune both servers at the same time. Avoids periods where backups can fail | 17:47 |
clarkb | yup second listing lgtm. I think we can update nodepool | 17:48 |
tonyb | Yeah makes sense, but backup02 did finish quickly | 17:48 |
tonyb | Next time I'll do them serially | 17:48 |
clarkb | tonyb: backup01 has larger disks iirc and possibly they are slower too? | 17:48 |
clarkb | I guess that is a tradeoff that isn't ideal but one we can live with | 17:48 |
tonyb | Yeah I think so. | 17:49 |
opendevreview | Merged opendev/system-config master: Update Gerrit image tag to 3.9 (from 3.8) https://review.opendev.org/c/opendev/system-config/+/920412 | 17:55 |
clarkb | sweet it should beat the hourly jobs | 17:55 |
tonyb | Nice | 17:55 |
opendevreview | Tony Breeds proposed openstack/project-config master: Switch nodepool over to the latest infra-root keyfile https://review.opendev.org/c/openstack/project-config/+/920927 | 17:58 |
clarkb | infra-prod-service-review was successful and docker-compose.yaml didn't change | 17:58 |
clarkb | now manage-projects is running | 17:58 |
tonyb | Great I was just about to check the log and file. | 17:59 |
clarkb | nodepool change lgtm. I didnt' approve it as I want to make sure we're considering gerrit done before moving to the next thing in production | 18:00 |
clarkb | manage projects was also successful | 18:01 |
clarkb | from where I am sitting this all lgtm and I think we can consider this done for the day. There are a few followup items like updating the jeepyb image build triggers (change pushed for that), removing 3.8 image builds and adding 3.10 image builds and upgrade testing, etc that I think we can pick up next week | 18:02 |
clarkb | that way if anything comes up that does make us consider a revert we haven't gone super far in the other direction | 18:03 |
tonyb | Yeah sounds good to me. | 18:03 |
clarkb | with that in mind I guess we can proceed with udpating keys in nodepool nodes. Do you want to +A or should I? | 18:04 |
tonyb | Go for it. | 18:04 |
clarkb | done | 18:05 |
tonyb | I'm going to need to step away soon for some late breakfast/lunch | 18:05 |
clarkb | ack I too will need food soon but can keep an eye on the nodepool stuff | 18:05 |
tonyb | I'm not leaving the house so I can also keep a weather eye on nodepool | 18:07 |
opendevreview | Merged openstack/project-config master: Switch nodepool over to the latest infra-root keyfile https://review.opendev.org/c/openstack/project-config/+/920927 | 18:17 |
clarkb | the configs appear to have updated on launchers. I'm trying to find a ndoe that has booted recently enough to use the new key | 18:28 |
clarkb | so far everything I've ssh'd into is using the old key. I wonder if we need to restart launchers to pick up the change | 18:32 |
clarkb | I'll check one more host and if it hasn't updated we can restart I guess | 18:32 |
clarkb | tonyb: node 0037638133 at 104.130.140.223 | 18:34 |
clarkb | seems that restarting isn't going to be necessary as that is using the new key | 18:34 |
clarkb | that looks good to me. I think we just keep an eye out for any boot failures, but we checked the key exists so that shouldn't happen | 18:35 |
clarkb | and with that I'm going to take a break and find food etc. Feels like we got a lot done this morning. Thanks for the help | 18:36 |
tonyb | Thanks fr everything. It's been a good day so far. | 18:39 |
tonyb | I can get into that node so that's good. | 18:40 |
tonyb | I'm not seeing any errors in grafana but I'll check again during the day | 18:41 |
Clark[m] | Arg I think my desktop may have crashed due to that amd GPU bug again. No vtys though so that idea for debugging the next time it happened is out | 19:03 |
Clark[m] | Confirmed. But no disk issues this time thankfully | 19:09 |
clarkb | and back. The really annoying thing is I'm on a kernel several versions nwwer than the first time this happened so no improvements/bugfixes yet. Oh well | 19:12 |
clarkb | I've switched to a proper screensaver now instead of blanking then turning off the display. Hopefully that prevents tripping over this bug in the first place | 19:23 |
opendevreview | Clark Boylan proposed opendev/system-config master: infra-prod-service-review depends on Gerrit 3.9 https://review.opendev.org/c/opendev/system-config/+/920937 | 21:42 |
opendevreview | Clark Boylan proposed opendev/system-config master: Remove Gerrit 3.8 images and related jobs https://review.opendev.org/c/opendev/system-config/+/920938 | 21:42 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add Gerrit 3.10 image builds and testing https://review.opendev.org/c/opendev/system-config/+/920939 | 21:42 |
clarkb | https://review.opendev.org/c/opendev/system-config/+/920937/ and https://review.opendev.org/c/openstack/project-config/+/920922 are going to be good early followups to the gerrit upgrade. That just ensures our gerrit jobs are happily testing 3.9 properly and using it as a dependency etc | 21:43 |
clarkb | the other two chagnes I pushed above are less urgent and in fact I marked 920938 as WIP until we're confident a revert is unlikely | 21:44 |
clarkb | oh and 920922 is preventing 920938 from event testing at the moment | 21:44 |
clarkb | side note my monitor is still going into sleep mode. Going to have to figure that out | 21:49 |
corvus | clarkb: fungi prometheanfire i think instead of "fixing" gertty to deal with the gerrit id format change, i'm just going to delete my local db and let it repopulate. that should be fine. if anyone else feels like doing something to deal with it, i'd be happy to review the change, but at least for me, it's easy enough to delete and resubscribe. | 21:53 |
clarkb | corvus: I guess it will transparently use the new id values then? that seems like a reasonable solution | 21:55 |
clarkb | transparently use them if given a fresh db state | 21:56 |
corvus | yep, the issue is basically just a duplicate key error cause it thinks the old changes are different | 21:58 |
corvus | get rid of old data, no more problem. | 21:58 |
clarkb | I'm already starting to appreciate some of the UI tweaks in 3.9 | 22:01 |
clarkb | things are just delineated better in lists and comments and so on | 22:01 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!