Friday, 2024-05-31

fungiwhich jdk version do we need for gerrit 3.9?00:56
fungi11 seems to be too old, with 21 it looks like maybe gerrit isn't starting (tests are timing out)00:57
*** dmitriis is now known as Guest811001:29
Clark[m]1701:44
fungithanks, adjusting01:59
opendevreviewJeremy Stanley proposed opendev/git-review master: Update the upper bound for Python and Gerrit tests  https://review.opendev.org/c/opendev/git-review/+/92084502:02
opendevreviewLukas Kranz proposed openstack/diskimage-builder master: Fix setting apt mirror for noble  https://review.opendev.org/c/openstack/diskimage-builder/+/92046606:18
clarkblooking at https://zuul.opendev.org/t/openstack/builds?project=opendev%2Fsystem-config&pipeline=periodic&skip=0 the only daily infra-prod job that failed was the clodu launcher which we know about and doesn't directly impact the gerrit upgrade so I think we're good from that perspective13:51
clarkbI'm going to sort out some breakfast and tea but I'm currently operating under the assumption that we will proceed as planned. At around 1500 UTC I'll put hosts in the emergency file so that is done well in advance of the 1600 outage window. We can send the service notice at 1600 and take gerrit down as soon as that is done reporting13:58
clarkbI might swap the target backup host for the db backup as the one in the pad just reported it needs pruning.13:59
clarkbthough really this db is relatively small and shouldn't have a big impact on that13:59
tonybsounds good.  I need to setup for the day and once that has happened I can prune the backups so we have options 14:10
tonybit needs to happen anyway so we may as well don't now14:10
tonybclarkb: pruning is running on both backup servers14:27
clarkback (I think it was only the one that needed it but shouldn't be an issue to do both)14:27
tonybYeah I did backup02 recently but it isn't any harder to do both14:28
clarkbhttps://lists.opendev.org/archives/list/service-announce@lists.opendev.org/thread/X2Z3BY4ON2KY3CC7E7QEPXCJGGREPK5M/ This is the upgrade downtime announcement14:44
clarkbhttps://etherpad.opendev.org/p/gerrit-upgrade-3.9 is the documented plan and testing and all that. Importantly it has the steps we'll be running through to do the upgrade14:44
clarkbtonyb: is the pruning still in process? I guess I don't remember how long that took the last time I did it14:54
clarkbok hosts are in the emergency file15:10
clarkband root screen on review02 has been started15:11
clarkbI think we're ready to proceed at 1600 assuming the backup servers are still happy15:12
tonybbackup02 is done, backup01 is still running.15:13
clarkback thanks15:14
clarkb02 is the one I've got in the plan so that should be good to go15:14
tonybOkay.  I'm ready to roll when the clock strikes 9/11/1615:26
clarkbI find myself being impatient. Should've scheduled for 15:30 :)15:55
tonybLOL15:55
tonyb5 more mins15:56
clarkbI can't really do anything else because I know in the not distant future I'm goign to be focused on oen particular thing15:56
corvuswhile you're waiting; what's the latest on the git cache clone perms stuff?15:56
clarkbso I end up sitting here thinking "should've started sooner"15:56
clarkbcorvus: the only current problem I am aware of with that is in the cloud launcher job. All of the other daily infra-prod jobs we ran last night passed15:57
clarkbcorvus: https://zuul.opendev.org/t/openstack/builds?project=opendev%2Fsystem-config&pipeline=periodic&skip=015:57
corvuscool, will the same fix work for cloud launcher?15:57
clarkbcorvus: I think rebuilding of jammy and noble with the chown got those fixed and I suspect nothing else was really broken on the CI side15:57
clarkbcorvus: no, there the failure is due to root on bridge trying to install an ansible module (the cloud integration stuff I think) from a zuul owned repo15:58
clarkbcorvus: for that one we probably will just trust the path since it is a single path and we don't have to worry about thousands15:58
tonybSo same issue just in reverse15:58
clarkband not in our ci images15:58
tonybYeah15:58
tonybt-45s15:59
clarkbyup I'll send the notice momentarily15:59
tonyb++15:59
clarkb#status notice Gerrit on review.opendev.org is being upgraded to version 3.9 and will be offline. We have allocated an hour for the outage window lasting until 1700 UTC16:00
opendevstatusclarkb: sending notice16:00
-opendevstatus- NOTICE: Gerrit on review.opendev.org is being upgraded to version 3.9 and will be offline. We have allocated an hour for the outage window lasting until 1700 UTC16:00
clarkbI will run the docker compose down once that is done sending then continue to proceed through my list of steps on the etherpad16:00
tonybOkay16:01
clarkbcorvus: tonyb  or maybe we can change the clone from a local clone to a remote one. We don't really need to ensure the latest version of that at all times since it rarely changes16:02
tonybclarkb: I'm looking at it now.16:02
opendevstatusclarkb: finished sending notice16:02
clarkbalright proceeding with the shutdown16:03
clarkbboth halves of the backup look good to me (I tailed the log in another terminal)16:05
clarkbthe image that was pulled lgtm. I'm going to proceed with the actual upgrade16:09
clarkband starting gerrit nowish16:10
clarkb[main] INFO  com.google.gerrit.pgm.Daemon : Gerrit Code Review 3.9.5-4-g782ac1b464-dirty ready16:11
clarkbreindexing has started too as expected16:11
clarkbshow-queue -w -q is qutie impressive (lots of stuff to reindex)16:12
clarkbmy dashboard loads and the reported version looks correct16:13
clarkbthere are warnings like "Did not find a single change associated with parent revision sha1here" for some changes in the error log. I suspect htis is an artifact of reindexing though as the changes seem to load just fine for me and have working diffs16:14
clarkboh nope those are merge commits so there is no parent change in gerrit16:15
clarkbprobably a bit verbose to have warnings about that as it is my udnerstanding that is a perfectly normal operating state for gerrit16:15
clarkbtonyb: maybe you can recheck a change and see if it is happy? I've got a change queued up I'll push too16:16
opendevreviewClark Boylan proposed openstack/project-config master: Update jeepyb to run Gerrit 3.9 image builds  https://review.opendev.org/c/openstack/project-config/+/92092216:16
clarkbchange push ^16:16
clarkbhrm i don't see 920922 in zuul status yet16:17
tonybI have rechcked https://review.opendev.org/c/opendev/system-config/+/92076016:17
clarkbnevermind 920922 is there I am just impatient (as noted previously16:17
clarkb920760 has enqueued now too so that is looking good16:18
tonybYup16:18
clarkbthe last thing on the check functionality list is to check replication is working16:18
clarkbI'll work on that16:19
clarkbthen once reindexing is confirmed to be done and successful we can probably remove hosts from the emergency file and approve the change to update our docker-compose config to match what I did by hand16:19
tonybSounds good, as I have no idea how to verify replication is working16:20
tonybOkay.16:20
clarkbwe can fetch the refs/changes/something/something path out of gitea to check replication of premerge items16:20
clarkbI'm figure those paths out for 920922 now as I know that wasn't created until after the upgrade16:20
tonybAh okay.  I see16:21
clarkbtonyb: `git fetch origin refs/changes/22/920922/1`16:21
clarkbthat did work for me16:21
clarkbfor origin https://opendev.org/openstack/project-config (fetch)16:21
tonybOkay I know for next time16:22
clarkband git show FETCH_HEAD looks how I expect16:22
clarkboh need to compare the config diffs too. Doing that now16:22
clarkbthe only files changed are the soy templates for email. We don't manage those and let gerrit decide what goes in them (this was also expected)16:23
tonybOkay16:24
clarkbthere was an exception caused by someone fetching a tgz archive but looking at it the tracebacks impy the reason was the peer (so client) closed the connection early and we couldn't write the complete archive to them (again I think this is fine)16:24
clarkbdoes to 1300 tasks. Seem to mostly be change reindexing16:25
clarkbs/does/down/16:25
tonyb+116:25
corvusi think gertty may need some adjustment after this upgrade.... it's running into duplicate key errors, so i wonder if something about the whole change id rigamarole has changed enough to trip something up.16:26
clarkbcorvus: ack, let us know if you think that exposes a critical problem in general16:27
clarkball reindexing but changes is complete and the new versions lgtm16:27
corvusoh interesting, it looks like the "new style" change id updated from `zuul%2Fzuul~master~Icd22461961f991cb0f50a19427c2182c19902d27` to `zuul%2Fzuul~920693`16:28
corvusthat's what's tripping up gertty... i don't think we use that in zuul (i'm not even sure if gertty uses it)16:29
clarkbcorvus: fwiw I don't think they called that out in the release notes either16:29
tonybThat does seem like something that should have been called out :/16:30
corvuslooks like {"id":"zuul%2Fzuul~920693","triplet_id":"zuul%2Fzuul~master~Icd22461961f991cb0f50a19427c2182c19902d27" is what the json says now16:31
clarkboh good you can easily refer to triplet id in that case I guess16:32
corvusyeah, so that's probably the fix for gertty (or maybe just drop that column).  i'm 99% sure this shouldn't affect zuul (searching for "id" is hard)16:33
corvusi think we're good, and i'll just try to fix gertty over the weekend16:34
clarkbcorvus: cool thank you for checking16:34
clarkbreindexing of changes has completed so all online reindexing is done16:34
corvusnp16:34
clarkbI get the sense that we can proceed with alnding https://review.opendev.org/c/opendev/system-config/+/920412 ?16:35
clarkbIv'e removed my WIP on that if yall approve it I'll take that as the signal to also remove hosts from the emergency file16:35
tonybLooks like it came from: https://gerrit.googlesource.com/gerrit/+/5073525af2791e778fa70fd977c986652de230f116:35
tonybclarkb: I'm good with that.16:36
clarkbcorvus: do you want ot be the second ack on 920412?16:36
corvuswe do actually use the triplet format in zuul for posting reviews, but we generate it ourselves rather than using a value sent to us.  i assume that will still work since the quickstart still works :)16:37
corvusbut that is a thing to pay attention to16:37
corvus+3 on 92041216:37
clarkbcorvus: ya and 920922 has successfully had things posted to it from zuul16:38
clarkbIv'e ermoved the hosts from the emergency file16:39
corvusgroovy.  the other thing to keep an eye out for is that behavior could be different between changes zuul has seen in the last 2 hours vs ones it has never seen 16:39
corvus(since there is a change cache and it expires 2 hours after a change hasn't been seen in any pipeline or event)16:39
corvus920922 bing a new change should cover one of those cases16:40
clarkbhttps://review.opendev.org/920430 is one that is in gate and should merge soon and has been in the gate since before the upgrade16:40
clarkbto cover the other case16:40
clarkband it just merged so I think we're good for both cases?16:40
tonyb\o/16:41
corvusyep!16:41
clarkb920412 is another example but will take longer :)16:41
corvusi'm going to afk now; thanks clarkb and tonyb !16:42
clarkbcorvus: and thank you!16:42
tonybThe next infra-prod run will in in ~18mins?16:43
clarkbtonyb: the next hourly runs happen then but do not run service-review hourly. 920412 should merge and trigger the infra-prod-service-review job that we need to check16:43
clarkbI'm going to quit the screen now and save the logfile16:43
tonybOkay x 216:43
clarkbthen once that is one and we've confirmed that docker-compose.yaml looks good we can probably consider this done for now. tonyb afterwards maybe we check that the new key name is in the clouds and then swap nodepool over?16:47
clarkbtonyb: https://opendev.org/openstack/openstack-ansible/commits/branch/master is the easier way to check replication works but requires waiting for things to merge (that shows the commit for 920430 above that merged is in the repo on gitea)16:50
clarkbtonyb: the change for the id triplet change shows release notes skip so ya they didn't put them in there16:50
tonybSounds good.  I did notice that it was 'skipped' I disagree with that call but what's done is done16:52
clarkbindeed16:52
tonybSeems like the desire is for clients to use the new form16:52
clarkbfwiw I am not able to reproduce the tgz download connection errors from my browser16:56
clarkbso I can only assume those are legit client side networking problems16:56
tonybThanks for checking.16:57
clarkblooks like we've got about an hour before the config update to 3.9 lands. I'm going to take a walk outside to let my eyes focus on things more than a meter in front of my face. I won't be far and I can keep an eye on IRC if anything important pops up17:00
tonybSounds good to me17:00
tonybThanks for doing all the planning and ensuring that was super smooth17:01
clarkband thank you for volunteering to be a second set of eyes/hands. ianw would offer to do gerrit upgrades solo. Not sure I've got that in me :)17:03
tonybhehe17:03
tonybclarkb: https://zuul.opendev.org/t/openstack/builds?pipeline=opendev-prod-hourly&skip=0&limit=6 Looks good17:32
clarkbagreed17:34
clarkbI think the merge of this change will race the next hour's hourly runs :/ so may be a little extra time before the job runs17:35
tonybYeah :/  Oh well17:35
tonybWhen you're home I think I verified that the new key is installed on all clouds17:35
clarkbya I just got back17:36
tonybOkay17:36
clarkbcool then the next step is to push a change to project-config that changes the key name in nodepool/nl*.yaml files17:36
clarkbto match the new key name. Then new instances that are booted should include the new keys17:36
clarkbs/include/use/17:36
tonybOn bridge there is a script kp-list.sh that I wrote and run, the log in in kp_list.log  If they look good I can do the project-config chnage17:37
clarkblooking17:37
tonybbackup01 is still pruning BTW17:38
clarkbtonyb: ovh SBG1 failed but we don't use that one so its fine17:39
clarkbwow17:39
clarkbtonyb: does the log look like it is making progress at least?17:39
clarkbtonyb: also the jenkins account name for rax was wrong and those failed17:40
tonybYup17:40
clarkbStill no reason to assume they didn't get updated in the rax cloud regions but we should double check17:40
tonybOh  Okay17:41
tonyb#oops17:41
clarkbya I think that all looks good except needing to confirm the rax account regions got it too17:41
clarkbI suspect they applied since they applied everywhere else.We just lack that info17:41
tonybYup I fixed the script and it looks good now17:46
tonybkp_list2.log if you're interested17:47
clarkbthe review02 backups failed to backup01 due to a failure to grab locks. I assume this is directly related to the pruning process having the locks held17:47
clarkbso that all looks "ok"17:47
tonybYeah.17:47
clarkbmaybe this is a reason to not prune both servers at the same time. Avoids periods where backups can fail17:47
clarkbyup second listing lgtm. I think we can update nodepool17:48
tonybYeah makes sense, but backup02 did finish quickly17:48
tonybNext time I'll do them serially17:48
clarkbtonyb: backup01 has larger disks iirc and possibly they are slower too?17:48
clarkbI guess that is a tradeoff that isn't ideal but one we can live with17:48
tonybYeah I think so.17:49
opendevreviewMerged opendev/system-config master: Update Gerrit image tag to 3.9 (from 3.8)  https://review.opendev.org/c/opendev/system-config/+/92041217:55
clarkbsweet it should beat the hourly jobs17:55
tonybNice17:55
opendevreviewTony Breeds proposed openstack/project-config master: Switch nodepool over to the latest infra-root keyfile  https://review.opendev.org/c/openstack/project-config/+/92092717:58
clarkbinfra-prod-service-review was successful and docker-compose.yaml didn't change17:58
clarkbnow manage-projects is running17:58
tonybGreat I was just about to check the log and file.17:59
clarkbnodepool change lgtm. I didnt' approve it as I want to make sure we're considering gerrit done before moving to the next thing in production18:00
clarkbmanage projects was also successful18:01
clarkbfrom where I am sitting this all lgtm and I think we can consider this done for the day. There are a few followup items like updating the jeepyb image build triggers (change pushed for that), removing 3.8 image builds and adding 3.10 image builds and upgrade testing, etc that I think we can pick up next week18:02
clarkbthat way if anything comes up that does make us consider a revert we haven't gone super far in the other direction18:03
tonybYeah sounds good to me.18:03
clarkbwith that in mind I guess we can proceed with udpating keys in nodepool nodes. Do you want to +A or should I?18:04
tonybGo for it.18:04
clarkbdone18:05
tonybI'm going to need to step away soon for some late breakfast/lunch18:05
clarkback I too will need food soon but can keep an eye on the nodepool stuff18:05
tonybI'm not leaving the house so I can also keep a weather eye on nodepool18:07
opendevreviewMerged openstack/project-config master: Switch nodepool over to the latest infra-root keyfile  https://review.opendev.org/c/openstack/project-config/+/92092718:17
clarkbthe configs appear to have updated on launchers. I'm trying to find a ndoe that has booted recently enough to use the new key18:28
clarkbso far everything I've ssh'd into is using the old key. I wonder if we need to restart launchers to pick up the change18:32
clarkbI'll check one more host and if it hasn't updated we can restart I guess18:32
clarkbtonyb: node 0037638133 at 104.130.140.22318:34
clarkbseems that restarting isn't going to be necessary as that is using the new key18:34
clarkbthat looks good to me. I think we just keep an eye out for any boot failures, but we checked the key exists so that shouldn't happen18:35
clarkband with that I'm going to take a break and find food etc. Feels like we got a lot done this morning. Thanks for the help18:36
tonybThanks fr everything.  It's been a good day so far.18:39
tonybI can get into that node so that's good.18:40
tonybI'm not seeing any errors in grafana but I'll check again during the day18:41
Clark[m]Arg I think my desktop may have crashed due to that amd GPU bug again. No vtys though so that idea for debugging the next time it happened is out19:03
Clark[m]Confirmed. But no disk issues this time thankfully19:09
clarkband back. The really annoying thing is I'm on a kernel several versions nwwer than the first time this happened so no improvements/bugfixes yet. Oh well19:12
clarkbI've switched to a proper screensaver now instead of blanking then turning off the display. Hopefully that prevents tripping over this bug in the first place19:23
opendevreviewClark Boylan proposed opendev/system-config master: infra-prod-service-review depends on Gerrit 3.9  https://review.opendev.org/c/opendev/system-config/+/92093721:42
opendevreviewClark Boylan proposed opendev/system-config master: Remove Gerrit 3.8 images and related jobs  https://review.opendev.org/c/opendev/system-config/+/92093821:42
opendevreviewClark Boylan proposed opendev/system-config master: Add Gerrit 3.10 image builds and testing  https://review.opendev.org/c/opendev/system-config/+/92093921:42
clarkbhttps://review.opendev.org/c/opendev/system-config/+/920937/ and https://review.opendev.org/c/openstack/project-config/+/920922 are going to be good early followups to the gerrit upgrade. That just ensures our gerrit jobs are happily testing 3.9 properly and using it as a dependency etc21:43
clarkbthe other two chagnes I pushed above are less urgent and in fact I marked 920938 as WIP until we're confident a revert is unlikely21:44
clarkboh and 920922 is preventing 920938 from event testing at the moment21:44
clarkbside note my monitor is still going into sleep mode. Going to have to figure that out21:49
corvusclarkb: fungi prometheanfire i think instead of "fixing" gertty to deal with the gerrit id format change, i'm just going to delete my local db and let it repopulate.  that should be fine.  if anyone else feels like doing something to deal with it, i'd be happy to review the change, but at least for me, it's easy enough to delete and resubscribe.21:53
clarkbcorvus: I guess it will transparently use the new id values then? that seems like a reasonable solution21:55
clarkbtransparently use them if given a fresh db state21:56
corvusyep, the issue is basically just a duplicate key error cause it thinks the old changes are different21:58
corvusget rid of old data, no more problem.21:58
clarkbI'm already starting to appreciate some of the UI tweaks in 3.922:01
clarkbthings are just delineated better in lists and comments and so on22:01

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!