Sunday, 2025-12-07

opendevreviewMerged opendev/zuul-providers master: Remove build_diskimage_image_name variable  https://review.opendev.org/c/opendev/zuul-providers/+/95637302:15
opendevreviewIvan Anfimov proposed opendev/zuul-providers master: wip  https://review.opendev.org/c/opendev/zuul-providers/+/97004911:38
opendevreviewIvan Anfimov proposed opendev/zuul-providers master: wip  https://review.opendev.org/c/opendev/zuul-providers/+/97004911:41
opendevreviewIvan Anfimov proposed opendev/zuul-providers master: Remove build_diskimage_image_name variable  https://review.opendev.org/c/opendev/zuul-providers/+/97004911:44
opendevreviewIvan Anfimov proposed opendev/zuul-providers master: wip  https://review.opendev.org/c/opendev/zuul-providers/+/97004211:48
opendevreviewDmitriy Chubinidze proposed opendev/zuul-providers master: Use standard x86-64 for AlmaLinux 10 image build  https://review.opendev.org/c/opendev/zuul-providers/+/97004212:51
opendevreviewIvan Anfimov proposed opendev/zuul-providers master: Use standard x86-64 for AlmaLinux 10 image build  https://review.opendev.org/c/opendev/zuul-providers/+/97004213:14
opendevreviewIvan Anfimov proposed opendev/zuul-providers master: Use standard x86-64 for AlmaLinux 10 image build  https://review.opendev.org/c/opendev/zuul-providers/+/97004213:15
opendevreviewIvan Anfimov proposed opendev/zuul-providers master: Use standard x86-64 for AlmaLinux 10 image build  https://review.opendev.org/c/opendev/zuul-providers/+/97004213:17
opendevreviewIvan Anfimov proposed opendev/zuul-providers master: Limit arm64 image builds to producing raw images  https://review.opendev.org/c/opendev/zuul-providers/+/96802913:20
opendevreviewIvan Anfimov proposed opendev/zuul-providers master: Limit arm64 image builds to producing raw images  https://review.opendev.org/c/opendev/zuul-providers/+/96802913:20
opendevreviewIvan Anfimov proposed opendev/zuul-providers master: Limit arm64 image builds to producing raw images  https://review.opendev.org/c/opendev/zuul-providers/+/96802913:25
opendevreviewIvan Anfimov proposed opendev/zuul-providers master: Remove build_diskimage_image_name variable  https://review.opendev.org/c/opendev/zuul-providers/+/97004914:44
opendevreviewIvan Anfimov proposed opendev/zuul-providers master: Remove build_diskimage_image_name variable  https://review.opendev.org/c/opendev/zuul-providers/+/97004914:44
opendevreviewIvan Anfimov proposed opendev/zuul-providers master: Remove build_diskimage_image_name variable  https://review.opendev.org/c/opendev/zuul-providers/+/97004914:47
fungi5.5 hours until the gerrit upgrade maintenance window starts (21:00 utc)15:30
fungiin 4.5 hours we can send the status notice15:30
clarkbI resemble some sort of operational human being. I've got a change staged locally that adds system-config-core to bindep's gerrit acls for code review +/-2. I figure I can push that up post upgrade which will help us check functionality of gerrit itself and replication. Then if we merge that it will check if project acl updates work16:35
clarkbthen we can revert it whenever is convenient. I think it is functionally a noop for us but gerrit won't see it that way16:35
clarkbotherwise I plan to do normal weekend things for a bit but should be back around 2000 UTC16:36
clarkb#status notice Gerrit on review.opendev.org is being upgraded to version 3.11 and will be offline starting at 2100 UTC. We have allocated an hour for the outage window lasting until 2200 UTC19:59
opendevstatusclarkb: sending notice19:59
-opendevstatus- NOTICE: Gerrit on review.opendev.org is being upgraded to version 3.11 and will be offline starting at 2100 UTC. We have allocated an hour for the outage window lasting until 2200 UTC19:59
clarkbI've updated the emergency.yaml file on bridge as well if anyone wants to double check that20:01
clarkbhttps://etherpad.opendev.org/p/gerrit-upgrade-3.11 is the plan we'll be following20:01
opendevstatusclarkb: finished sending notice20:02
fungii see the gerrit, gitea, storyboard and zuul scheduler servers in the emergency disable list20:07
clarkbfungi: and they all look typed correctly?20:07
fungiyes20:07
clarkbexcellent20:07
clarkbany preferences on who drives? I think the main consideration is whoever drives should start the root screen so that they can size the terminal for their needs (I'm happy to drive as I've gone through the process on the test node a few times, but I'll make a larger terminal window than fungi will for example)20:08
clarkbgoing to eat a quick lunch now while we wait for the 2100 timestamp20:11
fungii'm happy to cut and paste commands from the etherpad20:13
fungii've started a root screen session on review03 in any case20:14
clarkbthanks I'll join soon. Has it been ^a H'd for logging purposes?20:18
clarkbI did look and zuul nodes seem to have upgraded as expected yesterday. So we should do the manual restart of web andschedulers and that won't interfere with the automated background process20:19
fungii did not turn on hardcopy, but can20:20
fungiit's logging now20:20
clarkbthanks. It is one of the items noted in the task list20:21
clarkband I have attached to the screen session now20:21
clarkbnote item 7 isn't done yet. Thats another reminder at 2100. But not a big deal I won't forget to send that one20:22
fungiah, good point, we could have moved #8 before #720:25
clarkbya the order in the etherpad is probably backwards20:25
tonybdid I do the math wrong?20:25
clarkbtonyb: no we are 35 minutes away20:26
tonybphew!20:26
fungiyou're early20:26
clarkbwe're just sending notices early (we got feedback once that letting people know ahead of time like this allows them to save comments and fetch changes locally if necessary)20:26
clarkband then doing the other steps that can be done before we turn anything off20:26
tonybah okay 20:27
tonybI'll make another coffee and then I'll be ready 20:27
fungiyeah, trying to do as much as we can ahead of the service outage in order to keep it as brief as possible20:27
tonybsounds good20:30
clarkbI added a few more notes about restarting the zuul services to the etherpad. They weren't there before because my testing didn't have a zuul. But I did this once semi recently to pick up some changes and you basically stop web on one scheduler and scheduler on the other so they can spin their wheels on each server without any service interruption and minimizing impact to each server20:30
tonybwhere is the etherpad?20:39
clarkbtonyb: https://etherpad.opendev.org/p/gerrit-upgrade-3.1120:40
tonybThanks I looked at gerrit-3.11-upgrade ;P20:40
tonybI don't see the zuul pause step corvus mentioned last time20:43
tonybwhich I expect we should do before we down the containers20:44
fungi15 minutes until go time20:45
clarkboh ya I guess we could do that20:45
clarkbthat said we announced this more than a week in advance so I'm not super concerned about it20:46
tonybdoesn't is stop in flight testing from failing if gerrit is down?20:46
clarkbtonyb: yes, it pauses reporting to gerrit so you avoid failed reports (and merges if things are gating)20:47
clarkbthe feature didn't exist last time we upgraded gerrit which is why it isn't on the list already20:47
clarkbcurrently there is one check and one experimental change enqueued in all of zuul from what I can see20:48
clarkbso impact is non zero but also minimal20:48
tonybYeah.   I added what I think is the correct command to the etherpad20:49
tonybmostly for "correctness" today20:49
clarkbthat looks correct20:49
clarkbtonyb: I added the unpause command on list item 2420:50
tonybThanks.  I was going to figure that out next :)20:51
clarkbwhen I send the status notice at 2100 tonyb do you want to pause zuul then let fungi know when it is safe to stop gerrit and start the process there?20:53
tonyback20:53
fungi4 minutes20:56
fungi60 seconds20:59
clarkbI'll send the status notice as soon as I see my clock tick over to 210020:59
tonybAs it has minimal impact today I've paused zuul20:59
clarkb#status notice Gerrit on review.opendev.org is being upgraded to version 3.11 and will be offline momentarily. We have allocated an hour for the outage window lasting until 2200 UTC21:00
opendevstatusclarkb: sending notice21:00
fungithanks tonyb21:00
-opendevstatus- NOTICE: Gerrit on review.opendev.org is being upgraded to version 3.11 and will be offline momentarily. We have allocated an hour for the outage window lasting until 2200 UTC21:00
clarkbI see the banner on the zuul status page too21:00
fungii can in theory down gerrit now, while the notifications are going out21:00
clarkbfungi: yes i think we canstart. We already did the one hour warning too21:00
clarkbwe expect this step to take upwards of 5 minutes too21:01
fungitoo bad we didn't get the matrix config in place yet, would have been a great functional test21:02
fungibut at least it doesn't seem like we broke it21:02
clarkbya I'm sure we'll find something new to test the cross platform notices aginst21:03
opendevstatusclarkb: finished sending notice21:03
fungimaybe we'll have it installed by the time i do the project rename maintenance on friday21:03
fungigerrit is finally down, bringing mariadb back up now and backing it up21:04
clarkbit stopped quicker than the timeout too which is cool21:04
clarkbfungi: give mariadb a few seconds before starting the backup21:04
fungiyeah, 205.7s21:04
clarkbjust to be sure it is up and running before we back it up21:04
tonybLooks like it stopped within the timeout value21:04
clarkbthis is probably long enough.21:05
fungiit doesn't seem to have errored, so maybe the second or so i gave it was enough21:05
clarkbfungi: ya I think it backs up the fs first too before the db21:05
clarkbso there is a built in delay anyway as well21:05
fungithe backup exited 0, want to check anything else?21:05
clarkbI pulled up the log file in another terminal and it looks ok to me21:06
clarkbI think we can proceed21:06
fungi3.9M    /var/log/borg-backup-backup02.ca-ymq-1.vexxhost.opendev.org.log21:06
tonyb++21:06
fungioh, i guess that doesn't tell us much without knowing what was in the log before21:06
clarkbthe timestamps look correct21:06
fungiterminating with success status, rc 021:06
clarkband db backup returned rc 0. fs backup returned rc 1 which means there were warnings21:07
fungiyeah, log looks like it recorded a successful backup21:07
clarkbusually some file updatse as it goes maybe our screenlog for example21:07
fungimariadb is down again, backing up configs next21:07
fungiand indices21:07
clarkbthis step isn't strictly necessary since indexes don't update but I left it in place ebcause it seems liek good belts and suspenders21:08
fungitakes a moment, doesn't it21:10
clarkbya :/ I think it may keep older indexes aroudn which we mgiht be able to clean up or exclude from the copy potentially but I think keeping steps like this simple is probably worth the tradeoff21:10
clarkbit appears to be copying the last changes index should be done soon I hope21:11
fungimoving replication tasks aside now21:12
fungiand cleaning up h2 caches21:12
fungimv: target '/home/gerrit2/review_site/cache/modified_files.h2.db': Not a directory21:12
clarkbfungi: I think you cut off the end of that command21:12
fungiforget a target directory?21:13
clarkbyes its in there on the etherpad21:13
fungioh!21:13
fungithere we go21:13
fungivery long line, wrapped in the browser and i didn't spot it21:13
clarkbthat error should've prevented any copies we don't want going into the proper cache dir right?21:13
fungicorrect21:13
clarkbbasically it nooped rather than doing anything21:13
clarkbcool just want to make sure we didn't accidentally load up the cache with bad data 21:14
fungiit was a command parsing error, so nothing was executed21:14
fungicompose file updates now21:14
clarkbya timetsamps in the existing cache dir contents look fine21:14
fungilooks correct to me21:14
tonyband me21:15
fungiready to pull images next21:15
clarkbready21:15
fungiunderway21:15
clarkbfungi: that inspect command requires some manual var updates21:15
fungioh, yep21:16
clarkb127c7 is the image id we want to inspect I think'21:16
clarkbboth of those hashes lgtm21:17
tonyb++21:18
clarkbI think we can start mariadb give it 20 seconds then do the site init21:18
fungihttps://quay.io/repository/opendevorg/gerrit/manifest/sha256:065f7b03859065a2ba2305cab7c18bac77778c3f452e5fd0cc9c92ac11d24fa5 has an unknown entry, what's that?21:18
clarkbfungi: its some artifact of how we build images I think because we're using the multi arch builds universally now21:18
fungiokay21:19
clarkbfungi: the client side ignores that because they don't have a matching arch21:19
fungialso worth noting, that tls deprecation banner still links to a "subscriber only" kb article, there's a jira ticket filed about that courtesy of TheJulia but i haven't seen any activity on it for the month it's been open yet21:20
fungiokay, gerrit init time21:20
clarkbyup lets start mariadb then wait ~20 seconds then do the gerrit init21:20
fungiwaiting21:21
fungiit's been about 20 seconds21:21
fungiproceeding21:21
clarkbthat looks great just as expected21:21
TheJuliafungi: It got assigned to someone, but also have not seen any activity otherwise.21:22
clarkbI'm ready to start gerrit if you are21:22
fungilgtm too, time to bring gerrit up?21:22
fungistarting21:22
fungithanks again TheJulia!21:22
fungii started tailing the gerrit error_log in a second screen window just to avoid polluting the primary21:22
clarkb[2025-12-07T21:22:31.336Z] [main] INFO  com.google.gerrit.pgm.Daemon : Gerrit Code Review 3.11.7-2-g46f2be98bb-dirty ready21:22
clarkbthat traceback is a new one for me21:23
clarkbbut it also says it is trying to delete a trash file that awsn't there so its probably ok?21:23
fungii'm surprised that "i wanted to delete a nonexistent file" is an error, but whatevs21:24
fungiare we all set for tonyb to un-pause zuul then?21:24
clarkbthe web ui is up for me and I appear to still be logged in21:24
clarkbdiffs load too so yes I think we can unpause zuul21:24
clarkbI made a local note of that deletion traceback and can ask upstream about it later21:25
clarkband no config diff is good means the testing was accurate21:25
fungithe config diff at step #26.2 returns an empty result, yes21:25
tonybdone21:26
clarkbI'm going to propose that bindep acl update now which should test several things for us21:26
opendevreviewClark Boylan proposed openstack/project-config master: Add system-config-core to bindep ACLs  https://review.opendev.org/c/openstack/project-config/+/97009121:26
fungithanks21:26
fungi`gerrit show-queue -w -q` lists only 17 tasks21:27
clarkbya that is expected since there is no index update21:27
clarkbit should largely be business as usual on startup here21:27
fungik, makes sense21:27
clarkbthat delete trash files exception occurred against content in All-Users21:27
fungiready for me to exit the screen session and back up the log?21:27
clarkbdoes someone want ot log out and log back in just to sanity check that works?21:27
fungii can21:28
tonyb970091 looks good to me, enqueued in zuul, replicated to at least one gitea21:28
clarkbfungi: lets hold off on closing screen until we've gone through our short list of checks21:28
tonybI have logged in since the update and that's fine21:28
fungi i was able to log out and into the webui21:28
clarkbcool thanks for checking21:29
clarkbrecheck is the last item on the things to check list that I haven't seen cehcked yet21:29
* fungi looks for a candidate21:29
clarkbhttps://review.opendev.org/c/openstack/project-config/+/96984621:30
clarkbI just rechecked thsi one since it only runs one job it is cheap21:30
clarkband I see it in the zuul status page now21:30
fungicool, thanks21:30
fungiyeah, i agree it enqueued21:31
clarkbI detached from screen. I think we can probably shut it down and move the log file into its more permanent home21:31
clarkbthen we haev a few tasks listed in an order that don't necessarily need to be in that specific order21:31
clarkbspecifically the zuul web and scheduler restarts should be able to happen while we do other things21:31
fungiokay, did `mv /root/screenlog.0 /home/gerrit2/tmp/upgrade-3.11` after shutting down the session21:32
clarkbtonyb: do you want to review and approve https://review.opendev.org/c/opendev/system-config/+/968349 ? I can clean up the emergency file list21:32
clarkbonce ^ is in and we're happy then we can merge the bindep acl update if we think that is safe enough to test things21:33
tonybDone.21:34
clarkboh shoot that is going to take at least an hour to gate21:34
clarkbmaybe we don't need to trigger the gitea jobs on these gerrit changes...21:34
clarkbthats fine we have other thigns to do while we wait like restarting zuul services. Does someone else want to do that step? I wrote down the directions on the etherpad basically we do web on one node and scheduler on the other. Wait for both to come back tino the cluster then flip the two around and do the services the other way around21:35
clarkbI can do it too if we prefer21:35
clarkbbut I think things are looking happy other than that one unexpected exception21:37
clarkbthe expected exception showed up too and since those two there haven't been any others21:37
tonybI can do the zuul restarts21:37
tonybThe directions make sense, other than not rolling restarting the executors21:38
clarkbtonyb: we only need to restart scheduler and web beacuse they are the only ones that interact with code review systems21:38
clarkbtonyb: and the restart is because zuul asks gerrit for its version number on startup21:38
clarkb(executors don't do that so we can leave them be)21:38
tonybOk21:39
clarkbalso we don't have to test acl updates today I dont' think. I suspect that if anything goes wrong there we won't be downgrading but will instead be updating acls which we can do with services running. So if we want to end the process at gerrit 3.11 is the image version in config management that is probably good enough? Thought seeing manage-projects run regardless is always a good21:39
clarkbthing21:39
fungiseems sufficient to m21:41
fungiw21:41
fungime21:41
clarkbbut also the downgrade here is relatively cheap as they go since there is no reindexing21:41
fungii'll be around fairly early tomorrow in case something has gone terribly wrong, have a morning meeting to be up for anyway21:42
clarkbso if we do find need to downgrade later it is slightly less painful than usual21:42
fungiyeah21:42
clarkbok 968349 will trigger manage-projects but as a noop21:44
clarkbI think if that comes back happy then we can worry about testing the positive case tomorrow21:44
fungiwfm21:44
clarkb(a noop because that change doesn't include acl updates and jeepyb should do very little as a result)21:44
fungiwe don't make acl changes that often21:44
clarkbtonyb: looks like the first half of the zuul restarting has completed based on the components list21:46
clarkboh line 300 has a post upgrade task of checking if Blocked Users were added to All-Projects acls21:48
clarkbagain thats mostly so we can update our documentation if it did and not anything urgent so I'm happy to defer that but want to call it out as a task that could be done now21:49
clarkbthere is a certain part of me that is trying to optimize the work done today for the fact that it is the weekend :)21:50
tonybYup, sorry it took longer than expected and I got distracted21:50
clarkbtonyb: it is a lot faster today than in the past. It took about 20 mintues before but takes about 5 now21:50
tonybNoted21:52
clarkbI was leaving things in the emergecny file for now since the hourlies are about to start and the chagne we want to run jobs is some time away from merging21:54
clarkbbut thinking about it further we don't run jobs in hourly that should affect any of these nodes really (zuul is the only ones but zuul should be ok)21:54
clarkbany concern with me removing the nodes from the file now?21:54
fungiyeah, starting 5 minutes out21:54
fungishould be safe21:54
fungiwould be good to see them run before i call it done for me21:55
clarkbyup same here21:55
clarkbemergency file is cleaned up21:55
fungithanks!21:55
tonybzuul-web on zuul02 is that last thing I'm waiting on21:56
clarkbnew exception but I think thsi is a common one: Connection reset by peer21:56
tonyband done21:57
clarkbbasically crawlers grab git repo archive files then time out before they can be put together and close the connection then gerrit complains21:57
clarkbI've seen it before on older gerrit so not concerned about ti being related to the upgrade21:57
clarkbcool I think we're just waiting on that chagne to merge and trigger manage-projects and infra-prod-service-review. Both should noop for us21:58
clarkbheh jetty has a routine called eat what you kill22:02
opendevreviewMerged opendev/system-config master: Bump Gerrit container image to 3.11  https://review.opendev.org/c/opendev/system-config/+/96834922:07
clarkbthat was much quicker than expected jobs probably ran on raxflex22:07
clarkband the two jobs we wnt to see have queued up22:07
clarkbtheyare waiting for hourlies to finish22:08
clarkbDec  7 21:14 docker-compose.yaml <- in theory that modtime won't change22:08
clarkbok service-review finished successfull and that mod time did not change so that looks good22:12
clarkbmanage-projects is running now22:12
clarkbof course as that is happening I realize we never updated gerritlib's integration testing to gerrit 3.11...22:12
clarkbI'll make a note of that22:12
fungioh, yep22:12
clarkbI think we use the upstream images in that job so we could theoretically test 3.11 3.12 and 3.1322:14
clarkbwhcih is a nice way of getting ahead of it and not even needing to forget for the next couple of upgrades22:14
fungisgtm22:14
fungii guess we want to try to power through 3.12 to 3.13 early in 202622:15
clarkbyes, though we have to get on java 21 with 3.11 first22:15
fungiaha, yeah that's the next step22:15
clarkb3.11 supports java 17 and 21 so is the transition release. We're running with 17 right now22:16
clarkb3.12 only supports java 2122:16
clarkbmanage-projects job reports success and in the log file I see a whole lot of skipping22:18
clarkbif it looks good to everyone else I think we can call this done for now then pick up the post upgrade tasks tomorrow including testing an actual acl update22:18
clarkbbut also feel free to disagree with me on that point and point out anything else you feel is necessary to check or test first22:19
fungiand looks like the gitea task completed a couple of minutes ago22:19
clarkbfungi: gitea task?22:19
fungier, i guess it's gerrit and gitea, because it runs the plan on review03.opendev.org as well22:20
fungis/plan/play./22:20
fungiPLAY RECAP includes review03.opendev.org       : ok=4    changed=2    unreachable=0    failed=0    skipped=0    rescued=0    ignored=022:20
clarkboh yes manage-projects runs against gitea first to update the repos there so that when we then update gerrit and potentially replicate from gerrit to gitea the gitea info is up to date and ready to receive updates22:20
fungionce upon a time, the manage-projects process wrote to a log on the gerrit server, but i'm not seeing where that is these days22:22
clarkbfungi: it writes it on bridge now in /var/log/ansible/manage-projects.yaml.log which is then also included in the infra-prod-manage-projects job logs (its the only job where we include the log because we're confident we don't leak things)22:23
clarkbfungi: basically we stopped having it write to local disk and instead emit to stdout so ansible captures it22:23
clarkbI don't remember why we decided that was desireable. Maybe it happened as part of switching it to an ansible job run on demand22:23
fungiah okay, so everything that the manage-projects script used to log on the gerrit server is just captured by ansible now22:23
clarkbyes should be22:23
fungianyway, i don't see anything amiss there22:24
clarkbgreat. tonyb any concerns with considering this done for now?22:25
clarkbI did update the post upgrade tasks list which yall might want to check to see if we feel any of them need to be done now rather than tomorrow/later22:25
clarkb(that list is on the etherpad)22:25
fungiaty line 29722:26
clarkbI'm hearing no objections. I'm probably go to go and not look at a computer screen for a bit but can check in before dinner22:30
fungishould be fine, i'm still looking through it but i don't expect any concerns22:30
clarkbfungi: ack22:31
clarkbthank you for the help!22:31
fungimy pleasure!22:33
fungiyeah, i don't see anything urgent in the post upgrade tasks list, and i'm +2 on all the changes linked there so fart22:34
fungiso far22:34

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!