Tuesday, 2021-11-23

opendevreviewOpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml  https://review.opendev.org/c/openstack/project-config/+/81884602:31
opendevreviewIan Wienand proposed openstack/diskimage-builder master: [dnm] testing grubby machine-id interaction  https://review.opendev.org/c/openstack/diskimage-builder/+/81885105:31
opendevreviewMerged openstack/project-config master: Normalize projects.yaml  https://review.opendev.org/c/openstack/project-config/+/81884605:55
*** pojadhav is now known as pojadhav|afk05:58
*** pojadhav|afk is now known as pojadhav06:36
opendevreviewwangxiyuan proposed openstack/project-config master: Add openEuler 20.03 LTS SP2 node  https://review.opendev.org/c/openstack/project-config/+/81872307:08
ianwafter a day of investigation i think i'm closing in on this bootloader stuff07:18
ianwfirstly, it boots in the gate because there it builds on a DIB image already, which has booted with root=LABEL=cloudimg-rootfs07:19
ianwthe install takes a guess at this from the underlying /proc/cmdline -- and in the gate, it's right.  on the production builders, it's not.07:19
ianwbut then, the issue is that grub2-mkconfig isn't updating the entries07:20
ianwthis appears to come down to BLS entries, which are prefixed with a machine-id.  it seems we don't have /etc/machine-id, so the install doesn't match any of the /boot/loader/entries/<machine-id>*.conf files07:20
ianwit never re-writes the kernel command line07:21
ianwwe have grubby hacks to "fix" this -- but actually they are wrong07:22
ianwthis all came to light because the "fix" got broken for fedora 3507:22
ianwanyway, i'm out of time, but I think we can get this working by ensuring we have a machine-id 07:23
opendevreviewIan Wienand proposed openstack/diskimage-builder master: [dnm] testing grubby machine-id interaction  https://review.opendev.org/c/openstack/diskimage-builder/+/81885107:32
*** ykarel is now known as ykarel|lunch09:21
hashargood morning.   I had report that the last version of git-review is broken with the new git 2.34.0. Luckily that already got fixed https://review.opendev.org/c/opendev/git-review/+/818219   09:27
hasharit might be worth tagging a new release so people can install from Pypi rather than having to install from the git sources :]09:27
fricklerhashar: iiuc we wanted to give it a bit of testing first, since we don't have the new git readily available in our CI, but your argument is valid as well10:03
*** jpena|off is now known as jpena10:54
*** rlandy|out is now known as rlandy|ruck11:19
*** pojadhav is now known as pojadhav|afk11:32
*** ykarel|lunch is now known as ykarel12:05
*** pojadhav|afk is now known as pojadhav12:17
*** ykarel is now known as ykarel|afk12:41
fungithough the versions of git we do have are sufficient to cover both the old and new options12:47
fungi(just not new enough to throw an error when trying to use the old option)12:47
hasharfrickler: fungi: at least the patch has unit tests :)12:54
hasharI am super happy to see it is already patched up and I guess it only affects people running the very last version of git which is probably not that many people12:55
*** sshnaidm|afk is now known as sshnaidm13:24
fungihashar: yep, but thanks for the reminder, we should probably go ahead and tag a release. i can work on that in a bit14:01
*** ykarel|afk is now known as ykarel14:02
fungithe core.hooksPath support addition means we should probably tag a 2.2.0 release14:03
opendevreviewMerged opendev/lodgeit master: Drop Python 2.7 support  https://review.opendev.org/c/opendev/lodgeit/+/79899014:37
opendevreviewMerged opendev/lodgeit master: Update docker image to bullseye and python 3.8  https://review.opendev.org/c/opendev/lodgeit/+/81859714:42
*** ysandeep is now known as ysandeep|out15:01
opendevreviewJulia Kreger proposed openstack/diskimage-builder master: Trivial: fix whitespace in ubuntu element rst  https://review.opendev.org/c/openstack/diskimage-builder/+/81893615:02
opendevreviewMerged opendev/irc-meetings master: Remove Technical committee office hours  https://review.opendev.org/c/opendev/irc-meetings/+/81861315:50
*** ykarel is now known as ykarel|away15:55
*** ysandeep|out is now known as ysandeep16:05
*** ysandeep is now known as ysandeep|out16:27
opendevreviewMerged openstack/diskimage-builder master: Allowing ubuntu element use local image  https://review.opendev.org/c/openstack/diskimage-builder/+/81748116:59
*** marios is now known as marios|out17:06
*** jpena is now known as jpena|off17:33
fungilooking at the current state of git-review, we missed adding release notes for git core.hooksPath support (feature), ignoring unstaged/uncommitted submodules (bugfix), and switching to git --rebase-merges (bugfix i guess)18:22
fungii'll try to write some up quickly and get those merged so we can reuse them for the release announcement18:22
fungier, no ignore me, we're only missing a release note for --rebase-merges18:24
opendevreviewMerged openstack/diskimage-builder master: Trivial: fix whitespace in ubuntu element rst  https://review.opendev.org/c/openstack/diskimage-builder/+/81893618:37
opendevreviewJeremy Stanley proposed opendev/git-review master: Add release note for the rebase merge handling fix  https://review.opendev.org/c/opendev/git-review/+/81901819:00
fungiClark[m]: hashar: ^ i'll tag a release once we get that release note merged19:01
fungipriteau: ^19:36
opendevreviewJeremy Stanley proposed opendev/git-review master: Add release note for the rebase merge handling fix  https://review.opendev.org/c/opendev/git-review/+/81901819:50
*** artom_ is now known as artom20:20
priteauThank you fungi20:51
fungiianw: ^ if you maybe have a moment to review that release note as an extra pair of eyes, i'm happy to expedite a git-review release as multiple users are clearly running into the problem20:52
ianwlgtm, thanks21:17
corvusfungi, Clark, ianw: we merged a change to zuul that warrants a rolling scheduler restart at our convenience; does anyone want to drive that with me around to assist?  or should i take care of it myself?21:22
ianwcorvus: if it can wait ~40 minutes i can give it my full attention, but if you have other things to do that's fine, i can learn another time :)21:23
Clark[m]I'm technically not here today so probably best if I don't volunteer. I assume the process is pull images then restart 01 and wait for some indication of happy was the repeat on 02?21:23
corvusianw: 40m is not a problem; Clark yep21:24
fungii could likely help later, about to switch focus temporarily to dinner prep21:24
fungibut happy to catch it on the next upgrade too21:24
corvuslet's regroup in 40m then21:25
ianwok i am back22:01
corvusall right22:02
corvusfirst thing i'd do is run the zuul_pull playbook, just to make sure they all have the images.  they probably should, but it's easier to run the playbook than it is to actually look up the image ids, etc.22:02
corvus(it takes like 10 seconds if it's a noop)22:03
ianwok starting a screen22:04
ianwok, looks like everything pulled a new image22:05
corvusdid you start screen on bridge or elsewhere?22:06
corvusoh i think i just accidentally exited it, sorry!22:06
ianwoh that's what it was :)22:06
corvusokay, i think we're in sync now :)22:06
ianwzuul-scheduler image id is 7248236db1e1 on zuul0122:07
ianwwhich matches https://hub.docker.com/layers/zuul/zuul-scheduler/latest/images/sha256-640871154c1dfe2acd72140a117cfaa4e5d56b730dd49d9c7a82741dd1039dbe?context=explore22:09
corvuswe haven't written a rolling restart playbook yet, so the next steps i have just manually done the individual servers.  on zuul01, i would issue "zuul-scheduler stop" using a docker-compose exec or similar22:09
corvus* manually done on the individual22:10
ianwi guess we don't have to dump the queues any more22:11
corvusnope :)22:11
ianwok, so it's *not* stop the container, it's run zuul-scheduler stop in the continer, right?22:11
corvusianw: correct -- but only because i haven't really tested the other yet.  that should be fine too, i'm just not certain where we are on graceful shutdowns right now.  that should eventually be fine, and even if we run that now (intentionally or accidentally), it should still be fine too; it's designed to recover from crashing.  it's just mostly untested at this point.22:13
corvus(i would like to try some chaos monkey style crashes over some weekend...)22:13
ianwok ... /usr/bin/docker-compose -f /etc/zuul-scheduler/docker-compose.yaml exec -T scheduler zuul-scheduler stop22:14
corvusokay, according to the logs, everything is idle, however i think this may be a version where the shutdown sequence wasn't quite right, so i think now that it has mostly stopped, we should "docker-compose down" to finish the job22:15
corvusnote at this point that https://zuul.opendev.org/components shows zuul01 stopped but still extant22:16
ianwok, container gone22:16
corvus(that page doesn't have a reload button (yet) so we need to reload the page to see changes)22:16
corvusand now it's disappeared from the components page22:17
corvusi think you can docker-compose up on zuul01 now22:17
ianwyep it's gone now22:17
ianwok, bringing back up22:17
corvusokay, one of the recent changes is the new "initializing" state.  that means that it's up but not processing pipelines yet as it's still loading the config.22:19
ianwthe ? state22:19
ianw(on the components page)22:20
ianwlooks like it's madly loading configs now22:20
corvusyep.  we should wait until that's complete before stopping the other scheduler.  mostly because of gerrit event handling -- i'm not sure that the events will be distributed to the tenants until it's finished loading.22:23
corvusother than potentially missing events if this scheduler is elected to handle them -- there's no downside to it running with an incomplete config -- it won't process any pipelines it doesn't know about yet.22:23
corvuslooks like it's done now; state is 'running'22:25
ianwyep cool22:26
ianwa theoretical playbook could just loop on https://zuul.opendev.org/api/components i guess waiting for "running"22:26
corvusso i think you can do the same procedure on zuul02.  that will take the web server offline for a bit too, but that should be fine.22:27
corvusyep, we could do that.  i might spend some time thinking about whether we can make it safe for a scheduler to win the event election while initializing; that would be nice since it would speed up the rolling restart.22:27
ianwok i've done the stop22:28
ianwstatus shows stopped22:28
* fungi has been sort of following along, now that dinner is done22:28
corvusi think you can proceed to the hard stop now22:28
ianwyeah, it doesn't seem that stopped from tailing the logs22:28
corvusoh, way back a bit it said that the main loop stopped22:29
corvusit's just that there are some threads still running that get updates from zk22:29
ianwok, container down and it's out of the components list22:30
ianwback up an initializing22:31
corvusoh, i forgot that scheduler and web are actually different compose files, so that didn't take down zuul-web.  i think that's worth restarting too for this, so maybe go ahead and do a down/up cycle for that whenever you feel like it22:31
ianwok, done for web22:32
ianwhrm, getting a 503 from https://zuul.openstack.org/22:33
corvusthat will probably happen until it loads its config22:35
corvusit's effectively a read-only scheduler now, so it'll take just as long to start up as the real schedulers22:35
ianwyep it is22:35
ianwahh, that's new (to me :)22:35
corvuswe'll definitely want to have 2 of these and load balance them22:35
corvusit was one of the last things we did22:36
corvusso a very recent change22:36
fungiwonder if we should put them on entirely separate servers too22:36
corvusfungi: not a bad idea, but if we have extra cpu or ram on the schedulers we could continue to double up.22:37
corvusi should say extra cpu and ram.  but really ram.  of course we're going to have extra cpu (thanks GIL!)22:37
fungiyeah, good point, we could also scale down the scheduler servers at this point, i expect22:38
fungithe old one anyway (the new one is smaller now, right?)22:38
fungiyeah, zuul01 has half the ram capacity of zuul0222:39
corvusyeah, i think we're at the point where the code we're running is representative of what we'll be doing in the future.22:40
fungimemory utilization on zuul02 suggests it would be fine to put scheduler, web and fingergw on a vm the size of zuul0122:40
ianwoohhh, i see new lines between chagnes on the status page22:42
corvusi'm seeing some errors in the logs and now i realize there was a backwards-incompat change, so we actually do need a full cluster restart22:42
corvusi believe at this point we can just roll forward and restart all the executors and mergers22:42
ianwminor thing, but https://zuul.openstack.org/components (note the openstack.org, which just was because my browser autocompleted it) gives an error22:43
corvusbut -- for purposes of learning about rolling scheduler restarts, the process would normally conclude here :)22:43
ianwbut zuul.opendev.org/components is ok22:43
ianwheh, for merger/executor we can just use the playbook?22:44
opendevreviewJeremy Stanley proposed opendev/system-config master: Add zuul01 to cacti  https://review.opendev.org/c/opendev/system-config/+/81903322:44
corvusyes, but we no longer need to restart the schedulers/web... so maybe run a locally modified copy?22:44
ianwok, i've edited my playbook copies and can run that22:46
fungione of my changes which was in flight just got smacked with RETRY_LIMIT results for all of its builds22:47
fungiprobably the executor restart?22:47
ianwok, should all be restarted now22:47
corvusfungi: that's the problem the executor restarts are fixing22:48
corvusfungi: try a recheck now?22:48
fungiaha, sounds good22:48
fungiyeah, looks like it's being tested this time22:49
fungithe git-review change in the gate pipeline at https://zuul.opendev.org/t/opendev/status22:49
corvusi promoted the change at the top of the open gate pipeline to force a reset and clear out all the retries pending there22:50
fungioh, good call. thanks!22:51
corvuss/open gate/openstack integrated gate/22:51
fungiand yeah, my git-review change in the opendev tenant did get nodes assigned and start builds, so looks good22:51
corvusthere will have been some fallout, but hopefully not too much22:52
corvusi think we can call this done?22:53
ianw++ thanks!  good to have a bit more confidence with it if someting happens while you're all eating turkey :)22:54
fungiyeah, this seems to have stabilized. thanks!22:56
ianwi also think i've finally figured out all the magic behind installing kernel options for rh platforms now23:00
ianwthis is not helped by grub2 not being so much a package, but a series of 200 overlayed patches : https://src.fedoraproject.org/rpms/grub2/tree/rawhide23:00
ianwbut hopefully we can get f35 out of the failure loop it's in now23:03
opendevreviewMerged opendev/git-review master: Add release note for the rebase merge handling fix  https://review.opendev.org/c/opendev/git-review/+/81901823:14
opendevreviewMerged opendev/system-config master: Add zuul01 to cacti  https://review.opendev.org/c/opendev/system-config/+/81903323:18

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!