opendevreview | OpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml https://review.opendev.org/c/openstack/project-config/+/818846 | 02:31 |
---|---|---|
opendevreview | Ian Wienand proposed openstack/diskimage-builder master: [dnm] testing grubby machine-id interaction https://review.opendev.org/c/openstack/diskimage-builder/+/818851 | 05:31 |
opendevreview | Merged openstack/project-config master: Normalize projects.yaml https://review.opendev.org/c/openstack/project-config/+/818846 | 05:55 |
*** pojadhav is now known as pojadhav|afk | 05:58 | |
*** pojadhav|afk is now known as pojadhav | 06:36 | |
opendevreview | wangxiyuan proposed openstack/project-config master: Add openEuler 20.03 LTS SP2 node https://review.opendev.org/c/openstack/project-config/+/818723 | 07:08 |
ianw | after a day of investigation i think i'm closing in on this bootloader stuff | 07:18 |
ianw | firstly, it boots in the gate because there it builds on a DIB image already, which has booted with root=LABEL=cloudimg-rootfs | 07:19 |
ianw | the install takes a guess at this from the underlying /proc/cmdline -- and in the gate, it's right. on the production builders, it's not. | 07:19 |
ianw | but then, the issue is that grub2-mkconfig isn't updating the entries | 07:20 |
ianw | this appears to come down to BLS entries, which are prefixed with a machine-id. it seems we don't have /etc/machine-id, so the install doesn't match any of the /boot/loader/entries/<machine-id>*.conf files | 07:20 |
ianw | it never re-writes the kernel command line | 07:21 |
ianw | we have grubby hacks to "fix" this -- but actually they are wrong | 07:22 |
ianw | this all came to light because the "fix" got broken for fedora 35 | 07:22 |
ianw | anyway, i'm out of time, but I think we can get this working by ensuring we have a machine-id | 07:23 |
opendevreview | Ian Wienand proposed openstack/diskimage-builder master: [dnm] testing grubby machine-id interaction https://review.opendev.org/c/openstack/diskimage-builder/+/818851 | 07:32 |
*** ykarel is now known as ykarel|lunch | 09:21 | |
hashar | good morning. I had report that the last version of git-review is broken with the new git 2.34.0. Luckily that already got fixed https://review.opendev.org/c/opendev/git-review/+/818219 | 09:27 |
hashar | it might be worth tagging a new release so people can install from Pypi rather than having to install from the git sources :] | 09:27 |
frickler | hashar: iiuc we wanted to give it a bit of testing first, since we don't have the new git readily available in our CI, but your argument is valid as well | 10:03 |
*** jpena|off is now known as jpena | 10:54 | |
*** rlandy|out is now known as rlandy|ruck | 11:19 | |
*** pojadhav is now known as pojadhav|afk | 11:32 | |
*** ykarel|lunch is now known as ykarel | 12:05 | |
*** pojadhav|afk is now known as pojadhav | 12:17 | |
*** ykarel is now known as ykarel|afk | 12:41 | |
fungi | though the versions of git we do have are sufficient to cover both the old and new options | 12:47 |
fungi | (just not new enough to throw an error when trying to use the old option) | 12:47 |
hashar | frickler: fungi: at least the patch has unit tests :) | 12:54 |
hashar | I am super happy to see it is already patched up and I guess it only affects people running the very last version of git which is probably not that many people | 12:55 |
*** sshnaidm|afk is now known as sshnaidm | 13:24 | |
fungi | hashar: yep, but thanks for the reminder, we should probably go ahead and tag a release. i can work on that in a bit | 14:01 |
*** ykarel|afk is now known as ykarel | 14:02 | |
fungi | the core.hooksPath support addition means we should probably tag a 2.2.0 release | 14:03 |
opendevreview | Merged opendev/lodgeit master: Drop Python 2.7 support https://review.opendev.org/c/opendev/lodgeit/+/798990 | 14:37 |
opendevreview | Merged opendev/lodgeit master: Update docker image to bullseye and python 3.8 https://review.opendev.org/c/opendev/lodgeit/+/818597 | 14:42 |
*** ysandeep is now known as ysandeep|out | 15:01 | |
opendevreview | Julia Kreger proposed openstack/diskimage-builder master: Trivial: fix whitespace in ubuntu element rst https://review.opendev.org/c/openstack/diskimage-builder/+/818936 | 15:02 |
opendevreview | Merged opendev/irc-meetings master: Remove Technical committee office hours https://review.opendev.org/c/opendev/irc-meetings/+/818613 | 15:50 |
*** ykarel is now known as ykarel|away | 15:55 | |
*** ysandeep|out is now known as ysandeep | 16:05 | |
*** ysandeep is now known as ysandeep|out | 16:27 | |
opendevreview | Merged openstack/diskimage-builder master: Allowing ubuntu element use local image https://review.opendev.org/c/openstack/diskimage-builder/+/817481 | 16:59 |
*** marios is now known as marios|out | 17:06 | |
*** jpena is now known as jpena|off | 17:33 | |
fungi | looking at the current state of git-review, we missed adding release notes for git core.hooksPath support (feature), ignoring unstaged/uncommitted submodules (bugfix), and switching to git --rebase-merges (bugfix i guess) | 18:22 |
fungi | i'll try to write some up quickly and get those merged so we can reuse them for the release announcement | 18:22 |
fungi | er, no ignore me, we're only missing a release note for --rebase-merges | 18:24 |
opendevreview | Merged openstack/diskimage-builder master: Trivial: fix whitespace in ubuntu element rst https://review.opendev.org/c/openstack/diskimage-builder/+/818936 | 18:37 |
opendevreview | Jeremy Stanley proposed opendev/git-review master: Add release note for the rebase merge handling fix https://review.opendev.org/c/opendev/git-review/+/819018 | 19:00 |
fungi | Clark[m]: hashar: ^ i'll tag a release once we get that release note merged | 19:01 |
fungi | priteau: ^ | 19:36 |
opendevreview | Jeremy Stanley proposed opendev/git-review master: Add release note for the rebase merge handling fix https://review.opendev.org/c/opendev/git-review/+/819018 | 19:50 |
*** artom_ is now known as artom | 20:20 | |
priteau | Thank you fungi | 20:51 |
fungi | ianw: ^ if you maybe have a moment to review that release note as an extra pair of eyes, i'm happy to expedite a git-review release as multiple users are clearly running into the problem | 20:52 |
ianw | lgtm, thanks | 21:17 |
corvus | fungi, Clark, ianw: we merged a change to zuul that warrants a rolling scheduler restart at our convenience; does anyone want to drive that with me around to assist? or should i take care of it myself? | 21:22 |
ianw | corvus: if it can wait ~40 minutes i can give it my full attention, but if you have other things to do that's fine, i can learn another time :) | 21:23 |
Clark[m] | I'm technically not here today so probably best if I don't volunteer. I assume the process is pull images then restart 01 and wait for some indication of happy was the repeat on 02? | 21:23 |
corvus | ianw: 40m is not a problem; Clark yep | 21:24 |
fungi | i could likely help later, about to switch focus temporarily to dinner prep | 21:24 |
fungi | but happy to catch it on the next upgrade too | 21:24 |
corvus | let's regroup in 40m then | 21:25 |
ianw | ++ | 21:26 |
ianw | ok i am back | 22:01 |
corvus | all right | 22:02 |
corvus | first thing i'd do is run the zuul_pull playbook, just to make sure they all have the images. they probably should, but it's easier to run the playbook than it is to actually look up the image ids, etc. | 22:02 |
corvus | (it takes like 10 seconds if it's a noop) | 22:03 |
ianw | ok starting a screen | 22:04 |
ianw | ok, looks like everything pulled a new image | 22:05 |
corvus | did you start screen on bridge or elsewhere? | 22:06 |
ianw | bridge | 22:06 |
corvus | oh i think i just accidentally exited it, sorry! | 22:06 |
ianw | oh that's what it was :) | 22:06 |
corvus | okay, i think we're in sync now :) | 22:06 |
ianw | zuul-scheduler image id is 7248236db1e1 on zuul01 | 22:07 |
ianw | zuul/zuul-scheduler@sha256:640871154c1dfe2acd72140a117cfaa4e5d56b730dd49d9c7a82741dd1039dbe | 22:09 |
ianw | which matches https://hub.docker.com/layers/zuul/zuul-scheduler/latest/images/sha256-640871154c1dfe2acd72140a117cfaa4e5d56b730dd49d9c7a82741dd1039dbe?context=explore | 22:09 |
corvus | we haven't written a rolling restart playbook yet, so the next steps i have just manually done the individual servers. on zuul01, i would issue "zuul-scheduler stop" using a docker-compose exec or similar | 22:09 |
corvus | * manually done on the individual | 22:10 |
ianw | i guess we don't have to dump the queues any more | 22:11 |
corvus | nope :) | 22:11 |
ianw | ok, so it's *not* stop the container, it's run zuul-scheduler stop in the continer, right? | 22:11 |
corvus | ianw: correct -- but only because i haven't really tested the other yet. that should be fine too, i'm just not certain where we are on graceful shutdowns right now. that should eventually be fine, and even if we run that now (intentionally or accidentally), it should still be fine too; it's designed to recover from crashing. it's just mostly untested at this point. | 22:13 |
corvus | (i would like to try some chaos monkey style crashes over some weekend...) | 22:13 |
ianw | ok ... /usr/bin/docker-compose -f /etc/zuul-scheduler/docker-compose.yaml exec -T scheduler zuul-scheduler stop | 22:14 |
corvus | okay, according to the logs, everything is idle, however i think this may be a version where the shutdown sequence wasn't quite right, so i think now that it has mostly stopped, we should "docker-compose down" to finish the job | 22:15 |
corvus | note at this point that https://zuul.opendev.org/components shows zuul01 stopped but still extant | 22:16 |
ianw | ok, container gone | 22:16 |
corvus | (that page doesn't have a reload button (yet) so we need to reload the page to see changes) | 22:16 |
corvus | and now it's disappeared from the components page | 22:17 |
corvus | i think you can docker-compose up on zuul01 now | 22:17 |
ianw | yep it's gone now | 22:17 |
ianw | ok, bringing back up | 22:17 |
corvus | okay, one of the recent changes is the new "initializing" state. that means that it's up but not processing pipelines yet as it's still loading the config. | 22:19 |
ianw | the ? state | 22:19 |
ianw | (on the components page) | 22:20 |
ianw | looks like it's madly loading configs now | 22:20 |
corvus | yep. we should wait until that's complete before stopping the other scheduler. mostly because of gerrit event handling -- i'm not sure that the events will be distributed to the tenants until it's finished loading. | 22:23 |
corvus | other than potentially missing events if this scheduler is elected to handle them -- there's no downside to it running with an incomplete config -- it won't process any pipelines it doesn't know about yet. | 22:23 |
corvus | looks like it's done now; state is 'running' | 22:25 |
ianw | yep cool | 22:26 |
ianw | a theoretical playbook could just loop on https://zuul.opendev.org/api/components i guess waiting for "running" | 22:26 |
corvus | so i think you can do the same procedure on zuul02. that will take the web server offline for a bit too, but that should be fine. | 22:27 |
corvus | yep, we could do that. i might spend some time thinking about whether we can make it safe for a scheduler to win the event election while initializing; that would be nice since it would speed up the rolling restart. | 22:27 |
ianw | ok i've done the stop | 22:28 |
ianw | status shows stopped | 22:28 |
* fungi has been sort of following along, now that dinner is done | 22:28 | |
corvus | i think you can proceed to the hard stop now | 22:28 |
ianw | yeah, it doesn't seem that stopped from tailing the logs | 22:28 |
corvus | oh, way back a bit it said that the main loop stopped | 22:29 |
corvus | it's just that there are some threads still running that get updates from zk | 22:29 |
ianw | ok, container down and it's out of the components list | 22:30 |
ianw | back up an initializing | 22:31 |
corvus | oh, i forgot that scheduler and web are actually different compose files, so that didn't take down zuul-web. i think that's worth restarting too for this, so maybe go ahead and do a down/up cycle for that whenever you feel like it | 22:31 |
ianw | ok, done for web | 22:32 |
ianw | hrm, getting a 503 from https://zuul.openstack.org/ | 22:33 |
corvus | that will probably happen until it loads its config | 22:35 |
corvus | it's effectively a read-only scheduler now, so it'll take just as long to start up as the real schedulers | 22:35 |
ianw | yep it is | 22:35 |
ianw | ahh, that's new (to me :) | 22:35 |
corvus | we'll definitely want to have 2 of these and load balance them | 22:35 |
corvus | it was one of the last things we did | 22:36 |
corvus | so a very recent change | 22:36 |
fungi | wonder if we should put them on entirely separate servers too | 22:36 |
corvus | fungi: not a bad idea, but if we have extra cpu or ram on the schedulers we could continue to double up. | 22:37 |
corvus | i should say extra cpu and ram. but really ram. of course we're going to have extra cpu (thanks GIL!) | 22:37 |
fungi | yeah, good point, we could also scale down the scheduler servers at this point, i expect | 22:38 |
fungi | the old one anyway (the new one is smaller now, right?) | 22:38 |
fungi | yeah, zuul01 has half the ram capacity of zuul02 | 22:39 |
corvus | yeah, i think we're at the point where the code we're running is representative of what we'll be doing in the future. | 22:40 |
fungi | memory utilization on zuul02 suggests it would be fine to put scheduler, web and fingergw on a vm the size of zuul01 | 22:40 |
ianw | oohhh, i see new lines between chagnes on the status page | 22:42 |
corvus | i'm seeing some errors in the logs and now i realize there was a backwards-incompat change, so we actually do need a full cluster restart | 22:42 |
corvus | i believe at this point we can just roll forward and restart all the executors and mergers | 22:42 |
ianw | minor thing, but https://zuul.openstack.org/components (note the openstack.org, which just was because my browser autocompleted it) gives an error | 22:43 |
corvus | but -- for purposes of learning about rolling scheduler restarts, the process would normally conclude here :) | 22:43 |
ianw | but zuul.opendev.org/components is ok | 22:43 |
ianw | heh, for merger/executor we can just use the playbook? | 22:44 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Add zuul01 to cacti https://review.opendev.org/c/opendev/system-config/+/819033 | 22:44 |
corvus | yes, but we no longer need to restart the schedulers/web... so maybe run a locally modified copy? | 22:44 |
ianw | ok, i've edited my playbook copies and can run that | 22:46 |
corvus | ++ | 22:46 |
fungi | one of my changes which was in flight just got smacked with RETRY_LIMIT results for all of its builds | 22:47 |
fungi | probably the executor restart? | 22:47 |
ianw | ok, should all be restarted now | 22:47 |
corvus | fungi: that's the problem the executor restarts are fixing | 22:48 |
corvus | fungi: try a recheck now? | 22:48 |
fungi | aha, sounds good | 22:48 |
fungi | yeah, looks like it's being tested this time | 22:49 |
fungi | the git-review change in the gate pipeline at https://zuul.opendev.org/t/opendev/status | 22:49 |
corvus | i promoted the change at the top of the open gate pipeline to force a reset and clear out all the retries pending there | 22:50 |
fungi | oh, good call. thanks! | 22:51 |
corvus | s/open gate/openstack integrated gate/ | 22:51 |
fungi | and yeah, my git-review change in the opendev tenant did get nodes assigned and start builds, so looks good | 22:51 |
corvus | there will have been some fallout, but hopefully not too much | 22:52 |
corvus | i think we can call this done? | 22:53 |
ianw | ++ thanks! good to have a bit more confidence with it if someting happens while you're all eating turkey :) | 22:54 |
fungi | yeah, this seems to have stabilized. thanks! | 22:56 |
corvus | \o/ | 22:57 |
ianw | i also think i've finally figured out all the magic behind installing kernel options for rh platforms now | 23:00 |
ianw | this is not helped by grub2 not being so much a package, but a series of 200 overlayed patches : https://src.fedoraproject.org/rpms/grub2/tree/rawhide | 23:00 |
ianw | but hopefully we can get f35 out of the failure loop it's in now | 23:03 |
fungi | yeeowch | 23:05 |
opendevreview | Merged opendev/git-review master: Add release note for the rebase merge handling fix https://review.opendev.org/c/opendev/git-review/+/819018 | 23:14 |
opendevreview | Merged opendev/system-config master: Add zuul01 to cacti https://review.opendev.org/c/opendev/system-config/+/819033 | 23:18 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!