Tuesday, 2021-11-02

clarkbThe meeting will begin shortly18:59
* fungi listens to the muzak19:00
*** ianw_pto is now known as ianw19:00
fungi[please stand by]19:00
ianwo/19:00
fricklero/19:00
clarkb#startmeeting infra19:01
opendevmeetMeeting started Tue Nov  2 19:01:08 2021 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:01
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:01
opendevmeetThe meeting name has been set to 'infra'19:01
clarkb#link http://lists.opendev.org/pipermail/service-discuss/2021-November/000294.html Our Agenda19:01
clarkbWelcome, you'll find the agenda for this meeting ^ there.19:01
clarkb#topic Announcements19:01
clarkbThe Gerrit User Summit will be happening sometime early next month and details should be coming out soon. I expect that it will be remote but don't know that for sure.19:02
clarkbI bring it up because I was discussing that we did the 3.2 -> 3.3 upgrade and automated much of our testing for that and he thoguht other Gerrit users would be interested in hearing how we manage our gerrit19:02
fungii guess we're in a much better position to talk about the things we're doing with gerrit, now that we're running a relatively recent release19:02
clarkbI think our installation is a bit different than many others because while we run a fairly large instance we don't currently do HA or have very strict uptime requirements. But at the same time we automate much of our testing and development around gerrit now.19:03
clarkbAnyway the event might be interesting to those who attend this meeting so calling it out here. I'm hopign I can submit something to talk about how we run gerrit stuff too19:03
clarkb#topic Actions from last meeting19:04
clarkb#link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-10-26-19.01.txt minutes from last meeting19:04
clarkbianw you had an action to start on the gerrit 3.4 stuff19:04
fungii hope he didn't do that on his vacation19:05
clarkbwrite a checklist, hold a node, and test the downgrade.19:05
ianwyes i started on that19:05
ianw#link https://etherpad.opendev.org/p/gerrit-upgrade-3.419:05
clarkbthank you!19:05
ianwi got as far as noticing that the plugin updates do seem to break our theme19:06
clarkbthat is good news since it was one of the questions we had19:06
ianwso i'll dig into that first19:06
clarkber19:06
clarkbI read it as does not. But I guess knowing is good either way just more work in this case19:06
clarkbyay for testing :)19:06
ianwindeed :)19:07
fungithose used a "lightweight" polygerrit plugin method, so i guess we need some java to go along with it now19:07
clarkbI think you may still be able to do pure javascript plugins but you have to hook in some specific way?19:07
fungis/those/the theme/19:07
clarkbianw: the origianl file came from the android theme iirc. We might be able to see how they updated their theme?19:08
ianwyeah, good idea.  honestly i haven't even had an initial look at it yet19:08
fungipaladox also might have suggestions as to how to update it since he effectively supplied the original for us19:08
clarkbya no rush. Just wanted to check in on this as it was recorded as an action. Sounds like good progress. Thanks19:09
clarkbThe other recorded action was for infra root to review the mailman 3 spec19:09
clarkblets just dive into that topic now19:09
clarkb#topic Specs19:09
clarkb#link https://review.opendev.org/810990 Mailman 3 spec19:09
clarkbianw and I have reviewed it and appear happy with the spec. I'd like to approve this soon if we can as the holiday period is a good time for this type of work19:10
clarkbfrickler: corvus: do you think you have time to review it this week? Any objections to landing it this week if not?19:10
fricklerI'll put it on my list but won't object to anything19:11
clarkbok in that case I'll aim to approve it end of day Friday if no review objections come up19:11
clarkbthanks!19:11
clarkb#topic Topics19:12
fungialso i'm happy to make adjustments during setup if people come up with new concerns19:12
clarkbfungi: thanks19:12
clarkb#topic Improving OpenDev's CD throughput19:12
clarkbI'll admit I haven't really had a chance to look at this yet. I feel like this sort of change requires me to not be juggling a few things so I can focus on understanding the end result and haven't had that opportunity tet19:12
clarkbIt is on my todo list if I ever find that block of time :/19:13
ianwyeah i need to cycle back on some failures in jobs too19:13
clarkb#link https://review.opendev.org/c/opendev/system-config/+/80767219:13
clarkbspecifically that chagne and its child if others have time too19:13
clarkbat this point I think it just needs reviewers and someone to look at failures. Then we can improve it as required by review and start landing changes19:14
clarkb#topic Gerrit account cleanups19:15
clarkbJust a note that I haven't heard back from the user I most recently did the fixup for19:15
clarkbI suppose no news is good news in this case19:15
fungithey seemed relatively uncommunicative anyway19:15
clarkb#topic Fedora 34 boot problems19:16
clarkbI've not managed to keep up with the status of this other than reviewing a change here and there19:17
clarkbIs this still an issue? Anything we need to do to help fix it?19:17
ianwthe dracut fix made it into f34, but i was not clear if that actually would fix the default initramfs19:17
ianwso i have a change out still that regenerates it with dracut19:18
ianwhowever, i also just updated for fedora 3519:18
fungiwhat's the anticipated release date for 35?19:19
ianwi wasn't sure what to do with the mirror, but i think it just released today19:19
ianwso that solves having to figure out "/devel" paths19:19
clarkbfungi: it was yesterday I think19:19
fungioh, then yeah that may just be a better place to focus regardless19:19
fungiin the meantime we're not all that blocked on 34 since we've got three providers where it can boot now19:20
clarkbI think only 2 have labels configured for it though19:20
ianw(sorry just logging into gerrit)19:20
clarkbbut that is probably good enough while we get f35 up19:20
fungioh, i guess we never added it to vexxhost19:20
clarkbthe other related item was I had a -1 comment on the f33 mirror cleanup. I don't think we can remove the fedora atomic image yet because magnum has older branches still using it19:21
ianwi guess we still think we have fedora 29 users19:21
fungigranted it's also not terribly efficient that we've got poolworkers accepting node requests they'll ultimately be unable to fulfil after waiting 15-20 minutes for the node to never become reachable19:21
clarkbbut we should definitely clean out f3319:21
clarkbianw: ya I think we should also send a note to openstack-discuss that that image needs to go away. It isn't something anyone should be using and they need to make a plan for using something else?19:21
fungiif we do decide to abandon 34 and focus on 35, then we should probably remove the 34 label from providers where we know it's broken19:22
ianwok, i think it's more a "this is going away" message at this point ...19:22
clarkbianw: yup. Basically we know it is used but no one should use it and we need to clean it up. Lets give them sufficient warning then proceed19:22
fungimaybe a one-sentence reminder that we tend to not keep eol distro versions around19:23
ianwi can split those up, drop f33, add f35, then starts builds for f35, update zuul-jobs and any users and then we can drop f3419:23
clarkbsounds like a plan19:23
fungibut in the meantime, drop f34 everywhere besides inmotion and citycloud (and maybe add it to vexxhost?)19:24
fungiwe're just wasting resources trying to boot it everywhere else19:24
clarkbfungi: ya otherwise that19:24
ianwok, i'll do that too, although i hope the removal to proceed in a timely fashion :)19:25
fungiit's ore just that i'm watching nodepool try to boot a f34 node in rackspace right now19:25
clarkbthanks. Let me know if I can help. Happy to do reviews on that as the slow f34 boots affected random things when it was a bigger issue19:26
clarkb#topic Zuul multi scheduler setup19:26
clarkbOver the weekend zuul ran with an active active scheduler for the first time19:27
clarkbI saw a report that at least one job was started by one scheduler and finished by another19:27
clarkbUnfortunately there have been some bumps along the way (corvus is currently doing a restart to fall back to a single scheduler after debugging the main issue)19:27
clarkbbasically keep this in mind if you are doing any zuul work. And if you notice any weird zuul behavior reporting that back to the zuul matrix room is a good idea19:28
corvusit went a lot better than i expected actually :)19:28
fungiwe ran with one again just a few minutes ago!19:28
clarkbI'm happy because it is nice to see all that code review done last week show results. Super exciting to see zuulv5 when it is ready19:29
clarkbBut ya if you notice weirdness please report it. That information and feedback is useful.19:29
fungialso the zuul restart docs have been updated, and include information on dumping some diagnostic data19:30
clarkb#topic FIPS testing in our CI system19:31
clarkbWe're seeing more and more interest in testing software on FIPS enabled systems, particularly for openstack.19:31
clarkbThe way we've been approaching this is having the jobs install whatever they need for that then enable configs and reboot into the new kernel state19:32
clarkbThe reason for this is managing another set of centos-8 and fedora-* images just for FIPS doesnt' really scale well and zuul supports this reboot case just fine19:32
clarkbThis does present an issue where a lot of jobs set ephemeral state that doesn't survive reboots19:33
clarkbfor example multinode networking creates ovs networks and those don't come back after a reboot. swift unittesting creates an xfs filesystem that is mounted and not added to fstab19:33
fungii had a random crazy thought about that... what if glean grew the ability to run userdata scripts like cloud-init can, and nodepool supplied those to do things like enable fips and reboot, or tweak kernel parameters to limit available memory and reboot... we wouldn't need separate images then, just separate labels19:34
clarkbIf peopel come to you with problems around FIPS testing it is probably a good idea to check for any lost ephemeral state as an early debugging step19:34
clarkbfungi: I think we intentionally avoid that stuff because it is really difficult to debug19:34
clarkbfungi: the reason that we prefer putting as much logic into zuul as possible is that the info can then be exposed to users easily19:34
fungiyeah, fair, only the console log can really provide that info19:35
clarkbin fact debugging the swift issue was done with a held node but I did not log into that machine at all and only used job info19:35
ianwit would only be a label though to pass fips=1 on the command line?19:35
ianwdoes that not happen via glance options, similar to some of the stuff we set for arm64 images?19:36
clarkbianw: it would be a label with a user script set that installed necessary pacakges and updated config then did a reboot19:36
ianwglance metadata19:36
clarkbno it happens via nova metadata and is per instance19:36
clarkbdebugging that stuff is really difficult19:36
clarkbpersonally I don't think we should go that route for this reason19:37
fungii don't think all hypervisors can alter the kernel command line anyway19:37
ianwi did think it was just a switch, that we could have the images ready and boot them in either mode.  but i haven't looked at details, obviously19:38
clarkbSomething else to be aware of is that to address the loss of state people are wanting to modify all the jobs with FIPS enabled flags. I've been -1'ing those and asking for different base jobs instead. The reason for this is zuul-jobs is meant to be a generally reconsumable library of jobs for all zuul users and adding a bunch of fips flags to generic jobs seems to pollute19:38
clarkbthat goal19:38
clarkbbasically have a multinode-fips job instead of a multinode job with a fips flag19:38
clarkbThe last thing I wanted to call out on this topic is I think we should also try to encourage projects to avoid having two copies of all the jobs to cover fips=1 and fips=019:39
fungii  can't remember, does multiple inheritance (mix-ins) work? did that every get added or just talked about?19:39
clarkbfungi: there is a way to do it but it is undocumented iirc19:40
fungiahh, so probably discouraged19:40
corvusi don't discourage it :)19:40
clarkbWe should probably encourage fips by default for projects that is important for with the assumption that if it works under fips it will work without fips19:40
clarkband/or targetted fips testing and not attempt to test everything under fips19:40
ianwi'm just trying to read ... is the local config required regenerating initramfs?19:42
clarkbianw: https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/enable-fips/tasks/main.yaml is the current implementation19:42
clarkbI do not know what all `fips-mode-setup --enable` does19:43
clarkbbut I assume it is non trivial if it comes with its own command19:43
clarkbI don't think you can change the operating mode without a reboot either19:44
clarkbsince it changes kernel stuff that can't be modified without rebooting19:44
ianwit looks like it's mostly regenerating initramfs, disabling prelink and some sshd_config tweaks19:44
fungiapparenrtly you can do it on ubuntu lts too, but need to have a ua subscription19:44
ianwi thought prelink was dead anyway, have to investigate that19:45
clarkbAnyway the most important thing I wanted to call out was the hint for helping debug FIPS related problems potentailly being related to losing state in the reboot19:46
clarkbsince that was overlooked in the swift case for far too long19:46
clarkb#topic Open Discussion19:47
clarkbI ended up rebooting nb03 to address its weird high load average19:47
fungithe implementation detail of swift's functional tests creating and mounting an xfs filesystem on a loop device is easily overlooked anyway19:47
clarkbthe system itself is happy afterwards but nodepool-builder won't start there due to openshift==0.0.1 being installed19:47
clarkbhttps://review.opendev.org/c/zuul/nodepool/+/816389 should fix that but we wanted to confirm it doesn't make the nodepool iamge builds very slow before approving it19:47
ianwoh, indeed.  i'm not sure if we implemented something to add extra wheels to the opendev requirements build, or just talked about it19:48
fricklerthat sounds like we lack some test for that image?19:49
clarkbfrickler: yup largely because we'd need to figure out arm64 specific jobs for it19:49
clarkbit is doable but not as easy as getting it covered with say the dib jobs19:49
clarkbianw: I think your spec for zuul is likely to be the plan for properly solving that problem?19:49
ianwit doesn't work on the dib side because devsatck doesn't work on arm6419:49
clarkbah19:50
ianwwe could do a "does it start" test19:51
clarkbya we would fail that currently and that would be an improvement19:51
clarkbwon't cover everything but should ensure basic functionality19:51
fungithat might be "good enough" for a secondary architecture exercise anyway19:52
fungisince if it starts, the rest of its operation is unlikely to differ substantially19:52
ianwyeah, not exactly sure what that would look like -- perhaps just start a ZK container and make sure it gets into a listening state?19:53
ianwalthough probably just tox tests would pick this up too?19:53
clarkbif we can have it build a simple image but not upload it (do we have a way to force a build without upload?) then we can nodepool dib-image-list it19:53
ianwmaybe we should just run that in check-arm64?19:54
clarkbianw: you would need to ensure deps are installed the same way when running unittests but that should work too19:54
clarkbpart of this is an artifact of how we make the qemu arm64 emulated docker image build run in a reasonable time frame19:55
clarkbif we just ran tox on arm64 it would probably work19:55
clarkbbecause it would find the sdist for openshift 0.11.2 and install from that.19:55
fungi(or fail, more importantly)19:55
fungioh, i see what you mean19:55
clarkbwell in this case it wouldn't fail19:55
fungiyeah19:55
clarkbif we reproduced the same install method for unittests then it would fail19:55
fungiso tox is not good enough19:55
ianwyeah, it's more making sure we're pulling the same wheels etc. in tox19:56
fungiin this case, not good enough without setting nonstandard options for pip's dep solver to skip source-only versions anyway19:56
ianwthis does more-or-less cycle back to the rough spec from our discussions on arm64 wheels + zuul19:57
clarkbyup I think if we focus on that we'll largely solve this specific problem19:57
ianw#link https://review.opendev.org/c/zuul/zuul/+/81540619:57
clarkbany arm64 testing would be to sanity check that result19:57
clarkband we can possibly rely on unittests then if we have the better arm64 stuff for zuul19:58
ianwyeah, so i think calling this out there and making sure we address it is probably the way forward19:58
clarkb++19:58
ianwi can update that for the testing case and we can loop back on it19:58
clarkbsounds good. And we are at time20:00
clarkbthank you everyone!20:00
fungithanks clarkb!20:00
clarkbWe'll see you here next week and the week after. but then many of us have a big holiday in three weeks20:00
clarkbI expect that I won't be around much during the week of US thanksgiving20:00
clarkb#endmeeting20:00
opendevmeetMeeting ended Tue Nov  2 20:00:45 2021 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)20:00
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2021/infra.2021-11-02-19.01.html20:00
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2021/infra.2021-11-02-19.01.txt20:00
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2021/infra.2021-11-02-19.01.log.html20:00
clarkbJust a heads up. Feel free to have the meeting without me or skip. I suspect we'll just do a skip20:01

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!