19:00:29 <fungi> #startmeeting infra
19:00:29 <opendevmeet> Meeting started Tue Sep 20 19:00:29 2022 UTC and is due to finish in 60 minutes.  The chair is fungi. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:29 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:30 <opendevmeet> The meeting name has been set to 'infra'
19:01:02 <fungi> #link https://lists.opendev.org/pipermail/service-discuss/2022-September/000361.html Our agenda
19:01:18 <fungi> #topic Announcements
19:01:34 <fungi> anybody have anything that needs announcing?
19:02:20 <fungi> i'll take your silence as a resounding negative
19:02:35 <fungi> #topic Actions from last meeting
19:02:50 <ianw> o/
19:03:46 <fungi> #link https://meetings.opendev.org/meetings/infra/2022/infra.2022-09-13-19.01.html
19:03:56 <fungi> looks like that says "none" so i think we can move on
19:04:07 <fungi> #topic Specs Review
19:04:50 <fungi> i don't see any recently proposed nor recently merged
19:05:00 <fungi> anyone have a spec on the way?
19:05:24 <fungi> the in progress specs are covered in upcoming topics anyway
19:05:36 <fungi> so i guess let's jump straight to those
19:05:58 <fungi> #topic Bastion host
19:06:05 <fungi> ianw: looks like you had some questions there?
19:06:32 <ianw> i have found some time to move this forward a bit
19:06:37 <fungi> excellent!
19:06:54 <ianw> #link https://review.opendev.org/q/topic:bridge-ansible-venv
19:07:27 <fungi> that is quite the series of changes
19:07:46 <ianw> is the stack.  it's doing a few things -- moving our ansible into a venv, updating all our testing environments to a jammy-based bridge, and abstracting things so we can replace bridge.openstack.org with another fresh jammy host
19:08:00 <ianw> probably called bridge.opendev.org
19:08:04 <fungi> look lie clarkb has already reviewed the first several
19:08:12 <fungi> er, like
19:08:31 <ianw> yeah, the final change isn't 100% ready yet but hopefully soon
19:08:52 <fungi> 852477 is getting a post_failure on run-base. is that expected? or unrelated?
19:09:27 <ianw> i'm on PTO after today, until 10/3, so i don't expect much progress, but after that i think we'll be pretty much ready to swap out bridge
19:09:43 <fungi> sounds great, i'll try to set aside some time to go through those
19:10:07 <ianw> 852477 is not expected to fail :)  that is probably unrelated
19:10:21 <fungi> enjoy your time afk!
19:10:50 <fungi> i'm hoping to do something like that myself at the end of next month
19:11:04 <ianw> it should all be a big no-op; *should* being the operative term :)
19:12:17 <fungi> awesome, anything else specific you want to call out there?
19:12:35 <fungi> the agenda mentions venv for openstacksdk, upgrading the server distro version...
19:13:23 <clarkb> I think the change stack does a lot of that
19:13:32 <clarkb> so now that we have changes we can probably shift to reviewing things there
19:14:57 <fungi> sounds good. i guess we can proceed to the next topic
19:15:19 <fungi> thanks for putting that together ianw!
19:15:23 <fungi> #topic Upgrading Bionic servers to Focal/Jammy
19:15:39 <clarkb> I think the only new thing related to this topic is what ianw just covered above
19:15:43 <fungi> #link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades
19:16:15 <fungi> yep
19:17:24 <fungi> #topic Mailman 3
19:17:47 <fungi> this is coming along nicely
19:18:11 <fungi> i was able to hold a node from the latest patchset of the implementing change and import production copies of all our data into it
19:18:46 <fungi> most full site migrations will go quickly. i think i clocked lists.opendev.org in at around 8 minutes (i'd have to double-check my notes)
19:19:02 <fungi> the slowest of course will be lists.openstack.org, which took about 2.5 hours when i tested it
19:19:21 <fungi> the stable branch failures list accounts for about half that time on its own
19:20:11 <fungi> the only outstanding errors were around a few of the older openstack lists having some rejection/moderation message templates which were too large for the default column width in the db
19:20:33 <fungi> for those, i think i'll just fix the current production ones to be smaller (it was a total of 3, so not a big deal)
19:21:07 <fungi> i still need to find some time to test that the legacy pipermail redirects work, and test that mysqldump works on the imported data set
19:21:54 <fungi> the held server with the full imported data set is 104.239.143.143 in case anyone wants to poke at it
19:21:59 <clarkb> sounds like great progress. Thank you for continuing to push this along
19:22:43 <fungi> i think the plan, as it stands, is to shoot for doing lists.opendev.org and lists.zuul-ci.org around early november, if things continue to go well and stay on track?
19:23:47 <clarkb> that wfm. I think we can switch opendev whenever we're comfortable with the deployment (general functionality, migration path, etc)
19:24:09 <fungi> and then we can consider doing other lists after the start of the year if we don't run into any problems with the already migrated sites
19:25:13 <ianw> ++ having done nothing to help with this, it sounds good :)
19:25:32 <ianw> thank you, and just form poking at the site it looks great
19:25:37 <fungi> thanks!
19:25:46 <fungi> #topic Jaeger tracing server (for Zuul)
19:26:06 <fungi> corvus: i saw the change to add the server and initial deployment bits merged?
19:29:26 <fungi> #link https://review.opendev.org/c/opendev/system-config/+/855983 Add Jaeger tracing server
19:30:18 <corvus> yep, have not launched yet, will do so when i have a spare moment
19:31:00 <fungi> thanks, i'll keep an eye out for the inventory and dns changes
19:31:24 <fungi> anything else to cover on that at the moment?
19:32:34 <fungi> sounds like no
19:32:46 <fungi> #topic Fedora 36 Rollout
19:33:24 <fungi> we found out today that it needs an extra package to work with glean's networkmanager implementation for non-dhcp providers
19:33:59 <ianw> yes, thanks for digging on that!
19:34:00 <fungi> #link https://review.opendev.org/858547 Install Fedora ifcfg NM compat package
19:34:01 <ianw> #link https://review.opendev.org/c/openstack/diskimage-builder/+/858547
19:34:07 <ianw> jinx :)
19:34:17 <fungi> #undo
19:34:17 <opendevmeet> Removing item from minutes: #link https://review.opendev.org/c/openstack/diskimage-builder/+/858547
19:34:58 <fungi> it was conflated with a separate problem where we crashed nl04 with a bad config when the weekend restart happened
19:35:05 <ianw> i have déjà vu on that ... I either vaguely knew about it or we've fixed something similar before, but i can't pinpoint it
19:35:18 <fungi> so we've not been able to boot anything in ovh and some other providers since early utc saturday
19:35:22 <clarkb> ya the OVH thing made it worse as ovh has a lot of dhcp resources
19:35:33 <ianw> (the missing package)
19:35:41 <fungi> or, rather, we weren't able to until it got spotted and fixed earlier today
19:36:04 <ianw> was the nl04 something that ultimately I did, missing bits of f36?
19:36:18 <fungi> and with the dib change approved and on its way to merge, i guess we'll need the usual dib release tagged, nodepool requirements bumped, new container images built and deployed
19:36:27 <clarkb> ianw: it was missing bits of rockylinux
19:36:37 <clarkb> ianw: and I think a few days ago nodepool become more strict about checking that stuff
19:36:39 <fungi> ianw: it was a missed rockylinux-9 label from the nested-virt addition
19:37:17 <fungi> whatever strictness was added would have been between saturday and whenever the previous restart of nl04's container was
19:37:56 <clarkb> https://review.opendev.org/c/zuul/nodepool/+/858577 is something I've just pushed that should have our CI for project-config catch it
19:38:00 <fungi> the bad config merged at the beginning of the month, but nodepool only spams exceptions about failing to load its config until you try to restart it
19:38:15 <fungi> then it fails to start up after being stopped
19:38:19 <clarkb> fungi: oh I see it crashed because it had to start over but it was probably angry before then too
19:38:30 <ianw> clarkb: excellent; thanks, that was my next question as to why the linter passed it :)
19:38:40 <fungi> yes, i saw the tracebacks in older logs, but i didn't look to see how far back it started complaining
19:40:04 <fungi> anyway, if we decide we want the networking fix for f36 sooner we can add that package to infra-package-needs, but i'm fine waiting. it's no longer urgent with nl04 back in a working state
19:40:36 <fungi> #topic Improving Ansible task runtime
19:40:57 <fungi> clarkb: you had a couple changes around this, looks like
19:41:16 <clarkb> ya this is still on the agenda mostly for https://review.opendev.org/c/opendev/system-config/+/857239
19:41:38 <fungi> #link https://review.opendev.org/857232 Use synchronize to copy test host/group vars
19:41:45 <clarkb> The idea behind that change is to more forcefully enable pipelining for infra ansible things since we don't care about other connection types
19:42:06 <clarkb> zuul won't make that change because it has many connection types to support and that isn't necessarily safe. But in our corner of the world it should be
19:42:11 <fungi> #link https://review.opendev.org/857239 More aggresively enable ansible pipelining
19:42:24 <clarkb> and then ya comments on whether or not people want faster file copying at the expense of more complicated job setup would be good
19:43:52 <corvus> we should do that asap, that change will bitrot quick
19:44:00 <corvus> (do that == decide whether to merge it)
19:44:23 <clarkb> ++ the mm3 change actually conflicts with it too iirc
19:44:33 <clarkb> as an example of how quickly it will run into trouble
19:44:34 <corvus> (that change == 857232)
19:45:40 <ianw> heh yes and the stack previously mentioned about venv stuff will
19:46:13 <clarkb> I'm happy to do the legwork to coordinate and rebase things. Mostly just looking for feedback on whether or not we want to make the change in the first place
19:47:30 <clarkb> EOT for me
19:47:46 <fungi> i relate to corvus's review comment there. if it's an average savings of 3 minutes and just affects our deployment tests, that's not a huge savings (it's not exactly trivial either though, i'll grant)
19:48:31 <clarkb> Re time saving the idea I had was that if we can trim a few places that are slow like that then in aggregate maybe we end up saving 10 minutes per job
19:48:36 <clarkb> that is yet to be proven though
19:48:58 <clarkb> (other places for improvement are the multi node ssh key management stuff whcih our jobs use too)
19:50:05 <fungi> yeah, on the other hand, i don't think 857232 significantly adds to the complexity there. mainly it means we need to be mindful of what files need templating and what don't
19:50:20 <ianw> yeah multiple minutes, to me, is worth going after
19:50:32 <ianw> multiple seconds maybe not :)
19:50:40 <fungi> not to downplay the complexity, it's complex for sure and that means more we can trip over
19:51:11 <fungi> on balance i'm like +0.5 in favor
19:52:10 <clarkb> ack can follow up on the change
19:52:29 <fungi> okay, well we can hash it out in review but let's not take too long as corvus indicated
19:52:46 <clarkb> ++
19:52:47 <fungi> #topic Nodepool Builder Disk utilization
19:53:06 <fungi> the perennial discussion
19:53:20 <fungi> can we drop some images soon? or should we add more/bigger builders
19:53:21 <clarkb> I added this to the agenda after frickler discovered that the disk had filled on nb01 and nb02
19:53:48 <ianw> i think we're close on f35, at least
19:53:52 <clarkb> we've add two flavors of rocky since our last cleanup and our base disk utilization is higher. THis makse it easier for the disk to fill if something goes wrong
19:53:55 <ianw> (current issues not withstanding)
19:54:01 <frickler> so one raw image is 20-22G
19:54:13 <frickler> vmdk the same plus ca. 14G qcow2
19:54:32 <frickler> makes 50-60GB per image, times 2 because we keep the last two
19:54:41 <clarkb> note the raw disk utilization size is closer to the qcow2 size due to fs/block magic
19:54:51 <frickler> we have 1T per builder
19:54:54 <clarkb> du vs ls show the difference
19:55:28 <frickler> one of the problems I think is that the copies aren't necessarily distributed
19:55:32 <fungi> i'll note that gentoo hasn't successfully built for almost 2 months (nodes still seem to be booting but the image is very stale now)
19:55:45 <frickler> gentoo builds are disabled
19:55:59 <frickler> the still need some fix, I don't remember the detail
19:56:03 <clarkb> frickler: yes, when the builders are happy they pretty evenly distribute things. But as soon as one has problems all the load shifts to the other and it then falls over
19:56:19 <fungi> opensuse images seem to have successfully built today after two weeks of not being able to
19:56:22 <clarkb> Adding another builder or more disk would alleviate that aspect of the issue
19:56:31 <fungi> but that was likely due to the builders filling up
19:56:55 <ianw> I have https://review.opendev.org/c/zuul/nodepool/+/764280 that i've never got back to that would stop a builder trying to start a build if it knew there wasn't enough space
19:57:06 <frickler> the disks were full for two weeks, so no images got build during that time
19:57:14 <fungi> right, makes sense
19:57:28 <frickler> but all builds looked green now in grafana
19:58:10 <fungi> are there concerns with adding another builder?
19:58:28 <frickler> the other option would be just add another disk on the existing ones?
19:58:42 <fungi> yeah, though that doesn't get us faster build throughput
19:58:44 <clarkb> one upside to adding a new builder would be we could deploy it on jammy and make sure everything is working
19:58:53 <clarkb> then replace the older two
19:58:54 <frickler> because we don't seem to have an issue with buildtime yet
19:59:13 <clarkb> Adding more disk is probably the easiest thing to do though
19:59:15 <frickler> ah, upgrading is a good reason for new one(s)
19:59:26 <ianw> this is true.  i think the 1tb limit is just a self-imposed thing, perhaps related to the maximum cinder volume size in rax, but i don't think it's a special number
19:59:34 <fungi> but any idea what % of the day is spent building images on each builder? we won't know that we're out of build bandwidth until we are
19:59:49 <clarkb> fungi: it takes about an hour to build each image
20:00:04 <clarkb> total images / 2 ~= active hours of the day per builder
20:00:07 <fungi> and we're at ~16 images now
20:00:25 <fungi> so yeah, we're theoretically at 33% capacity for build bandwidth
20:00:55 <fungi> i agree more space would be fine for now in that case, unless we just want to be able to get images online a bit faster
20:01:25 <frickler> oh, right, we should check upload times
20:01:35 <frickler> I don't think that is included in the 1h
20:01:46 <clarkb> it isn't
20:01:58 <clarkb> but we do parallelize those but builds are serial
20:02:13 <clarkb> there is still an upper limit though. I don't think it is the one we are likely to hit first. But checking is a good idea
20:02:20 <fungi> and we still need to replace the arm builder right? does it need more space too?
20:02:54 <clarkb> yes it needs to be replaced. Still waiting on the new cloud to put it in unless we move it to osuosl
20:02:55 <ianw> that will need replacement as linaro goes, yep.  that seems to be a wip
20:03:04 <fungi> also note that we're roughly three minutes over the end of the meeting
20:03:17 <clarkb> and its disk does much better since we don't need the vhd and qcow images
20:03:21 <clarkb> we use raw only for arm64 iirc
20:03:38 <fungi> so sounds like some consensus on adding a second volume for nb01 and nb02. i can try to find time for that later this week
20:04:14 <fungi> it's lvm so easy to just grow in-place
20:04:46 <fungi> #topic Open Discussion
20:05:00 <fungi> if nobody has anything else i'll close the meeting
20:05:32 <clarkb> I'm good. Thank you for steppign in today
20:05:44 <ianw> #link https://review.opendev.org/c/zuul/zuul/+/858243
20:06:08 <ianw> is a zuul one; thanks for reviews on the recent console web-stack -- that one is a fix for narrow screens on the console
20:06:21 <ianw> there's others on top, but they're opinions, not fixes :)
20:07:02 <ianw> and thanks to fungi for meeting too :)
20:07:17 <fungi> thanks everyone! enjoy the rest of your week
20:07:19 <fungi> #endmeeting