19:00:29 <fungi> #startmeeting infra 19:00:29 <opendevmeet> Meeting started Tue Sep 20 19:00:29 2022 UTC and is due to finish in 60 minutes. The chair is fungi. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:29 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:30 <opendevmeet> The meeting name has been set to 'infra' 19:01:02 <fungi> #link https://lists.opendev.org/pipermail/service-discuss/2022-September/000361.html Our agenda 19:01:18 <fungi> #topic Announcements 19:01:34 <fungi> anybody have anything that needs announcing? 19:02:20 <fungi> i'll take your silence as a resounding negative 19:02:35 <fungi> #topic Actions from last meeting 19:02:50 <ianw> o/ 19:03:46 <fungi> #link https://meetings.opendev.org/meetings/infra/2022/infra.2022-09-13-19.01.html 19:03:56 <fungi> looks like that says "none" so i think we can move on 19:04:07 <fungi> #topic Specs Review 19:04:50 <fungi> i don't see any recently proposed nor recently merged 19:05:00 <fungi> anyone have a spec on the way? 19:05:24 <fungi> the in progress specs are covered in upcoming topics anyway 19:05:36 <fungi> so i guess let's jump straight to those 19:05:58 <fungi> #topic Bastion host 19:06:05 <fungi> ianw: looks like you had some questions there? 19:06:32 <ianw> i have found some time to move this forward a bit 19:06:37 <fungi> excellent! 19:06:54 <ianw> #link https://review.opendev.org/q/topic:bridge-ansible-venv 19:07:27 <fungi> that is quite the series of changes 19:07:46 <ianw> is the stack. it's doing a few things -- moving our ansible into a venv, updating all our testing environments to a jammy-based bridge, and abstracting things so we can replace bridge.openstack.org with another fresh jammy host 19:08:00 <ianw> probably called bridge.opendev.org 19:08:04 <fungi> look lie clarkb has already reviewed the first several 19:08:12 <fungi> er, like 19:08:31 <ianw> yeah, the final change isn't 100% ready yet but hopefully soon 19:08:52 <fungi> 852477 is getting a post_failure on run-base. is that expected? or unrelated? 19:09:27 <ianw> i'm on PTO after today, until 10/3, so i don't expect much progress, but after that i think we'll be pretty much ready to swap out bridge 19:09:43 <fungi> sounds great, i'll try to set aside some time to go through those 19:10:07 <ianw> 852477 is not expected to fail :) that is probably unrelated 19:10:21 <fungi> enjoy your time afk! 19:10:50 <fungi> i'm hoping to do something like that myself at the end of next month 19:11:04 <ianw> it should all be a big no-op; *should* being the operative term :) 19:12:17 <fungi> awesome, anything else specific you want to call out there? 19:12:35 <fungi> the agenda mentions venv for openstacksdk, upgrading the server distro version... 19:13:23 <clarkb> I think the change stack does a lot of that 19:13:32 <clarkb> so now that we have changes we can probably shift to reviewing things there 19:14:57 <fungi> sounds good. i guess we can proceed to the next topic 19:15:19 <fungi> thanks for putting that together ianw! 19:15:23 <fungi> #topic Upgrading Bionic servers to Focal/Jammy 19:15:39 <clarkb> I think the only new thing related to this topic is what ianw just covered above 19:15:43 <fungi> #link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades 19:16:15 <fungi> yep 19:17:24 <fungi> #topic Mailman 3 19:17:47 <fungi> this is coming along nicely 19:18:11 <fungi> i was able to hold a node from the latest patchset of the implementing change and import production copies of all our data into it 19:18:46 <fungi> most full site migrations will go quickly. i think i clocked lists.opendev.org in at around 8 minutes (i'd have to double-check my notes) 19:19:02 <fungi> the slowest of course will be lists.openstack.org, which took about 2.5 hours when i tested it 19:19:21 <fungi> the stable branch failures list accounts for about half that time on its own 19:20:11 <fungi> the only outstanding errors were around a few of the older openstack lists having some rejection/moderation message templates which were too large for the default column width in the db 19:20:33 <fungi> for those, i think i'll just fix the current production ones to be smaller (it was a total of 3, so not a big deal) 19:21:07 <fungi> i still need to find some time to test that the legacy pipermail redirects work, and test that mysqldump works on the imported data set 19:21:54 <fungi> the held server with the full imported data set is 104.239.143.143 in case anyone wants to poke at it 19:21:59 <clarkb> sounds like great progress. Thank you for continuing to push this along 19:22:43 <fungi> i think the plan, as it stands, is to shoot for doing lists.opendev.org and lists.zuul-ci.org around early november, if things continue to go well and stay on track? 19:23:47 <clarkb> that wfm. I think we can switch opendev whenever we're comfortable with the deployment (general functionality, migration path, etc) 19:24:09 <fungi> and then we can consider doing other lists after the start of the year if we don't run into any problems with the already migrated sites 19:25:13 <ianw> ++ having done nothing to help with this, it sounds good :) 19:25:32 <ianw> thank you, and just form poking at the site it looks great 19:25:37 <fungi> thanks! 19:25:46 <fungi> #topic Jaeger tracing server (for Zuul) 19:26:06 <fungi> corvus: i saw the change to add the server and initial deployment bits merged? 19:29:26 <fungi> #link https://review.opendev.org/c/opendev/system-config/+/855983 Add Jaeger tracing server 19:30:18 <corvus> yep, have not launched yet, will do so when i have a spare moment 19:31:00 <fungi> thanks, i'll keep an eye out for the inventory and dns changes 19:31:24 <fungi> anything else to cover on that at the moment? 19:32:34 <fungi> sounds like no 19:32:46 <fungi> #topic Fedora 36 Rollout 19:33:24 <fungi> we found out today that it needs an extra package to work with glean's networkmanager implementation for non-dhcp providers 19:33:59 <ianw> yes, thanks for digging on that! 19:34:00 <fungi> #link https://review.opendev.org/858547 Install Fedora ifcfg NM compat package 19:34:01 <ianw> #link https://review.opendev.org/c/openstack/diskimage-builder/+/858547 19:34:07 <ianw> jinx :) 19:34:17 <fungi> #undo 19:34:17 <opendevmeet> Removing item from minutes: #link https://review.opendev.org/c/openstack/diskimage-builder/+/858547 19:34:58 <fungi> it was conflated with a separate problem where we crashed nl04 with a bad config when the weekend restart happened 19:35:05 <ianw> i have déjà vu on that ... I either vaguely knew about it or we've fixed something similar before, but i can't pinpoint it 19:35:18 <fungi> so we've not been able to boot anything in ovh and some other providers since early utc saturday 19:35:22 <clarkb> ya the OVH thing made it worse as ovh has a lot of dhcp resources 19:35:33 <ianw> (the missing package) 19:35:41 <fungi> or, rather, we weren't able to until it got spotted and fixed earlier today 19:36:04 <ianw> was the nl04 something that ultimately I did, missing bits of f36? 19:36:18 <fungi> and with the dib change approved and on its way to merge, i guess we'll need the usual dib release tagged, nodepool requirements bumped, new container images built and deployed 19:36:27 <clarkb> ianw: it was missing bits of rockylinux 19:36:37 <clarkb> ianw: and I think a few days ago nodepool become more strict about checking that stuff 19:36:39 <fungi> ianw: it was a missed rockylinux-9 label from the nested-virt addition 19:37:17 <fungi> whatever strictness was added would have been between saturday and whenever the previous restart of nl04's container was 19:37:56 <clarkb> https://review.opendev.org/c/zuul/nodepool/+/858577 is something I've just pushed that should have our CI for project-config catch it 19:38:00 <fungi> the bad config merged at the beginning of the month, but nodepool only spams exceptions about failing to load its config until you try to restart it 19:38:15 <fungi> then it fails to start up after being stopped 19:38:19 <clarkb> fungi: oh I see it crashed because it had to start over but it was probably angry before then too 19:38:30 <ianw> clarkb: excellent; thanks, that was my next question as to why the linter passed it :) 19:38:40 <fungi> yes, i saw the tracebacks in older logs, but i didn't look to see how far back it started complaining 19:40:04 <fungi> anyway, if we decide we want the networking fix for f36 sooner we can add that package to infra-package-needs, but i'm fine waiting. it's no longer urgent with nl04 back in a working state 19:40:36 <fungi> #topic Improving Ansible task runtime 19:40:57 <fungi> clarkb: you had a couple changes around this, looks like 19:41:16 <clarkb> ya this is still on the agenda mostly for https://review.opendev.org/c/opendev/system-config/+/857239 19:41:38 <fungi> #link https://review.opendev.org/857232 Use synchronize to copy test host/group vars 19:41:45 <clarkb> The idea behind that change is to more forcefully enable pipelining for infra ansible things since we don't care about other connection types 19:42:06 <clarkb> zuul won't make that change because it has many connection types to support and that isn't necessarily safe. But in our corner of the world it should be 19:42:11 <fungi> #link https://review.opendev.org/857239 More aggresively enable ansible pipelining 19:42:24 <clarkb> and then ya comments on whether or not people want faster file copying at the expense of more complicated job setup would be good 19:43:52 <corvus> we should do that asap, that change will bitrot quick 19:44:00 <corvus> (do that == decide whether to merge it) 19:44:23 <clarkb> ++ the mm3 change actually conflicts with it too iirc 19:44:33 <clarkb> as an example of how quickly it will run into trouble 19:44:34 <corvus> (that change == 857232) 19:45:40 <ianw> heh yes and the stack previously mentioned about venv stuff will 19:46:13 <clarkb> I'm happy to do the legwork to coordinate and rebase things. Mostly just looking for feedback on whether or not we want to make the change in the first place 19:47:30 <clarkb> EOT for me 19:47:46 <fungi> i relate to corvus's review comment there. if it's an average savings of 3 minutes and just affects our deployment tests, that's not a huge savings (it's not exactly trivial either though, i'll grant) 19:48:31 <clarkb> Re time saving the idea I had was that if we can trim a few places that are slow like that then in aggregate maybe we end up saving 10 minutes per job 19:48:36 <clarkb> that is yet to be proven though 19:48:58 <clarkb> (other places for improvement are the multi node ssh key management stuff whcih our jobs use too) 19:50:05 <fungi> yeah, on the other hand, i don't think 857232 significantly adds to the complexity there. mainly it means we need to be mindful of what files need templating and what don't 19:50:20 <ianw> yeah multiple minutes, to me, is worth going after 19:50:32 <ianw> multiple seconds maybe not :) 19:50:40 <fungi> not to downplay the complexity, it's complex for sure and that means more we can trip over 19:51:11 <fungi> on balance i'm like +0.5 in favor 19:52:10 <clarkb> ack can follow up on the change 19:52:29 <fungi> okay, well we can hash it out in review but let's not take too long as corvus indicated 19:52:46 <clarkb> ++ 19:52:47 <fungi> #topic Nodepool Builder Disk utilization 19:53:06 <fungi> the perennial discussion 19:53:20 <fungi> can we drop some images soon? or should we add more/bigger builders 19:53:21 <clarkb> I added this to the agenda after frickler discovered that the disk had filled on nb01 and nb02 19:53:48 <ianw> i think we're close on f35, at least 19:53:52 <clarkb> we've add two flavors of rocky since our last cleanup and our base disk utilization is higher. THis makse it easier for the disk to fill if something goes wrong 19:53:55 <ianw> (current issues not withstanding) 19:54:01 <frickler> so one raw image is 20-22G 19:54:13 <frickler> vmdk the same plus ca. 14G qcow2 19:54:32 <frickler> makes 50-60GB per image, times 2 because we keep the last two 19:54:41 <clarkb> note the raw disk utilization size is closer to the qcow2 size due to fs/block magic 19:54:51 <frickler> we have 1T per builder 19:54:54 <clarkb> du vs ls show the difference 19:55:28 <frickler> one of the problems I think is that the copies aren't necessarily distributed 19:55:32 <fungi> i'll note that gentoo hasn't successfully built for almost 2 months (nodes still seem to be booting but the image is very stale now) 19:55:45 <frickler> gentoo builds are disabled 19:55:59 <frickler> the still need some fix, I don't remember the detail 19:56:03 <clarkb> frickler: yes, when the builders are happy they pretty evenly distribute things. But as soon as one has problems all the load shifts to the other and it then falls over 19:56:19 <fungi> opensuse images seem to have successfully built today after two weeks of not being able to 19:56:22 <clarkb> Adding another builder or more disk would alleviate that aspect of the issue 19:56:31 <fungi> but that was likely due to the builders filling up 19:56:55 <ianw> I have https://review.opendev.org/c/zuul/nodepool/+/764280 that i've never got back to that would stop a builder trying to start a build if it knew there wasn't enough space 19:57:06 <frickler> the disks were full for two weeks, so no images got build during that time 19:57:14 <fungi> right, makes sense 19:57:28 <frickler> but all builds looked green now in grafana 19:58:10 <fungi> are there concerns with adding another builder? 19:58:28 <frickler> the other option would be just add another disk on the existing ones? 19:58:42 <fungi> yeah, though that doesn't get us faster build throughput 19:58:44 <clarkb> one upside to adding a new builder would be we could deploy it on jammy and make sure everything is working 19:58:53 <clarkb> then replace the older two 19:58:54 <frickler> because we don't seem to have an issue with buildtime yet 19:59:13 <clarkb> Adding more disk is probably the easiest thing to do though 19:59:15 <frickler> ah, upgrading is a good reason for new one(s) 19:59:26 <ianw> this is true. i think the 1tb limit is just a self-imposed thing, perhaps related to the maximum cinder volume size in rax, but i don't think it's a special number 19:59:34 <fungi> but any idea what % of the day is spent building images on each builder? we won't know that we're out of build bandwidth until we are 19:59:49 <clarkb> fungi: it takes about an hour to build each image 20:00:04 <clarkb> total images / 2 ~= active hours of the day per builder 20:00:07 <fungi> and we're at ~16 images now 20:00:25 <fungi> so yeah, we're theoretically at 33% capacity for build bandwidth 20:00:55 <fungi> i agree more space would be fine for now in that case, unless we just want to be able to get images online a bit faster 20:01:25 <frickler> oh, right, we should check upload times 20:01:35 <frickler> I don't think that is included in the 1h 20:01:46 <clarkb> it isn't 20:01:58 <clarkb> but we do parallelize those but builds are serial 20:02:13 <clarkb> there is still an upper limit though. I don't think it is the one we are likely to hit first. But checking is a good idea 20:02:20 <fungi> and we still need to replace the arm builder right? does it need more space too? 20:02:54 <clarkb> yes it needs to be replaced. Still waiting on the new cloud to put it in unless we move it to osuosl 20:02:55 <ianw> that will need replacement as linaro goes, yep. that seems to be a wip 20:03:04 <fungi> also note that we're roughly three minutes over the end of the meeting 20:03:17 <clarkb> and its disk does much better since we don't need the vhd and qcow images 20:03:21 <clarkb> we use raw only for arm64 iirc 20:03:38 <fungi> so sounds like some consensus on adding a second volume for nb01 and nb02. i can try to find time for that later this week 20:04:14 <fungi> it's lvm so easy to just grow in-place 20:04:46 <fungi> #topic Open Discussion 20:05:00 <fungi> if nobody has anything else i'll close the meeting 20:05:32 <clarkb> I'm good. Thank you for steppign in today 20:05:44 <ianw> #link https://review.opendev.org/c/zuul/zuul/+/858243 20:06:08 <ianw> is a zuul one; thanks for reviews on the recent console web-stack -- that one is a fix for narrow screens on the console 20:06:21 <ianw> there's others on top, but they're opinions, not fixes :) 20:07:02 <ianw> and thanks to fungi for meeting too :) 20:07:17 <fungi> thanks everyone! enjoy the rest of your week 20:07:19 <fungi> #endmeeting