19:00:29 #startmeeting infra 19:00:29 Meeting started Tue Sep 20 19:00:29 2022 UTC and is due to finish in 60 minutes. The chair is fungi. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:29 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:30 The meeting name has been set to 'infra' 19:01:02 #link https://lists.opendev.org/pipermail/service-discuss/2022-September/000361.html Our agenda 19:01:18 #topic Announcements 19:01:34 anybody have anything that needs announcing? 19:02:20 i'll take your silence as a resounding negative 19:02:35 #topic Actions from last meeting 19:02:50 o/ 19:03:46 #link https://meetings.opendev.org/meetings/infra/2022/infra.2022-09-13-19.01.html 19:03:56 looks like that says "none" so i think we can move on 19:04:07 #topic Specs Review 19:04:50 i don't see any recently proposed nor recently merged 19:05:00 anyone have a spec on the way? 19:05:24 the in progress specs are covered in upcoming topics anyway 19:05:36 so i guess let's jump straight to those 19:05:58 #topic Bastion host 19:06:05 ianw: looks like you had some questions there? 19:06:32 i have found some time to move this forward a bit 19:06:37 excellent! 19:06:54 #link https://review.opendev.org/q/topic:bridge-ansible-venv 19:07:27 that is quite the series of changes 19:07:46 is the stack. it's doing a few things -- moving our ansible into a venv, updating all our testing environments to a jammy-based bridge, and abstracting things so we can replace bridge.openstack.org with another fresh jammy host 19:08:00 probably called bridge.opendev.org 19:08:04 look lie clarkb has already reviewed the first several 19:08:12 er, like 19:08:31 yeah, the final change isn't 100% ready yet but hopefully soon 19:08:52 852477 is getting a post_failure on run-base. is that expected? or unrelated? 19:09:27 i'm on PTO after today, until 10/3, so i don't expect much progress, but after that i think we'll be pretty much ready to swap out bridge 19:09:43 sounds great, i'll try to set aside some time to go through those 19:10:07 852477 is not expected to fail :) that is probably unrelated 19:10:21 enjoy your time afk! 19:10:50 i'm hoping to do something like that myself at the end of next month 19:11:04 it should all be a big no-op; *should* being the operative term :) 19:12:17 awesome, anything else specific you want to call out there? 19:12:35 the agenda mentions venv for openstacksdk, upgrading the server distro version... 19:13:23 I think the change stack does a lot of that 19:13:32 so now that we have changes we can probably shift to reviewing things there 19:14:57 sounds good. i guess we can proceed to the next topic 19:15:19 thanks for putting that together ianw! 19:15:23 #topic Upgrading Bionic servers to Focal/Jammy 19:15:39 I think the only new thing related to this topic is what ianw just covered above 19:15:43 #link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades 19:16:15 yep 19:17:24 #topic Mailman 3 19:17:47 this is coming along nicely 19:18:11 i was able to hold a node from the latest patchset of the implementing change and import production copies of all our data into it 19:18:46 most full site migrations will go quickly. i think i clocked lists.opendev.org in at around 8 minutes (i'd have to double-check my notes) 19:19:02 the slowest of course will be lists.openstack.org, which took about 2.5 hours when i tested it 19:19:21 the stable branch failures list accounts for about half that time on its own 19:20:11 the only outstanding errors were around a few of the older openstack lists having some rejection/moderation message templates which were too large for the default column width in the db 19:20:33 for those, i think i'll just fix the current production ones to be smaller (it was a total of 3, so not a big deal) 19:21:07 i still need to find some time to test that the legacy pipermail redirects work, and test that mysqldump works on the imported data set 19:21:54 the held server with the full imported data set is 104.239.143.143 in case anyone wants to poke at it 19:21:59 sounds like great progress. Thank you for continuing to push this along 19:22:43 i think the plan, as it stands, is to shoot for doing lists.opendev.org and lists.zuul-ci.org around early november, if things continue to go well and stay on track? 19:23:47 that wfm. I think we can switch opendev whenever we're comfortable with the deployment (general functionality, migration path, etc) 19:24:09 and then we can consider doing other lists after the start of the year if we don't run into any problems with the already migrated sites 19:25:13 ++ having done nothing to help with this, it sounds good :) 19:25:32 thank you, and just form poking at the site it looks great 19:25:37 thanks! 19:25:46 #topic Jaeger tracing server (for Zuul) 19:26:06 corvus: i saw the change to add the server and initial deployment bits merged? 19:29:26 #link https://review.opendev.org/c/opendev/system-config/+/855983 Add Jaeger tracing server 19:30:18 yep, have not launched yet, will do so when i have a spare moment 19:31:00 thanks, i'll keep an eye out for the inventory and dns changes 19:31:24 anything else to cover on that at the moment? 19:32:34 sounds like no 19:32:46 #topic Fedora 36 Rollout 19:33:24 we found out today that it needs an extra package to work with glean's networkmanager implementation for non-dhcp providers 19:33:59 yes, thanks for digging on that! 19:34:00 #link https://review.opendev.org/858547 Install Fedora ifcfg NM compat package 19:34:01 #link https://review.opendev.org/c/openstack/diskimage-builder/+/858547 19:34:07 jinx :) 19:34:17 #undo 19:34:17 Removing item from minutes: #link https://review.opendev.org/c/openstack/diskimage-builder/+/858547 19:34:58 it was conflated with a separate problem where we crashed nl04 with a bad config when the weekend restart happened 19:35:05 i have déjà vu on that ... I either vaguely knew about it or we've fixed something similar before, but i can't pinpoint it 19:35:18 so we've not been able to boot anything in ovh and some other providers since early utc saturday 19:35:22 ya the OVH thing made it worse as ovh has a lot of dhcp resources 19:35:33 (the missing package) 19:35:41 or, rather, we weren't able to until it got spotted and fixed earlier today 19:36:04 was the nl04 something that ultimately I did, missing bits of f36? 19:36:18 and with the dib change approved and on its way to merge, i guess we'll need the usual dib release tagged, nodepool requirements bumped, new container images built and deployed 19:36:27 ianw: it was missing bits of rockylinux 19:36:37 ianw: and I think a few days ago nodepool become more strict about checking that stuff 19:36:39 ianw: it was a missed rockylinux-9 label from the nested-virt addition 19:37:17 whatever strictness was added would have been between saturday and whenever the previous restart of nl04's container was 19:37:56 https://review.opendev.org/c/zuul/nodepool/+/858577 is something I've just pushed that should have our CI for project-config catch it 19:38:00 the bad config merged at the beginning of the month, but nodepool only spams exceptions about failing to load its config until you try to restart it 19:38:15 then it fails to start up after being stopped 19:38:19 fungi: oh I see it crashed because it had to start over but it was probably angry before then too 19:38:30 clarkb: excellent; thanks, that was my next question as to why the linter passed it :) 19:38:40 yes, i saw the tracebacks in older logs, but i didn't look to see how far back it started complaining 19:40:04 anyway, if we decide we want the networking fix for f36 sooner we can add that package to infra-package-needs, but i'm fine waiting. it's no longer urgent with nl04 back in a working state 19:40:36 #topic Improving Ansible task runtime 19:40:57 clarkb: you had a couple changes around this, looks like 19:41:16 ya this is still on the agenda mostly for https://review.opendev.org/c/opendev/system-config/+/857239 19:41:38 #link https://review.opendev.org/857232 Use synchronize to copy test host/group vars 19:41:45 The idea behind that change is to more forcefully enable pipelining for infra ansible things since we don't care about other connection types 19:42:06 zuul won't make that change because it has many connection types to support and that isn't necessarily safe. But in our corner of the world it should be 19:42:11 #link https://review.opendev.org/857239 More aggresively enable ansible pipelining 19:42:24 and then ya comments on whether or not people want faster file copying at the expense of more complicated job setup would be good 19:43:52 we should do that asap, that change will bitrot quick 19:44:00 (do that == decide whether to merge it) 19:44:23 ++ the mm3 change actually conflicts with it too iirc 19:44:33 as an example of how quickly it will run into trouble 19:44:34 (that change == 857232) 19:45:40 heh yes and the stack previously mentioned about venv stuff will 19:46:13 I'm happy to do the legwork to coordinate and rebase things. Mostly just looking for feedback on whether or not we want to make the change in the first place 19:47:30 EOT for me 19:47:46 i relate to corvus's review comment there. if it's an average savings of 3 minutes and just affects our deployment tests, that's not a huge savings (it's not exactly trivial either though, i'll grant) 19:48:31 Re time saving the idea I had was that if we can trim a few places that are slow like that then in aggregate maybe we end up saving 10 minutes per job 19:48:36 that is yet to be proven though 19:48:58 (other places for improvement are the multi node ssh key management stuff whcih our jobs use too) 19:50:05 yeah, on the other hand, i don't think 857232 significantly adds to the complexity there. mainly it means we need to be mindful of what files need templating and what don't 19:50:20 yeah multiple minutes, to me, is worth going after 19:50:32 multiple seconds maybe not :) 19:50:40 not to downplay the complexity, it's complex for sure and that means more we can trip over 19:51:11 on balance i'm like +0.5 in favor 19:52:10 ack can follow up on the change 19:52:29 okay, well we can hash it out in review but let's not take too long as corvus indicated 19:52:46 ++ 19:52:47 #topic Nodepool Builder Disk utilization 19:53:06 the perennial discussion 19:53:20 can we drop some images soon? or should we add more/bigger builders 19:53:21 I added this to the agenda after frickler discovered that the disk had filled on nb01 and nb02 19:53:48 i think we're close on f35, at least 19:53:52 we've add two flavors of rocky since our last cleanup and our base disk utilization is higher. THis makse it easier for the disk to fill if something goes wrong 19:53:55 (current issues not withstanding) 19:54:01 so one raw image is 20-22G 19:54:13 vmdk the same plus ca. 14G qcow2 19:54:32 makes 50-60GB per image, times 2 because we keep the last two 19:54:41 note the raw disk utilization size is closer to the qcow2 size due to fs/block magic 19:54:51 we have 1T per builder 19:54:54 du vs ls show the difference 19:55:28 one of the problems I think is that the copies aren't necessarily distributed 19:55:32 i'll note that gentoo hasn't successfully built for almost 2 months (nodes still seem to be booting but the image is very stale now) 19:55:45 gentoo builds are disabled 19:55:59 the still need some fix, I don't remember the detail 19:56:03 frickler: yes, when the builders are happy they pretty evenly distribute things. But as soon as one has problems all the load shifts to the other and it then falls over 19:56:19 opensuse images seem to have successfully built today after two weeks of not being able to 19:56:22 Adding another builder or more disk would alleviate that aspect of the issue 19:56:31 but that was likely due to the builders filling up 19:56:55 I have https://review.opendev.org/c/zuul/nodepool/+/764280 that i've never got back to that would stop a builder trying to start a build if it knew there wasn't enough space 19:57:06 the disks were full for two weeks, so no images got build during that time 19:57:14 right, makes sense 19:57:28 but all builds looked green now in grafana 19:58:10 are there concerns with adding another builder? 19:58:28 the other option would be just add another disk on the existing ones? 19:58:42 yeah, though that doesn't get us faster build throughput 19:58:44 one upside to adding a new builder would be we could deploy it on jammy and make sure everything is working 19:58:53 then replace the older two 19:58:54 because we don't seem to have an issue with buildtime yet 19:59:13 Adding more disk is probably the easiest thing to do though 19:59:15 ah, upgrading is a good reason for new one(s) 19:59:26 this is true. i think the 1tb limit is just a self-imposed thing, perhaps related to the maximum cinder volume size in rax, but i don't think it's a special number 19:59:34 but any idea what % of the day is spent building images on each builder? we won't know that we're out of build bandwidth until we are 19:59:49 fungi: it takes about an hour to build each image 20:00:04 total images / 2 ~= active hours of the day per builder 20:00:07 and we're at ~16 images now 20:00:25 so yeah, we're theoretically at 33% capacity for build bandwidth 20:00:55 i agree more space would be fine for now in that case, unless we just want to be able to get images online a bit faster 20:01:25 oh, right, we should check upload times 20:01:35 I don't think that is included in the 1h 20:01:46 it isn't 20:01:58 but we do parallelize those but builds are serial 20:02:13 there is still an upper limit though. I don't think it is the one we are likely to hit first. But checking is a good idea 20:02:20 and we still need to replace the arm builder right? does it need more space too? 20:02:54 yes it needs to be replaced. Still waiting on the new cloud to put it in unless we move it to osuosl 20:02:55 that will need replacement as linaro goes, yep. that seems to be a wip 20:03:04 also note that we're roughly three minutes over the end of the meeting 20:03:17 and its disk does much better since we don't need the vhd and qcow images 20:03:21 we use raw only for arm64 iirc 20:03:38 so sounds like some consensus on adding a second volume for nb01 and nb02. i can try to find time for that later this week 20:04:14 it's lvm so easy to just grow in-place 20:04:46 #topic Open Discussion 20:05:00 if nobody has anything else i'll close the meeting 20:05:32 I'm good. Thank you for steppign in today 20:05:44 #link https://review.opendev.org/c/zuul/zuul/+/858243 20:06:08 is a zuul one; thanks for reviews on the recent console web-stack -- that one is a fix for narrow screens on the console 20:06:21 there's others on top, but they're opinions, not fixes :) 20:07:02 and thanks to fungi for meeting too :) 20:07:17 thanks everyone! enjoy the rest of your week 20:07:19 #endmeeting