Tuesday, 2022-09-20

fungiopendev "infra" meeting is starting shortly18:59
clarkbI'm going to let fungi be substitute meeting chair today as I think I've got something kids brought home from school (you may have noticed I was awake far too early today)18:59
clarkbBut I'll follow along19:00
fungisounds like you could use some debugging19:00
fungi#startmeeting infra19:00
opendevmeetMeeting started Tue Sep 20 19:00:29 2022 UTC and is due to finish in 60 minutes.  The chair is fungi. Information about MeetBot at http://wiki.debian.org/MeetBot.19:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:00
opendevmeetThe meeting name has been set to 'infra'19:00
fungi#link https://lists.opendev.org/pipermail/service-discuss/2022-September/000361.html Our agenda19:01
fungi#topic Announcements19:01
fungianybody have anything that needs announcing?19:01
fungii'll take your silence as a resounding negative19:02
fungi#topic Actions from last meeting19:02
ianwo/19:02
fungi#link https://meetings.opendev.org/meetings/infra/2022/infra.2022-09-13-19.01.html19:03
fungilooks like that says "none" so i think we can move on19:03
fungi#topic Specs Review19:04
fungii don't see any recently proposed nor recently merged19:04
fungianyone have a spec on the way?19:05
fungithe in progress specs are covered in upcoming topics anyway19:05
fungiso i guess let's jump straight to those19:05
fungi#topic Bastion host19:05
fungiianw: looks like you had some questions there?19:06
ianwi have found some time to move this forward a bit19:06
fungiexcellent!19:06
ianw#link https://review.opendev.org/q/topic:bridge-ansible-venv19:06
fungithat is quite the series of changes19:07
ianwis the stack.  it's doing a few things -- moving our ansible into a venv, updating all our testing environments to a jammy-based bridge, and abstracting things so we can replace bridge.openstack.org with another fresh jammy host19:07
ianwprobably called bridge.opendev.org19:08
fungilook lie clarkb has already reviewed the first several19:08
fungier, like19:08
ianwyeah, the final change isn't 100% ready yet but hopefully soon19:08
fungi852477 is getting a post_failure on run-base. is that expected? or unrelated?19:08
ianwi'm on PTO after today, until 10/3, so i don't expect much progress, but after that i think we'll be pretty much ready to swap out bridge19:09
fungisounds great, i'll try to set aside some time to go through those19:09
ianw852477 is not expected to fail :)  that is probably unrelated19:10
fungienjoy your time afk!19:10
fungii'm hoping to do something like that myself at the end of next month19:10
ianwit should all be a big no-op; *should* being the operative term :)19:11
fungiawesome, anything else specific you want to call out there?19:12
fungithe agenda mentions venv for openstacksdk, upgrading the server distro version...19:12
clarkbI think the change stack does a lot of that19:13
clarkbso now that we have changes we can probably shift to reviewing things there19:13
fungisounds good. i guess we can proceed to the next topic19:14
fungithanks for putting that together ianw!19:15
fungi#topic Upgrading Bionic servers to Focal/Jammy19:15
clarkbI think the only new thing related to this topic is what ianw just covered above19:15
fungi#link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades19:15
fungiyep19:16
fungi#topic Mailman 319:17
fungithis is coming along nicely19:17
fungii was able to hold a node from the latest patchset of the implementing change and import production copies of all our data into it19:18
fungimost full site migrations will go quickly. i think i clocked lists.opendev.org in at around 8 minutes (i'd have to double-check my notes)19:18
fungithe slowest of course will be lists.openstack.org, which took about 2.5 hours when i tested it19:19
fungithe stable branch failures list accounts for about half that time on its own19:19
fungithe only outstanding errors were around a few of the older openstack lists having some rejection/moderation message templates which were too large for the default column width in the db19:20
fungifor those, i think i'll just fix the current production ones to be smaller (it was a total of 3, so not a big deal)19:20
fungii still need to find some time to test that the legacy pipermail redirects work, and test that mysqldump works on the imported data set19:21
fungithe held server with the full imported data set is 104.239.143.143 in case anyone wants to poke at it19:21
clarkbsounds like great progress. Thank you for continuing to push this along19:21
fungii think the plan, as it stands, is to shoot for doing lists.opendev.org and lists.zuul-ci.org around early november, if things continue to go well and stay on track?19:22
clarkbthat wfm. I think we can switch opendev whenever we're comfortable with the deployment (general functionality, migration path, etc)19:23
fungiand then we can consider doing other lists after the start of the year if we don't run into any problems with the already migrated sites19:24
ianw++ having done nothing to help with this, it sounds good :)  19:25
ianwthank you, and just form poking at the site it looks great19:25
fungithanks!19:25
fungi#topic Jaeger tracing server (for Zuul)19:25
fungicorvus: i saw the change to add the server and initial deployment bits merged?19:26
fungi#link https://review.opendev.org/c/opendev/system-config/+/855983 Add Jaeger tracing server19:29
corvusyep, have not launched yet, will do so when i have a spare moment19:30
fungithanks, i'll keep an eye out for the inventory and dns changes19:31
fungianything else to cover on that at the moment?19:31
fungisounds like no19:32
fungi#topic Fedora 36 Rollout19:32
fungiwe found out today that it needs an extra package to work with glean's networkmanager implementation for non-dhcp providers19:33
ianwyes, thanks for digging on that!19:33
fungi#link https://review.opendev.org/858547 Install Fedora ifcfg NM compat package19:34
ianw#link https://review.opendev.org/c/openstack/diskimage-builder/+/85854719:34
ianwjinx :)19:34
fungi#undo19:34
opendevmeetRemoving item from minutes: #link https://review.opendev.org/c/openstack/diskimage-builder/+/85854719:34
fungiit was conflated with a separate problem where we crashed nl04 with a bad config when the weekend restart happened19:34
ianwi have déjà vu on that ... I either vaguely knew about it or we've fixed something similar before, but i can't pinpoint it19:35
fungiso we've not been able to boot anything in ovh and some other providers since early utc saturday19:35
clarkbya the OVH thing made it worse as ovh has a lot of dhcp resources19:35
ianw(the missing package)19:35
fungior, rather, we weren't able to until it got spotted and fixed earlier today19:35
ianwwas the nl04 something that ultimately I did, missing bits of f36?19:36
fungiand with the dib change approved and on its way to merge, i guess we'll need the usual dib release tagged, nodepool requirements bumped, new container images built and deployed19:36
clarkbianw: it was missing bits of rockylinux19:36
clarkbianw: and I think a few days ago nodepool become more strict about checking that stuff19:36
fungiianw: it was a missed rockylinux-9 label from the nested-virt addition19:36
fungiwhatever strictness was added would have been between saturday and whenever the previous restart of nl04's container was19:37
clarkbhttps://review.opendev.org/c/zuul/nodepool/+/858577 is something I've just pushed that should have our CI for project-config catch it19:37
fungithe bad config merged at the beginning of the month, but nodepool only spams exceptions about failing to load its config until you try to restart it19:38
fungithen it fails to start up after being stopped19:38
clarkbfungi: oh I see it crashed because it had to start over but it was probably angry before then too19:38
ianwclarkb: excellent; thanks, that was my next question as to why the linter passed it :)19:38
fungiyes, i saw the tracebacks in older logs, but i didn't look to see how far back it started complaining19:38
fungianyway, if we decide we want the networking fix for f36 sooner we can add that package to infra-package-needs, but i'm fine waiting. it's no longer urgent with nl04 back in a working state19:40
fungi#topic Improving Ansible task runtime19:40
fungiclarkb: you had a couple changes around this, looks like19:40
clarkbya this is still on the agenda mostly for https://review.opendev.org/c/opendev/system-config/+/85723919:41
fungi#link https://review.opendev.org/857232 Use synchronize to copy test host/group vars19:41
clarkbThe idea behind that change is to more forcefully enable pipelining for infra ansible things since we don't care about other connection types19:41
clarkbzuul won't make that change because it has many connection types to support and that isn't necessarily safe. But in our corner of the world it should be19:42
fungi#link https://review.opendev.org/857239 More aggresively enable ansible pipelining19:42
clarkband then ya comments on whether or not people want faster file copying at the expense of more complicated job setup would be good19:42
corvuswe should do that asap, that change will bitrot quick19:43
corvus(do that == decide whether to merge it)19:44
clarkb++ the mm3 change actually conflicts with it too iirc19:44
clarkbas an example of how quickly it will run into trouble19:44
corvus(that change == 857232)19:44
ianwheh yes and the stack previously mentioned about venv stuff will19:45
clarkbI'm happy to do the legwork to coordinate and rebase things. Mostly just looking for feedback on whether or not we want to make the change in the first place19:46
clarkbEOT for me19:47
fungii relate to corvus's review comment there. if it's an average savings of 3 minutes and just affects our deployment tests, that's not a huge savings (it's not exactly trivial either though, i'll grant)19:47
clarkbRe time saving the idea I had was that if we can trim a few places that are slow like that then in aggregate maybe we end up saving 10 minutes per job19:48
clarkbthat is yet to be proven though19:48
clarkb(other places for improvement are the multi node ssh key management stuff whcih our jobs use too)19:48
fungiyeah, on the other hand, i don't think 857232 significantly adds to the complexity there. mainly it means we need to be mindful of what files need templating and what don't19:50
ianwyeah multiple minutes, to me, is worth going after19:50
ianwmultiple seconds maybe not :)19:50
funginot to downplay the complexity, it's complex for sure and that means more we can trip over19:50
fungion balance i'm like +0.5 in favor19:51
clarkback can follow up on the change19:52
fungiokay, well we can hash it out in review but let's not take too long as corvus indicated19:52
clarkb++19:52
fungi#topic Nodepool Builder Disk utilization19:52
fungithe perennial discussion19:53
fungican we drop some images soon? or should we add more/bigger builders19:53
clarkbI added this to the agenda after frickler discovered that the disk had filled on nb01 and nb0219:53
ianwi think we're close on f35, at least19:53
clarkbwe've add two flavors of rocky since our last cleanup and our base disk utilization is higher. THis makse it easier for the disk to fill if something goes wrong19:53
ianw(current issues not withstanding)19:53
fricklerso one raw image is 20-22G19:54
fricklervmdk the same plus ca. 14G qcow219:54
fricklermakes 50-60GB per image, times 2 because we keep the last two19:54
clarkbnote the raw disk utilization size is closer to the qcow2 size due to fs/block magic19:54
fricklerwe have 1T per builder19:54
clarkbdu vs ls show the difference19:54
fricklerone of the problems I think is that the copies aren't necessarily distributed19:55
fungii'll note that gentoo hasn't successfully built for almost 2 months (nodes still seem to be booting but the image is very stale now)19:55
fricklergentoo builds are disabled19:55
fricklerthe still need some fix, I don't remember the detail19:55
clarkbfrickler: yes, when the builders are happy they pretty evenly distribute things. But as soon as one has problems all the load shifts to the other and it then falls over19:56
fungiopensuse images seem to have successfully built today after two weeks of not being able to19:56
clarkbAdding another builder or more disk would alleviate that aspect of the issue19:56
fungibut that was likely due to the builders filling up19:56
ianwI have https://review.opendev.org/c/zuul/nodepool/+/764280 that i've never got back to that would stop a builder trying to start a build if it knew there wasn't enough space19:56
fricklerthe disks were full for two weeks, so no images got build during that time19:57
fungiright, makes sense19:57
fricklerbut all builds looked green now in grafana19:57
fungiare there concerns with adding another builder?19:58
fricklerthe other option would be just add another disk on the existing ones?19:58
fungiyeah, though that doesn't get us faster build throughput19:58
clarkbone upside to adding a new builder would be we could deploy it on jammy and make sure everything is working19:58
clarkbthen replace the older two19:58
fricklerbecause we don't seem to have an issue with buildtime yet19:58
clarkbAdding more disk is probably the easiest thing to do though19:59
fricklerah, upgrading is a good reason for new one(s)19:59
ianwthis is true.  i think the 1tb limit is just a self-imposed thing, perhaps related to the maximum cinder volume size in rax, but i don't think it's a special number19:59
fungibut any idea what % of the day is spent building images on each builder? we won't know that we're out of build bandwidth until we are19:59
clarkbfungi: it takes about an hour to build each image19:59
clarkbtotal images / 2 ~= active hours of the day per builder20:00
fungiand we're at ~16 images now20:00
fungiso yeah, we're theoretically at 33% capacity for build bandwidth20:00
fungii agree more space would be fine for now in that case, unless we just want to be able to get images online a bit faster20:00
frickleroh, right, we should check upload times20:01
fricklerI don't think that is included in the 1h20:01
clarkbit isn't20:01
clarkbbut we do parallelize those but builds are serial20:01
clarkbthere is still an upper limit though. I don't think it is the one we are likely to hit first. But checking is a good idea20:02
fungiand we still need to replace the arm builder right? does it need more space too?20:02
clarkbyes it needs to be replaced. Still waiting on the new cloud to put it in unless we move it to osuosl20:02
ianwthat will need replacement as linaro goes, yep.  that seems to be a wip20:02
fungialso note that we're roughly three minutes over the end of the meeting20:03
clarkband its disk does much better since we don't need the vhd and qcow images20:03
clarkbwe use raw only for arm64 iirc20:03
fungiso sounds like some consensus on adding a second volume for nb01 and nb02. i can try to find time for that later this week20:03
fungiit's lvm so easy to just grow in-place20:04
fungi#topic Open Discussion20:04
fungiif nobody has anything else i'll close the meeting20:05
clarkbI'm good. Thank you for steppign in today20:05
ianw#link https://review.opendev.org/c/zuul/zuul/+/85824320:05
ianwis a zuul one; thanks for reviews on the recent console web-stack -- that one is a fix for narrow screens on the console20:06
ianwthere's others on top, but they're opinions, not fixes :)20:06
ianwand thanks to fungi for meeting too :)20:07
fungithanks everyone! enjoy the rest of your week20:07
fungi#endmeeting20:07
opendevmeetMeeting ended Tue Sep 20 20:07:19 2022 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)20:07
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2022/infra.2022-09-20-19.00.html20:07
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2022/infra.2022-09-20-19.00.txt20:07
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2022/infra.2022-09-20-19.00.log.html20:07

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!