19:01:24 <clarkb> #startmeeting infra
19:01:24 <opendevmeet> Meeting started Tue Sep 27 19:01:24 2022 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:24 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:24 <opendevmeet> The meeting name has been set to 'infra'
19:01:35 <clarkb> #link https://lists.opendev.org/pipermail/service-discuss/2022-September/000362.html Our Agenda
19:02:17 <clarkb> #topic Announcements
19:03:11 <clarkb> No announcements. I guess keep in mind the openstack release is happening soon
19:03:22 <clarkb> then in a couple? maybe three? weeks the PTG is happening virtually
19:03:36 <clarkb> #topic Mailman 3
19:04:02 <fungi> yet still moar progress
19:04:34 <fungi> migration script is now included in the implementation change
19:05:05 <fungi> and it's been updated to also cover the bits to move the backward-compat copies of the old archives to where we'll serve them from, and relink them properly
19:05:47 <fungi> we have a handful of mailing lists with some fields which are too large for their db columns, but i'll fix those in production and then we can do another (perhaps final) full migration test
19:06:10 <fungi> i think we're close to scheduling migration cut-over for lists.opendev.org and lists.zuul-ci.org
19:06:17 <clarkb> is that the last thing we need to do bfeore holding a node and doing a migration test again?
19:06:23 <fungi> afaik yes
19:06:58 <clarkb> sounds great
19:07:20 <fungi> i noted the specific lists and fields at the end of th eetherpad in the todo list
19:07:39 <fungi> #link https://etherpad.opendev.org/p/mm3migration
19:08:30 <fungi> maybe next meeting we'll be in a position to talk about maintenance scheduling
19:08:32 <clarkb> and the disk filling was due to unbound and not something we expect to be a problem in production right?
19:08:42 <clarkb> because unbound on our test nodes has more verbose logging settings
19:08:55 <fungi> correct, we apparently set unbound up for very verbose logging on our test nodes
19:09:15 <fungi> the risks of holding a test node and letting it sit there for weeks
19:10:18 <clarkb> exciting and let me know if there is anyhting else I can help with on this. I'e been trying to make sure I help keep it moving forward
19:10:20 <clarkb> anything else?
19:10:52 <fungi> you said the ci failures on the lynx addition pr are bitrot, right?
19:10:56 <fungi> #link https://github.com/maxking/docker-mailman/pull/552
19:11:19 <clarkb> yes I'm pretty sure that was a upstream change to mailman3 iirc that broke it
19:11:25 <clarkb> something to do with translations
19:11:27 <fungi> k
19:11:46 <clarkb> but that is a good reminder we should probably decide if we need lynx installed and do our own images or try to reach out to upstream more
19:12:07 <fungi> looks like that repo has merged no new commits to its main branch since june
19:13:04 <clarkb> I wonder if anyone knows the maintainer
19:15:04 <fungi> anyway, i don't have anything else on this topic
19:15:13 <clarkb> #topic jaeger tracing server
19:15:22 <clarkb> The server is up and running
19:15:35 <corvus> it's up, lgtm
19:15:48 <fungi> certainly a thing we like to see in our servers ;)
19:15:52 <clarkb> we've also updated the zuul config to send tracing info to the jaeger server
19:15:54 <corvus> we're not sending data to it yet, but we should be configured to do so once we restart zuul with the changes that start exporting traces
19:16:28 <fungi> is that waiting for the weekend, or should we restart sooner?
19:17:12 <clarkb> restarting sooner is easy to do with the playbook if the changes have landed to zuul
19:17:22 <corvus> i'm inclined to just let it happen over the weekend.... but i dunno, maybe a restart of the schedulers before then might be interesting?
19:17:48 <corvus> that might get us some traces without restarting the schedulers
19:18:13 <fungi> without restarting the executors?
19:18:28 <corvus> yep that, sorry :)
19:18:31 <fungi> cool, just making sure i understood!
19:18:57 <fungi> that sounds fine to me. i can probably do that after dinner if we're ready
19:19:01 <corvus> would be a quick sanity check, maybe if there's an issue we can fix it before the weekend.
19:19:31 <corvus> well, i don't think the zuul stack has merged up to where i'd like it to yet
19:20:08 <fungi> which changes are still outstanding?
19:20:10 <corvus> there's a good chunk approved, but they hit some failures
19:20:15 <corvus> i just rechecked them
19:20:15 <fungi> ahh
19:20:34 <corvus> #link https://review.opendev.org/858372 initial retracing change tip
19:20:45 <corvus> i think a partial (or full) restart after that makes sense
19:20:57 <corvus> so that's the one to watch for.  i can ping when it lands
19:21:01 <fungi> okay, so we're clear to restart schedulers once there are images built from 858372?
19:21:13 <corvus> ya.  sounds like a plan
19:21:21 <fungi> (and tagged latest in promote of course)
19:21:59 <fungi> cool, i've starred that so i should get notifications for it
19:22:05 <clarkb> and once that restart happens we'd expect to see traces in jaeger. Any issue with potentially orphaned traces since all components won't be restarted yet?
19:22:35 <clarkb> (thinking about this from a rolling upgrade standpoint). I guess the jaeger server might hold onto partial info that we can't view but otherwise its fine?
19:22:40 <corvus> i don't think so - in my testing jaeger just shows a little error message as a sort of footnote to the trace
19:22:54 <corvus> so it knows there's missing info, but it's no big deal
19:23:00 <corvus> and eventually things start looking more complete
19:23:05 <fungi> i could do a rolling scheduler restart, and then follow it up with a full restart, if that's preferable
19:23:19 <fungi> full rolling restart of all components i mean
19:23:53 <corvus> fungi: not worth it for this -- no matter what, there will be partial/orphaned traces.  that's sort of intentional in order to keep the complexity of the changes under control
19:24:01 <fungi> okay, wfm
19:24:34 <corvus> also (schedulers are generally emitting the highest level traces, so we'll just be missing children, not parents, once all the schedulers are restarted)
19:25:12 <corvus> then over the next days/weeks, we'll be landing more changes to zuul to add more traces, so data will become richer over time
19:25:33 <fungi> awesome
19:25:39 <corvus> i think that's probably it from my pov
19:25:48 <clarkb> sounds good, thanks
19:26:31 <clarkb> #topic Nodepool Disk Utilization
19:26:56 <fungi> hopefully we're in the clear here now
19:27:01 <fungi> and least for a while
19:27:21 <clarkb> yup you just expanded the disks on both builders to 2TB which is plenty
19:27:31 <fungi> yeah, i added a 1tb volume to each of nb01 and nb02 and expanded /opt to twice its earlier size
19:27:36 <clarkb> I wasn't sure where this was when I put the agenda togethe rso wanted to make sure it was brought up
19:28:07 <fungi> probably safe to take off the agenda now, unless we want to monitor things for a while
19:28:21 <clarkb> agreed. I think this was the solution. Thank you for taking care of it
19:28:26 <fungi> np
19:29:02 <clarkb> #topic Open Discussion
19:30:01 <clarkb> Looks like https://review.opendev.org/c/zuul/zuul-jobs/+/858961 has the +2's it needs. Are ya'll good with me approving when I can keep an eye on it?
19:30:25 <clarkb> If so I can probably do that tomorrow (today I've got to get the kids from schoo land stuff so my afternoon won't be attached to a keyboard the whole time)
19:30:27 <fungi> yeah, i say go for it
19:30:47 <fungi> volume for reprepro mirror of ceph quincy deb packages has been created, i'm just waiting to see if the jobs run clean on 858961 now
19:31:03 <clarkb> oh I was also going to land the rocky 9 arm64 image if we are happy wit hit
19:31:11 <fungi> yeah, sounds good, thanks
19:31:18 <clarkb> that needs less attention if it goes wrong
19:31:47 <clarkb> unless it kills a launcher I suppose but our testing should catch that now
19:31:50 <fungi> well, a similar change recently did take down nl03
19:31:56 <fungi> or was it nl04?
19:32:11 <clarkb> nl04. But ya I can approv ethat one when I've got more consistent time at a keyboard to check things too
19:32:30 <fungi> anyway, there's certainly some associated risk but yes if we're on top of double-checking after it deploys i think that's fine
19:33:55 <fungi> does anyone know whether we expect system-config-run-base-ansible-devel to ever start passing again?
19:34:41 <clarkb> fungi: yes I think the issue is that bridge is too old for new ansible
19:34:44 <fungi> currently seems to be breaking because we test with python 3.8 and ansible-core needs >=3.9
19:34:51 <fungi> so that makes sense
19:34:56 <clarkb> so once we get the venv stuff done and switched over to a newer bridge in theory it will work
19:35:03 <fungi> awesome
19:36:06 <clarkb> last call for anything else
19:36:16 * fungi gots nuthin'
19:36:41 <fungi> other than a gnawing hunger for takeout
19:36:52 <clarkb> #endmeeting