19:01:24 #startmeeting infra 19:01:24 Meeting started Tue Sep 27 19:01:24 2022 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:24 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:24 The meeting name has been set to 'infra' 19:01:35 #link https://lists.opendev.org/pipermail/service-discuss/2022-September/000362.html Our Agenda 19:02:17 #topic Announcements 19:03:11 No announcements. I guess keep in mind the openstack release is happening soon 19:03:22 then in a couple? maybe three? weeks the PTG is happening virtually 19:03:36 #topic Mailman 3 19:04:02 yet still moar progress 19:04:34 migration script is now included in the implementation change 19:05:05 and it's been updated to also cover the bits to move the backward-compat copies of the old archives to where we'll serve them from, and relink them properly 19:05:47 we have a handful of mailing lists with some fields which are too large for their db columns, but i'll fix those in production and then we can do another (perhaps final) full migration test 19:06:10 i think we're close to scheduling migration cut-over for lists.opendev.org and lists.zuul-ci.org 19:06:17 is that the last thing we need to do bfeore holding a node and doing a migration test again? 19:06:23 afaik yes 19:06:58 sounds great 19:07:20 i noted the specific lists and fields at the end of th eetherpad in the todo list 19:07:39 #link https://etherpad.opendev.org/p/mm3migration 19:08:30 maybe next meeting we'll be in a position to talk about maintenance scheduling 19:08:32 and the disk filling was due to unbound and not something we expect to be a problem in production right? 19:08:42 because unbound on our test nodes has more verbose logging settings 19:08:55 correct, we apparently set unbound up for very verbose logging on our test nodes 19:09:15 the risks of holding a test node and letting it sit there for weeks 19:10:18 exciting and let me know if there is anyhting else I can help with on this. I'e been trying to make sure I help keep it moving forward 19:10:20 anything else? 19:10:52 you said the ci failures on the lynx addition pr are bitrot, right? 19:10:56 #link https://github.com/maxking/docker-mailman/pull/552 19:11:19 yes I'm pretty sure that was a upstream change to mailman3 iirc that broke it 19:11:25 something to do with translations 19:11:27 k 19:11:46 but that is a good reminder we should probably decide if we need lynx installed and do our own images or try to reach out to upstream more 19:12:07 looks like that repo has merged no new commits to its main branch since june 19:13:04 I wonder if anyone knows the maintainer 19:15:04 anyway, i don't have anything else on this topic 19:15:13 #topic jaeger tracing server 19:15:22 The server is up and running 19:15:35 it's up, lgtm 19:15:48 certainly a thing we like to see in our servers ;) 19:15:52 we've also updated the zuul config to send tracing info to the jaeger server 19:15:54 we're not sending data to it yet, but we should be configured to do so once we restart zuul with the changes that start exporting traces 19:16:28 is that waiting for the weekend, or should we restart sooner? 19:17:12 restarting sooner is easy to do with the playbook if the changes have landed to zuul 19:17:22 i'm inclined to just let it happen over the weekend.... but i dunno, maybe a restart of the schedulers before then might be interesting? 19:17:48 that might get us some traces without restarting the schedulers 19:18:13 without restarting the executors? 19:18:28 yep that, sorry :) 19:18:31 cool, just making sure i understood! 19:18:57 that sounds fine to me. i can probably do that after dinner if we're ready 19:19:01 would be a quick sanity check, maybe if there's an issue we can fix it before the weekend. 19:19:31 well, i don't think the zuul stack has merged up to where i'd like it to yet 19:20:08 which changes are still outstanding? 19:20:10 there's a good chunk approved, but they hit some failures 19:20:15 i just rechecked them 19:20:15 ahh 19:20:34 #link https://review.opendev.org/858372 initial retracing change tip 19:20:45 i think a partial (or full) restart after that makes sense 19:20:57 so that's the one to watch for. i can ping when it lands 19:21:01 okay, so we're clear to restart schedulers once there are images built from 858372? 19:21:13 ya. sounds like a plan 19:21:21 (and tagged latest in promote of course) 19:21:59 cool, i've starred that so i should get notifications for it 19:22:05 and once that restart happens we'd expect to see traces in jaeger. Any issue with potentially orphaned traces since all components won't be restarted yet? 19:22:35 (thinking about this from a rolling upgrade standpoint). I guess the jaeger server might hold onto partial info that we can't view but otherwise its fine? 19:22:40 i don't think so - in my testing jaeger just shows a little error message as a sort of footnote to the trace 19:22:54 so it knows there's missing info, but it's no big deal 19:23:00 and eventually things start looking more complete 19:23:05 i could do a rolling scheduler restart, and then follow it up with a full restart, if that's preferable 19:23:19 full rolling restart of all components i mean 19:23:53 fungi: not worth it for this -- no matter what, there will be partial/orphaned traces. that's sort of intentional in order to keep the complexity of the changes under control 19:24:01 okay, wfm 19:24:34 also (schedulers are generally emitting the highest level traces, so we'll just be missing children, not parents, once all the schedulers are restarted) 19:25:12 then over the next days/weeks, we'll be landing more changes to zuul to add more traces, so data will become richer over time 19:25:33 awesome 19:25:39 i think that's probably it from my pov 19:25:48 sounds good, thanks 19:26:31 #topic Nodepool Disk Utilization 19:26:56 hopefully we're in the clear here now 19:27:01 and least for a while 19:27:21 yup you just expanded the disks on both builders to 2TB which is plenty 19:27:31 yeah, i added a 1tb volume to each of nb01 and nb02 and expanded /opt to twice its earlier size 19:27:36 I wasn't sure where this was when I put the agenda togethe rso wanted to make sure it was brought up 19:28:07 probably safe to take off the agenda now, unless we want to monitor things for a while 19:28:21 agreed. I think this was the solution. Thank you for taking care of it 19:28:26 np 19:29:02 #topic Open Discussion 19:30:01 Looks like https://review.opendev.org/c/zuul/zuul-jobs/+/858961 has the +2's it needs. Are ya'll good with me approving when I can keep an eye on it? 19:30:25 If so I can probably do that tomorrow (today I've got to get the kids from schoo land stuff so my afternoon won't be attached to a keyboard the whole time) 19:30:27 yeah, i say go for it 19:30:47 volume for reprepro mirror of ceph quincy deb packages has been created, i'm just waiting to see if the jobs run clean on 858961 now 19:31:03 oh I was also going to land the rocky 9 arm64 image if we are happy wit hit 19:31:11 yeah, sounds good, thanks 19:31:18 that needs less attention if it goes wrong 19:31:47 unless it kills a launcher I suppose but our testing should catch that now 19:31:50 well, a similar change recently did take down nl03 19:31:56 or was it nl04? 19:32:11 nl04. But ya I can approv ethat one when I've got more consistent time at a keyboard to check things too 19:32:30 anyway, there's certainly some associated risk but yes if we're on top of double-checking after it deploys i think that's fine 19:33:55 does anyone know whether we expect system-config-run-base-ansible-devel to ever start passing again? 19:34:41 fungi: yes I think the issue is that bridge is too old for new ansible 19:34:44 currently seems to be breaking because we test with python 3.8 and ansible-core needs >=3.9 19:34:51 so that makes sense 19:34:56 so once we get the venv stuff done and switched over to a newer bridge in theory it will work 19:35:03 awesome 19:36:06 last call for anything else 19:36:16 * fungi gots nuthin' 19:36:41 other than a gnawing hunger for takeout 19:36:52 #endmeeting