19:01:10 <clarkb> #startmeeting infra 19:01:10 <opendevmeet> Meeting started Tue Aug 29 19:01:10 2023 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:10 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:10 <opendevmeet> The meeting name has been set to 'infra' 19:01:18 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/2LK5PHWBDIBZDHVLIEFKFZJKB3AEJZ45/ Our Agenda 19:01:22 <clarkb> #topic Announcements 19:01:33 <clarkb> Monday is a holiday in some parts of the world. 19:02:32 <clarkb> #topic Service Coordinator Election 19:02:47 <fungi> congratudolences 19:02:58 <clarkb> heh I was the only nominee so I'm it by default 19:03:20 <clarkb> feedback/help/interest in taking over in the future all welcome 19:03:23 <clarkb> just let me know 19:03:33 <clarkb> #topic Infra Root Google Account 19:03:56 <clarkb> This is me noting I still haven't tried to dig into that. I feel like i need to be in a forensic frame of mind for that and I just haven't had that lately 19:04:03 <clarkb> #topic Mailman 3 19:04:16 <clarkb> Cruising along to a topic with good news! 19:04:27 <fungi> si 19:04:34 <clarkb> all of fungi's outstanding changes have landed and been applied to the server. This includes upgraded to latest mailman3 19:04:40 <clarkb> thank you fungi for continuing to push this along 19:04:41 <fungi> i think we've merged everything we expected to merge 19:04:54 <fungi> so far no new issues observed and known issues are addressed 19:05:18 <fungi> next up is scheduling migrations for the 5 remaining mm2 domains we're hosting 19:05:26 <clarkb> we have successfully sent and received email through it since the changes 19:05:50 <fungi> migrating lists.katacontainers.io first might be worthwhile, since that will allow us to decommission the separate server it's occupying 19:06:17 <fungi> we also have lists.airshipit.org which is mostly dead so nobody's likely to notice it moving anyway 19:06:49 <fungi> as well as lists.starlingx.io and lists.openinfra.dev 19:07:04 <clarkb> ya starting with airshipit and kata seems like a good idea 19:07:15 <fungi> then lastly, lists.openstack.org (which we should also save for last, it will be the longest outage and should definitely have a dedicated window to itself) 19:07:35 <clarkb> do you think we should do them sequentially or try to do blocks of a few at a time for the smaller domains 19:07:42 <fungi> i expect the openstack lists migration to require a minimum of 3 hours downtime 19:08:25 <fungi> i think maybe batches of two? so we could do airship/kata in one maintenance, openinfra/starlingx in another 19:08:43 <clarkb> sounds like a plan. We can also likely go ahead with those two blocks whenever we are ready 19:08:53 <clarkb> I don't think any of those projects are currently in the middle of release activity or similar 19:09:15 <fungi> i'll identify the most relevant mailing lists on each of those to send a heads-up to 19:10:13 <clarkb> I'm happy to be an extra set of hands/eyeballs during those migrations. I expect you'll be happy for any of us to participate 19:10:14 <fungi> mainly it's the list moderators who will need to be aware of interface changes 19:10:24 <fungi> and yes, all assistance is welcome 19:10:48 <fungi> the migration is mostly scripted now, the script i've been testing with is in system-config 19:11:24 <clarkb> great I guess let us know when you've got times picked and list moderators notified and we can take it from there 19:12:00 <fungi> will do. we can coordinate scheduling those outside the meeting 19:12:24 <clarkb> #topic Server Upgrades 19:12:35 <clarkb> Another topic where I've had some todos but haven't made progress yet 19:12:52 <clarkb> I do plan to clean up the old isnecure ci registry server today and then I need to look at replacing some old server 19:13:03 <clarkb> #topic Rax IAD image upload struggles 19:13:15 <clarkb> fungi: frickler: anything new to add here? What is the current state of image uplaods for that region? 19:13:34 <fungi> i cleaned up all the leaked images in all regions 19:14:33 <fungi> there were about 400 each in dfw/ord and around 80 new in iad. now that things are mostly clean we should look for newly leaked nodes to see if we can spot why they're not getting cleaned up (if there are any, i haven't looked) 19:14:41 <fungi> also i'm not aware of a ticket for rackspace yet 19:15:20 <clarkb> would be great if we can put one of those together. I feel like I don't have enough of the full debug history to do it justice myself 19:16:27 <fungi> yeah, i'll try to put something together for that tomorrow 19:16:27 <frickler> I think if we could limit nodepool to upload no more than one image at a time, we would have no issue 19:16:53 <clarkb> I think we can do that but its nodepool builder instance wide. So we might need to run a special isntace jkust for that region 19:17:03 <clarkb> (there is a flag for number of upload threads) 19:17:11 <clarkb> it would be clunky to do with current nodepool but possible 19:17:41 <frickler> so that would also build images another time just for that region? 19:18:12 <clarkb> yes 19:18:16 <clarkb> definitely not ideal 19:18:52 <frickler> the other option might be to delete other images and just run jammy jobs there? not sure how that would affect mixed nodesets 19:19:14 <clarkb> I think it would prevent mixed nodesets from running there but nodepool would properly avoid using that region for those nodesets 19:19:18 <clarkb> so ya that would work 19:19:38 <frickler> so I could delete the other images manually 19:19:52 <frickler> and then we can wait for the rackspace ticket to work 19:20:50 <clarkb> if things are okayish right now maybe see if we get a response on the ticket quickly otherwise we can refactor something like ^ or even look at nodepool changes to make it more easily "load balanced" 19:21:22 <frickler> well the issue is that the other images get older each day, not sure when that will start to cause issues in jobs 19:21:45 <clarkb> got it. The main risk is probably that we're ignoring possible bugfixes upstream of us. 19:21:57 <fungi> they are almost certainly already causing jobs to take at least a little longer since more git commits and packages have to be pulled over the network 19:21:57 <clarkb> definitely not ideal 19:22:27 <fungi> jobs which were hovering close to timeouts could be pushed over the cliff by that, i suppose 19:22:54 <fungi> or the increase in network activity could raise their chances that a stray network issue causes the job to be retried 19:23:20 <clarkb> ya maybe we should just focus on our default label (jammy) since most jobs run on that and let the others lie dormant/disabled/removed for now 19:24:18 <clarkb> ok anything else on this topic? 19:24:30 <frickler> ok, so I'll delete other image, we can still reupload manually if needed 19:24:36 <frickler> *images 19:24:54 <corvus> what if... 19:25:15 <corvus> what if we set the upload threads to 1 globally; so don't make any other changes than that 19:25:36 <clarkb> corvus: we'll end up with more stale images everywhere. But maybe within a few days so thats ok? 19:25:40 <corvus> it would slow everything down, but would it be too much? or would that be okay? 19:26:02 <clarkb> I think the upper bound of image uploads on things that are "happy" is ~1hour 19:26:16 <frickler> I think it will be too much, 10 or so images times ~8 regions times ~30mins per image 19:26:17 <clarkb> so we'll end up about 5 ish days behind doing some quick math in my head on fuzzy numbers 19:26:26 <fungi> and we have fewer than 24 images presently 19:26:35 <corvus> yeah, like, what's our wall-clock time for uploading to everywhere? if that is < 24 hours than it's not a big deal? 19:26:55 <fungi> oh, upload to only one provider at a time too 19:26:55 <clarkb> 10 * 8 * .5 / 2 = 20 hours? 19:27:00 <corvus> (but also keeping in mind that we still have multiple builders, so it's not completely serialized) 19:27:09 <clarkb> .5 for half an hour per upload and /2 because we haev two builders 19:27:35 <frickler> oh, that is per builder then, not global? 19:27:49 <frickler> so then we could still have two parallel uploads to IAD 19:27:50 <clarkb> frickler: yes its an option on the nodepool-builder process 19:27:54 <clarkb> frickler: yes 19:28:12 <corvus> (but of different images) 19:28:16 <corvus> (not that matters, just clarifying) 19:28:30 <corvus> so it'd go from 8 possible to 2 possible in parallel 19:28:45 <frickler> but that would likely still push those over the 1h limit according to what we tested 19:28:58 <clarkb> maybe it is worth trying since it is a fairly low effort change? 19:29:10 <clarkb> and reverting it is quick since we don't do anything "destructive" to cloud image content 19:29:50 <corvus> that's my feeling -- like i'm not strongly advocating for it since it's not a complete solution, but maybe it's easy and maybe close enough to good enough to buy some time 19:30:15 <frickler> yeah, ok 19:30:21 <clarkb> I'm up for trying it and if we find by the end of the week we are super behind we can revert 19:30:47 <corvus> yeah, if it doesn't work out, oh well 19:31:38 <clarkb> cool lets try that and take it from there (including a ticket to rax if we can manage a constructive write up) 19:32:15 <clarkb> #topic Fedora cleanup 19:32:18 <clarkb> #link https://review.opendev.org/c/opendev/base-jobs/+/892380 Remove the fedora-latest nodeset 19:32:46 <clarkb> I think we're readyish for this change? The nodes themselves are largely nonfunctional so if this breaks anything it won't be more broken than before? 19:33:11 <clarkb> then we can continue towards removing the labels and images from nodepool (which will make the above situation better too) 19:33:51 <clarkb> I'm happy to continue helping nudge this along as long as we're in rough agreement about impact and process 19:34:44 <corvus> i think zuul-jobs is ready for that. wfm. 19:35:03 <fungi> yeah, we dropped the last use of the nodeset we're aware of (was in bindep) 19:35:11 <frickler> we are still building f35 images, too, btw 19:35:51 <clarkb> frickler: ah ok so we'll claen up multiple images 19:36:05 <clarkb> alright I'll approve that change later today if I don't hear any objections 19:36:23 <frickler> just remember to drop them in the right order (which I don't remember), so nodepool can clean them up on all providers 19:36:50 <clarkb> ya I'll have to think about the ndoepool ordering after zuul side is cleaner 19:37:15 <corvus> hopefully https://zuul-ci.org/docs/nodepool/latest/operation.html#removing-from-the-builder helps 19:37:21 <clarkb> ++ 19:37:30 <corvus> (but don't actually remove the provider at the end) 19:38:14 <clarkb> #topic Zuul Ansible 8 Default 19:38:31 <clarkb> We are ansible 8 by default in opendev zuul now everywhere but openstack 19:38:45 <clarkb> I brought up the plan to switch openstack to ansible 8 by dfeault on Monday to the TC in their meeting today and no one screamed 19:38:54 <clarkb> Its also a holiday for some of us whcih should help a bit 19:39:10 <fungi> i'll be around in case it goes sideways 19:39:14 <clarkb> I plan to be around long enough in the morning (and probably longer) monday to land that change and monitor it a bit 19:39:18 <fungi> well, weather permitting anyway 19:39:37 <clarkb> ya I don't have any plans yet, but it is the day before my parents leave so might end up doign some family stuff but nothing crazy enough I can't jump on for debugging or a revert 19:39:44 <fungi> (things here might literally go sideways if the current storm track changes) 19:39:45 <clarkb> fungi: is that when the hurricane(s) might pass by? 19:40:04 <fungi> no, but if things get bad i'll likely be unavailable next week for cleanup 19:40:31 <frickler> if you prepare and review a patch, I can also approve that earlier on monday and watch a bit 19:40:32 <corvus> i should also be around 19:40:36 <clarkb> frickler: can do 19:41:12 <clarkb> looks like it is just one hurricane at least now 19:41:21 <clarkb> franklin is predicted to go further north and east 19:42:13 <clarkb> #topic Python container updates 19:42:16 <fungi> yeah, idalia is the one we have to watch for now 19:42:23 <clarkb> #link https://review.opendev.org/q/hashtag:bookworm+status:open Next round of image rebuilds onto bookworm. 19:42:51 <clarkb> thank you corvus for pushing up another set of these. Other than the gerrit one I think we can probably land these whenever. For Gerrit we should plan to land it when we are able to restart the container just in case 19:43:03 <clarkb> particularly since the gerrit change bumps java up to java 17 19:43:11 <corvus> o7 19:43:24 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/893073 Gitea bookworm migration. Does not use base python image. 19:43:51 <clarkb> I pushed a change for gitea earlier today that does not use the same base pythion images but those images will do a similar bullseye to bookworm bump 19:44:07 <clarkb> similar to gerrit gitea probably deserves a bit of attention in this case to ensure that gerrit replication isn't affected. 19:44:20 <clarkb> I'm also happy to do more testing with gerrit and or gitea if we feel that is prudent 19:44:29 <clarkb> reviews and feedback very much welcome 19:45:19 <clarkb> #topic Open Discussion 19:45:32 <clarkb> Other things of note: we upgraded gitea to 1.20.3 and etherpad to 1.9.1 recently 19:45:42 <clarkb> It has been long enough that I don't expect trouble but somethign to be aware of 19:46:08 <fungi> yay upgrades. bigger yay for our test infrastructure which makes them almost entirely worry-free 19:46:25 <clarkb> I mentioned meetpad to someone recently and was todl some group had tried it and ran into problems again. It may be worth doing a sanity check it works as expected 19:47:00 <fungi> i'm free to do a test on it soon 19:47:18 <clarkb> I can do it after I eat some lunch. Say about 20:45UTC 19:48:06 <fungi> i may be in the middle of food at that time but can play it by ear 19:48:10 <clarkb> tox 4.10.0 + pyproject-api 1.6.0/1.6.1 appear to have blown up projects using tox. Tox 4.11.0 fixes it apparently so rechecks will correct it 19:48:27 <clarkb> debugging of this was happening during this meeting so it is very new :) 19:49:20 <corvus> in other news, nox did not break today 19:49:33 <clarkb> Oh I meant to metnion to tonyb to feel free to jump into any of the above stuff or new things if still able/interested. I think you are busy with openstack election stuff right now though 19:50:27 <clarkb> sounds like that is everything. Thank you everyone! 19:50:32 <clarkb> #endmeeting