Tuesday, 2023-08-29

*** NeilHanlon_ is now known as NeilHanlon13:11
clarkbalmost meeting time18:59
clarkb#startmeeting infra19:01
opendevmeetMeeting started Tue Aug 29 19:01:10 2023 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:01
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:01
opendevmeetThe meeting name has been set to 'infra'19:01
clarkb#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/2LK5PHWBDIBZDHVLIEFKFZJKB3AEJZ45/ Our Agenda19:01
clarkb#topic Announcements19:01
clarkbMonday is a holiday in some parts of the world.19:01
clarkb#topic Service Coordinator Election19:02
fungicongratudolences19:02
clarkbheh I was the only nominee so I'm it by default19:02
clarkbfeedback/help/interest in taking over in the future all welcome19:03
clarkbjust let me know19:03
clarkb#topic Infra Root Google Account19:03
clarkbThis is me noting I still haven't tried to dig into that. I feel like i need to be in a forensic frame of mind for that and I just haven't had that lately19:03
clarkb#topic Mailman 319:04
clarkbCruising along to a topic with good news!19:04
fungisi19:04
clarkball of fungi's outstanding changes have landed and been applied to the server. This includes upgraded to latest mailman319:04
clarkbthank you fungi for continuing to push this along19:04
fungii think we've merged everything we expected to merge19:04
fungiso far no new issues observed and known issues are addressed19:04
funginext up is scheduling migrations for the 5 remaining mm2 domains we're hosting19:05
clarkbwe have successfully sent and received email through it since the changes19:05
fungimigrating lists.katacontainers.io first might be worthwhile, since that will allow us to decommission the separate server it's occupying19:05
fungiwe also have lists.airshipit.org which is mostly dead so nobody's likely to notice it moving anyway19:06
fungias well as lists.starlingx.io and lists.openinfra.dev19:06
clarkbya starting with airshipit and kata seems like a good idea19:07
fungithen lastly, lists.openstack.org (which we should also save for last, it will be the longest outage and should definitely have a dedicated window to itself)19:07
clarkbdo you think we should do them sequentially or try to do blocks of a few at a time for the smaller domains19:07
fungii expect the openstack lists migration to require a minimum of 3 hours downtime19:07
fungii think maybe batches of two? so we could do airship/kata in one maintenance, openinfra/starlingx in another19:08
clarkbsounds like a plan. We can also likely go ahead with those two blocks whenever we are ready19:08
clarkbI don't think any of those projects are currently in the middle of release activity or similar19:08
fungii'll identify the most relevant mailing lists on each of those to send a heads-up to19:09
clarkbI'm happy to be an extra set of hands/eyeballs during those migrations. I expect you'll be happy for any of us to participate19:10
fungimainly it's the list moderators who will need to be aware of interface changes19:10
fungiand yes, all assistance is welcome19:10
fungithe migration is mostly scripted now, the script i've been testing with is in system-config19:10
clarkbgreat I guess let us know when you've got times picked and list moderators notified and we can take it from there19:11
fungiwill do. we can coordinate scheduling those outside the meeting19:12
clarkb#topic Server Upgrades19:12
clarkbAnother topic where I've had some todos but haven't made progress yet19:12
clarkbI do plan to clean up the old isnecure ci registry server today and then I need to look at replacing some old server19:12
clarkb#topic Rax IAD image upload struggles19:13
clarkbfungi: frickler: anything new to add here? What is the current state of image uplaods for that region?19:13
fungii cleaned up all the leaked images in all regions19:13
fungithere were about 400 each in dfw/ord and around 80 new in iad. now that things are mostly clean we should look for newly leaked nodes to see if we can spot why they're not getting cleaned up (if there are any, i haven't looked)19:14
fungialso i'm not aware of a ticket for rackspace yet19:14
clarkbwould be great if we can put one of those together. I feel like I don't have enough of the full debug history to do it justice myself19:15
fungiyeah, i'll try to put something together for that tomorrow19:16
fricklerI think if we could limit nodepool to upload no more than one image at a time, we would have no issue19:16
clarkbI think we can do that but its nodepool builder instance wide. So we might need to run a special isntace jkust for that region19:16
clarkb(there is a flag for number of upload threads)19:17
clarkbit would be clunky to do with current nodepool but possible19:17
fricklerso that would also build images another time just for that region?19:17
clarkbyes19:18
clarkbdefinitely not ideal19:18
fricklerthe other option might be to delete other images and just run jammy jobs there? not sure how that would affect mixed nodesets19:18
clarkbI think it would prevent mixed nodesets from running there but nodepool would properly avoid using that region for those nodesets19:19
clarkbso ya that would work19:19
fricklerso I could delete the other images manually19:19
fricklerand then we can wait for the rackspace ticket to work19:19
clarkbif things are okayish right now maybe see if we get a response on the ticket quickly otherwise we can refactor something like ^ or even look at nodepool changes to make it more easily "load balanced"19:20
fricklerwell the issue is that the other images get older each day, not sure when that will start to cause issues in jobs19:21
clarkbgot it. The main risk is probably that we're ignoring possible bugfixes upstream of us.19:21
fungithey are almost certainly already causing jobs to take at least a little longer since more git commits and packages have to be pulled over the network19:21
clarkbdefinitely not ideal19:21
fungijobs which were hovering close to timeouts could be pushed over the cliff by that, i suppose19:22
fungior the increase in network activity could raise their chances that a stray network issue causes the job to be retried19:22
clarkbya maybe we should just focus on our default label (jammy) since most jobs run on that and let the others lie dormant/disabled/removed for now19:23
clarkbok anything else on this topic?19:24
fricklerok, so I'll delete other image, we can still reupload manually if needed19:24
frickler*images19:24
corvuswhat if...19:24
corvuswhat if we set the upload threads to 1 globally; so don't make any other changes than that19:25
clarkbcorvus: we'll end up with more stale images everywhere. But maybe within a few days so thats ok?19:25
corvusit would slow everything down, but would it be too much?  or would that be okay?19:25
clarkbI think the upper bound of image uploads on things that are "happy" is ~1hour19:26
fricklerI think it will be too much, 10 or so images times ~8 regions times ~30mins per image19:26
clarkbso we'll end up about 5 ish days behind doing some quick math in my head on fuzzy numbers19:26
fungiand we have fewer than 24 images presently19:26
corvusyeah, like, what's our wall-clock time for uploading to everywhere?  if that is < 24 hours than it's not a big deal?19:26
fungioh, upload to only one provider at a time too19:26
clarkb10 * 8 * .5 / 2 = 20 hours?19:26
corvus(but also keeping in mind that we still have multiple builders, so it's not completely serialized)19:27
clarkb.5 for half an hour per upload and /2 because we haev two builders19:27
frickleroh, that is per builder then, not global?19:27
fricklerso then we could still have two parallel uploads to IAD19:27
clarkbfrickler: yes its an option on the nodepool-builder process19:27
clarkbfrickler: yes19:27
corvus(but of different images)19:28
corvus(not that matters, just clarifying)19:28
corvusso it'd go from 8 possible to 2 possible in parallel19:28
fricklerbut that would likely still push those over the 1h limit according to what we tested19:28
clarkbmaybe it is worth trying since it is a fairly low effort change?19:28
clarkband reverting it is quick since we don't do anything "destructive" to cloud image content19:29
corvusthat's my feeling -- like i'm not strongly advocating for it since it's not a complete solution, but maybe it's easy and maybe close enough to good enough to buy some time19:29
frickleryeah, ok19:30
clarkbI'm up for trying it and if we find by the end of the week we are super behind we can revert19:30
corvusyeah, if it doesn't work out, oh well19:30
clarkbcool lets try that and take it from there (including a ticket to rax if we can manage a constructive write up)19:31
clarkb#topic Fedora cleanup19:32
clarkb#link https://review.opendev.org/c/opendev/base-jobs/+/892380 Remove the fedora-latest nodeset19:32
clarkbI think we're readyish for this change? The nodes themselves are largely nonfunctional so if this breaks anything it won't be more broken than before?19:32
clarkbthen we can continue towards removing the labels and images from nodepool (which will make the above situation better too)19:33
clarkbI'm happy to continue helping nudge this along as long as we're in rough agreement about impact and process19:33
corvusi think zuul-jobs is ready for that.  wfm.19:34
fungiyeah, we dropped the last use of the nodeset we're aware of (was in bindep)19:35
fricklerwe are still building f35 images, too, btw19:35
clarkbfrickler: ah ok so we'll claen up multiple images19:35
clarkbalright I'll approve that change later today if I don't hear any objections19:36
fricklerjust remember to drop them in the right order (which I don't remember), so nodepool can clean them up on all providers19:36
clarkbya I'll have to think about the ndoepool ordering after zuul side is cleaner19:36
corvushopefully https://zuul-ci.org/docs/nodepool/latest/operation.html#removing-from-the-builder helps19:37
clarkb++19:37
corvus(but don't actually remove the provider at the end)19:37
clarkb#topic Zuul Ansible 8 Default19:38
clarkbWe are ansible 8 by default in opendev zuul now everywhere but openstack19:38
clarkbI brought up the plan to switch openstack to ansible 8 by dfeault on Monday to the TC in their meeting today and no one screamed19:38
clarkbIts also a holiday for some of us whcih should help a bit 19:38
fungii'll be around in case it goes sideways19:39
clarkbI plan to be around long enough in the morning (and probably longer) monday to land that change and monitor it a bit19:39
fungiwell, weather permitting anyway19:39
clarkbya I don't have any plans yet, but it is the day before my parents leave so might end up doign some family stuff but nothing crazy enough I can't jump on for debugging or a revert19:39
fungi(things here might literally go sideways if the current storm track changes)19:39
clarkbfungi: is that when the hurricane(s) might pass by?19:39
fungino, but if things get bad i'll likely be unavailable next week for cleanup19:40
fricklerif you prepare and review a patch, I can also approve that earlier on monday and watch a bit19:40
corvusi should also be around19:40
clarkbfrickler: can do19:40
clarkblooks like it is just one hurricane at least now19:41
clarkbfranklin is predicted to go further north and east19:41
clarkb#topic Python container updates19:42
fungiyeah, idalia is the one we have to watch for now19:42
clarkb#link https://review.opendev.org/q/hashtag:bookworm+status:open Next round of image rebuilds onto bookworm.19:42
clarkbthank you corvus for pushing up another set of these. Other than the gerrit one I think we can probably land these whenever. For Gerrit we should plan to land it when we are able to restart the container just in case19:42
clarkbparticularly since the gerrit change bumps java up to java 1719:43
corvuso719:43
clarkb#link https://review.opendev.org/c/opendev/system-config/+/893073 Gitea bookworm migration. Does not use base python image.19:43
clarkbI pushed a change for gitea earlier today that does not use the same base pythion images but those images will do a similar bullseye to bookworm bump19:43
clarkbsimilar to gerrit gitea probably deserves a bit of attention in this case to ensure that gerrit replication isn't affected.19:44
clarkbI'm also happy to do more testing with gerrit and or gitea if we feel that is prudent19:44
clarkbreviews and feedback very much welcome19:44
clarkb#topic Open Discussion19:45
clarkbOther things of note: we upgraded gitea to 1.20.3 and etherpad to 1.9.1 recently19:45
clarkbIt has been long enough that I don't expect trouble but somethign to be aware of19:45
fungiyay upgrades. bigger yay for our test infrastructure which makes them almost entirely worry-free19:46
clarkbI mentioned meetpad to someone recently and was todl some group had tried it and ran into problems again. It may be worth doing a sanity check it works as expected19:46
fungii'm free to do a test on it soon19:47
clarkbI can do it after I eat some lunch. Say about 20:45UTC19:47
fungii may be in the middle of food at that time but can play it by ear19:48
clarkbtox 4.10.0 + pyproject-api 1.6.0/1.6.1 appear to have blown up projects using tox. Tox 4.11.0 fixes it apparently so rechecks will correct it19:48
clarkbdebugging of this was happening during this meeting so it is very new :)19:48
corvusin other news, nox did not break today19:49
clarkbOh I meant to metnion to tonyb to feel free to jump into any of the above stuff or new things if still able/interested. I think you are busy with openstack election stuff right now though19:49
clarkbsounds like that is everything. Thank you everyone!19:50
clarkb#endmeeting19:50
opendevmeetMeeting ended Tue Aug 29 19:50:32 2023 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)19:50
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2023/infra.2023-08-29-19.01.html19:50
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2023/infra.2023-08-29-19.01.txt19:50
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2023/infra.2023-08-29-19.01.log.html19:50
tonybelection and some internal stuff but noted19:50
clarkbtonyb: mostly didn't want you to feel like we're pushing you out of any of this. We're like a river that keeps flowing and more than welcome to have people jump in when able :)19:51
tonybI totally get it.19:51
fungiever tried to drink a whole river? no? well now's your chance!19:52

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!