*** NeilHanlon_ is now known as NeilHanlon | 13:11 | |
clarkb | almost meeting time | 18:59 |
---|---|---|
clarkb | #startmeeting infra | 19:01 |
opendevmeet | Meeting started Tue Aug 29 19:01:10 2023 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:01 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:01 |
opendevmeet | The meeting name has been set to 'infra' | 19:01 |
clarkb | #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/2LK5PHWBDIBZDHVLIEFKFZJKB3AEJZ45/ Our Agenda | 19:01 |
clarkb | #topic Announcements | 19:01 |
clarkb | Monday is a holiday in some parts of the world. | 19:01 |
clarkb | #topic Service Coordinator Election | 19:02 |
fungi | congratudolences | 19:02 |
clarkb | heh I was the only nominee so I'm it by default | 19:02 |
clarkb | feedback/help/interest in taking over in the future all welcome | 19:03 |
clarkb | just let me know | 19:03 |
clarkb | #topic Infra Root Google Account | 19:03 |
clarkb | This is me noting I still haven't tried to dig into that. I feel like i need to be in a forensic frame of mind for that and I just haven't had that lately | 19:03 |
clarkb | #topic Mailman 3 | 19:04 |
clarkb | Cruising along to a topic with good news! | 19:04 |
fungi | si | 19:04 |
clarkb | all of fungi's outstanding changes have landed and been applied to the server. This includes upgraded to latest mailman3 | 19:04 |
clarkb | thank you fungi for continuing to push this along | 19:04 |
fungi | i think we've merged everything we expected to merge | 19:04 |
fungi | so far no new issues observed and known issues are addressed | 19:04 |
fungi | next up is scheduling migrations for the 5 remaining mm2 domains we're hosting | 19:05 |
clarkb | we have successfully sent and received email through it since the changes | 19:05 |
fungi | migrating lists.katacontainers.io first might be worthwhile, since that will allow us to decommission the separate server it's occupying | 19:05 |
fungi | we also have lists.airshipit.org which is mostly dead so nobody's likely to notice it moving anyway | 19:06 |
fungi | as well as lists.starlingx.io and lists.openinfra.dev | 19:06 |
clarkb | ya starting with airshipit and kata seems like a good idea | 19:07 |
fungi | then lastly, lists.openstack.org (which we should also save for last, it will be the longest outage and should definitely have a dedicated window to itself) | 19:07 |
clarkb | do you think we should do them sequentially or try to do blocks of a few at a time for the smaller domains | 19:07 |
fungi | i expect the openstack lists migration to require a minimum of 3 hours downtime | 19:07 |
fungi | i think maybe batches of two? so we could do airship/kata in one maintenance, openinfra/starlingx in another | 19:08 |
clarkb | sounds like a plan. We can also likely go ahead with those two blocks whenever we are ready | 19:08 |
clarkb | I don't think any of those projects are currently in the middle of release activity or similar | 19:08 |
fungi | i'll identify the most relevant mailing lists on each of those to send a heads-up to | 19:09 |
clarkb | I'm happy to be an extra set of hands/eyeballs during those migrations. I expect you'll be happy for any of us to participate | 19:10 |
fungi | mainly it's the list moderators who will need to be aware of interface changes | 19:10 |
fungi | and yes, all assistance is welcome | 19:10 |
fungi | the migration is mostly scripted now, the script i've been testing with is in system-config | 19:10 |
clarkb | great I guess let us know when you've got times picked and list moderators notified and we can take it from there | 19:11 |
fungi | will do. we can coordinate scheduling those outside the meeting | 19:12 |
clarkb | #topic Server Upgrades | 19:12 |
clarkb | Another topic where I've had some todos but haven't made progress yet | 19:12 |
clarkb | I do plan to clean up the old isnecure ci registry server today and then I need to look at replacing some old server | 19:12 |
clarkb | #topic Rax IAD image upload struggles | 19:13 |
clarkb | fungi: frickler: anything new to add here? What is the current state of image uplaods for that region? | 19:13 |
fungi | i cleaned up all the leaked images in all regions | 19:13 |
fungi | there were about 400 each in dfw/ord and around 80 new in iad. now that things are mostly clean we should look for newly leaked nodes to see if we can spot why they're not getting cleaned up (if there are any, i haven't looked) | 19:14 |
fungi | also i'm not aware of a ticket for rackspace yet | 19:14 |
clarkb | would be great if we can put one of those together. I feel like I don't have enough of the full debug history to do it justice myself | 19:15 |
fungi | yeah, i'll try to put something together for that tomorrow | 19:16 |
frickler | I think if we could limit nodepool to upload no more than one image at a time, we would have no issue | 19:16 |
clarkb | I think we can do that but its nodepool builder instance wide. So we might need to run a special isntace jkust for that region | 19:16 |
clarkb | (there is a flag for number of upload threads) | 19:17 |
clarkb | it would be clunky to do with current nodepool but possible | 19:17 |
frickler | so that would also build images another time just for that region? | 19:17 |
clarkb | yes | 19:18 |
clarkb | definitely not ideal | 19:18 |
frickler | the other option might be to delete other images and just run jammy jobs there? not sure how that would affect mixed nodesets | 19:18 |
clarkb | I think it would prevent mixed nodesets from running there but nodepool would properly avoid using that region for those nodesets | 19:19 |
clarkb | so ya that would work | 19:19 |
frickler | so I could delete the other images manually | 19:19 |
frickler | and then we can wait for the rackspace ticket to work | 19:19 |
clarkb | if things are okayish right now maybe see if we get a response on the ticket quickly otherwise we can refactor something like ^ or even look at nodepool changes to make it more easily "load balanced" | 19:20 |
frickler | well the issue is that the other images get older each day, not sure when that will start to cause issues in jobs | 19:21 |
clarkb | got it. The main risk is probably that we're ignoring possible bugfixes upstream of us. | 19:21 |
fungi | they are almost certainly already causing jobs to take at least a little longer since more git commits and packages have to be pulled over the network | 19:21 |
clarkb | definitely not ideal | 19:21 |
fungi | jobs which were hovering close to timeouts could be pushed over the cliff by that, i suppose | 19:22 |
fungi | or the increase in network activity could raise their chances that a stray network issue causes the job to be retried | 19:22 |
clarkb | ya maybe we should just focus on our default label (jammy) since most jobs run on that and let the others lie dormant/disabled/removed for now | 19:23 |
clarkb | ok anything else on this topic? | 19:24 |
frickler | ok, so I'll delete other image, we can still reupload manually if needed | 19:24 |
frickler | *images | 19:24 |
corvus | what if... | 19:24 |
corvus | what if we set the upload threads to 1 globally; so don't make any other changes than that | 19:25 |
clarkb | corvus: we'll end up with more stale images everywhere. But maybe within a few days so thats ok? | 19:25 |
corvus | it would slow everything down, but would it be too much? or would that be okay? | 19:25 |
clarkb | I think the upper bound of image uploads on things that are "happy" is ~1hour | 19:26 |
frickler | I think it will be too much, 10 or so images times ~8 regions times ~30mins per image | 19:26 |
clarkb | so we'll end up about 5 ish days behind doing some quick math in my head on fuzzy numbers | 19:26 |
fungi | and we have fewer than 24 images presently | 19:26 |
corvus | yeah, like, what's our wall-clock time for uploading to everywhere? if that is < 24 hours than it's not a big deal? | 19:26 |
fungi | oh, upload to only one provider at a time too | 19:26 |
clarkb | 10 * 8 * .5 / 2 = 20 hours? | 19:26 |
corvus | (but also keeping in mind that we still have multiple builders, so it's not completely serialized) | 19:27 |
clarkb | .5 for half an hour per upload and /2 because we haev two builders | 19:27 |
frickler | oh, that is per builder then, not global? | 19:27 |
frickler | so then we could still have two parallel uploads to IAD | 19:27 |
clarkb | frickler: yes its an option on the nodepool-builder process | 19:27 |
clarkb | frickler: yes | 19:27 |
corvus | (but of different images) | 19:28 |
corvus | (not that matters, just clarifying) | 19:28 |
corvus | so it'd go from 8 possible to 2 possible in parallel | 19:28 |
frickler | but that would likely still push those over the 1h limit according to what we tested | 19:28 |
clarkb | maybe it is worth trying since it is a fairly low effort change? | 19:28 |
clarkb | and reverting it is quick since we don't do anything "destructive" to cloud image content | 19:29 |
corvus | that's my feeling -- like i'm not strongly advocating for it since it's not a complete solution, but maybe it's easy and maybe close enough to good enough to buy some time | 19:29 |
frickler | yeah, ok | 19:30 |
clarkb | I'm up for trying it and if we find by the end of the week we are super behind we can revert | 19:30 |
corvus | yeah, if it doesn't work out, oh well | 19:30 |
clarkb | cool lets try that and take it from there (including a ticket to rax if we can manage a constructive write up) | 19:31 |
clarkb | #topic Fedora cleanup | 19:32 |
clarkb | #link https://review.opendev.org/c/opendev/base-jobs/+/892380 Remove the fedora-latest nodeset | 19:32 |
clarkb | I think we're readyish for this change? The nodes themselves are largely nonfunctional so if this breaks anything it won't be more broken than before? | 19:32 |
clarkb | then we can continue towards removing the labels and images from nodepool (which will make the above situation better too) | 19:33 |
clarkb | I'm happy to continue helping nudge this along as long as we're in rough agreement about impact and process | 19:33 |
corvus | i think zuul-jobs is ready for that. wfm. | 19:34 |
fungi | yeah, we dropped the last use of the nodeset we're aware of (was in bindep) | 19:35 |
frickler | we are still building f35 images, too, btw | 19:35 |
clarkb | frickler: ah ok so we'll claen up multiple images | 19:35 |
clarkb | alright I'll approve that change later today if I don't hear any objections | 19:36 |
frickler | just remember to drop them in the right order (which I don't remember), so nodepool can clean them up on all providers | 19:36 |
clarkb | ya I'll have to think about the ndoepool ordering after zuul side is cleaner | 19:36 |
corvus | hopefully https://zuul-ci.org/docs/nodepool/latest/operation.html#removing-from-the-builder helps | 19:37 |
clarkb | ++ | 19:37 |
corvus | (but don't actually remove the provider at the end) | 19:37 |
clarkb | #topic Zuul Ansible 8 Default | 19:38 |
clarkb | We are ansible 8 by default in opendev zuul now everywhere but openstack | 19:38 |
clarkb | I brought up the plan to switch openstack to ansible 8 by dfeault on Monday to the TC in their meeting today and no one screamed | 19:38 |
clarkb | Its also a holiday for some of us whcih should help a bit | 19:38 |
fungi | i'll be around in case it goes sideways | 19:39 |
clarkb | I plan to be around long enough in the morning (and probably longer) monday to land that change and monitor it a bit | 19:39 |
fungi | well, weather permitting anyway | 19:39 |
clarkb | ya I don't have any plans yet, but it is the day before my parents leave so might end up doign some family stuff but nothing crazy enough I can't jump on for debugging or a revert | 19:39 |
fungi | (things here might literally go sideways if the current storm track changes) | 19:39 |
clarkb | fungi: is that when the hurricane(s) might pass by? | 19:39 |
fungi | no, but if things get bad i'll likely be unavailable next week for cleanup | 19:40 |
frickler | if you prepare and review a patch, I can also approve that earlier on monday and watch a bit | 19:40 |
corvus | i should also be around | 19:40 |
clarkb | frickler: can do | 19:40 |
clarkb | looks like it is just one hurricane at least now | 19:41 |
clarkb | franklin is predicted to go further north and east | 19:41 |
clarkb | #topic Python container updates | 19:42 |
fungi | yeah, idalia is the one we have to watch for now | 19:42 |
clarkb | #link https://review.opendev.org/q/hashtag:bookworm+status:open Next round of image rebuilds onto bookworm. | 19:42 |
clarkb | thank you corvus for pushing up another set of these. Other than the gerrit one I think we can probably land these whenever. For Gerrit we should plan to land it when we are able to restart the container just in case | 19:42 |
clarkb | particularly since the gerrit change bumps java up to java 17 | 19:43 |
corvus | o7 | 19:43 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/893073 Gitea bookworm migration. Does not use base python image. | 19:43 |
clarkb | I pushed a change for gitea earlier today that does not use the same base pythion images but those images will do a similar bullseye to bookworm bump | 19:43 |
clarkb | similar to gerrit gitea probably deserves a bit of attention in this case to ensure that gerrit replication isn't affected. | 19:44 |
clarkb | I'm also happy to do more testing with gerrit and or gitea if we feel that is prudent | 19:44 |
clarkb | reviews and feedback very much welcome | 19:44 |
clarkb | #topic Open Discussion | 19:45 |
clarkb | Other things of note: we upgraded gitea to 1.20.3 and etherpad to 1.9.1 recently | 19:45 |
clarkb | It has been long enough that I don't expect trouble but somethign to be aware of | 19:45 |
fungi | yay upgrades. bigger yay for our test infrastructure which makes them almost entirely worry-free | 19:46 |
clarkb | I mentioned meetpad to someone recently and was todl some group had tried it and ran into problems again. It may be worth doing a sanity check it works as expected | 19:46 |
fungi | i'm free to do a test on it soon | 19:47 |
clarkb | I can do it after I eat some lunch. Say about 20:45UTC | 19:47 |
fungi | i may be in the middle of food at that time but can play it by ear | 19:48 |
clarkb | tox 4.10.0 + pyproject-api 1.6.0/1.6.1 appear to have blown up projects using tox. Tox 4.11.0 fixes it apparently so rechecks will correct it | 19:48 |
clarkb | debugging of this was happening during this meeting so it is very new :) | 19:48 |
corvus | in other news, nox did not break today | 19:49 |
clarkb | Oh I meant to metnion to tonyb to feel free to jump into any of the above stuff or new things if still able/interested. I think you are busy with openstack election stuff right now though | 19:49 |
clarkb | sounds like that is everything. Thank you everyone! | 19:50 |
clarkb | #endmeeting | 19:50 |
opendevmeet | Meeting ended Tue Aug 29 19:50:32 2023 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 19:50 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2023/infra.2023-08-29-19.01.html | 19:50 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2023/infra.2023-08-29-19.01.txt | 19:50 |
opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2023/infra.2023-08-29-19.01.log.html | 19:50 |
tonyb | election and some internal stuff but noted | 19:50 |
clarkb | tonyb: mostly didn't want you to feel like we're pushing you out of any of this. We're like a river that keeps flowing and more than welcome to have people jump in when able :) | 19:51 |
tonyb | I totally get it. | 19:51 |
fungi | ever tried to drink a whole river? no? well now's your chance! | 19:52 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!