19:00:28 <clarkb> #startmeeting infra 19:00:28 <opendevmeet> Meeting started Tue Sep 17 19:00:28 2024 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:28 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:28 <opendevmeet> The meeting name has been set to 'infra' 19:00:33 <NeilHanlon> o/ heya 19:00:33 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/OLEKXKOL5LLSYPUH6KMC5KSPZKYR24R6/ Our Agenda 19:00:41 <clarkb> #topic Announcements 19:01:00 <clarkb> I didn't have this in the email but a reminder that if you are eligible to vote in the openstack TC election you have ~1 day to do so 19:01:44 <NeilHanlon> ty for reminder 19:03:26 <clarkb> #topic Upgrading Old Servers 19:04:30 <clarkb> tonyb: anything new with the wiki changes? I havne't seen any updates since I last review them. Also I suspect you have been busy with election stuff? 19:05:55 <tonyb> not doing election stuff this time. just moving slower than I'd like 19:06:19 <tonyb> I've been addressing your review comments and testing locally 19:06:28 <tonyb> I should have updates later today 19:06:39 <clarkb> cool, looks like there were also some comments from frickler 19:06:57 <clarkb> also would it help to reorgnize the meeting so this topic went at the end given the timezone delta? 19:07:06 <tonyb> yup, I'm looking at those as well 19:07:42 <tonyb> it might, I do tend to miss the very beginning of the meeting 19:08:14 <tonyb> good news is that Australia will do it's DST transition within a month 19:09:11 <clarkb> still easy enough to chagne the order up for next time. I'll try to remember to do so 19:09:18 <clarkb> anything else related to new servers? 19:09:45 <tonyb> not from me 19:10:59 <clarkb> #topic AFS Mirror Cleanups 19:11:42 <clarkb> Nothing really new on this topic from me. Other than that I keep finding distractions when it comes to pushing on xenial cleanups. I do think the next step there is removing dead/idle projects from the zuul tenant config so that we can reduce the number of things with xenial references then follow up with xenial removal in what remains 19:11:55 <clarkb> I may take this off the agenda until I'm able to pick that up again 19:12:54 <clarkb> #topic Rackspace Flex Cloud 19:13:13 <clarkb> Wanted to give an update on where we are with Rackspace's new Flex Cloud region but I may drop this off of next week too as I think we're overall in a good spot 19:13:25 <clarkb> We're using the entirety of our quota and most things seem to be working 19:13:52 <clarkb> The small issues we have seen include: this is a floating ip cloud so some jobs have had to adjust to using private ips in their configs instead of public ips (since nodes don't know their public ips) 19:14:09 <clarkb> the mtu on the network interfaces is only 1442 instead of the common 1500. 19:14:28 <clarkb> And we sometimes have slowness scanning ssh keys from nodepool which was causing boot timeouts until we increased the timeout 19:14:51 <clarkb> I do wonder if possibly the mtu thing could cause the slowness ^ there. But it seems like fragmentation should negotiate more quickly than that 19:15:14 <frickler> did we start using swift storage yet? 19:15:20 <clarkb> Continue to be on the lookout for any unexpected behaviors, they have been receptive to our feedback so far and we can continue to feed that back as well 19:15:23 <clarkb> frickler: not yet 19:15:45 <clarkb> we did add this cloud region (and openmetal) to the nested virt labels and johnsom reports they both seem to be working for that 19:15:57 <clarkb> uploading job logs to swift in that region is likely going to be the next good step we take 19:16:16 <fungi> related to my swift cleanup work though, it might be worthwhile long term to migrate from classic rackspace swift to flex swift and then ask them to just delete all the containers in our account once the current log data has expired 19:16:49 <clarkb> as far as setting swift up goes I think the first step is figureing out how auth is supposed to work for that and if our existing auth setup is functional 19:17:20 <clarkb> if it is then I think we can just add this as a new region in the list that we randomly select from. However I half expect we'll need some new settings and setup will be more involved 19:17:43 <frickler> so that would be worth keeping on the agenda then, I'd think 19:17:59 <clarkb> sure we can do that if we want to track the swift effort that way 19:18:06 <frickler> also maybe tracking when they're ready to ramp up quota? 19:18:11 <clarkb> fungi: I don't think you tried swift auth with our swift accounts in the spin up earlier right? 19:18:20 <fungi> i did not, no 19:18:27 <clarkb> ok so we don't have any idea yet on how that works 19:18:37 <clarkb> I'll see if I have time later this week to experiment 19:19:17 <clarkb> frickler: ya though I half expect that to happen in an email response to the feedback thread I started so not sure we need to check in weekly on the quota situation 19:20:06 <clarkb> any other questions or concerns or ideas related to the new cloud region? 19:21:10 <clarkb> sounds like no 19:21:15 <clarkb> #topic Etherpad 2.2.4 Upgrade 19:21:25 <clarkb> So we upgraded and everything seemed happy except for the meetpad integration 19:21:50 <clarkb> it turns out in version 2.2.2 or similar they updated etherpad to assume it is always in the root window for jquery (I may get some of these details wrong because js) 19:22:00 <clarkb> and since meetpad embeds etherpad this broke etherpad 19:22:55 <clarkb> other people using etherpad embedded (including jitsi meet users) noticed and reported the issue which got fixed in the first commit after the 2.2.4 release. Unfortauntely there is no 2.2.5 release yet so we went ahead and deployed a new image that checks out the latest commit (by sha) as of the time of writing that chagne and this has fixed things 19:23:17 <clarkb> Ideally we won't run a random dev commit for very long so I'm still hopeful that 2.2.5 shows up soon. But things seem to work again 19:23:50 <tonyb> makes sense given the ptg is coming up 19:24:16 <fungi> yeah, we didn't want to leave it like that any longer than necessary 19:24:23 <fungi> just glad we remembered to test it once the update was deployed 19:24:33 <clarkb> if you notice any problems with etherpad or meetpad or the integration between the two please say something 19:24:46 <clarkb> but with my admittedly limited in scope and duration testing it seems to be working again 19:25:24 <clarkb> #topic Updating ansible+ansible-lint versions in our repos 19:25:38 <clarkb> I'm selfishly keeping this item on the agenda because I'm having a tough time getting reviews :) 19:25:42 <clarkb> #link https://review.opendev.org/c/openstack/project-config/+/926848 19:25:47 <clarkb> #link https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/926970 19:26:00 <clarkb> I'd like to get these landed just as part of the ubuntu noble default nodeset manuever 19:26:18 <clarkb> I'm happy to address feedback if we feel strongly about any of those ansible rules (eg I can disable them and undo the updates) 19:26:25 <clarkb> s/ansible rules/ansible-lint rules/ 19:26:43 <clarkb> but I think getting that updated will help future proof us for a bit 19:27:26 <frickler> I was looking at those but still undecided whether to just accept it or complain like corvus did 19:27:33 <clarkb> basically don't take this as me advocating for anything in particular other than "run more up to date tools so we can keep up with python releases" 19:28:19 <corvus> i offer my moral support for skipping those rules :) 19:28:30 <clarkb> one upside to using a linter is we can avoid complaining about formatting ourselves. That said I would say that as a group we're pretty good about avoiding review nit picks like that and ansible-lint is extrmeely opinionated so we're kind of in a weird situation there 19:29:07 <clarkb> I suspect that other projects (nova maybe based on recent mailing list emails) get bigger benefits from just going with what the tool says to do 19:30:48 <frickler> all those "name"/"hosts" reorderings are the top ones I would likely want to not do 19:31:36 <frickler> but I can also see benefit in just following those, similar to python projects just using black and putting an end to all formatting discussions 19:32:15 <clarkb> ya thats the main thing the easiest thing is probably to just accept someone else had an opinion then fix it once 19:32:33 <clarkb> anyway if no one feels strongly enough to -1 maybe we should proceed? 19:33:11 <clarkb> we can discuss further in review 19:33:18 <clarkb> #topic Zuul-launcher image builds 19:33:33 <clarkb> The opendev/zuul-jobs project has been created and is hosting these image build configs now 19:33:37 <clarkb> #link https://review.opendev.org/c/opendev/zuul-jobs/+/929141 Build a debian bullseye image with dib in a zuul job 19:33:50 <clarkb> this change successfully builds a debian bullseye image and I think it just merged 19:34:54 <clarkb> I think the next step is to upload it to an intermediate location then configure zuul to fetch and upload that to clouds? 19:35:03 <corvus> that is one next step 19:35:13 <frickler> so one question I have about this: we use the cache built into our image to prime the new cache, do I understand this correctly? 19:35:15 <clarkb> corvus: ^ do we need to disable that image in the nodepool builders to prevent conflicts or will they coordinate via zk and it should work out? 19:35:24 <corvus> the other next step, which can also start right now is for someone to run with making more jobs for more images 19:35:26 <clarkb> frickler: correct. Its like we are doing mathematical induction on git caches 19:35:34 <corvus> frickler: yes 19:35:48 <frickler> but can we start the induction in case we loose our existing images? 19:36:00 <corvus> clarkb: it's safe to build duplicate images 19:36:29 <clarkb> frickler: I think the bootstrap process is to use an existing cloud image to run the job then the build will just take much longer to prime the cache essentially 19:36:33 <corvus> frickler: the build should be able to run on an empty cloud node (slowly) 19:36:38 <corvus> yep 19:36:44 <tonyb> I'm keen to look at the "building more images" thing 19:36:44 <clarkb> frickler: if we find that tiem is too long we could manually snapshot an instance with the git repos pre cloned and use that image 19:36:48 <corvus> we could test that case with a 3 hour job if we want 19:36:53 <corvus> tonyb: ++ 19:37:24 <NeilHanlon> (so I don't get distracted looking at opensearch, I have an update on rocky CI failures w.r.t. "should we mirror rocky") 19:37:33 <NeilHanlon> just tag me when you're ready :D 19:37:37 <clarkb> NeilHanlon: will do 19:37:51 <corvus> after we get uploads to object storage working ... 19:38:08 <clarkb> corvus: are the uploads to intermedaite storage then eventually the clouds something you'll be working on? 19:38:47 <corvus> ... the code to have the zuul-launcher actually create cloud images is nearly ready to merge; once that's done, we should have all the pieces in place to watch a zuul-launcher manage a full image build and upload process 19:39:35 <corvus> we will need to add the openstack driver though :) 19:39:37 <clarkb> also are we running a zuul-launcher node? or do we need to do that too 19:39:55 <corvus> https://review.opendev.org/924188 19:39:59 <corvus> safe to merge any time 19:40:11 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/924188/ Run a zuul-launcher 19:40:13 <clarkb> thanks! 19:40:31 <corvus> clarkb: if anyone else wants to do the upload to intermediate storage, i welcome it; otherwise i should be able to get to it in a bit. 19:40:57 <corvus> one open question about that: what intermediate storage do we want? existing log storage? new rax flex container? 19:41:29 <fungi> also because these are huge, we should think carefully about expirations 19:41:34 <clarkb> corvus: due to the size of these images and not needing them to live for 30 days I wonder if we should use a dedicated container 19:41:52 <clarkb> it will just make it easier for humans to grok pruning of the content should we need to 19:41:56 <corvus> (incidentally, one thing we might want to consider if we don't end up liking the process with cloud storage is that we could use a simple opendev fileserver for our intermediate storage; but i like the idea of starting with object storage) 19:42:03 <clarkb> but then we can probably upload to any/all/one of the existing swift locations 19:42:43 <corvus> dedicated container sounds good. and i was thinking an expiration of a couple of days should be okay to start with. maybe we make it longer later, but that should keep the fallout small from any early errors in programming 19:42:52 <clarkb> ++ 19:43:51 <corvus> so if the rax-flex auth question is answered by then, maybe do it there? otherwise... vexxhost? rax-dfw? 19:44:07 <clarkb> probably not vexxhost since we don't use swift there (we made ceph sad when we tried) 19:44:12 <clarkb> but rax-dfw or ovh-bhs1 seem fine 19:44:35 <corvus> dfw will use fewer intertubes 19:44:50 <clarkb> that seems like a good reason to choose it 19:44:59 <corvus> ok, so dedicated container in rax-flex or rax-dfw. sgtm! 19:45:16 <corvus> if someone gets rax flex working, maybe please just go ahead and create an extra container? :) 19:45:20 <fungi> yeah, i noticed that rax classic dfw to rax flex sjc3 communication goes through the internet (but at least they share a common backbone provider) 19:45:35 <clarkb> corvus: will do if I manage that 19:46:04 <corvus> thx! 19:46:24 <clarkb> #topic Mirroring Rocky Linux Packages 19:46:29 <clarkb> NeilHanlon: hello! 19:46:40 <NeilHanlon> hi :) 19:46:49 <NeilHanlon> so.. i can't get opensearch to do what I want, but 19:47:01 <NeilHanlon> https://drop1.neilhanlon.me/irc/uploads/44fb256b36a4f97b/image.png 19:47:44 <clarkb> looks like something keyed off of depsolving? 19:47:49 <NeilHanlon> green is "successful", red is a job which had a "Depsolve Failed" message 19:47:50 <NeilHanlon> yeah 19:48:06 <NeilHanlon> https://drop1.neilhanlon.me/irc/uploads/17b1fdc1dad12d0b/image.png 19:48:17 <NeilHanlon> i can't seem to generate a short URL otherwise I'd link to the viz 19:48:23 <fungi> so indicates builds which hit some sort of package access problem i guess 19:48:56 <NeilHanlon> yeah these I looked into and are almost all because the host got some mirror A for Appstream and mirror B for BaseOS which were not in sync 19:49:18 <NeilHanlon> i'm sure there's others which aren't matching this depsolve message, but the signal was clear for these ones at least 19:49:37 <NeilHanlon> https://paste.opendev.org/show/bHtL7sBLms4vpOIOkxBN/ here.. the opensearch url :D 19:49:39 <clarkb> cool. I think that does point to using our own mirrors would have a benefit 19:49:54 <fungi> which raises a related question then... when we mirror, how can we be sure we keep both of those in sync with each other? 19:49:57 <clarkb> (side note I wonder if the proxies for the upstream mirrors should do some ip stickyness) 19:50:06 <fungi> or are they mirrored as a unit? 19:50:10 <clarkb> fungi: we would be rsyncing from a single source so in theory that source will be in sync with itself 19:50:19 <clarkb> fungi: rather than rsyncing from multiple locations which may be out of sync 19:50:20 <NeilHanlon> right, yeah. using --delay-updates or so 19:50:38 <fungi> and yeah, we do delay deletions 19:51:38 <NeilHanlon> alternatively, I've sometimes set it up so that everything except the metadata is synced first, then the metadata is can be fetched -- but if you're using --delete that wouldn't work 19:52:00 <clarkb> so ya as mentioend before the next steps would be to ensure we've got enough disk (using centos 9 stream as a stand in I think we decided we do) then write a mirroring script (should look very similar to centos 9 stream and other rsync scripts) then an admin can create the afs volume and merge things and get stuff published 19:52:27 <NeilHanlon> alright, I can work on a mirroring script and open a change for that 19:52:33 <tonyb> Similar ti CentOS I'm working on a tool that will ensure that all packages in the repomd are available in a mirror. which we can run after rsync befoer the vos release 19:52:46 <clarkb> NeilHanlon: that would be great. Then whoever ends up reviewing that can ensure the afs side is ready for it to land too 19:52:59 <tonyb> I don't think that will help with issues where BaseOS and Appstream are out of sync though :( 19:53:10 <clarkb> we can also set the quota on the afs volume such taht we don't accidentally sync down too much content 19:53:18 <fungi> yeah, if there's a semi-quick way we can double-check consistency, afs lets us just avoid publishing that state when it's wrong 19:53:19 <clarkb> better to hit a quota limit than completely run out of disk 19:53:25 <NeilHanlon> hear hear 19:54:14 <tonyb> fungi: I ran the tool on an afs node and it was < 1min per repo, which is quick enough for me 19:54:41 <clarkb> tonyb: that is plenty fast compared to how long rsync takes even not syncing any real data 19:54:48 <fungi> yeah, that's quick, especially where afs is concerned 19:54:59 <clarkb> NeilHanlon: and don't hesitate to ask if any questions come up in preparing that script 19:55:06 <clarkb> #topic Open Discussion 19:55:14 <clarkb> we have 5 minutes for anything else before our hour is up 19:55:22 <tonyb> I was thing so, also very quick if we can avoid a bunch of job failures 19:55:53 <fungi> just a heads up that i won't be around much thursday/friday this week, or over the weekend 19:56:25 * frickler will also be offline starting thursday, hopefully just a couple of days 19:56:50 * tonyb will be more around again ... albeit in AU :/ 19:57:25 <clarkb> thanks for the heads up 19:57:38 <clarkb> sounds like that may be just abuot everything. Thank you for your time today 19:57:46 <clarkb> #endmeeting