19:00:28 <clarkb> #startmeeting infra
19:00:28 <opendevmeet> Meeting started Tue Sep 17 19:00:28 2024 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:28 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:28 <opendevmeet> The meeting name has been set to 'infra'
19:00:33 <NeilHanlon> o/ heya
19:00:33 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/OLEKXKOL5LLSYPUH6KMC5KSPZKYR24R6/ Our Agenda
19:00:41 <clarkb> #topic Announcements
19:01:00 <clarkb> I didn't have this in the email but a reminder that if you are eligible to vote in the openstack TC election you have ~1 day to do so
19:01:44 <NeilHanlon> ty for reminder
19:03:26 <clarkb> #topic Upgrading Old Servers
19:04:30 <clarkb> tonyb: anything new with the wiki changes? I havne't seen any updates since I last review them. Also I suspect you have been busy with election stuff?
19:05:55 <tonyb> not doing election stuff this time.  just moving slower than I'd like
19:06:19 <tonyb> I've been addressing your review comments and testing locally
19:06:28 <tonyb> I should have updates later today
19:06:39 <clarkb> cool, looks like there were also some comments from frickler
19:06:57 <clarkb> also would it help to reorgnize the meeting so this topic went at the end given the timezone delta?
19:07:06 <tonyb> yup, I'm looking at those as well
19:07:42 <tonyb> it might, I do tend to miss the very beginning of the meeting
19:08:14 <tonyb> good news is that Australia will do it's DST transition within a month
19:09:11 <clarkb> still easy enough to chagne the order up for next time. I'll try to remember to do so
19:09:18 <clarkb> anything else related to new servers?
19:09:45 <tonyb> not from me
19:10:59 <clarkb> #topic AFS Mirror Cleanups
19:11:42 <clarkb> Nothing really new on this topic from me. Other than that I keep finding distractions when it comes to pushing on xenial cleanups. I do think the next step there is removing dead/idle projects from the zuul tenant config so that we can reduce the number of things with xenial references then follow up with xenial removal in what remains
19:11:55 <clarkb> I may take this off the agenda until I'm able to pick that up again
19:12:54 <clarkb> #topic Rackspace Flex Cloud
19:13:13 <clarkb> Wanted to give an update on where we are with Rackspace's new Flex Cloud region but I may drop this off of next week too as I think we're overall in a good spot
19:13:25 <clarkb> We're using the entirety of our quota and most things seem to be working
19:13:52 <clarkb> The small issues we have seen include: this is a floating ip cloud so some jobs have had to adjust to using private ips in their configs instead of public ips (since nodes don't know their public ips)
19:14:09 <clarkb> the mtu on the network interfaces is only 1442 instead of the common 1500.
19:14:28 <clarkb> And we sometimes have slowness scanning ssh keys from nodepool which was causing boot timeouts until we increased the timeout
19:14:51 <clarkb> I do wonder if possibly the mtu thing could cause the slowness ^ there. But it seems like fragmentation should negotiate more quickly than that
19:15:14 <frickler> did we start using swift storage yet?
19:15:20 <clarkb> Continue to be on the lookout for any unexpected behaviors, they have been receptive to our feedback so far and we can continue to feed that back as well
19:15:23 <clarkb> frickler: not yet
19:15:45 <clarkb> we did add this cloud region (and openmetal) to the nested virt labels and johnsom reports they both seem to be working for that
19:15:57 <clarkb> uploading job logs to swift in that region is likely going to be the next good step we take
19:16:16 <fungi> related to my swift cleanup work though, it might be worthwhile long term to migrate from classic rackspace swift to flex swift and then ask them to just delete all the containers in our account once the current log data has expired
19:16:49 <clarkb> as far as setting swift up goes I think the first step is figureing out how auth is supposed to work for that and if our existing auth setup is functional
19:17:20 <clarkb> if it is then I think we can just add this as a new region in the list that we randomly select from. However I half expect we'll need some new settings and setup will be more involved
19:17:43 <frickler> so that would be worth keeping on the agenda then, I'd think
19:17:59 <clarkb> sure we can do that if we want to track the swift effort that way
19:18:06 <frickler> also maybe tracking when they're ready to ramp up quota?
19:18:11 <clarkb> fungi: I don't think you tried swift auth with our swift accounts in the spin up earlier right?
19:18:20 <fungi> i did not, no
19:18:27 <clarkb> ok so we don't have any idea yet on how that works
19:18:37 <clarkb> I'll see if I have time later this week to experiment
19:19:17 <clarkb> frickler: ya though I half expect that to happen in an email response to the feedback thread I started so not sure we need to check in weekly on the quota situation
19:20:06 <clarkb> any other questions or concerns or ideas related to the new cloud region?
19:21:10 <clarkb> sounds like no
19:21:15 <clarkb> #topic Etherpad 2.2.4 Upgrade
19:21:25 <clarkb> So we upgraded and everything seemed happy except for the meetpad integration
19:21:50 <clarkb> it turns out in version 2.2.2 or similar they updated etherpad to assume it is always in the root window for jquery (I may get some of these details wrong because js)
19:22:00 <clarkb> and since meetpad embeds etherpad this broke etherpad
19:22:55 <clarkb> other people using etherpad embedded (including jitsi meet users) noticed and reported the issue which got fixed in the first commit after the 2.2.4 release. Unfortauntely there is no 2.2.5 release yet so we went ahead and deployed a new image that checks out the latest commit (by sha) as of the time of writing that chagne and this has fixed things
19:23:17 <clarkb> Ideally we won't run a random dev commit for very long so I'm still hopeful that 2.2.5 shows up soon. But things seem to work again
19:23:50 <tonyb> makes sense given the ptg is coming up
19:24:16 <fungi> yeah, we didn't want to leave it like that any longer than necessary
19:24:23 <fungi> just glad we remembered to test it once the update was deployed
19:24:33 <clarkb> if you notice any problems with etherpad or meetpad or the integration between the two please say something
19:24:46 <clarkb> but with my admittedly limited in scope and duration testing it seems to be working again
19:25:24 <clarkb> #topic Updating ansible+ansible-lint versions in our repos
19:25:38 <clarkb> I'm selfishly keeping this item on the agenda because I'm having a tough time getting reviews :)
19:25:42 <clarkb> #link https://review.opendev.org/c/openstack/project-config/+/926848
19:25:47 <clarkb> #link https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/926970
19:26:00 <clarkb> I'd like to get these landed just as part of the ubuntu noble default nodeset manuever
19:26:18 <clarkb> I'm happy to address feedback if we feel strongly about any of those ansible rules (eg I can disable them and undo the updates)
19:26:25 <clarkb> s/ansible rules/ansible-lint rules/
19:26:43 <clarkb> but I think getting that updated will help future proof us for a bit
19:27:26 <frickler> I was looking at those but still undecided whether to just accept it or complain like corvus did
19:27:33 <clarkb> basically don't take this as me advocating for anything in particular other than "run more up to date tools so we can keep up with python releases"
19:28:19 <corvus> i offer my moral support for skipping those rules :)
19:28:30 <clarkb> one upside to using a linter is we can avoid complaining about formatting ourselves. That said I would say that as a group we're pretty good about avoiding review nit picks like that and ansible-lint is extrmeely opinionated so we're kind of in a weird situation there
19:29:07 <clarkb> I suspect that other projects (nova maybe based on recent mailing list emails) get bigger benefits from just going with what the tool says to do
19:30:48 <frickler> all those "name"/"hosts" reorderings are the top ones I would likely want to not do
19:31:36 <frickler> but I can also see benefit in just following those, similar to python projects just using black and putting an end to all formatting discussions
19:32:15 <clarkb> ya thats the main thing the easiest thing is probably to just accept someone else had an opinion then fix it once
19:32:33 <clarkb> anyway if no one feels strongly enough to -1 maybe we should proceed?
19:33:11 <clarkb> we can discuss further in review
19:33:18 <clarkb> #topic Zuul-launcher image builds
19:33:33 <clarkb> The opendev/zuul-jobs project has been created and is hosting these image build configs now
19:33:37 <clarkb> #link https://review.opendev.org/c/opendev/zuul-jobs/+/929141 Build a debian bullseye image with dib in a zuul job
19:33:50 <clarkb> this change successfully builds a debian bullseye image and I think it just merged
19:34:54 <clarkb> I think the next step is to upload it to an intermediate location then configure zuul to fetch and upload that to clouds?
19:35:03 <corvus> that is one next step
19:35:13 <frickler> so one question I have about this: we use the cache built into our image to prime the new cache, do I understand this correctly?
19:35:15 <clarkb> corvus: ^ do we need to disable that image in the nodepool builders to prevent conflicts or will they coordinate via zk and it should work out?
19:35:24 <corvus> the other next step, which can also start right now is for someone to run with making more jobs for more images
19:35:26 <clarkb> frickler: correct. Its like we are doing mathematical induction on git caches
19:35:34 <corvus> frickler: yes
19:35:48 <frickler> but can we start the induction in case we loose our existing images?
19:36:00 <corvus> clarkb: it's safe to build duplicate images
19:36:29 <clarkb> frickler: I think the bootstrap process is to use an existing cloud image to run the job then the build will just take much longer to prime the cache essentially
19:36:33 <corvus> frickler: the build should be able to run on an empty cloud node (slowly)
19:36:38 <corvus> yep
19:36:44 <tonyb> I'm keen to look at the "building more images" thing
19:36:44 <clarkb> frickler: if we find that tiem is too long we could manually snapshot an instance with the git repos pre cloned and use that image
19:36:48 <corvus> we could test that case with a 3 hour job if we want
19:36:53 <corvus> tonyb: ++
19:37:24 <NeilHanlon> (so I don't get distracted looking at opensearch, I have an update on rocky CI failures w.r.t. "should we mirror rocky")
19:37:33 <NeilHanlon> just tag me when you're ready :D
19:37:37 <clarkb> NeilHanlon: will do
19:37:51 <corvus> after we get uploads to object storage working ...
19:38:08 <clarkb> corvus: are the uploads to intermedaite storage then eventually the clouds something you'll be working on?
19:38:47 <corvus> ... the code to have the zuul-launcher actually create cloud images is nearly ready to merge; once that's done, we should have all the pieces in place to watch a zuul-launcher manage a full image build and upload process
19:39:35 <corvus> we will need to add the openstack driver though :)
19:39:37 <clarkb> also are we running a zuul-launcher node? or do we need to do that too
19:39:55 <corvus> https://review.opendev.org/924188
19:39:59 <corvus> safe to merge any time
19:40:11 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/924188/ Run a zuul-launcher
19:40:13 <clarkb> thanks!
19:40:31 <corvus> clarkb: if anyone else wants to do the upload to intermediate storage, i welcome it; otherwise i should be able to get to it in a bit.
19:40:57 <corvus> one open question about that: what intermediate storage do we want?  existing log storage?  new rax flex container?
19:41:29 <fungi> also because these are huge, we should think carefully about expirations
19:41:34 <clarkb> corvus: due to the size of these images and not needing them to live for 30 days I wonder if we should use a dedicated container
19:41:52 <clarkb> it will just make it easier for humans to grok pruning of the content should we need to
19:41:56 <corvus> (incidentally, one thing we might want to consider if we don't end up liking the process with cloud storage is that we could use a simple opendev fileserver for our intermediate storage; but i like the idea of starting with object storage)
19:42:03 <clarkb> but then we can probably upload to any/all/one of the existing swift locations
19:42:43 <corvus> dedicated container sounds good.  and i was thinking an expiration of a couple of days should be okay to start with.  maybe we make it longer later, but that should keep the fallout small from any early errors in programming
19:42:52 <clarkb> ++
19:43:51 <corvus> so if the rax-flex auth question is answered by then, maybe do it there?  otherwise... vexxhost?  rax-dfw?
19:44:07 <clarkb> probably not vexxhost since we don't use swift there (we made ceph sad when we tried)
19:44:12 <clarkb> but rax-dfw or ovh-bhs1 seem fine
19:44:35 <corvus> dfw will use fewer intertubes
19:44:50 <clarkb> that seems like a good reason to choose it
19:44:59 <corvus> ok, so dedicated container in rax-flex or rax-dfw.  sgtm!
19:45:16 <corvus> if someone gets rax flex working, maybe please just go ahead and create an extra container? :)
19:45:20 <fungi> yeah, i noticed that rax classic dfw to rax flex sjc3 communication goes through the internet (but at least they share a common backbone provider)
19:45:35 <clarkb> corvus: will do if I manage that
19:46:04 <corvus> thx!
19:46:24 <clarkb> #topic Mirroring Rocky Linux Packages
19:46:29 <clarkb> NeilHanlon: hello!
19:46:40 <NeilHanlon> hi :)
19:46:49 <NeilHanlon> so.. i can't get opensearch to do what I want, but
19:47:01 <NeilHanlon> https://drop1.neilhanlon.me/irc/uploads/44fb256b36a4f97b/image.png
19:47:44 <clarkb> looks like something keyed off of depsolving?
19:47:49 <NeilHanlon> green is "successful", red is a job which had a "Depsolve Failed" message
19:47:50 <NeilHanlon> yeah
19:48:06 <NeilHanlon> https://drop1.neilhanlon.me/irc/uploads/17b1fdc1dad12d0b/image.png
19:48:17 <NeilHanlon> i can't seem to generate a short URL otherwise I'd link to the viz
19:48:23 <fungi> so indicates builds which hit some sort of package access problem i guess
19:48:56 <NeilHanlon> yeah these I looked into and are almost all because the host got some mirror A for Appstream and mirror B for BaseOS which were not in sync
19:49:18 <NeilHanlon> i'm sure there's others which aren't matching this depsolve message, but the signal was clear for these ones at least
19:49:37 <NeilHanlon> https://paste.opendev.org/show/bHtL7sBLms4vpOIOkxBN/ here.. the opensearch url :D
19:49:39 <clarkb> cool. I think that does point to using our own mirrors would have a benefit
19:49:54 <fungi> which raises a related question then... when we mirror, how can we be sure we keep both of those in sync with each other?
19:49:57 <clarkb> (side note I wonder if the proxies for the upstream mirrors should do some ip stickyness)
19:50:06 <fungi> or are they mirrored as a unit?
19:50:10 <clarkb> fungi: we would be rsyncing from a single source so in theory that source will be in sync with itself
19:50:19 <clarkb> fungi: rather than rsyncing from multiple locations which may be out of sync
19:50:20 <NeilHanlon> right, yeah. using --delay-updates or so
19:50:38 <fungi> and yeah, we do delay deletions
19:51:38 <NeilHanlon> alternatively, I've sometimes set it up so that everything except the metadata is synced first, then the metadata is can be fetched -- but if you're using --delete that wouldn't work
19:52:00 <clarkb> so ya as mentioend before the next steps would be to ensure we've got enough disk (using centos 9 stream as a stand in I think we decided we do) then write a mirroring script (should look very similar to centos 9 stream and other rsync scripts) then an admin can create the afs volume and merge things and get stuff published
19:52:27 <NeilHanlon> alright, I can work on a mirroring script and open a change for that
19:52:33 <tonyb> Similar ti CentOS I'm working on a tool that will ensure that all packages in the repomd are available in a mirror.  which we can run after rsync befoer the vos release
19:52:46 <clarkb> NeilHanlon: that would be great. Then whoever ends up reviewing that can ensure the afs side is ready for it to land too
19:52:59 <tonyb> I don't think that will help with issues where BaseOS and Appstream are out of sync though :(
19:53:10 <clarkb> we can also set the quota on the afs volume such taht we don't accidentally sync down too much content
19:53:18 <fungi> yeah, if there's a semi-quick way we can double-check consistency, afs lets us just avoid publishing that state when it's wrong
19:53:19 <clarkb> better to hit a quota limit than completely run out of disk
19:53:25 <NeilHanlon> hear hear
19:54:14 <tonyb> fungi: I ran the tool on an afs node and it was < 1min per repo, which is quick enough for me
19:54:41 <clarkb> tonyb: that is plenty fast compared to how long rsync takes even not syncing any real data
19:54:48 <fungi> yeah, that's quick, especially where afs is concerned
19:54:59 <clarkb> NeilHanlon: and don't hesitate to ask if any questions come up in preparing that script
19:55:06 <clarkb> #topic Open Discussion
19:55:14 <clarkb> we have 5 minutes for anything else before our hour is up
19:55:22 <tonyb> I was thing so, also very quick if we can avoid a bunch of job failures
19:55:53 <fungi> just a heads up that i won't be around much thursday/friday this week, or over the weekend
19:56:25 * frickler will also be offline starting thursday, hopefully just a couple of days
19:56:50 * tonyb will be more around again ... albeit in AU :/
19:57:25 <clarkb> thanks for the heads up
19:57:38 <clarkb> sounds like that may be just abuot everything. Thank you for your time today
19:57:46 <clarkb> #endmeeting