19:00:31 <clarkb> #startmeeting infra
19:00:31 <opendevmeet> Meeting started Tue Dec 10 19:00:31 2024 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:31 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:31 <opendevmeet> The meeting name has been set to 'infra'
19:00:38 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/YTAWQCPLRMRJARKVH4XZR3LXF5DJ4IHQ/ Our Agenda
19:00:44 <clarkb> #topic Announcements
19:00:51 <clarkb> tonyb mentioned not being here today for the meeting due to travel
19:01:00 <clarkb> Anything else to announce?
19:02:25 <clarkb> sounds like now
19:02:27 <clarkb> *no
19:02:35 <clarkb> #topic Zuul-launcher image builds
19:03:03 <clarkb> corvus: anything new here? Curious if api improvements have made it into zuul yet
19:04:34 <clarkb> Not sure if corvus is around right now. We can get back to this later if that changes
19:04:36 <clarkb> #topic Backup Server Pruning
19:05:24 <clarkb> We retired backups via the new automation and fungi reran the prune script both to check that it works post retirement but also because the server was complaining about being near full again. Good news is that got us down to 68% utilization (previous low water mark was ~77% iirc)
19:05:27 <clarkb> a definite improvement
19:06:01 <fungi> most thanks for that go to ianw and you, i just pushed some buttons
19:06:02 <clarkb> The next step is to purge the backups for these retired backups
19:06:11 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/937040 Purge retired backups on vexxhost backup server
19:06:25 <fungi> i should be able to take care of that after the meeting, once i grab some food
19:06:37 <clarkb> and once that has landed and purged things we should rerun the manual prune again just to be sure that the prune script is still happy and also to get a new low water mark
19:06:40 <clarkb> fungi: thanks!
19:07:13 <clarkb> its worth noting that so far we have only done the retirements and purges on one of two backup servers. Once we're happy with this entire process on one server we can consider how we may want to apply it to the other server
19:07:35 <clarkb> its just far less urgent to do on the other as it has more disk space and doing it one at a time is a nice safety mechanism to avoid overdeleting things
19:07:46 <clarkb> Any questions or concerns on backup pruning?
19:09:16 <fungi> none on my end
19:09:25 <clarkb> #topic Upgrading old servers
19:09:36 <clarkb> tonyb isn't here but I did have some uopdates on my end
19:09:59 <clarkb> Now that the gerrit upgrade is mostly behind us (we still have some changes to land and test, more on that later) the next big gerrit related item on my todo list is upgrading the server
19:10:39 <clarkb> I suspect the easiest way to do this will be to boot a new server with a new cinder volume and then sync everything over to it. Then schedule a downtime where we can do a final sync and udpate DNS
19:10:50 <clarkb> any concerns with that approach or preferences for another method?
19:12:29 <fungi> in the past we've given advance notice of the new ip address due to companies baking 25418/tcp into firewall policies
19:12:51 <clarkb> good point. Thouhg I'm not sure if we did that for the move to vexxhost
19:12:54 <fungi> no idea if that's a tradition worth continuing to observe
19:13:33 <corvus> o/
19:13:35 <clarkb> we can probably do some notice since we're taking a downtime anyway. Not sure it will be the month we've had in the past
19:13:36 <frickler> do we want to stay on vexxhost? or maybe move to raxflex due to IPv6 issues?
19:13:36 <fungi> i'd be okay with not worrying about it
19:13:50 <fungi> not worrying about the advance notice i mean
19:14:03 <clarkb> frickler: considering raxflex has once lost all of hte data on our cinder volumes I am extremely wary of doing that
19:14:08 <frickler> although raxflex doesn't have v6 yet, either, I remember now
19:14:09 <fungi> frickler: rackspace flex currently has worse ipv6 issues, in that it has no ipv6
19:14:15 <clarkb> I think if we were upgrading in a year the consideration would be different
19:14:17 <fungi> yeah, that
19:14:42 <clarkb> and I'm not sure what flavor sizes are available there
19:14:53 <clarkb> I don't think we can get away with a much smaller gerrit node (mostly for memroy)
19:14:55 <fungi> also flex folks say it's still under some churn before reaching general availability, so i've been hesitant to suggest moving any control plane systems there so soon
19:15:00 <corvus> i'm happy with gerrit and friends staying in rax classic for now.
19:15:10 <clarkb> corvus: its in vexxhost fwiw
19:16:00 <clarkb> I can take a survey of ipv6 availability and flavor sizes and operational concerns in each cloud before booting anything though
19:16:01 <fungi> vexxhost's v6 routes in bgp have been getting filtered/dropped by some isps, hence frickler's concern
19:16:06 <corvus> oh i do remember that now...  :)
19:16:18 <corvus> what's the rax classic ipv6 situation?
19:16:37 <fungi> quite good
19:16:40 <clarkb> rax classic ipv6 generally works except for sometimes not between rax classic nodes (this is why ansible is set to use ipv4)
19:16:53 <clarkb> I don't think we've seen any connectivity problems to the outside world only within the cloud
19:17:16 <corvus> maybe a move back to rax classic would be a good idea then?
19:17:17 <fungi> yeah, they've had some quo filtering issues in the past with v6 between regions as well, though not presently to my knowledge
19:17:34 <fungi> s/quo/qos/
19:18:10 <clarkb> possibly the issues with xen not reliably working with noble probalby need better understanding before committing to that
19:18:18 <clarkb> also not sure how big the big flavors are would have to check that
19:18:27 <clarkb> (they got pretty big from memory doing elasticsearch though)
19:18:56 <clarkb> which is the other related topic. I'd like to target noble for this new node so I may take a detour first and get noble with docker compose and podman working on a simpler service
19:19:05 <clarkb> and then we can make a decision on the exact location :)
19:19:21 <corvus> sounds like our choices are likely: a) aging hardware; b) unreliable v6; c) cloud could disappear at any time
19:20:07 <corvus> oh -- are the different vexxhost regions any different in regards to v6?
19:20:17 <frickler> d) semi-regular issues with vouchers expiring ;)
19:20:29 <corvus> maybe gerrit is in ymq and maybe sjc could be better?
19:20:30 <clarkb> corvus: that is a good question. frickler do you have problems with ipv6 reachability for opendev.org / gitea?
19:20:40 <clarkb> corvus: ya ymq for gerrit. gitea is in sjc1
19:21:03 <fungi> wrt v6 in sjc1 vs ca-ymq-1, probably the same since they're part of one larger allocation, but we'd have to do some long-term testing to know
19:21:10 <frickler> I remember more issues with gitea, but I need to check
19:21:26 <corvus> fungi: yeah, though those regions are so different otherwise i wouldn't take it as a given
19:21:52 <frickler> 2604:e100:1::/48 vs. 2604:e100:3::/48
19:22:02 <frickler> I think they were mostly affected similarly
19:22:16 <corvus> ack. oh well.  :|
19:22:33 <clarkb> so anyway I think the roundabout process here is sorting out docker/compose/podman on noble, update a service like paste, concurrently survey hosting options, then we decide where to host gerrit and proceed with it on nobl;e
19:22:46 <corvus> clarkb: ++
19:23:32 <fungi> all of 2604:e100 seems to be assigned to vexxhost, fwiw
19:23:38 <clarkb> I'm probably going to start with the container stuff as I expect that to require more time and effort
19:23:52 <clarkb> if others want to do a more proper hosting survey I'd welcome that otherwise I'll poke at it as time allows
19:23:57 <clarkb> Any other questions/concerns/etc?
19:24:05 <fungi> 2604:e100::/32 is a singular allocation according to arin's whois
19:24:57 <clarkb> #topic Docker compose plugin with podman service for servers
19:25:16 <clarkb> No new updates on this other than that I'm likely going to start poking at it and may have updates in the future
19:25:23 <clarkb> #topic Docker Hub Rate Limits
19:25:51 <clarkb> Sort of an update on this. I noticed yesterday that kolla ran a lot of jobs (like enough to have every single one of our nodes dedicated to kolla I think)
19:26:10 <clarkb> for about 6-8 hours after that point docker hub rate limits were super common
19:26:30 <clarkb> this has me wondering if at least some of the change we've observed in rate limit behavior is related to usage changes on our end
19:26:38 <clarkb> though it could just be coincidental
19:26:55 <fungi> some of our providers have limited address pools that are dedicated to us and get recycled frequently
19:27:14 <clarkb> if anyone knows of potential suspicious changes though digging into those might help us better undersatnd the rate limits and further make this reliable in while we sort out alternatives
19:27:30 <clarkb> fungi: yup though I was seeing this in rax-ord/rax-iad too which I don't think have the same super limited ip pools
19:27:39 <clarkb> but maybe they are more limited than I expect
19:28:07 <frickler> kolla tried to get rid of using dockerhub, though I'm not sure whether that succeeded to 100%
19:28:19 <fungi> well, they may not allocate a lot of unused addresses, so deleting and replacing servers is still likely to cycle through the smaller number of available addresses
19:28:22 <clarkb> as for alternatives https://review.opendev.org/c/zuul/zuul-jobs/+/935574 has the review approvals it needs but has struggled to merge
19:28:28 <clarkb> fungi: good point
19:28:52 <clarkb> corvus: to land 935574 I suspect approving it during our evening time is going to be least contended for rate limit quota
19:29:04 <corvus> clarkb: ack
19:29:07 <clarkb> corvus: maybe we should just try approving it now and if it fails agani approve it again this evening and try to get it in?
19:29:13 <clarkb> s/try approving/try rechecking/
19:29:14 <corvus> i just reapproved it now, but let's try to do that at the end of day too
19:29:20 <clarkb> ++
19:29:47 <clarkb> then for opendev (and probably zuul) mirroring the python images (both ours and the parents) and the mariadb images are likely to have the biggest impact
19:30:02 <clarkb> everything else in opendev is like a single container image that we're often pulling from the intermediate registry anyway
19:31:26 * clarkb scribbled a note to follow that chagne and get it landed
19:31:37 <clarkb> we can followup once that happens. Anything else on this topic?
19:31:39 <fungi> i can keep an eye on it and recheck too if it fails
19:32:27 <clarkb> #topic Gerrit 3.10 Upgrade
19:32:32 <clarkb> The upgrade is complete \o/
19:32:38 <clarkb> and I think it went pretty smoothly
19:32:43 <clarkb> #link https://etherpad.opendev.org/p/gerrit-upgrade-3.10 Gerrit upgrade planning document
19:33:02 <clarkb> this document still has a few todos on it though most of them have been captured into changes under topic:upgrade-gerrit-3.10
19:33:06 <corvus> thanks!
19:33:36 <fungi> sorry my isp decided to experience a fiber cut 10 minutes into the maintenance
19:33:37 <clarkb> essentially its update some test jobs to test the right version of gerrit, drop gerrit 3.9 image builds once we feel confident we won't need to revert, then add 3.11 image builds and 3.10 -> 3.11 upgrade testing
19:33:54 <corvus> fungi: heck of a time to start gardening
19:34:21 <clarkb> reviews on those are super appreciated. There is one gotcha in that the change to add 3.11 image builds will also rebuild our 3.10 image because I had to modify files to update acls in testing for 3.11. The side effect of this is we'll pull in my openid redirect fix from upstream when that change lands
19:34:43 <clarkb> I want to test the behavior of that change against 3.10 (prior testing was only against 3.9) before we merge it so that we can confidently restart gerrit on the rebuilt 3.10 image post merge
19:34:55 <clarkb> all t hat to say if you want to defer to me on approving things that is fine. But still reviews are helpful
19:35:20 <clarkb> my goal is to test the openid redirect stuff this afternoon then tomorrow morning land changes and restart gerrit quickly to pick up the update
19:35:47 <fungi> sounds great
19:36:07 <fungi> i also need to look into what's needed so we can update git-review testing
19:36:11 <clarkb> thank you to everyone who hlped with the upgrade. I'm really happy with how it went
19:36:27 <clarkb> any gerrit 3.10 concerns before we move on?
19:37:49 <clarkb> #topic Diskimage Builder Testing and Release
19:38:04 <clarkb> we have a request to make a dib release to pick up gentoo element fixes
19:38:29 <clarkb> unfortuantely the openeuler tests are failing consistently after our mirror updates and the distro doesn't seem to have fallback package location. Fungi pushed a change to disbale the tests which I have approved
19:38:51 <clarkb> getting those tests out of the way should allow us to land a couple of other bug fixes (one for svc-map and another for centos 9 installing grub when using efi)
19:39:00 <clarkb> and then I think we can push a tag for a new release
19:39:19 <clarkb> I also noticed that our openeuler image builds in nodepool are failing likely due to a similar issue and we need to pause them
19:39:54 <clarkb> do we know who to reach out to in openeuler land? I feel like this is a good "we're resetting things and you should be involved in that" point
19:40:05 <clarkb> as I don't think we'll get fixes into dib or our nodepool install otherwise
19:40:53 <fungi> i'm assuming the proposer of the change for the openeuler mirror update is someone who could speak for their community
19:41:40 <clarkb> good point. I wonder if they read email :)
19:41:58 <clarkb> I've made some notes to try and track them down and point them at what needs updating
19:42:28 <fungi> we could cc them on a change pausing the image builds, they did cc some of us directly for reviews on the mirror update change
19:42:49 <clarkb> I like that idea
19:42:56 <clarkb> fungi: did you want to push that up?
19:43:00 <fungi> i can, sure
19:43:14 <fungi> i'll mention the dib change from earlier today too
19:43:14 <clarkb> also your dib fix got a -1 testing noble and I'm guessing thats related to the broken noble package update that jayf just discovered
19:43:22 <JayF> yep
19:43:23 <clarkb> they pushed a kernel update but didn't push an updated kernel modules package
19:43:27 <clarkb> (or maybe the other way around)
19:43:35 <clarkb> fungi: thanks!
19:43:38 <JayF> that's exactly it clarkb
19:43:41 <fungi> well, sure. why would they? (distro'ing is hard work)
19:43:50 <clarkb> #topic Rax-ord Noble Nodes with 1 VCPU
19:43:55 <JayF> and then I tried to report the bug and got told if I needed commerical support to go buy it
19:44:03 <fungi> bwahahaha
19:44:13 <JayF> you can imagine how long I stayed in that channel
19:44:23 <clarkb> oh I thought you said they confirmed it
19:44:32 <clarkb> that is an unfortunate response
19:44:48 <JayF> they confirmed it but were like "huh, weird, whatever"
19:44:57 <clarkb> twice in as many months we've had jobs with weird failures due to having only 1 vcpu. THe jobs all ran noble on rax ord iirc
19:44:58 <JayF> and I pushed them to acknowledge it was broken for everyone
19:45:05 <clarkb> JayF: ack
19:45:14 <JayF> that's what got that response, when I really just wanted someone to point me to the LP project to complain to :D
19:45:30 <clarkb> on closer inspection of the zuul collected debugging info the ansible facts show a clear difference in xen version between noble nodes with 8vcpu and those with 1vcpu
19:45:32 <fungi> people with actual support contracts will be pushing them to fix it soon enough anyway
19:45:36 <JayF> I assumed they, like I would if reported for Ironic, would want to unbreak their users ASAP
19:46:31 <clarkb> my best hunch at this point is that this is not a bug in nodepool using the wrong flavor or a bug in openstack using the wrong flavor (part of this assumption is other aspects of the flavor like disk and memroy totals check out) and isntaed a bug in xen or noble or both that rpevent the VMs from seeing all of the cpus
19:47:05 <clarkb> we pinged a couple of raxflex folks about it but they both indicated that the xen stuff is outside of their purview. I suspect this means the best path forward here is to file a support ticket
19:48:08 <clarkb> open to other ideas for getting this sorted out. I worry that a ticket might get dismissed as not having enough info or something
19:48:52 <fungi> we do have some code already that rejects nodes very early in jobs so they get retried
19:49:03 <frickler> likely watching whether we can gather more data might be the only option
19:49:05 <fungi> in base-jobs i think
19:49:45 <frickler> oh, also https://review.opendev.org/c/zuul/zuul-jobs/+/937376?usp=dashboard
19:50:12 <frickler> to at least have a bit more evidence possibly
19:50:20 <clarkb> oh ya we could have some check that was long the lines if rax and vcpu == 1 then fial
19:50:56 <clarkb> maybe we start with the data collection change and then add ^ and if this problem happens more than once a month we escalate
19:51:50 <fungi> pretty sure there are similar checks in the validate-host role, i just need to remember where it's defined
19:52:25 <frickler> fungi: my patch is also within that role
19:52:39 <fungi> aha, yep
19:53:38 <clarkb> I just found the logs showing that frickler's update works. I'll approve it now
19:53:58 <clarkb> #topic Open Discussion
19:54:02 <clarkb> Anything else before our hour is up?
19:54:56 <corvus> no updates on the niz image stuff (as expected); i'm hoping to get back to that soonish
19:55:03 <clarkb> A reminder that we'll have our last weekly meeting of they year next week and then we'll be back January 7
19:55:34 <fungi> same bat time, same bat channel
19:56:50 <clarkb> sounds like that is everything. Thank you for your time and all your help running OpenDev
19:56:56 <clarkb> #endmeeting