19:00:31 #startmeeting infra 19:00:31 Meeting started Tue Dec 10 19:00:31 2024 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:31 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:31 The meeting name has been set to 'infra' 19:00:38 #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/YTAWQCPLRMRJARKVH4XZR3LXF5DJ4IHQ/ Our Agenda 19:00:44 #topic Announcements 19:00:51 tonyb mentioned not being here today for the meeting due to travel 19:01:00 Anything else to announce? 19:02:25 sounds like now 19:02:27 *no 19:02:35 #topic Zuul-launcher image builds 19:03:03 corvus: anything new here? Curious if api improvements have made it into zuul yet 19:04:34 Not sure if corvus is around right now. We can get back to this later if that changes 19:04:36 #topic Backup Server Pruning 19:05:24 We retired backups via the new automation and fungi reran the prune script both to check that it works post retirement but also because the server was complaining about being near full again. Good news is that got us down to 68% utilization (previous low water mark was ~77% iirc) 19:05:27 a definite improvement 19:06:01 most thanks for that go to ianw and you, i just pushed some buttons 19:06:02 The next step is to purge the backups for these retired backups 19:06:11 #link https://review.opendev.org/c/opendev/system-config/+/937040 Purge retired backups on vexxhost backup server 19:06:25 i should be able to take care of that after the meeting, once i grab some food 19:06:37 and once that has landed and purged things we should rerun the manual prune again just to be sure that the prune script is still happy and also to get a new low water mark 19:06:40 fungi: thanks! 19:07:13 its worth noting that so far we have only done the retirements and purges on one of two backup servers. Once we're happy with this entire process on one server we can consider how we may want to apply it to the other server 19:07:35 its just far less urgent to do on the other as it has more disk space and doing it one at a time is a nice safety mechanism to avoid overdeleting things 19:07:46 Any questions or concerns on backup pruning? 19:09:16 none on my end 19:09:25 #topic Upgrading old servers 19:09:36 tonyb isn't here but I did have some uopdates on my end 19:09:59 Now that the gerrit upgrade is mostly behind us (we still have some changes to land and test, more on that later) the next big gerrit related item on my todo list is upgrading the server 19:10:39 I suspect the easiest way to do this will be to boot a new server with a new cinder volume and then sync everything over to it. Then schedule a downtime where we can do a final sync and udpate DNS 19:10:50 any concerns with that approach or preferences for another method? 19:12:29 in the past we've given advance notice of the new ip address due to companies baking 25418/tcp into firewall policies 19:12:51 good point. Thouhg I'm not sure if we did that for the move to vexxhost 19:12:54 no idea if that's a tradition worth continuing to observe 19:13:33 o/ 19:13:35 we can probably do some notice since we're taking a downtime anyway. Not sure it will be the month we've had in the past 19:13:36 do we want to stay on vexxhost? or maybe move to raxflex due to IPv6 issues? 19:13:36 i'd be okay with not worrying about it 19:13:50 not worrying about the advance notice i mean 19:14:03 frickler: considering raxflex has once lost all of hte data on our cinder volumes I am extremely wary of doing that 19:14:08 although raxflex doesn't have v6 yet, either, I remember now 19:14:09 frickler: rackspace flex currently has worse ipv6 issues, in that it has no ipv6 19:14:15 I think if we were upgrading in a year the consideration would be different 19:14:17 yeah, that 19:14:42 and I'm not sure what flavor sizes are available there 19:14:53 I don't think we can get away with a much smaller gerrit node (mostly for memroy) 19:14:55 also flex folks say it's still under some churn before reaching general availability, so i've been hesitant to suggest moving any control plane systems there so soon 19:15:00 i'm happy with gerrit and friends staying in rax classic for now. 19:15:10 corvus: its in vexxhost fwiw 19:16:00 I can take a survey of ipv6 availability and flavor sizes and operational concerns in each cloud before booting anything though 19:16:01 vexxhost's v6 routes in bgp have been getting filtered/dropped by some isps, hence frickler's concern 19:16:06 oh i do remember that now... :) 19:16:18 what's the rax classic ipv6 situation? 19:16:37 quite good 19:16:40 rax classic ipv6 generally works except for sometimes not between rax classic nodes (this is why ansible is set to use ipv4) 19:16:53 I don't think we've seen any connectivity problems to the outside world only within the cloud 19:17:16 maybe a move back to rax classic would be a good idea then? 19:17:17 yeah, they've had some quo filtering issues in the past with v6 between regions as well, though not presently to my knowledge 19:17:34 s/quo/qos/ 19:18:10 possibly the issues with xen not reliably working with noble probalby need better understanding before committing to that 19:18:18 also not sure how big the big flavors are would have to check that 19:18:27 (they got pretty big from memory doing elasticsearch though) 19:18:56 which is the other related topic. I'd like to target noble for this new node so I may take a detour first and get noble with docker compose and podman working on a simpler service 19:19:05 and then we can make a decision on the exact location :) 19:19:21 sounds like our choices are likely: a) aging hardware; b) unreliable v6; c) cloud could disappear at any time 19:20:07 oh -- are the different vexxhost regions any different in regards to v6? 19:20:17 d) semi-regular issues with vouchers expiring ;) 19:20:29 maybe gerrit is in ymq and maybe sjc could be better? 19:20:30 corvus: that is a good question. frickler do you have problems with ipv6 reachability for opendev.org / gitea? 19:20:40 corvus: ya ymq for gerrit. gitea is in sjc1 19:21:03 wrt v6 in sjc1 vs ca-ymq-1, probably the same since they're part of one larger allocation, but we'd have to do some long-term testing to know 19:21:10 I remember more issues with gitea, but I need to check 19:21:26 fungi: yeah, though those regions are so different otherwise i wouldn't take it as a given 19:21:52 2604:e100:1::/48 vs. 2604:e100:3::/48 19:22:02 I think they were mostly affected similarly 19:22:16 ack. oh well. :| 19:22:33 so anyway I think the roundabout process here is sorting out docker/compose/podman on noble, update a service like paste, concurrently survey hosting options, then we decide where to host gerrit and proceed with it on nobl;e 19:22:46 clarkb: ++ 19:23:32 all of 2604:e100 seems to be assigned to vexxhost, fwiw 19:23:38 I'm probably going to start with the container stuff as I expect that to require more time and effort 19:23:52 if others want to do a more proper hosting survey I'd welcome that otherwise I'll poke at it as time allows 19:23:57 Any other questions/concerns/etc? 19:24:05 2604:e100::/32 is a singular allocation according to arin's whois 19:24:57 #topic Docker compose plugin with podman service for servers 19:25:16 No new updates on this other than that I'm likely going to start poking at it and may have updates in the future 19:25:23 #topic Docker Hub Rate Limits 19:25:51 Sort of an update on this. I noticed yesterday that kolla ran a lot of jobs (like enough to have every single one of our nodes dedicated to kolla I think) 19:26:10 for about 6-8 hours after that point docker hub rate limits were super common 19:26:30 this has me wondering if at least some of the change we've observed in rate limit behavior is related to usage changes on our end 19:26:38 though it could just be coincidental 19:26:55 some of our providers have limited address pools that are dedicated to us and get recycled frequently 19:27:14 if anyone knows of potential suspicious changes though digging into those might help us better undersatnd the rate limits and further make this reliable in while we sort out alternatives 19:27:30 fungi: yup though I was seeing this in rax-ord/rax-iad too which I don't think have the same super limited ip pools 19:27:39 but maybe they are more limited than I expect 19:28:07 kolla tried to get rid of using dockerhub, though I'm not sure whether that succeeded to 100% 19:28:19 well, they may not allocate a lot of unused addresses, so deleting and replacing servers is still likely to cycle through the smaller number of available addresses 19:28:22 as for alternatives https://review.opendev.org/c/zuul/zuul-jobs/+/935574 has the review approvals it needs but has struggled to merge 19:28:28 fungi: good point 19:28:52 corvus: to land 935574 I suspect approving it during our evening time is going to be least contended for rate limit quota 19:29:04 clarkb: ack 19:29:07 corvus: maybe we should just try approving it now and if it fails agani approve it again this evening and try to get it in? 19:29:13 s/try approving/try rechecking/ 19:29:14 i just reapproved it now, but let's try to do that at the end of day too 19:29:20 ++ 19:29:47 then for opendev (and probably zuul) mirroring the python images (both ours and the parents) and the mariadb images are likely to have the biggest impact 19:30:02 everything else in opendev is like a single container image that we're often pulling from the intermediate registry anyway 19:31:26 * clarkb scribbled a note to follow that chagne and get it landed 19:31:37 we can followup once that happens. Anything else on this topic? 19:31:39 i can keep an eye on it and recheck too if it fails 19:32:27 #topic Gerrit 3.10 Upgrade 19:32:32 The upgrade is complete \o/ 19:32:38 and I think it went pretty smoothly 19:32:43 #link https://etherpad.opendev.org/p/gerrit-upgrade-3.10 Gerrit upgrade planning document 19:33:02 this document still has a few todos on it though most of them have been captured into changes under topic:upgrade-gerrit-3.10 19:33:06 thanks! 19:33:36 sorry my isp decided to experience a fiber cut 10 minutes into the maintenance 19:33:37 essentially its update some test jobs to test the right version of gerrit, drop gerrit 3.9 image builds once we feel confident we won't need to revert, then add 3.11 image builds and 3.10 -> 3.11 upgrade testing 19:33:54 fungi: heck of a time to start gardening 19:34:21 reviews on those are super appreciated. There is one gotcha in that the change to add 3.11 image builds will also rebuild our 3.10 image because I had to modify files to update acls in testing for 3.11. The side effect of this is we'll pull in my openid redirect fix from upstream when that change lands 19:34:43 I want to test the behavior of that change against 3.10 (prior testing was only against 3.9) before we merge it so that we can confidently restart gerrit on the rebuilt 3.10 image post merge 19:34:55 all t hat to say if you want to defer to me on approving things that is fine. But still reviews are helpful 19:35:20 my goal is to test the openid redirect stuff this afternoon then tomorrow morning land changes and restart gerrit quickly to pick up the update 19:35:47 sounds great 19:36:07 i also need to look into what's needed so we can update git-review testing 19:36:11 thank you to everyone who hlped with the upgrade. I'm really happy with how it went 19:36:27 any gerrit 3.10 concerns before we move on? 19:37:49 #topic Diskimage Builder Testing and Release 19:38:04 we have a request to make a dib release to pick up gentoo element fixes 19:38:29 unfortuantely the openeuler tests are failing consistently after our mirror updates and the distro doesn't seem to have fallback package location. Fungi pushed a change to disbale the tests which I have approved 19:38:51 getting those tests out of the way should allow us to land a couple of other bug fixes (one for svc-map and another for centos 9 installing grub when using efi) 19:39:00 and then I think we can push a tag for a new release 19:39:19 I also noticed that our openeuler image builds in nodepool are failing likely due to a similar issue and we need to pause them 19:39:54 do we know who to reach out to in openeuler land? I feel like this is a good "we're resetting things and you should be involved in that" point 19:40:05 as I don't think we'll get fixes into dib or our nodepool install otherwise 19:40:53 i'm assuming the proposer of the change for the openeuler mirror update is someone who could speak for their community 19:41:40 good point. I wonder if they read email :) 19:41:58 I've made some notes to try and track them down and point them at what needs updating 19:42:28 we could cc them on a change pausing the image builds, they did cc some of us directly for reviews on the mirror update change 19:42:49 I like that idea 19:42:56 fungi: did you want to push that up? 19:43:00 i can, sure 19:43:14 i'll mention the dib change from earlier today too 19:43:14 also your dib fix got a -1 testing noble and I'm guessing thats related to the broken noble package update that jayf just discovered 19:43:22 yep 19:43:23 they pushed a kernel update but didn't push an updated kernel modules package 19:43:27 (or maybe the other way around) 19:43:35 fungi: thanks! 19:43:38 that's exactly it clarkb 19:43:41 well, sure. why would they? (distro'ing is hard work) 19:43:50 #topic Rax-ord Noble Nodes with 1 VCPU 19:43:55 and then I tried to report the bug and got told if I needed commerical support to go buy it 19:44:03 bwahahaha 19:44:13 you can imagine how long I stayed in that channel 19:44:23 oh I thought you said they confirmed it 19:44:32 that is an unfortunate response 19:44:48 they confirmed it but were like "huh, weird, whatever" 19:44:57 twice in as many months we've had jobs with weird failures due to having only 1 vcpu. THe jobs all ran noble on rax ord iirc 19:44:58 and I pushed them to acknowledge it was broken for everyone 19:45:05 JayF: ack 19:45:14 that's what got that response, when I really just wanted someone to point me to the LP project to complain to :D 19:45:30 on closer inspection of the zuul collected debugging info the ansible facts show a clear difference in xen version between noble nodes with 8vcpu and those with 1vcpu 19:45:32 people with actual support contracts will be pushing them to fix it soon enough anyway 19:45:36 I assumed they, like I would if reported for Ironic, would want to unbreak their users ASAP 19:46:31 my best hunch at this point is that this is not a bug in nodepool using the wrong flavor or a bug in openstack using the wrong flavor (part of this assumption is other aspects of the flavor like disk and memroy totals check out) and isntaed a bug in xen or noble or both that rpevent the VMs from seeing all of the cpus 19:47:05 we pinged a couple of raxflex folks about it but they both indicated that the xen stuff is outside of their purview. I suspect this means the best path forward here is to file a support ticket 19:48:08 open to other ideas for getting this sorted out. I worry that a ticket might get dismissed as not having enough info or something 19:48:52 we do have some code already that rejects nodes very early in jobs so they get retried 19:49:03 likely watching whether we can gather more data might be the only option 19:49:05 in base-jobs i think 19:49:45 oh, also https://review.opendev.org/c/zuul/zuul-jobs/+/937376?usp=dashboard 19:50:12 to at least have a bit more evidence possibly 19:50:20 oh ya we could have some check that was long the lines if rax and vcpu == 1 then fial 19:50:56 maybe we start with the data collection change and then add ^ and if this problem happens more than once a month we escalate 19:51:50 pretty sure there are similar checks in the validate-host role, i just need to remember where it's defined 19:52:25 fungi: my patch is also within that role 19:52:39 aha, yep 19:53:38 I just found the logs showing that frickler's update works. I'll approve it now 19:53:58 #topic Open Discussion 19:54:02 Anything else before our hour is up? 19:54:56 no updates on the niz image stuff (as expected); i'm hoping to get back to that soonish 19:55:03 A reminder that we'll have our last weekly meeting of they year next week and then we'll be back January 7 19:55:34 same bat time, same bat channel 19:56:50 sounds like that is everything. Thank you for your time and all your help running OpenDev 19:56:56 #endmeeting