19:00:28 <clarkb> #startmeeting infra 19:00:28 <opendevmeet> Meeting started Tue May 14 19:00:28 2024 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:28 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:28 <opendevmeet> The meeting name has been set to 'infra' 19:01:07 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/Q3ZNOUTLIF3HWCUX7CQI7ITFBYC3KCCF/ Our Agenda 19:01:11 <clarkb> #topic Announcements 19:01:23 <clarkb> I'm planning to take Friday off and probably Monday as well and do a long weekend 19:01:57 <frickler> monday is a public holiday here 19:02:34 <fungi> i am back from my travels and trying to catch up again 19:02:44 <clarkb> oh nice. The monday after is a holiday here but I'm likely going to work that monday and instead do the long weekend this weekend 19:03:34 <clarkb> #topic Upgrading Old Servers 19:03:47 <clarkb> not sure if tonyb is around but I think we can dive right in 19:03:53 <clarkb> the meetpad cluster has been upgraded to jammy \o/ 19:04:00 <tonyb> Yup I'm here 19:04:06 <clarkb> the old servers are still around but in the emergency file and shutdown 19:04:17 <tonyb> so next is wiki and/or cacti 19:04:21 <clarkb> I think we can probably delete them in the near future as I've just managed to use meetpad successfully without issue 19:04:45 <tonyb> Great. I'll remove them today/tomorrow 19:05:15 <tonyb> WRT cacti there is an approved spec to implement prometheus 19:05:17 <clarkb> tonyb: sounds good. In addition to what is next I'm thinking we should try and collapse our various todo lists into a single one for server upgrades. Do you want to do that or should I take that on? 19:05:39 <tonyb> I an do it. 19:05:48 <tonyb> *can 19:06:08 <clarkb> tonyb: yes there is. The main bit of that that needs effort is likely the agent for reporting from servers to prometheus. I think running prometheus should be straightfowrard with a volume for metric data backing and containers for the service (thats still non zero effort but is fairly straightfowrard) 19:06:48 <tonyb> So how long would cacti be around once we have prometheus? 19:07:22 <clarkb> tonyb: that is a good qusetion. I suspect we can keep it around for some time in a locked down state for historical access, but don't need to worry about that being public whcih does reduce a number of concerns 19:07:27 <tonyb> I looked at cacti and it seems like it should be doable on jammy or noble which I think would avoid the need for extended OS support from Canonical 19:07:38 <clarkb> I think currently cacti will give us a yaer of graphing by default so thats probably a good goal to aim for if possible 19:07:58 <clarkb> I'm not sure if older data is accessible at all 19:08:10 <tonyb> Okay. I'll work on that. 19:08:26 <corvus> yep 1 year no older data 19:09:05 <tonyb> got it. I have the start of a plan 19:09:59 <clarkb> thanks, feel free to reach out with any other questions or concerns 19:10:15 <tonyb> I was also thinking .... we have a number of "simple" services that just got put onto jammy. Is is a good/bad or indifferent idea to pull them onto noble 19:10:36 <clarkb> tonyb: I don't think it is a bad idea, but I also think they are less urgent 19:10:46 <tonyb> I worry that doing so will mean more pressure when getting away from noble, but it also simplifies the infra 19:10:53 <tonyb> Okay. Noted 19:10:55 <clarkb> historically we've been happy to skip the intermediate LTS and just go to the next one 19:11:11 <clarkb> though I think we've got a ton of stuff on focal there may be some imbalance in the distribution 19:11:22 <tonyb> if there are ones that are easy or make sense I'll see about including them 19:11:46 <clarkb> we also don't have noble test nodes yet so testing noble system-config nodes isn't yet doable 19:11:56 <clarkb> which is related to the next coupel of topics 19:12:09 <clarkb> lets jump into those 19:12:12 <clarkb> #topic Adding Noble Test Nodes 19:12:39 <clarkb> I think we're still at the step of needing to mirror noble. I may have time to poke at that later this week depending on how some of this gerrit upgrade prep goes 19:13:17 <clarkb> Once we've managed to write all that data into afs we can then add noble nodes to nodepool. I don't think we need a dib release for that baceuse only glean and our own elements changed for noble 19:13:34 <tonyb> Sounds good. 19:13:36 <clarkb> but if my undersatnding of that is wrong please let me know and we can make a new dib release too 19:14:20 <clarkb> #topic AFS Volume Cleanup 19:14:47 <clarkb> after our meeting last week all the mergeable topic:drop-ubuntu-xenial changes merged 19:15:12 <clarkb> thank you for the reviews. At this point we are no longer building wheels for xenial and a number of unused jobs have been cleared out. That said there is still much to do 19:15:19 <clarkb> next on my list for this item is retiring devstack-gate 19:15:39 <clarkb> it is/was a big xenial consumer and openstack doesn't use it anymore so I'm going to retire it and pull it out of zuul entirely 19:16:12 <clarkb> then when that is done the next step is pruning the rest of the xenial usage for python jobs and nodejs and so on. I suspect this is where momentum will slow down and/or we'll just pull the plug on some stuff 19:16:27 <fungi> hopefully the blast radius for that is vanishingly small, probably non-openstack projects which haven't been touched in years 19:16:37 <fungi> for devstack-gate i mean 19:16:41 <clarkb> ya a lot of x/* stuff have xenial jobs too 19:16:54 <clarkb> I expect many of those will end up removed from zuul (without full retirement) 19:17:01 <fungi> also means we should be able to drop a lot of the "legacy" jobs converted from zuul v2 19:17:02 <clarkb> similar to what we did with fedora cleanup and windmill 19:17:23 <clarkb> oh good point 19:17:50 <fungi> tripleo was one of the last stragglers depending on legacy jobs, i think 19:17:53 <tonyb> fun. never a dull moment ;P 19:19:19 <clarkb> so ya keep an eye out for changes related to this and I'm sure I'll bug people when I've got a good set that need eyeballs 19:19:25 <clarkb> slow progress is being made otherwise 19:19:49 <clarkb> #topic Gerrit 3.9 Upgrade Planning 19:20:23 <clarkb> Last week we landed a chnge to rebuild our gerrit images so that we would actually have an up to date 3.9 image after the last attempt failed (my fault as the jobs I needed to run were not triggered) 19:20:48 <clarkb> we restarted Gerrit to pick up the corresponding 3.8 image update just to be sure everything there was happy and also upgraded mariadb to 10.11 at the same time 19:21:12 <clarkb> I have since held some test nodes to go through the upgrade and downgrade process in order to capture notes for our planning either 19:21:17 <clarkb> s/either/etherpad/ 19:21:18 <fungi> i can't remember, did i see a change go by last week so that similar changes will trigger image rebuilds in the future? 19:21:25 <clarkb> #link https://etherpad.opendev.org/p/gerrit-upgrade-3.9 Upgrade prep and process notes 19:21:48 <tonyb> fungi: I think you are thinking wishfully 19:21:52 <clarkb> fungi: no change that I am aware of. I'm on the fence over whether or not we want that 19:22:12 <clarkb> its ambiguous to me if the 3.9 image should be promoted when we aren't explicitly updating the 3.9 image and only modifying the job. Maybe it should 19:22:37 <fungi> tonyb: i'm a repo-half-full kinda guy 19:23:03 <clarkb> other items of note here: testing seems to confirm the topic change limit should be a non issue for the upgrade and our known areas of concern (since we never have that many open changes) 19:23:16 <clarkb> the limit is also forgiving enough to allow you to restore abandoned chagnes that push it over the limit 19:23:39 <clarkb> creating new changes with that topic or applying the topic to existing changes that would push it over the limit does error though 19:23:40 <fungi> and only rejects attempts to push new ones or set the topic 19:23:50 <fungi> yeah, that. perfect 19:23:52 <clarkb> yup that appears to be the case through testing 19:24:23 <clarkb> I haven't seen any feedback on my change to add a release war build target to drop web fonts 19:24:54 <clarkb> would probably be good for other infra-root to form an opinion on whether or not we want to patch the build stuff locally for our prod builds or just keep using docs with web fonts 19:25:18 <fungi> what was the concern there? were we leaking visitor details by linking to remotely-hosted webfont files? 19:25:35 <clarkb> fungi: we are but only if you open docs I think 19:26:06 <fungi> oh, so not for the general interface 19:26:18 <fungi> just the documentation part 19:26:28 <clarkb> and gerrit added the ability to disable that very recently when building docs. However the didn't expose that option when building release wars 19:26:33 <clarkb> just when building the documentation directly 19:26:40 <fungi> neat 19:26:57 <clarkb> ya just did a qucik check. Gerrit proper serves the fonts it needs 19:27:01 <clarkb> the docs fetch them from google 19:27:26 <fungi> probably added the disable feature to aid in faster local gerrit doc development 19:28:00 <fungi> and didn't consider that people might not want remote webfont links in production 19:28:05 <clarkb> so anyway I pushed a change upstream to have a release war build target that depends on the documentation build without web fonts 19:28:14 <clarkb> we can apply that patch locally to our gerrit builds if we want 19:28:30 <clarkb> or we can stick with the status quo until the change merges upstream 19:29:33 <clarkb> feedback on approach taken there very much welcome 19:30:03 <clarkb> #link https://gerrit-review.googlesource.com/c/gerrit/+/424404 Upstream release war build without web fonts change 19:30:16 <clarkb> #topic openstack.org DNS record cleanup 19:30:24 <clarkb> #link https://paste.opendev.org/show/bVHJchKcKBnlgTNRHLVK/ Proposed DNS cleanup 19:30:49 <fungi> just a quick request for folks to look through that list and let me know if there are any entries which i shouldn't delete 19:30:50 <clarkb> fungi put together this list of openstack.org records that can be cleaned up. I skimmed the list and had a couple of records to call out, but otherwise seems fine 19:31:06 <clarkb> api.openstack.org redirects to developer.openstack.org so something is backing that name and record 19:31:44 <clarkb> and then I don't think we managed whatever sat behind the rss.cdn record so unsure if that is in use or expected to function 19:31:55 <fungi> oh, yep, i'm not sure why api.openstack.org was in the list 19:32:12 <fungi> i'll say it was a test to see if anyone read closely ;) 19:32:40 <fungi> we did manage the rss.cdn.o.o service, back around 2013 or so 19:32:51 <fungi> it was used to provide an rss feed of changes for review 19:33:13 <clarkb> oh I remember that now. 19:33:21 <fungi> i set up the swift container for that, and anita did the frontend work i think 19:33:35 <fungi> openstackwatch, i think it was called? 19:33:40 <clarkb> that sounds right 19:34:03 <fungi> cronjob that turned gerrit queries into rss and uploaded the blob there 19:35:14 <clarkb> those were the only two I had questions about. Everything else looked safe to me 19:35:49 <clarkb> and now api.o.o is the only one I would remove fro mthe list 19:36:21 <tonyb> Nothing stands out to me, but I can do some naive checking to confirm 19:36:47 <fungi> oddly, i can't tell what's supplying the api.openstack.org redirect 19:37:19 <fungi> something we don't manage inside an ip block allocated to Liquid Web, L.L.C 19:38:02 <fungi> i guess we should host that redirect on static.o.o instead 19:39:15 <fungi> i'll push up a change for that and make the dns adjustment once it deploys 19:40:15 <clarkb> sounds like a plan 19:40:22 <clarkb> #topic Open Discussion 19:40:26 <clarkb> Anything else? 19:40:35 <tonyb> Vexxhost and nodepool 19:41:02 <clarkb> I meant to mention this in the main channel earlier today then didn't: but this is the classic reason for why we have redundancy 19:41:04 <tonyb> frickler: reported API issues and they seem to ongoing 19:41:07 <clarkb> I don't think this is an emergency 19:41:25 <clarkb> however, it would be good to understand in case its a systemic problem that would affect say gitea servers or review 19:41:59 <frickler> seeing the node failures I'm also wondering whether disabling that region would help 19:42:44 <tonyb> No not an emergency just don't want it to linger and wondering when, or indeed, if we should act/escalate 19:43:53 <clarkb> yes i think the next step may be to set max-servers to 0 in those region(s) then try manually booting instances 19:43:56 <clarkb> and debug from there 19:44:32 <clarkb> then hopefully that gives us information that can be used to escalate effectively 19:45:18 <tonyb> Okay. That sounds reasonable. 19:48:27 <clarkb> sounds like that may be everything. Thank you everyone 19:48:37 <clarkb> We'll be back here same time and location next week 19:48:42 <corvus> thanks! 19:48:51 <clarkb> and feel free to bring things up in irc or on the mailing list in the interim 19:48:53 <clarkb> #endmeeting