#opendev-meeting log

19:00:28 <clarkb> #startmeeting infra
19:00:28 <opendevmeet> Meeting started Tue May 14 19:00:28 2024 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:28 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:28 <opendevmeet> The meeting name has been set to 'infra'
19:01:07 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/Q3ZNOUTLIF3HWCUX7CQI7ITFBYC3KCCF/ Our Agenda
19:01:11 <clarkb> #topic Announcements
19:01:23 <clarkb> I'm planning to take Friday off and probably Monday as well and do a long weekend
19:01:57 <frickler> monday is a public holiday here
19:02:34 <fungi> i am back from my travels and trying to catch up again
19:02:44 <clarkb> oh nice. The monday after is a holiday here but I'm likely going to work that monday and instead do the long weekend this weekend
19:03:34 <clarkb> #topic Upgrading Old Servers
19:03:47 <clarkb> not sure if tonyb is around but I think we can dive right in
19:03:53 <clarkb> the meetpad cluster has been upgraded to jammy \o/
19:04:00 <tonyb> Yup  I'm here
19:04:06 <clarkb> the old servers are still around but in the emergency file and shutdown
19:04:17 <tonyb> so next is wiki and/or cacti
19:04:21 <clarkb> I think we can probably delete them in the near future as I've just managed to use meetpad successfully without issue
19:04:45 <tonyb> Great.  I'll remove them today/tomorrow
19:05:15 <tonyb> WRT cacti there is an approved spec to implement prometheus
19:05:17 <clarkb> tonyb: sounds good. In addition to what is next I'm thinking we should try and collapse our various todo lists into a single one for server upgrades. Do you want to do that or should I take that on?
19:05:39 <tonyb> I an do it.
19:05:48 <tonyb> *can
19:06:08 <clarkb> tonyb: yes there is. The main bit of that that needs effort is likely the agent for reporting from servers to prometheus. I think running prometheus should be straightfowrard with a volume for metric data backing and containers for the service (thats still non zero effort but is fairly straightfowrard)
19:06:48 <tonyb> So how long would cacti be around once we have prometheus?
19:07:22 <clarkb> tonyb: that is a good qusetion. I suspect we can keep it around for some time in a locked down state for historical access, but don't need to worry about that being public whcih does reduce a number of concerns
19:07:27 <tonyb> I looked at cacti and it seems like it should be doable on jammy or noble which I think would avoid the need for extended OS support from Canonical
19:07:38 <clarkb> I think currently cacti will give us a yaer of graphing by default so thats probably a good goal to aim for if possible
19:07:58 <clarkb> I'm not sure if older data is accessible at all
19:08:10 <tonyb> Okay.  I'll work on that.
19:08:26 <corvus> yep 1 year no older data
19:09:05 <tonyb> got it.  I have the start of a plan
19:09:59 <clarkb> thanks, feel free to reach out with any other questions or concerns
19:10:15 <tonyb> I was also thinking .... we have a number of "simple" services that just got put onto jammy.  Is is a good/bad or indifferent idea to pull them onto noble
19:10:36 <clarkb> tonyb: I don't think it is a bad idea, but I also think they are less urgent
19:10:46 <tonyb> I worry that doing so will mean more pressure when getting away from noble, but it also simplifies the infra
19:10:53 <tonyb> Okay.  Noted
19:10:55 <clarkb> historically we've been happy to skip the intermediate LTS and just go to the next one
19:11:11 <clarkb> though I think we've got a ton of stuff on focal there may be some imbalance in the distribution
19:11:22 <tonyb> if there are ones that are easy or make sense I'll see about including them
19:11:46 <clarkb> we also don't have noble test nodes yet so testing noble system-config nodes isn't yet doable
19:11:56 <clarkb> which is related to the next coupel of topics
19:12:09 <clarkb> lets jump into those
19:12:12 <clarkb> #topic Adding Noble Test Nodes
19:12:39 <clarkb> I think we're still at the step of needing to mirror noble. I may have time to poke at that later this week depending on how some of this gerrit upgrade prep goes
19:13:17 <clarkb> Once we've managed to write all that data into afs we can then add noble nodes to nodepool. I don't think we need a dib release for that baceuse only glean and our own elements changed for noble
19:13:34 <tonyb> Sounds good.
19:13:36 <clarkb> but if my undersatnding of that is wrong please let me know and we can make a new dib release too
19:14:20 <clarkb> #topic AFS Volume Cleanup
19:14:47 <clarkb> after our meeting last week all the mergeable topic:drop-ubuntu-xenial changes merged
19:15:12 <clarkb> thank you for the reviews. At this point we are no longer building wheels for xenial and a number of unused jobs have been cleared out. That said there is still much to do
19:15:19 <clarkb> next on my list for this item is retiring devstack-gate
19:15:39 <clarkb> it is/was a big xenial consumer and openstack doesn't use it anymore so I'm going to retire it and pull it out of zuul entirely
19:16:12 <clarkb> then when that is done the next step is pruning the rest of the xenial usage for python jobs and nodejs and so on. I suspect this is where momentum will slow down and/or we'll just pull the plug on some stuff
19:16:27 <fungi> hopefully the blast radius for that is vanishingly small, probably non-openstack projects which haven't been touched in years
19:16:37 <fungi> for devstack-gate i mean
19:16:41 <clarkb> ya a lot of x/* stuff have xenial jobs too
19:16:54 <clarkb> I expect many of those will end up removed from zuul (without full retirement)
19:17:01 <fungi> also means we should be able to drop a lot of the "legacy" jobs converted from zuul v2
19:17:02 <clarkb> similar to what we did with fedora cleanup and windmill
19:17:23 <clarkb> oh good point
19:17:50 <fungi> tripleo was one of the last stragglers depending on legacy jobs, i think
19:17:53 <tonyb> fun.  never a dull moment ;P
19:19:19 <clarkb> so ya keep an eye out for changes related to this and I'm sure I'll bug people when I've got a good set that need eyeballs
19:19:25 <clarkb> slow progress is being made otherwise
19:19:49 <clarkb> #topic Gerrit 3.9 Upgrade Planning
19:20:23 <clarkb> Last week we landed a chnge to rebuild our gerrit images so that we would actually have an up to date 3.9 image after the last attempt failed (my fault as the jobs I needed to run were not triggered)
19:20:48 <clarkb> we restarted Gerrit to pick up the corresponding 3.8 image update just to be sure everything there was happy and also upgraded mariadb to 10.11 at the same time
19:21:12 <clarkb> I have since held some test nodes to go through the upgrade and downgrade process in order to capture notes for our planning either
19:21:17 <clarkb> s/either/etherpad/
19:21:18 <fungi> i can't remember, did i see a change go by last week so that similar changes will trigger image rebuilds in the future?
19:21:25 <clarkb> #link https://etherpad.opendev.org/p/gerrit-upgrade-3.9 Upgrade prep and process notes
19:21:48 <tonyb> fungi: I think you are thinking wishfully
19:21:52 <clarkb> fungi: no change that I am aware of. I'm on the fence over whether or not we want that
19:22:12 <clarkb> its ambiguous to me if the 3.9 image should be promoted when we aren't explicitly updating the 3.9 image and only modifying the job. Maybe it should
19:22:37 <fungi> tonyb: i'm a repo-half-full kinda guy
19:23:03 <clarkb> other items of note here: testing seems to confirm the topic change limit should be a non issue for the upgrade and our known areas of concern (since we never have that many open changes)
19:23:16 <clarkb> the limit is also forgiving enough to allow you to restore abandoned chagnes that push it over the limit
19:23:39 <clarkb> creating new changes with that topic or applying the topic to existing changes that would push it over the limit does error though
19:23:40 <fungi> and only rejects attempts to push new ones or set the topic
19:23:50 <fungi> yeah, that. perfect
19:23:52 <clarkb> yup that appears to be the case through testing
19:24:23 <clarkb> I haven't seen any feedback on my change to add a release war build target to drop web fonts
19:24:54 <clarkb> would probably be good for other infra-root to form an opinion on whether or not we want to patch the build stuff locally for our prod builds or just keep using docs with web fonts
19:25:18 <fungi> what was the concern there? were we leaking visitor details by linking to remotely-hosted webfont files?
19:25:35 <clarkb> fungi: we are but only if you open docs I think
19:26:06 <fungi> oh, so not for the general interface
19:26:18 <fungi> just the documentation part
19:26:28 <clarkb> and gerrit added the ability to disable that very recently when building docs. However the didn't expose that option when building release wars
19:26:33 <clarkb> just when building the documentation directly
19:26:40 <fungi> neat
19:26:57 <clarkb> ya just did a qucik check. Gerrit proper serves the fonts it needs
19:27:01 <clarkb> the docs fetch them from google
19:27:26 <fungi> probably added the disable feature to aid in faster local gerrit doc development
19:28:00 <fungi> and didn't consider that people might not want remote webfont links in production
19:28:05 <clarkb> so anyway I pushed a change upstream to have a release war build target that depends on the documentation build without web fonts
19:28:14 <clarkb> we can apply that patch locally to our gerrit builds if we want
19:28:30 <clarkb> or we can stick with the status quo until the change merges upstream
19:29:33 <clarkb> feedback on approach taken there very much welcome
19:30:03 <clarkb> #link https://gerrit-review.googlesource.com/c/gerrit/+/424404 Upstream release war build without web fonts change
19:30:16 <clarkb> #topic openstack.org DNS record cleanup
19:30:24 <clarkb> #link https://paste.opendev.org/show/bVHJchKcKBnlgTNRHLVK/ Proposed DNS cleanup
19:30:49 <fungi> just a quick request for folks to look through that list and let me know if there are any entries which i shouldn't delete
19:30:50 <clarkb> fungi put together this list of openstack.org records that can be cleaned up. I skimmed the list and had a couple of records to call out, but otherwise seems fine
19:31:06 <clarkb> api.openstack.org redirects to developer.openstack.org so something is backing that name and record
19:31:44 <clarkb> and then I don't think we managed whatever sat behind the rss.cdn record so unsure if that is in use or expected to function
19:31:55 <fungi> oh, yep, i'm not sure why api.openstack.org was in the list
19:32:12 <fungi> i'll say it was a test to see if anyone read closely ;)
19:32:40 <fungi> we did manage the rss.cdn.o.o service, back around 2013 or so
19:32:51 <fungi> it was used to provide an rss feed of changes for review
19:33:13 <clarkb> oh I remember that now.
19:33:21 <fungi> i set up the swift container for that, and anita did the frontend work i think
19:33:35 <fungi> openstackwatch, i think it was called?
19:33:40 <clarkb> that sounds right
19:34:03 <fungi> cronjob that turned gerrit queries into rss and uploaded the blob there
19:35:14 <clarkb> those were the only two I had questions about. Everything else looked safe to me
19:35:49 <clarkb> and now api.o.o is the only one I would remove fro mthe list
19:36:21 <tonyb> Nothing stands out to me, but I can do some naive checking to confirm
19:36:47 <fungi> oddly, i can't tell what's supplying the api.openstack.org redirect
19:37:19 <fungi> something we don't manage inside an ip block allocated to Liquid Web, L.L.C
19:38:02 <fungi> i guess we should host that redirect on static.o.o instead
19:39:15 <fungi> i'll push up a change for that and make the dns adjustment once it deploys
19:40:15 <clarkb> sounds like a plan
19:40:22 <clarkb> #topic Open Discussion
19:40:26 <clarkb> Anything else?
19:40:35 <tonyb> Vexxhost and nodepool
19:41:02 <clarkb> I meant to mention this in the main channel earlier today then didn't: but this is the classic reason for why we have redundancy
19:41:04 <tonyb> frickler: reported API issues and they seem to ongoing
19:41:07 <clarkb> I don't think this is an emergency
19:41:25 <clarkb> however, it would be good to understand in case its a systemic problem that would affect say gitea servers or review
19:41:59 <frickler> seeing the node failures I'm also wondering whether disabling that region would help
19:42:44 <tonyb> No not an emergency just don't want it to linger and wondering when, or indeed, if we should act/escalate
19:43:53 <clarkb> yes i think the next step may be to set max-servers to 0 in those region(s) then try manually booting instances
19:43:56 <clarkb> and debug from there
19:44:32 <clarkb> then hopefully that gives us information that can be used to escalate effectively
19:45:18 <tonyb> Okay.  That sounds reasonable.
19:48:27 <clarkb> sounds like that may be everything. Thank you everyone
19:48:37 <clarkb> We'll be back here same time and location next week
19:48:42 <corvus> thanks!
19:48:51 <clarkb> and feel free to bring things up in irc or on the mailing list in the interim
19:48:53 <clarkb> #endmeeting