19:00:28 #startmeeting infra 19:00:28 Meeting started Tue May 14 19:00:28 2024 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:28 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:28 The meeting name has been set to 'infra' 19:01:07 #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/Q3ZNOUTLIF3HWCUX7CQI7ITFBYC3KCCF/ Our Agenda 19:01:11 #topic Announcements 19:01:23 I'm planning to take Friday off and probably Monday as well and do a long weekend 19:01:57 monday is a public holiday here 19:02:34 i am back from my travels and trying to catch up again 19:02:44 oh nice. The monday after is a holiday here but I'm likely going to work that monday and instead do the long weekend this weekend 19:03:34 #topic Upgrading Old Servers 19:03:47 not sure if tonyb is around but I think we can dive right in 19:03:53 the meetpad cluster has been upgraded to jammy \o/ 19:04:00 Yup I'm here 19:04:06 the old servers are still around but in the emergency file and shutdown 19:04:17 so next is wiki and/or cacti 19:04:21 I think we can probably delete them in the near future as I've just managed to use meetpad successfully without issue 19:04:45 Great. I'll remove them today/tomorrow 19:05:15 WRT cacti there is an approved spec to implement prometheus 19:05:17 tonyb: sounds good. In addition to what is next I'm thinking we should try and collapse our various todo lists into a single one for server upgrades. Do you want to do that or should I take that on? 19:05:39 I an do it. 19:05:48 *can 19:06:08 tonyb: yes there is. The main bit of that that needs effort is likely the agent for reporting from servers to prometheus. I think running prometheus should be straightfowrard with a volume for metric data backing and containers for the service (thats still non zero effort but is fairly straightfowrard) 19:06:48 So how long would cacti be around once we have prometheus? 19:07:22 tonyb: that is a good qusetion. I suspect we can keep it around for some time in a locked down state for historical access, but don't need to worry about that being public whcih does reduce a number of concerns 19:07:27 I looked at cacti and it seems like it should be doable on jammy or noble which I think would avoid the need for extended OS support from Canonical 19:07:38 I think currently cacti will give us a yaer of graphing by default so thats probably a good goal to aim for if possible 19:07:58 I'm not sure if older data is accessible at all 19:08:10 Okay. I'll work on that. 19:08:26 yep 1 year no older data 19:09:05 got it. I have the start of a plan 19:09:59 thanks, feel free to reach out with any other questions or concerns 19:10:15 I was also thinking .... we have a number of "simple" services that just got put onto jammy. Is is a good/bad or indifferent idea to pull them onto noble 19:10:36 tonyb: I don't think it is a bad idea, but I also think they are less urgent 19:10:46 I worry that doing so will mean more pressure when getting away from noble, but it also simplifies the infra 19:10:53 Okay. Noted 19:10:55 historically we've been happy to skip the intermediate LTS and just go to the next one 19:11:11 though I think we've got a ton of stuff on focal there may be some imbalance in the distribution 19:11:22 if there are ones that are easy or make sense I'll see about including them 19:11:46 we also don't have noble test nodes yet so testing noble system-config nodes isn't yet doable 19:11:56 which is related to the next coupel of topics 19:12:09 lets jump into those 19:12:12 #topic Adding Noble Test Nodes 19:12:39 I think we're still at the step of needing to mirror noble. I may have time to poke at that later this week depending on how some of this gerrit upgrade prep goes 19:13:17 Once we've managed to write all that data into afs we can then add noble nodes to nodepool. I don't think we need a dib release for that baceuse only glean and our own elements changed for noble 19:13:34 Sounds good. 19:13:36 but if my undersatnding of that is wrong please let me know and we can make a new dib release too 19:14:20 #topic AFS Volume Cleanup 19:14:47 after our meeting last week all the mergeable topic:drop-ubuntu-xenial changes merged 19:15:12 thank you for the reviews. At this point we are no longer building wheels for xenial and a number of unused jobs have been cleared out. That said there is still much to do 19:15:19 next on my list for this item is retiring devstack-gate 19:15:39 it is/was a big xenial consumer and openstack doesn't use it anymore so I'm going to retire it and pull it out of zuul entirely 19:16:12 then when that is done the next step is pruning the rest of the xenial usage for python jobs and nodejs and so on. I suspect this is where momentum will slow down and/or we'll just pull the plug on some stuff 19:16:27 hopefully the blast radius for that is vanishingly small, probably non-openstack projects which haven't been touched in years 19:16:37 for devstack-gate i mean 19:16:41 ya a lot of x/* stuff have xenial jobs too 19:16:54 I expect many of those will end up removed from zuul (without full retirement) 19:17:01 also means we should be able to drop a lot of the "legacy" jobs converted from zuul v2 19:17:02 similar to what we did with fedora cleanup and windmill 19:17:23 oh good point 19:17:50 tripleo was one of the last stragglers depending on legacy jobs, i think 19:17:53 fun. never a dull moment ;P 19:19:19 so ya keep an eye out for changes related to this and I'm sure I'll bug people when I've got a good set that need eyeballs 19:19:25 slow progress is being made otherwise 19:19:49 #topic Gerrit 3.9 Upgrade Planning 19:20:23 Last week we landed a chnge to rebuild our gerrit images so that we would actually have an up to date 3.9 image after the last attempt failed (my fault as the jobs I needed to run were not triggered) 19:20:48 we restarted Gerrit to pick up the corresponding 3.8 image update just to be sure everything there was happy and also upgraded mariadb to 10.11 at the same time 19:21:12 I have since held some test nodes to go through the upgrade and downgrade process in order to capture notes for our planning either 19:21:17 s/either/etherpad/ 19:21:18 i can't remember, did i see a change go by last week so that similar changes will trigger image rebuilds in the future? 19:21:25 #link https://etherpad.opendev.org/p/gerrit-upgrade-3.9 Upgrade prep and process notes 19:21:48 fungi: I think you are thinking wishfully 19:21:52 fungi: no change that I am aware of. I'm on the fence over whether or not we want that 19:22:12 its ambiguous to me if the 3.9 image should be promoted when we aren't explicitly updating the 3.9 image and only modifying the job. Maybe it should 19:22:37 tonyb: i'm a repo-half-full kinda guy 19:23:03 other items of note here: testing seems to confirm the topic change limit should be a non issue for the upgrade and our known areas of concern (since we never have that many open changes) 19:23:16 the limit is also forgiving enough to allow you to restore abandoned chagnes that push it over the limit 19:23:39 creating new changes with that topic or applying the topic to existing changes that would push it over the limit does error though 19:23:40 and only rejects attempts to push new ones or set the topic 19:23:50 yeah, that. perfect 19:23:52 yup that appears to be the case through testing 19:24:23 I haven't seen any feedback on my change to add a release war build target to drop web fonts 19:24:54 would probably be good for other infra-root to form an opinion on whether or not we want to patch the build stuff locally for our prod builds or just keep using docs with web fonts 19:25:18 what was the concern there? were we leaking visitor details by linking to remotely-hosted webfont files? 19:25:35 fungi: we are but only if you open docs I think 19:26:06 oh, so not for the general interface 19:26:18 just the documentation part 19:26:28 and gerrit added the ability to disable that very recently when building docs. However the didn't expose that option when building release wars 19:26:33 just when building the documentation directly 19:26:40 neat 19:26:57 ya just did a qucik check. Gerrit proper serves the fonts it needs 19:27:01 the docs fetch them from google 19:27:26 probably added the disable feature to aid in faster local gerrit doc development 19:28:00 and didn't consider that people might not want remote webfont links in production 19:28:05 so anyway I pushed a change upstream to have a release war build target that depends on the documentation build without web fonts 19:28:14 we can apply that patch locally to our gerrit builds if we want 19:28:30 or we can stick with the status quo until the change merges upstream 19:29:33 feedback on approach taken there very much welcome 19:30:03 #link https://gerrit-review.googlesource.com/c/gerrit/+/424404 Upstream release war build without web fonts change 19:30:16 #topic openstack.org DNS record cleanup 19:30:24 #link https://paste.opendev.org/show/bVHJchKcKBnlgTNRHLVK/ Proposed DNS cleanup 19:30:49 just a quick request for folks to look through that list and let me know if there are any entries which i shouldn't delete 19:30:50 fungi put together this list of openstack.org records that can be cleaned up. I skimmed the list and had a couple of records to call out, but otherwise seems fine 19:31:06 api.openstack.org redirects to developer.openstack.org so something is backing that name and record 19:31:44 and then I don't think we managed whatever sat behind the rss.cdn record so unsure if that is in use or expected to function 19:31:55 oh, yep, i'm not sure why api.openstack.org was in the list 19:32:12 i'll say it was a test to see if anyone read closely ;) 19:32:40 we did manage the rss.cdn.o.o service, back around 2013 or so 19:32:51 it was used to provide an rss feed of changes for review 19:33:13 oh I remember that now. 19:33:21 i set up the swift container for that, and anita did the frontend work i think 19:33:35 openstackwatch, i think it was called? 19:33:40 that sounds right 19:34:03 cronjob that turned gerrit queries into rss and uploaded the blob there 19:35:14 those were the only two I had questions about. Everything else looked safe to me 19:35:49 and now api.o.o is the only one I would remove fro mthe list 19:36:21 Nothing stands out to me, but I can do some naive checking to confirm 19:36:47 oddly, i can't tell what's supplying the api.openstack.org redirect 19:37:19 something we don't manage inside an ip block allocated to Liquid Web, L.L.C 19:38:02 i guess we should host that redirect on static.o.o instead 19:39:15 i'll push up a change for that and make the dns adjustment once it deploys 19:40:15 sounds like a plan 19:40:22 #topic Open Discussion 19:40:26 Anything else? 19:40:35 Vexxhost and nodepool 19:41:02 I meant to mention this in the main channel earlier today then didn't: but this is the classic reason for why we have redundancy 19:41:04 frickler: reported API issues and they seem to ongoing 19:41:07 I don't think this is an emergency 19:41:25 however, it would be good to understand in case its a systemic problem that would affect say gitea servers or review 19:41:59 seeing the node failures I'm also wondering whether disabling that region would help 19:42:44 No not an emergency just don't want it to linger and wondering when, or indeed, if we should act/escalate 19:43:53 yes i think the next step may be to set max-servers to 0 in those region(s) then try manually booting instances 19:43:56 and debug from there 19:44:32 then hopefully that gives us information that can be used to escalate effectively 19:45:18 Okay. That sounds reasonable. 19:48:27 sounds like that may be everything. Thank you everyone 19:48:37 We'll be back here same time and location next week 19:48:42 thanks! 19:48:51 and feel free to bring things up in irc or on the mailing list in the interim 19:48:53 #endmeeting