19:01:04 #startmeeting infra 19:01:04 Meeting started Tue Nov 29 19:01:04 2022 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:04 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:04 The meeting name has been set to 'infra' 19:01:06 #link https://lists.opendev.org/pipermail/service-discuss/2022-November/000383.html Our Agenda 19:01:44 #topic Announcements 19:01:45 o/ 19:02:22 There is a board meeting next week December 6 at 2100 UTC and weblate for openstack translations will be discussed if anyone is interested in attending 19:03:36 The summit cfp is also open. I noticed that SCaLE's cfp closes friday as well 19:03:43 and fossdem is doing things 19:04:24 anythine else to announce? 19:05:20 #topic Bastion Host Updates 19:06:30 with the holiday weekend I've sort of lost where we are at on this 19:06:45 I think there were changes to manage rax dns with a small script 19:07:06 and openstacksdk updated to fix the networking issue when booting rax nodes 19:07:20 yep there's a stack for updating the node launcher @ 19:07:32 #link https://review.opendev.org/q/topic:rax-rdns 19:07:51 #link https://review.opendev.org/865320 Improve launch-node deps and fix script bugs 19:07:53 also that 19:08:11 that has a tool for updating RAX rdns automatically when launching nodes, and also updates our dns outputs etc. 19:08:44 there's also 19:08:46 #link https://review.opendev.org/q/topic:bridge-osc 19:09:01 which updates the "openstack" command on the bastion host that doesn't currently work 19:09:22 and then 19:09:26 #link https://review.opendev.org/q/topic:bridge-ansible-update 19:09:42 though in theory we could just deeplink from /usr/local/(s)bin to the osc in the launch env and not manage two separate copies 19:09:43 is a separate stack that updates ansible on the bridge 19:10:08 fungi: that's what https://review.opendev.org/c/opendev/system-config/+/865606 does :) 19:10:15 oh, perfect ;) 19:11:36 cool. I'll do my best to pull these up after my appointment and review as many as I can 19:11:53 anything else with the bastion? 19:12:27 #link https://review.opendev.org/c/opendev/system-config/+/865784 19:12:42 is a tool for backing up bits of the host 19:13:25 that's about it. i think it's pretty close to being as much of a "normal" host as i think it can be 19:13:42 there's still 19:13:45 #link https://review.opendev.org/q/topic:prod-bastion-group 19:13:58 and that last link will do the parallel runs right? 19:14:05 to update jobs so they can run in parallel 19:14:08 yep 19:14:11 I'm thinking we should stablize as much as we can prior to that as that alone demands a fair bit of attention 19:14:25 agree on that, not high priority 19:14:53 #topic Upgrading Old Servers 19:14:58 #link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades Notes 19:15:08 I don't have anything new to add to this :( I keep finding distractions 19:15:29 #topic Mailman 3 19:15:35 #link https://etherpad.opendev.org/p/mm3migration Server and list migration notes 19:15:38 #link https://review.opendev.org/c/opendev/system-config/+/865361 19:15:48 I believe fungi is just about ready to deploy the new server 19:15:57 then its a matter of announcing and performing the swap over 19:16:01 well, the server exists, but the inventory change is in the gate noiw 19:16:09 dns is already there 19:16:10 it just merged :) 19:16:55 perfect 19:17:13 and i've added reverse dns and am making sure it's clear of any blocklists 19:17:44 so assuming the deploy from the inventory addition checks out, we should be able to announce a maintenance window now 19:18:14 is there anything you need from us at this point? 19:18:19 i'll float a draft etherpad in #opendev later today or early tomorrow 19:18:40 sounds good 19:18:46 when would folks want to do the maintenance? monday december 5? 19:18:55 that day should work for me 19:19:10 ++ 19:20:03 as for things to cover in the announcement, people wanting to manage list moderation queues and configs or adjust their subscriptions will need to create accounts (but if they use the same address from the configs we've imported then those roles will be linked as soon as they follow the link from the resulting confirmation e-mail) 19:20:36 also we're unable to migrate any held messages from moderation queues, so list moderators will want to do a quick pass over theirs shortly before the maintenance window 19:21:29 other than that, and the brief outage and what the new ip addresses are, is there anything else i should cover in the announcement? 19:21:33 can we first stop incoming mails to avoid race conditions? 19:21:40 i can do that for zuul on sunday/monday 19:21:50 frickler: yes the etherpad link above covers the whole process 19:22:07 and I'm pretty sure stopping incoming mail is part of it? 19:22:27 but before the final moderation run? 19:22:27 that should force it to queue up and then get through to the new server when dns is updated 19:22:36 frickler: oh specifically for moderation. For that I'm not sure 19:22:56 maybe not so relevant for these low volume lists, but possibly for openstack-discuss 19:23:02 the plan is to stop incoming mails by switching the hostnames for the sites to nonexistent addresses temporarily 19:23:23 because we can't easily turn off some sites and not others at the mta layer 19:24:05 but also to keep that as brief as possible, just long enough for dns to propagate and the imports to run 19:24:36 and then we'll switch to the proper addresses in dns so messages deferred on the senders mtas will end up at the correct server 19:25:33 for openstack-discuss, i'll personally be moderating it anyway and can literally check the moderation queue via an /etc/hosts override after dns is swizzled before the import 19:25:34 ok, guess I should've read the etherpad first ;) 19:26:01 but that won't be migrated in this upcoming window, it'll be sometime early next year 19:27:39 #topic Quo Vadis Storyboard 19:27:51 I sent the followup to the thread that I promised (a little late sorry) 19:28:10 tried to summarize the position the opendev team is in and some ideas for improving the status quo 19:28:25 I'd love to hear feedback (ideally on the thread so that others can easily follow along) 19:28:54 frickler did mention that if we're going to invest further in the gitea deployment (and maybe if we don't anyway) we'll need to sort out the gitea fork situation 19:29:10 I followed that whole thing as it was happening a few weeks ago and didn't feel like we needed to take any drastic action at the time 19:29:27 But i agree it is worth being informed and up to date on that situation to ensure we're putting effort in the correct location 19:29:37 i guess people are still pushing to fork and it hasn't settled out yet? 19:29:53 yes I think a fork is in progress, but I'm not sure it is deployable yet 19:30:02 does it have a name? 19:30:11 yes, my main concern when I saw this was we might build something for gitea and then it turns out we would want/need the fork instead 19:30:16 forgejo 19:30:30 https://codeberg.org/forgejo/forgejo 19:30:50 not sure though whether that is _the_ fork or just one of multiple ones 19:30:58 they had to one-up gitea on unpronounceable names 19:31:18 it's esperanto for "forge" they say 19:32:04 i saw 19:32:21 I dont' think we need drastic action today either fwiw 19:32:37 but it would be good to evaluate both if we dig deeping into using gitea's extra functionality 19:33:32 some of the folks involved in that have a history of getting very excited about things and then dropping them months later, so i'm not holding my breath yet anyway 19:34:19 #topic Vexxhost instance rescuing 19:34:45 This is mostly on the agenda at this point to try adn remind me to find time when jrosser's day overlaps with mine to gather info on the bfv setup they use for rescuing 19:35:03 then I can try and set that up in vexxhost and test it. IF I can make that work then we can develop a rescue image with all the right settings maybe 19:35:11 but no new updates on this since the last time we met 19:35:14 (holidays will do that) 19:35:21 #topic Gerrit 3.6 Upgrade 19:35:25 #link https://etherpad.opendev.org/p/gerrit-upgrade-3.6 19:35:58 ianw: I've skimmed this doc a couple times at this point. The main new bit of feedback I've got (which is on the etherpad too) is that the latest 3.5 release which we upgraded to recently was made partially to fix a bunch of issues with copy approvals 19:36:25 I think it might be worth catching up on the state of copy approvals upstream (just to be sure there aren't any more bug fixes outstanding) then give it a go on our installation? 19:36:39 that way if our install finds new issues we have time to work upstream to address them 19:36:56 yeah i agree -- aiui that can be run online just fine right? 19:37:11 it looks likely to be a multiple-hour thing 19:37:18 ianw: yes, it is supposed to be able to run online in the background. Digging through logs for that might be the hard part to double check it was happy 19:37:56 the other piece which I don't thin kwe've done is evaluate any other potentailly breaking or user visible updates 19:38:06 just to try and idnetify any major things people might complain about 19:38:30 yeah, i need to spin up the node 19:38:43 then we can poke 19:38:57 3.7 is looking to be far more problematic too so I'm glad we aren't trying to make that jump yet 19:39:13 it has currently broken recheck comments for example 19:39:26 if you like, i can try running the copy-approvals when it slows down in a few hours, and monitor it 19:39:49 ianw: I think the first thing is to look at the changelog and open changes for 3.5 to see if we are missing any copy approvals updates 19:39:57 update our image if necessary but then ya running it 19:40:04 clarkb: upstream bug suggests there should be an easy fix for that in zuul; i'll be investigating it soon 19:40:31 ++ i'll check that out this morning and we can sync on it 19:40:35 corvus: it is a bit weird to me that want zuul to address it? its the commetn added event which has for a decade now included the content of the comment 19:40:50 corvus: I mean, its great if we can workaround it but I feel like that breaks a pretty base level expectation 19:41:28 ianw: sounds good. Anything else gerrit 3.6 upgrade related? 19:41:42 assuming the upstream bug is accurate, they've had this backwards compat in place for years at our request, so i'm not ready to ding them on that. 19:42:26 clarkb: nope, let's get that groundwork done then we can decide on update schedules 19:42:32 sounds good 19:42:37 anyway, give me a chance to actually look into it so i can speak from knowledge :) 19:42:41 corvus: ++ 19:43:03 #topic Acme.sh failures 19:43:24 acme.sh switched to ecc certs/keys by default (from rsa) recently and broke our ability to renew things 19:43:47 ianw fungi and I all poked at it a bit and I think ianw was able to track it down to that specific change and wrote an upstream issue about it 19:43:51 #link https://github.com/acmesh-official/acme.sh/issues/4416 19:44:04 Basically file paths changed and now acme.sh can't find data it is looking for later 19:44:21 to address this we've pinned to the previous release: 3.0.5 and the change to do that landed recently 19:44:40 we should expect that certs that our cert checker complains about refresh overnight today and address that (maybe sooner) 19:45:03 One thing I wanted to ask is if we should try and explicitly set settings like that to avoid underlying changes impacting us 19:45:10 we can set the key type and length and so on explicitly 19:45:18 and that might allow us to avoid pinning? 19:45:56 The downside to this is 5 years from now when rsa 2048 is no longer safe <_< 19:46:07 we could -- this feels like a bit of a corner case because the ecc path seems to be a bit of a separate, opt-in thing, and upstream sort of half-switched it 19:47:09 they can't really make the tool suddenly start updating certs in a different place on disk (a directory with _ecc appended) because that would seem to break everything 19:47:22 maybe we wait for feedback on the issue before deciding on how to move forward post pin 19:47:36 I just wanted to call the option of being explicit as an alternative to pinning 19:47:53 yeah; i think it's good we've pointed it out -- we've run from dev for several years and this is the first time we've had an issue 19:48:40 so i'll keep an eye, and if we can get back to a point of running from dev I think that's desirable to continue being a canary 19:48:46 we can always pin to the last known working thing easily 19:48:56 sounds good and thank you for digging into that yesterday 19:49:04 I had a note on your debugging change too no sure if you saw 19:49:16 the one that updates driver.sh 19:49:39 #topic Open Discussion 19:49:41 ok thanks, will go over those. just a couple of things that would have made it easier if happens again 19:49:46 ++ 19:50:13 corvus: ianw's fix to openstacksdk for launch node may be what we need for latest openstacksdk to work with nodepool too so I'll try to reprioritize testing that with your test tool 19:50:43 also linaro's new arm cloud appears to be near ready for use. This is being spun up to get us off old hardware that equinix wants to shutdown 19:51:02 I expect we'll need to move the builder and the mirror node manually (but maybe linaro has some magic to move those vms? I doubt it though ) 19:51:19 yeah pretty sure that will all be a rebuild 19:51:21 And a reminder that we'll turn off the iweb cloud at the end of the year. 19:51:26 that's ok, good test of the launch node changes :) 19:51:30 ++ 19:52:38 last call for anything else 19:54:29 i've checked the new mm3 server's ip addresses against spamhaus and senderbase, and they're all clean 19:55:00 excellent 19:55:04 Thank you all for your time (in the meeting and working on OpenDev)! We'll be back next week. I should look at a calendar soon to see how December holidays and the new yaer impact our schedule. 19:55:08 announcement is being drafted in the migration plan etherpad and is nearly complete, so i'll give folks a heads up in #opendev once it's ready to proof 19:55:15 thanks clarkb! 19:55:27 #endmeeting