19:01:41 #startmeeting infra 19:01:41 Meeting started Tue Nov 1 19:01:41 2022 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:41 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:41 The meeting name has been set to 'infra' 19:01:57 #link https://lists.opendev.org/pipermail/service-discuss/2022-October/000376.html Our Agenda 19:02:02 #topic Announcements 19:02:09 There were no announcements so we can dive right in 19:02:39 #topic Topics 19:02:44 #topic Bastion Host Updates 19:02:54 #link https://review.opendev.org/q/topic:prod-bastion-group 19:03:01 #link https://review.opendev.org/q/topic:bridge-ansible-venv 19:03:11 are a couple groups of changes to keep moving this along 19:03:40 * frickler should finally review some of those 19:03:55 frickler also discovered that the secrets management key is missing on the new host. Something that should be migrated over and tested before we remove the old one 19:04:19 but I think we're really close to being able to finish this up. ianw if you are around anything else to add? 19:04:33 we should also agree when to move editing those from one host to the other 19:04:50 o/ 19:04:58 ++ at this point I would probably say anything that can't be done on the new host is a bug and we should fix that as quickly s possible and use the new host 19:05:14 yes please move over anything from your home directories, etc. that you want 19:05:50 i've added a note on the secret key to 19:05:51 #link https://etherpad.opendev.org/p/bastion-upgrade-nodes-2022-10 19:06:04 thanks for that -- i will be writing something up on that 19:06:36 I also need to review the virtualenv management change since that will ensure we have a working openstackclient for rax and others 19:07:39 yeah a couple of changes are out there just to clean up some final things 19:07:53 also the zuul reboot playbook ran successfully off the new bridge 19:08:04 ianw: are you o.k. with rebooting bridge01 after the openssl updates or is there some blocker for that? 19:08:27 (I ran the apt update earlier already) 19:08:29 one thing to consider when doing ^ is if we have any infra prod jobs that we don't want to conflict with 19:08:36 but I'm not aware of any urgent jobs at the moment 19:08:44 the gist is I think that we can get testing to use "bridge99.opendev.org" -- which is a nice proof that we're not hard-coding in references 19:09:25 i think it's fine to reboot -- sorry i've been out last two days and not caught up but i can babysit it soonish 19:09:43 sounds good we can coordinate further after the meeting. 19:09:45 Anything else on this topic? 19:10:11 nope, thanks for the reviews and getting it this far! 19:10:33 #topic Upgrading Bionic Servers 19:10:51 at this point I think we've largely sorted out the jammy related issues and we should be good to boot just about anything on jammy 19:11:00 #link https://review.opendev.org/c/opendev/system-config/+/862835/ Disable phased package updates 19:11:22 that is one remaining item though. Basically it says don't do phased updates which will ensure that our jammy servers all get the same pacakges at the same time 19:11:43 rather than staggering them over time. I'm concerned the staggering will just lead to confusion about whether or not a package is related to unexpected behaviors 19:12:27 https://review.opendev.org/c/opendev/zone-opendev.org/+/862941 and its depends on are related to gitea-lb02 being brought up as a jammy node too (this is cleanup of old nodes) 19:12:49 Otherwise nothing extra to say. Just that we can (and probably should) do this for new servers and replacing old servers with jammy is extra good too 19:13:03 I'm hoping I'll have time later this week to replace another server (maybe one of the gitea backends) 19:13:13 #topic Removing snapd 19:13:23 #link https://review.opendev.org/c/opendev/system-config/+/862834 Change to remove snapd from our servers 19:13:46 after we discussed this in our last meeting I poked around on snapcraft and in ubuntu package repositories and I think there isn't much reason for us to have snapd installed on our servers 19:14:16 This change can and will affect a number of servers though so worth double checking. I haven't done an audit to see which would be affected but we could do that if we think it is necessary 19:15:03 to do my filtering I looked for snaps maitnained by canonical on snapcraft to see which ones were likely to be useful for us. And many ofthem continue to have actual packages or aren't useful to servers 19:15:10 Reviews very much welcome 19:15:45 #topic Mailman 3 19:16:01 Since our last meeting the upstream for the mailman 3 docker images did land my change to add lynx to the images 19:16:09 No repsonses on the other issues I filed though. 19:16:42 Unfortunately, I think this makes it more confusing over whether or not we should fork not easier. I'm leaning more towards forking at this point simply because I'm not sure how responsive upstream will be. But feedback there continues to be welcome 19:18:03 When fungi is back we should make a decision and move forward 19:18:14 #topic Updating base python docker images to use pip wheel 19:18:34 Upstream seems to be moving slowly on my bugfix PR. Some of that slowness is changes to git happened at the same time that impacted their CI setup 19:18:45 Either way I think we should strongly consider their suggestion of using pip wheel though 19:18:55 #link https://review.opendev.org/c/opendev/system-config/+/862152 19:19:13 There is a nodepool and a dib change too which help illustrate that this change functions and doesn't regress features like siblings installs 19:19:36 It should be a noop for us, but makes us more resilient to pip changes if/when they happen in the future. Reviews very much welcome on this as well 19:20:43 #topic Etherpad service logging 19:20:56 ianw: did you have time to write the change to update etherpad logging to syslog yet? 19:21:40 oh no, sorry, totally distracted on that 19:21:42 will do 19:22:13 thanks 19:22:32 unrelated to the logging issue I had to reboot etherpad after its db volume got remounted RO due to errors 19:22:46 after reboot it mounted the volume just fine as far as I could tell and things have been happy since yesterday 19:22:55 (just a heads up I don't think any action is necessary there) 19:23:02 #topic Unexpected Gerrit Reboot 19:23:14 This happened around 06:00 UTC ish today 19:23:33 basically looks like review02.o.o rebooted and when it came back it had no networking until ~13:00 UTC 19:23:48 we suspect something on the cloud side which would explain the lack of networking for some time as well. But we haven't heard back on that yet 19:24:10 do we have some support contact at vexxhost other than mnaser__ ? 19:24:28 frickler: mnaser__ has been the primary contact. There have been others in the past but I don't think they are at vexxhost anymore 19:25:24 If we think it is important I can ask if anyone at the foundation has contacts we might try 19:25:38 there's also the (likely unrelated) IPv6 routing issue, which I think is more important 19:25:40 but at this point things seem stable and we're mostly just interested in confirmation of our assumptions? Might be ok to wait a day 19:25:41 one thing i noticed was that corvus i think had to start the container? 19:26:05 ianw: yes, our docker-compose file doesn't specify a restart policy which mimics the old pre docker behavior of not starting automatically 19:26:21 frickler: re ipv6 thats a good point. 19:27:14 regarding manual start we assumed that that was intentional and agreeable behavior 19:27:30 i did perform some sanity checks to make sure the server looked okay before starting 19:27:44 (which is one of the benefits of that) 19:27:49 something to think about -- but also this is the first case i can think of since we migrated the review host to vexxhost that there's been what seems to be instability beyond our control 19:28:14 so yeah, no need to make urgent changes to handle unscheduled reboots at this point :) 19:28:17 considering that we seemed to lack network access anyway I'm not sure its super important to auto restart based on this event 19:28:20 we would've waited either way 19:28:44 the other thing worth considering is whether we want to have some local account to allow debugging via vnc 19:28:57 but i think honestly the main reason we didn't have it start on boot is so that if we stopped a service manually it didn't restart manually. that can be achieved with a "restart: unless-stopped" policy. so really, there are 2 reasons not to start on boot, and we can evaluate whether we still like 1, the other, both, or none of them. 19:29:16 since my first assumption was lack of network device caused by a kernel update 19:29:33 frickler: the way we would normally handle that today is via a rescue instance 19:30:05 when you rescue and instance with nova it shuts down the instance then boots another image and attaches the broken instance to it as a device which allows you to mount the partitions 19:30:30 its a little bit of work, but the cases where we've had to resort to it are few and its probably worth keeping our images as simple as possible iwthout user passwords? 19:30:36 except for boot-from-volume instances, which seem to be a bit more tricky? 19:30:41 have we ever done that with vexxhost? 19:30:45 frickler: oh is bfv different? 19:31:10 at least it needs a recent compute api (>=ussuri iirc) 19:31:12 ianw: I'm not sure about doing it specifically in vexxhost. Testing it is a good idea I suppose before I/we declare it is good enough 19:31:41 my concern with passwords on instances is that we don't have central auth so rotating/changing/managing them is mroe difficult 19:31:45 i love not having local passwords. i hope it is good enough. 19:31:59 ya I'd much rather avoid it if possible 19:32:12 I was also wondering why we choose boot from volume, was that intentional? 19:32:21 I've made a note to myself to test instance rescue in vexxhost. Both bfv and not 19:32:50 i have a vague memory that it might be a flavor requirement, but i'm not sure 19:33:01 frickler: i'd have to go back and research, but i feel like it was a requirement of vexxhost 19:33:02 yes, I think at the time the current set of flavors had no disk 19:33:13 their latest flavors do have disk and can be booted without bfv 19:33:27 heh, i think that's three vague memories, maybe that makes 1 real one :) 19:33:34 I booted gitea-lb02 without bfv (but uploaded the jammy image to vexxhost as raw allowing us to do bfv as well) 19:34:10 #action clarkb test vexxhost instance rescues 19:34:25 why don't we do that and come back to the idea of passwords for recovery once we know if ^ works 19:34:29 anything else on this subject? 19:34:35 ++ 19:34:40 ack 19:35:01 also I'll use throwaway test instances not anything prod like :) 19:35:16 #topic OpenSSL v3 19:35:52 As everyone is probably aware of openssl v3 had a big security release today. It turned out to be a bit less scary than the CRITICAL label that was initially shared led everyone to believe (they downgraded it to high) 19:36:07 Since all but two of our servers are too old to have openssl v3 we are largely unaffected 19:36:24 all in all the impact is far more limited than feared which is great 19:36:41 Also ubunut seems to think they way they compile openssl with stack protections mitigates the RCE and this is only a DoS 19:37:58 #topic Upgrading Zookeeper 19:38:28 #link https://review.opendev.org/c/opendev/system-config/+/863089 19:38:38 I would like to upgrade zookeeper tomorrow 19:39:01 at first I thought that we could just let automation do it (whcih is still liekyl fine) but all the docs I can find suggesting upgrading the leader which our automation isn't aware of 19:39:41 That means my plan is to stop ansible via the emergency file on zk04-zk06 and do them one by one. Followers first then the leader (currently zk05). Then merge that chagne and finally rmeove the hosts form the emergency file 19:39:58 if I could get reviews on the change and any concerns for that plan I'd appreciate it. 19:40:13 That said it seems like zookeeper upgrades if you go release to release are meant to be uneventful 19:40:14 (upgrading the leader last i think you missed a word) 19:40:24 yup leader last I mean 19:40:53 the plan sounds fine to me and I'll try to review until your morning 19:41:16 i'll be around to help 19:41:19 thanks! 19:41:54 #topic Gitea Rebuild 19:42:09 There are golang compiler updates today as well and it seems worthwhile to rebuild gitea under them 19:42:19 I'll have that change up as soon as the meeting ends 19:42:38 I should be able to monitor that change as it lands and gets deployed today. But we should coordinate that with the bridge reboot 19:43:19 #topic Open Discussion 19:43:19 ++ 19:44:06 It is probably worth mentioning that gitea as an upstream is going through a bit of a rough time. Their community has disagreements over the handling of trademarks and some individuals have talked about forking 19:44:27 :/ 19:44:38 I've been tryingto follow along as well as I canto understand any potential impact to us and I'm not sure we're at a point where we need to take a stance or plan to change anything 19:44:54 but it is possible that we'll be in that position whether or not we like it in the future 19:44:59 on the zuul-sphinx bug that started occuring with the latest sphinx -- might need to think about how that works including files per https://sourceforge.net/p/docutils/bugs/459/ 19:49:49 Sounds like that may be it? 19:50:06 Everyone can have 10 minutes for breakfast/lunch/dinner/sleep :) 19:50:20 thank you all for your time and we'll be back here same time and location next week 19:50:23 #endmeeting