19:01:51 #startmeeting infra 19:01:51 Meeting started Tue Nov 22 19:01:51 2022 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:51 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:51 The meeting name has been set to 'infra' 19:02:00 #link https://lists.opendev.org/pipermail/service-discuss/2022-November/000381.html Our Agenda 19:02:22 #topic Announcements 19:02:35 fungi: are individual director nominations still open? 19:03:24 yes, until january 19:03:31 i think? 19:03:52 no, december 16 19:04:03 #link https://lists.openinfra.dev/pipermail/foundation/2022-November/003104.html 2023 Open Infrastructure Foundation Individual Director nominations are open 19:04:15 thanks 19:04:40 also cfp is open for the vancouver summit in june 19:04:57 #link https://lists.openinfra.dev/pipermail/foundation/2022-November/003105.html The CFP for the OpenInfra Summit 2023 is open 19:06:01 #topic Bastion Host Updates 19:06:08 Lets dive right in 19:06:30 the bastion is doing centrally managed known hosts now. We also have a dedicated venv for launch node now 19:07:10 fungi discovered that this venv isn't working with rax though. It needed older cinderclient, and iwht that fixed now needs something to address networking stuff with rax. iirc this was the problem that corvus and ianw ran into on old bridge? I think we need older osc maybe? 19:07:19 fungi: that may be why my osc is pinned back on bridge01 fwiw 19:07:59 well, the cinderclient version only gets in the way if you want to use launch-node's --volume option in rackspace, but it seems there's also a problem with the default networks behavior 19:08:22 right, I suspect that the reason I pinned back osc but not cinderclient is I was not doing volume stuff but was doing server stuff 19:08:24 ... that does sound familiar 19:08:29 rackspace doesn't have neutron api support, and for whatever reason osc thinks it should 19:08:43 maybe we need to add another option to our clouds.yaml for it? 19:08:52 ianw: I'm pretty sure its the same issue that corvus ran into and helped you solve when you ran into it. And I suspect it was fixed for me on brdige01 with oldre osc 19:09:16 fungi: if you can figure that out you win a prize. unfortunately the options for stuff like that aren't super clear 19:09:34 (https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2022-10-24.log.html#t2022-10-24T22:56:16 was what i saw when bringing up new bridge) 19:09:59 looks like the server create command is hitting a problem in openstack.resource._translate_response() where it complains about the networks metadata 19:10:01 "Bad networks format" 19:10:12 yeah, that's it 19:11:46 we can solve the venv problem outside th emeeting but I wanted to call it out 19:11:48 okay, so we think ~root/corvus-venv (on old bridge?) has a working set? 19:11:58 fungi: ya 19:12:03 yeah, that's enough for me to dig deeper 19:12:06 thanks 19:12:06 are there any other bridge updates or changes to call out? 19:12:45 the ansible 6 work is close 19:13:00 #link https://review.opendev.org/q/topic:bridge-ansible-update 19:13:08 anything with a +1 can be reviewed :) 19:13:16 oh right I pulled up the change there that I havne't reviewed yet but saw it was huge compared to the others and got distracted :) 19:13:28 #link https://review.opendev.org/c/opendev/system-config/+/865092 19:13:48 is one for disabling known_hosts now that we have the global one 19:14:05 ianw: I think you can go ahead and approve that one if you like. I mostly didn't approve it just so that we could take a minute to consider if there was a valid use case for not doing it 19:15:04 ok, i can't think of one :) i have in my todo to look at making it a global thing ... because i don't think we should have any logging in between machines that we don't explicitly codify in system-config 19:15:12 #link https://review.opendev.org/q/topic:prod-bastion-group 19:15:28 has a couple of changes still if we want to push on the parallel running of jobs. 19:15:47 (it doesn't do anything in parallel, but sets up the dependencies so they could, in theory) 19:16:03 ++ I do want to push on that. But I also want to be able to observe things as they change 19:16:13 probbaly won't dig into that until next week 19:16:16 that is very high touch, and i don't have 100% confidence it wouldn't cause problems, so not a fire and forget 19:16:20 ++ 19:16:52 other than that, i just want to write up something on the migration process, and i think it's (finally!) done 19:17:01 excellent. Thanks again 19:17:14 there's no more points i can think of that we can really automate more 19:18:05 sounds like we can move on to the next item 19:18:07 #topic Upgrading Servers 19:18:25 I was hoping to make progress on another server last week then I got sick and now this is behind a few other efforts :/ 19:18:46 that said one of those efforts is mailman3 and does include a server upgrade. Lets just skip ahead to talk about that 19:18:49 #topic Mailman 3 19:19:05 changes for mailman3 config management have landed 19:19:20 after fungi picked this back up and addressed some issues with newer mailman and django 19:19:46 I believe the next step is to launch a node, add it to our inventory and get mailman3 running on it. Then we can plan migration sfor opendev and zuul? 19:19:57 er I guess they are already planned. We can go ahead and perform them :) 19:20:45 yeah, the migration plan is outlined here: 19:20:57 #link https://etherpad.opendev.org/p/mm3migration Mailman 3 Migration Plan 19:22:18 anything else to add to this? 19:22:59 sounds great, thanks! 19:23:01 i'm thinking since we said let's do go/no-go in the meeting today for a monday 2022-11-28 maintenance, i'm leaning toward no-go since i'm currently struggling to get the server created and expect to not be around much on thursday/friday (also a number of people are quite unlikely to see an announcement this week) 19:23:22 fungi: that is reasonable. no objection from me 19:23:36 yeah i wouldn't want to take this one on alone :) 19:24:21 later nect week might work, we could do a different day than friday given clarkb's unavailability 19:24:45 also apologies for typos and slow response, i seem to be suffering some mighty lag/packet loss at the moment 19:25:23 yup later that week works for me. I'm just not around that friday 19:25:25 anyway, nothing else on this topic 19:25:26 (the 2nd) 19:25:42 mainly working to get the prod server booted and added to inventory 19:25:43 cool, lets move o nto the next thing 19:25:53 #topic Using pip wheel in python base images 19:26:17 The change to update how our assemble system creates wheels in our base images has landed 19:26:36 At this point I don't expect any problems due to the testing I did with nodepool which covered siblings and all that 19:26:50 But if you do see problems please make note of them so that they can be debugged, but reverting is also fine 19:26:58 mostly just a heads up 19:28:35 #topic Vexxhost instance rescuing 19:28:55 I didn't want this to fall off our mind swithout some sort of conclusion 19:29:11 For normally booted nodes we can create a custom boot image that uses a different device label for / 19:29:27 we can create this with dib and uploda it or make a snapshot of an image we've done surgery on in the cloud. 19:29:58 However, for BFV nodes we would need a special image with special metadata flags and I have no idea what the appropriate flags would be 19:30:38 given that, do we want to change anything about how we operate in vexxhost? previously people had asked about console login 19:31:15 I'm personally hopeful that public clouds would make this work out of the box for users. But that isn't currently the case so I'm open to ideas 19:31:57 i guess review is the big one ... 19:32:17 i guess console login only helps if it gets to the point of starting tty's ... so not borked initrd etc. 19:32:35 ya its definitely a subset of issues 19:32:58 maybe our effort is best spent trying to help mnaser figure out what that metadata for bfv instances is? 19:33:08 er metadata for bfv rescue images is 19:33:41 yes, being able to mount the broken server's rootfs on another server instance is really the only 100% solution 19:33:47 it does feel like there's not much replacement for it 19:34:16 i'm trying to think of things in production that have stopped boot 19:34:33 ianw: the lists server is the main one. But thats due to its ancientness 19:34:39 (which we are addressing by replacing it) 19:34:42 a disk being down in the large static volume and not mounting was one i can think of 19:35:24 i think i had to jump into rescue mode for that once 19:35:56 for that case, having a ramdisk image in the bootloader is useful 19:36:22 but not for cases where the hypervisor can't fire the bootloader at all (rackspace pv lists.o.o) 19:37:06 ya it definitely seems like rescuing is the most versatile tool. I'll see if I can learn anything more about how to configure ceph for it 19:37:26 in particular jrosser had input and we might be able to try setting metadata like jrosser and see if it owrks :) 19:38:11 and if we can figure it out then other vexxhost users will potentially benefit too 19:38:15 so worthwhile I think 19:39:09 #topic Quo vadis Storyboard 19:39:25 Its been about two weeks since I sent out anothre email asking for more input 19:39:35 #link https://lists.opendev.org/pipermail/service-discuss/2022-October/000370.html 19:39:49 we've not had any responses since my last email. I think that means we should probably consider feedback gathered 19:40:51 At this point I think our next step is to work out amongst ourslves (but probably still on the mailing list?) what we think opendev's next steps should be 19:41:31 to summarize the feedback from our users, it seems there aren't any specifically interested in using storaybodr or helping to maintain/develop it. 19:41:47 There are users interested in migrating from storyboard to launchpad 19:41:48 the options are really to upgrade it to something maintainable or somehow gracefully shut it down? 19:42:13 do we know whether there is any other storyboard deployment beside ours? 19:42:22 ianw: yes I think ultimately where that puts us is we need to decide if we are willing to maintain it ourselves (including necessary dev work). This would require a shift in our priorities considering it hasn't been updated yet 19:42:39 ianw: and if we don't want to do that figure out if migrating to something else is feasible 19:42:56 frickler: there were once upon a time but I'm not sure today 19:44:08 as always people can fork or take up maintenance later if it's something they're using, regardless of whether we're maintaining it 19:45:03 right I'm not too worried about others unless they are willing to help us. And so far no one has indicated that is something they can or want to do 19:45:26 I don't think we should make any decisions in the meeting. But I wanted to summarize where we've ended up and what our two likely paths forward appear to be 19:45:27 it feels like if we're going to expend a fair bit of effort, it might be worth thinking about putting that into figuring out something like using gitea issues 19:45:56 ianw: my main concern with that is I don't think we can effectively disable repo usage and allow issues 19:46:06 its definitely something that can be investigated though 19:46:29 right now there are a lot of features in gitea which we avoid supporting because we aren't allowing people to have gitea accounts 19:46:47 and a number of features we explicitly don't want 19:47:07 as soon as we do allow gitea accounts, we need to more carefully go through the features and work out which ones can be turned off or which we're stuck supporting because people will end up using them 19:47:17 But we've got really good CI for gitea so testing setups should't be difficult if people want to look into that. 19:47:48 also not allowing authenticated activities has shielded us from a number of security vulnerabilities in gitea 19:47:57 ++ 19:48:37 anyway I do think we should continue the discussion on the mailing list. I can write a followup indicating that we have options like maintain it ourselves and do the work, migrate to something else (lp and/or gitea) and try to outline the effort required for each? 19:48:51 Do we want that in a separate thread? Or should I keep rolling forward with the existing one 19:49:27 i do agreee, but it feels like useful time spent investigating that path 19:52:58 it probably also loops back to our authentication options, something that would also be useful to work on i guess 19:53:21 ya at the end of the day everything end sup intertwined somehow and we're doing our best to try and prioritize :) 19:53:43 (fwiw I think the recent priorities around gerrit and zuul and strong CI have been the right choice. They hvae enabled us to move more quickly on other things as necessary) 19:54:10 sounds like no strong opinion on where exactly I send the email. I'll take a look at the existing thread and decide if a follow up makes esnse or not 19:54:19 but I'll try to get that out later today or tomorrow and we can pick up discussion there 19:54:41 agreed, i would want to find time to put into the sso spec if we went that route 19:56:19 #topic Open Discussion 19:56:24 Anything else before we run out of time 19:57:07 just thx to ianw for the mastodon setup 19:57:13 reminder that i don't expect to be around much thursday through the weekend 19:57:19 fungi: me too 19:57:21 that went much easier than I had expected 19:57:30 my kids actually have wednesday - monday out of school 19:57:33 I wish I realized that sooner 19:57:51 well, if you're not around tomorrow i won't tell anyone ;) 19:58:11 ++ enjoy the turkeys 19:59:51 and we are at time. Thank you for your time everyone. We'll be back next week 19:59:54 thanks clarkb! 19:59:54 #endmeeting