19:01:04 #startmeeting infra 19:01:04 Meeting started Tue Oct 11 19:01:04 2022 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:04 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:04 The meeting name has been set to 'infra' 19:01:06 \o 19:01:21 #link https://lists.opendev.org/pipermail/service-discuss/2022-October/000364.html Our Agenda 19:01:24 #topic Announcements 19:01:52 The PTG is happening next week. Be aware of that as we make changes (best to avoid updating meetpad and etherpad for example) 19:02:16 Also encourage people that have problems to reach out to us directly. In the past we've gotten reports of trouble via a game of telephone and that has been difficult 19:03:19 Also, this morning I spilled a glass of water on my keyboard so I've been in disarray all day. Thankfully it was the desktop keyboard and not the laptop and I've got a spare modem M that has been plugged in 19:03:33 but still its really weird to type on a new keyboard after having that one for a decade 19:03:50 the zuul default ansible version changed to 6 across all of our tenants as of late last week, and we'll be dropping ansible 5 support "real soon" so fixing related problems is time-sensitive 19:04:31 probably warrants another announcement if we can nail down a timeline for the ansible 5 removal change 19:04:52 I did at least warn of the ansible 5 removal in the email anouncing the switch to 6 by default 19:05:00 yep 19:06:40 #topic Topics 19:06:49 #topic Bastion Host Updates 19:07:20 The changes to stop writing console log files on bridge landed yesterday. Looks like there was a small issue getting the flag name correct. ianw do we have an idea yet if that is working as expected? 19:08:07 and we still need a similar one for static.o.o right? or is that already up? 19:08:30 i just checked and i think so. there hasn't been a log file written since Oct 11 06:09 (UTC) which is about when all the periodic jobs cleared out 19:09:26 static has been done the same way, so it looks good too 19:09:39 awesome. Thats one less thing to worry about now :) 19:10:24 heh yes, thank you for reviews. i think it was good to reach a bit more of a generic solution 19:10:59 the "tunnel console things via a socket and the ssh connection" changes are another option that is still on my todo list, and seems like a great thing to look into as well 19:11:11 one day ... :) 19:11:21 ya though I think we don't want to expose that on bridge or static due to how the protocl works 19:11:30 since it could be used to read other files? 19:11:38 I think what we've done not exposing it is correct for us 19:11:42 is intended for reading additional files 19:11:57 designed with that in mind anyway 19:12:33 The other stack of changes in flight here has to do with ansible in a venv 19:12:38 #link https://review.opendev.org/q/topic:bridge-ansible-venv 19:12:43 er... the current protocol shouldn't be able to read other files? 19:13:13 corvus: ya I don't think it does, but that was always part of the intention iirc. And I don't want to have to think about undoing any opening of things if that changes 19:13:38 i know it was at least a theoretical future use case 19:13:47 clarkb: yes, it was originally designed for that, but we haven't implemented it because we haven't figured out how to do it safely 19:13:58 gotcha 19:14:14 so if that's the concern -- just future-proofing cool. but if there was a thought that we had a current vulnerability... then i would like to explore that more. :) 19:14:41 oh no I don't think we have a current vulnerability. We're set up to avoid it should the zuul behavior change to what was (I thought anyway) the intended behavior 19:15:22 (and there's really 2 protocols here -- there's the websocket/finger protocol of user -> zuul, and the internal protocol of executor -> node; the former is the one we designed to allow reading other files in the future, and the second is what we just changed) 19:16:04 (though support for the former probably would need changes like the latter) 19:16:58 okay, so i think we're all on the same page that currently there is not the ability to read arbitrary files, but that we like the status quo of explicitly disabling log streaming on bridge because, among other things, that future-proofs us against eventually adding that feature. ya? 19:17:06 ++ 19:17:40 cool, thx and sorry for the diversion. just wanted to make sure we didn't open something we didn't intend to. :) 19:18:04 ianw: for ansible in a venv did you manage to sort out using the first member of a singleton group as the hosts specification? 19:18:28 (specifically i was talking about https://review.opendev.org/c/zuul/zuul/+/542469 but let's not go into that further now :) 19:19:17 clarkb: thanks ... one step back i just approved the change reviewed by yourself and fungi to move the production ansible into a venv on the current bridge. so i'll watch that in today. that's the "venv" bit of it really 19:20:01 and that preps us for being able to use newer ansible, right? 19:20:03 ya and in theory thta should just switch over due to symlinking the venv install over to ansible 19:20:10 (that was my thought during review anyway) 19:20:29 yep, *in theory* it's a noop :) 19:20:39 fungi: sort of, we need to upgrade the python installation too (which is where the replacement node comes in and why the other group work is related) 19:20:41 the bits on top now are about upgrading to jammy, and abstracting the way we address the bastion host so we can switch the host more easily -- in this case to probably bridge01.opendev.org 19:21:11 anyway, i did establish that as a playbook matcher "groupname[0]" does seem to work to address the first member of a group 19:22:08 like `- hosts: bridgegroup[0]` means this is a play that runs on the first host in the bridge group? 19:22:37 (er, in the group named "bridgegroup"; i was trying to be clear and may have failed :) 19:22:37 and group member ordering is guaranteed deterministic (uses the order in which the members are added i guess) right? 19:22:38 ya I the idea being we can control what the bridge is in a single place (the bridgegroup group) but then only ever have a single entry in that group 19:22:50 yep -- https://review.opendev.org/c/opendev/system-config/+/858476 19:22:52 fungi: the idea is that it would be a singleton group 19:23:03 but to enforce that we would take the first entry everywhere 19:23:09 i see 19:23:18 why not just let it run on the whole group of 1? 19:23:49 the reason I was concerned with that is it makes the ansible really confusing when you need to address a specific node 19:23:59 like when grabbing the CA files 19:24:27 the ansible you express becomes "create a different CA on every member of the bridge group, but only distribute the CA files for the first group member 19:24:43 if others prefer that I'm ok with that too, but I found it a bit confusing to read when I reviewed it 19:25:06 corvus: one problem i haven't dealt with yet is playbooks/bootstrap-bridge.yaml. that runs both under zuul, where the inventory is setup via the job, and in infra-prod, where the inventory is setup by opendev/base-jobs 19:25:55 i'm not sure whether or not i would have the same confusion, but i certainly see your point, and the solution seems good. now that i know the reasoning, i can be on board with that. 19:26:02 so basically both have to agree on the name/group. this is a bit annoying for clarkb's note of trying to use a different group name for the initial setup bastion host, and the production version 19:26:20 sorry, that wasn't intended for corvus: ... :) 19:26:56 oh hrm if using distinct groups for the top level ansible and nested ansible in CI is problematic I think we can just not do that 19:27:06 oh whew cause that's a hard question and i was struggling with that. glad i'm off the hook. :) 19:27:08 it was an idea I had whentrying to sort out why the job needed to redefine the group 19:27:47 yeah, it is mostly explained in the comment at https://review.opendev.org/c/opendev/system-config/+/858476/9/zuul.d/system-config-run.yaml 19:28:32 anyway -- i will keep at it and see what we can come up with; i don't think we need a solution now 19:28:42 intuitively, having the group name be the same makes sense to me... so if that's a workable/livable option i would be in favor of that. 19:29:35 i think that's where i'm coming back to as well ... 19:30:13 and maybe keep a version of that comment explaining that we're using that as a group for the zuul playbook 19:30:24 sounds good to me 19:30:27 wfm 19:30:51 yes i will definitely do my usual probably-too-verbose commenting on all this :) 19:32:22 anyway, I think it's quite likely by this time next meeting we'll have a fully updated bridge, and an easier path when we want to rotate it out next time as well 19:32:39 sounds good. Thank you for working through all the little details of this 19:32:47 #topic Upgrading Bionic Servers 19:33:21 The expected fix for removing the ubuntu user has landed. Now just need to try booting a jammy control plane server again. I'm hoping to give that a go sometime this week. 19:33:32 Sounds like ianw may also give it a go 19:33:43 But other than that I didn't have any new updated here 19:33:58 we'll want it before we boot the new listserv at the very least 19:35:35 yup I was thinking I'd find something easy to replace as a guinea pig like a mirror maybe 19:35:44 but probably not until the end of this week 19:36:05 Lets keep moving as the last topic on the agenda is one that deserves discussion before we run out of time 19:36:07 #topic Mailman 3 19:36:41 fungi has edited the extra long strings on he production mailman2 site and has begun the process of copying data for reattempting the mm3 migration on a newly held test node with our forked images 19:37:03 new held node for this is 149.202.168.204, built from your container image fork 19:37:16 will hopefully kick off a new scripted import on it within the next hour or so 19:37:27 depending on how much longer the rsync runs 19:37:28 corvus: we noticed that a child change of https://review.opendev.org/c/opendev/system-config/+/860157 doesn't find the images that change builds. And were wondering if we got the bits wrong for telling zuul about the image 19:37:58 corvus: maybe if you get some time you can take a look at how the new image build jobs and system-config-run-mailman3 job are hooked up with the buildest registry and provides/requires and dependencies 19:38:15 we've worked around it by forcingthe node hold change to rebuild the images itself 19:38:48 fungi: anything else you need from the rest of us? I expect it is largely just a wait for test results though 19:39:05 we've knocked out about all the remaining todo items, so we're probably ready to talk scheduling for lists.opendev.org and lists.zuul-ci.org production migrations 19:39:34 clarkb: let's continue that in #opendev 19:39:37 i did want to check a few more urls for possible easy/convenient redirects (things like list description pages which people tend to link in various places) 19:40:16 stuff not covered by keeping a copy of the pipermail archives hosted from the new server 19:40:18 corvus: yup don't need to solve that here 19:40:41 fungi: good idea, the existing redirects are probably not much help though as tehy redirect to content on disk but you probbaly want url redirects to mm3 urls for those 19:41:40 right. i think the list description pages are probably the only thing we really care about redirects for 19:42:04 the list indexes for the sites are just served from the root url of each vhost anyway 19:42:22 and i'm not too worried about redirecting old admin and moderator interface urls 19:42:43 makes sense 19:42:51 anything else on mm3 before we continue? 19:43:21 we should probably also confirm whether we want local logins for users or whether there's a desire to hold this for keycloak integration in order to avoid local credentials in mailman 19:44:06 i'm assuming we'd rather get the mm3 migration done and then look at keycloak integration after the fact, but just want to be sure everyone's on the same page there 19:44:13 you can subscribe to lists without creating a user (I did this with upstream mm3) 19:44:21 correct 19:44:30 we might even encourage users to do that if they never want to use the web ui for repsonding to things 19:44:44 but ya I wasn't too worried about a future switch over 19:44:49 just off the top of my head, it feels like if we allow local logins and then move to a more generic keycloak, we then have the problem of having to merge the local users too? 19:44:53 list admins/moderators will need accounts though, and if someone wants to adjust their subscription preferences they'll need a login 19:45:24 ianw: yes we'd likely need to do that. The good thing is we should have email on both sides to align them at least 19:45:50 ianw: we'll have that either way. subscribers technically all have accounts, they just don't necessarily have login info for them unless they go through the password reset 19:46:35 ahh ok 19:46:43 is the login per list or per site or per installation? for mm2 it was per list iiuc 19:46:50 frickler: its per installation 19:47:03 frickler: right, for mm3 it's system-wide 19:47:18 so not just all lists on a given site, but all mailman sites on that server 19:47:54 convenient for folks who interact with a lot of lists, especially across multiple domains on the sam ehost 19:48:00 same host 19:49:09 so if this is needed to set e.g. digest mode, I think we cannot delay it into the future 19:49:09 anyway, i didn't have anything else. we can mull that over, i expect we'll start doing migration scheduling after the ptg 19:49:21 frickler: correct 19:50:08 basically the options are 1. wait to migrate lists to mm3 until we have keycloak in production the way we want, or 2. migrate to mm3 and then integrate keycloak later and make sure accounts can be linked/merged as needed 19:50:18 right, I think some users will still need to create accounts, but a good chunk of them shouldn't need to which helps simplify things if we want to try and keep them simple like that 19:50:28 I'm fine with 2 19:50:37 ack 19:50:47 well, to reiterate, the accounts are precreated, whether the users have login info for them or not 19:51:02 fungi: for all uses? 19:51:16 I guess the migration doesn't stick to not creating an account if it doesn't need to 19:51:52 if they're referenced in a config (admin, mod, existing subscription) then the import process creates their accounts. if they subscribe later an account is created the first time they do so 19:51:57 anyway I think its fine to migrate them later since in this case we should have the info needed to make associations 19:52:38 also the mailing list is the sort of thing that can probably safely not have single sign on forever 19:53:07 we are running out of time and I do want to get to the last item on the agenda 19:53:13 we can return to this in #opendev if necessary 19:53:23 please do 19:53:24 #topic Updating OpenDev's base job nodeset to Jammy 19:53:51 It has been pointed out that OpenDev's base job nodeset is still Focal. Jammy has been out for about half a year now and has a .1 release. It should be stable enough for our jobs 19:54:05 But that opens questions about how we want to communicate and schedule the switch 19:54:16 yes, I came across that while looking to upgrade devstack jobs 19:54:42 I was thinking that we should avoid changing it before the PTG since that will just add a distraction during PTG week. But maybe we can do it the week after ish? Basically do a 2 week notice to service-announce and then swap? 19:54:47 openstack is actively switching from focal to jammy for testing now that their zed release is done 19:54:55 I think we'd want to run some tests with base-test before discussing details of scheduling? 19:55:30 frickler: in the past we've done that (when the infra team managed this all for openstack) and he problem with that is it sets the expectation that we are repsonsible for making it work for every job 19:55:49 I twas the xenial switch or maybe trusty switch that made me never want to do that again. 19:56:10 I think people should test what they are interested in and be explicit where they know they need to be (say for specific verisons of python). 19:56:36 still we'd need to change base-test in order to allow for that? 19:56:45 #link https://review.opendev.org/c/opendev/base-jobs/+/860686 would be the change for that 19:56:48 frickler: no, any job can select the jammy nodeset 19:56:48 anything inheriting from our default nodeset which breaks when we change it has the option of overriding the nodeset it uses to the earlier value anyway 19:57:19 just as it can be adjusted to use the new value before our planned transition date 19:57:28 hmm, true that 19:57:37 I think updating base-test is a good idea to keep it in sync with base. But I don't think that is the method for tesitng this. base-test is for testing the roles in base 19:57:47 we know they work on jammy because projects like zuul already use jammy 19:57:51 so we don 19:58:00 er we don't need to test that base functionality 19:58:30 i agree i don't think this needs a base-test cycle since we know that the change won't break all jobs (because we can and have made the change explicitly elsewhere, and zuul performs syntax validation on the change) 19:58:42 in my mind, the main questions are when do we plan to switch it/how much advance notice do we want to provide users 19:58:51 fungi: ++ 19:58:59 I think we should wait for after the PTG at the very least 19:59:18 wait for after the ptg to announce it, or for actually changing it? 19:59:32 actually changing it. Ideally we should announce whatever we decide on real soon now 19:59:46 2 week notice should be fine then. announce now 19:59:50 sounds good to me 19:59:54 ++ 20:00:15 cool. I can work on a draft for service-announce after lunch today 20:00:22 (I'm happy to send that as I think most others get moderated) 20:00:28 however, we should be mindful of the zuul dropping ansible 5 situation as well, and whether we want those to coincide, or be announced together, or not compete 20:01:16 dropping ansible 5 has already been announced but without a hard date. I think it was a week or so from today that zuul had planned to drop ansible5 20:01:38 agree, having a couple of days between them will help in distinguishing failure causes 20:02:00 we will need to manually restart zuul to pick up that change quicker than our weekly restarts. But that is easy to do 20:02:11 (also I don't think anything is using ansible 5 so should be an easy switch) 20:02:29 I'll work on a draft email for all that in a bit 20:02:34 thanks! 20:02:37 and we are at time 20:02:39 for mine i think it probably gets confusing to combine them as a single change, as they're not really related as such, so agree with doing separtely 20:02:52 thanks everyone 20:02:59 thanks clarkb 20:03:02 feel free to continue discussion over in #opendev 20:03:05 #endmeeting