19:01:06 <clarkb> #startmeeting infra 19:01:06 <opendevmeet> Meeting started Tue Feb 14 19:01:06 2023 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:06 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:06 <opendevmeet> The meeting name has been set to 'infra' 19:01:16 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/VH5EQYJH3E2YKTP3K4IXQI27WRRSEUMR/ Our Agenda 19:01:31 <clarkb> #topic Announcements 19:02:04 <clarkb> It is Service Coordinator nomination time. I've not seen any nominations yet and the time period ends today. I suppose thats a gentle way ofsaying I should keep doing it? 19:02:33 <fungi> congratudolences! 19:02:36 <clarkb> If no one speaks up indicating their interest before I finish lunch today then I guess I'll make my nomination official after lunch 19:03:53 <clarkb> The other announcement today is that we'll cancel next week's meeting. fungi and I are traveling and busy next tuesday. Thats ~50% of our normal attendance so I think we can just skip 19:04:26 <ianw> ++ 19:04:45 <clarkb> #topic Bastion Host Updates 19:04:51 <clarkb> #link https://review.opendev.org/q/topic:bridge-backups 19:04:56 <clarkb> #link https://review.opendev.org/q/topic:prod-bastion-group Remaining changes are part of parallel ansible runs on bridge 19:05:28 <clarkb> ianw: ^ you should just start nagging me to review that first set of changes. I keep putting it off due to distractions. I have been doing zuul reviews today and when I get tired of those I should do some opendev reviews too 19:05:49 <clarkb> (zuul early day due to overlap with europe is good then opendev late day due to overlap wuth au is good :) ) 19:05:59 <ianw> :) i should loop back on the parallel stuff too 19:06:17 <ianw> it probably needs remerging etc. 19:06:31 <clarkb> are there any other bastion concerns? 19:07:11 <ianw> it being a jammy host it hits 19:07:14 <ianw> #link https://review.opendev.org/c/opendev/system-config/+/872808 19:07:23 <ianw> with the old apt-key config 19:07:37 <ianw> that's all i can think of 19:07:40 <clarkb> oh I had a question about that which i guess I didn't post on the review directly (my bad) 19:07:50 <clarkb> specifically how new of an apt do we need to support that method of key trust 19:08:05 <clarkb> We would need it to work on bionic and newer iirc 19:08:15 <ianw> i think 1.4 which is >=bionic 19:08:42 <clarkb> and I guess reverting and making it distro release specific isn't too terrible either 19:08:58 <clarkb> I'll do a quick rereview after the meeting since i didn't record my previous thoughts properly 19:10:43 <clarkb> #topic Mailman 3 19:11:09 <clarkb> We're still poking at the site creation stuff last I saw, but there was one other thing that had a change to address it 19:11:15 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/873337 Fix warnings about missing migrations 19:11:31 <fungi> my testing on the previous held node was probably invalid because of the lingering db migration issue which i think is what also resulted in my container restart errors 19:12:05 <fungi> i thought i had approved 873337 already but i guess now 19:12:07 <fungi> approved now 19:12:23 <fungi> i'll get a new held node once we have new images 19:12:26 <clarkb> sounds good 19:12:38 <fungi> or i guess that fix won't need new images 19:12:49 <fungi> so i can recheck the dnm change as soon as that merges 19:12:58 <clarkb> Any other mailman related items? I think we've managed to chip away at most of it other than the site creation to fix vhosting (which amkes sense since that is the complicated bitwith db migrations) 19:13:34 <fungi> i don't have anything else, no. i still haven't had time to wrap my head around creating new sites with django migrations 19:13:38 <clarkb> #topic Gerrit Updates 19:13:50 <clarkb> #link https://review.opendev.org/c/openstack/project-config/+/867931 Cleaning up deprecated copy conditions in project ACLs 19:14:06 <clarkb> This update has been announced. I think we can probably land it whenever we're confident jeepyb is happy (and last I saw it was workign?) 19:14:14 <clarkb> ianw: ^ not sure if you had a specific plan for that one 19:14:52 <ianw> i think it might need to be manually applied as it will take longer than the job timeout 19:15:20 <clarkb> ianw: jeepyb will only update those 90ish repos? Oh except those files might be used in more than 90 repos 19:15:59 <clarkb> if we think increasing the timeout would work that seems fine, otherwise manual application also seems fine. 19:16:00 <fungi> we could chunk the change up into batches i guess, but manual manage-projects run seems fine to me 19:16:55 <ianw> it's *probably* ok, but still. might be an idea to 1) put in emergency 2) down gerrit 3) run a manual backup run 4) up gerrit 5) manually apply? 6) remove from emergency? 19:17:41 <clarkb> ianw: do we think we need a backup like that? The acls are all in git so theoretically we can just revert them if necessary 19:18:05 <clarkb> mostly just thinking that I'm not sure a gerrit downtime is necessary 19:19:06 <ianw> i could go either way; i was just thinking it's an unambiguous snapshot 19:19:40 <clarkb> I think I'm willing to trust the acl system's historical record here. We've relied on it in the past and can continue to do so 19:19:41 <ianw> i guess we have now double-checked all the acl files, and gerrit shouldn't let us merge anything it doesn't like 19:20:49 <fungi> right, as long as we merge it through gerrit rather than behind its back, the worst that should happen is manage-projects throws errors and we can't create new projects or update existing ones for a little while until we sort it out 19:21:26 <fungi> or would have to take manual action in order to do so at least 19:21:41 <clarkb> I guess doing a canary change with a smaller set of updates might be good if we're worried about getting syntax wrong etc 19:21:57 <clarkb> but ya I think a downtime for backups is overkill given gerrit's builtin checks and record keeping 19:22:22 <fungi> technically the syntax is already checked by the manage-projects test in our gerrit job, right? 19:22:38 <clarkb> fungi: yes, but using the rules we try to interpret from gerrit not gerrit tiself 19:22:45 <clarkb> and this is a new set of rules so possible we got it wrong 19:23:11 <fungi> oh, i guess i thought we had an integration test creating a project in gerrit 19:23:30 <clarkb> not using our production acls 19:23:44 <clarkb> that would take too long to run probably unfortunately 19:23:54 <fungi> and the acl change doesn't update the test acl to match? 19:24:07 <clarkb> I don't think so. 19:24:20 <fungi> i didn't think to check that myself 19:24:48 <clarkb> they are decoupled. We test jeepyb + gerritlib in that bubble. We test our deployment of gerrit in system-config. And then we do simple linter type checks in project-config 19:24:57 <clarkb> this change is to project-config and doesn't impact the others 19:25:00 <ianw> i don't think we set any conditions in system-config, but we can 19:25:35 <clarkb> ya maybe that is better than doing a canary change 19:25:45 <clarkb> just to make suer we get the general syntax correct 19:26:02 <clarkb> But ya I'm not too worried about it given the ability to rollback etc 19:26:13 <clarkb> There were two other Gerrit related items 19:26:27 <ianw> ok, will do, and then i'll plan to apply it manually just to really watch it closely and because of timeouts, but with no downtime 19:26:28 <clarkb> Both of which i've put on the Gerrit community meeting agenda for March 2 (8am pacific) 19:27:07 <clarkb> The first is Java 17 support. I have a hcange up to swap us to java 17 which works in our CI jobs. But you have to use an ugly java cli option to make it happen which seems at odds with their full compatibility statement 19:27:38 <clarkb> I'm hoping to get a better sense of that support in the community meeting and if thats he path forward I guess we roll with it 19:27:48 <clarkb> The other is the ssh connectivity problem with channel tracking 19:28:18 <clarkb> ianw: has been digging into this quite a bit and I think discovered a bug in the upstream implementation of channel tracking. That doesn't explain why ssh is unhappy though right? just that fixing the bug will get us better information when those cases happen? 19:29:09 <ianw> so i think the bug means that the workaround committed was actually not doing anything 19:29:15 <clarkb> #link https://github.com/apache/mina-sshd/issues/319 Gerrit SSH issues with flaky networks. 19:29:40 <ianw> #link https://gerrit-review.googlesource.com/c/gerrit/+/358314 19:29:42 <clarkb> ianw: "the workaround" is to disable channel tracking? 19:30:15 <ianw> specifically it's https://gerrit-review.googlesource.com/c/gerrit/+/238384 19:30:41 <ianw> what that is supposed to do is track when a ssh channel is opened in a variable 19:31:28 <ianw> then, if an unhandledchannelerror is raised by mina, it looks at what channel it was, and if that channel has been opened before, basically ignores it 19:31:58 <clarkb> ah but since it wasn't tracking it that error propagates. So your fix may be the actual fix too 19:32:29 <ianw> right, the "track when opened" was never running because it wasn't registered to receive the channel open events 19:32:46 <clarkb> In that case I can use the community meeting to beg for reviews if they haven't landed it by then 19:32:51 <clarkb> :) 19:32:59 <ianw> so ... it's a fix ... but it doesn't really seem to answer any questions of what's going on 19:33:25 <clarkb> which is why your extra logging change remains to collect that info and hopefully debug the underlying situation 19:33:48 <ianw> #link https://gerrit-review.googlesource.com/c/gerrit/+/357694 19:34:46 <ianw> right, yeah that change has logging for basically every channel event. but i'm not sure how much it helps now -- since we would be getting log messages when the channel is initalized from the prior change, which was mostly what we were interested in 19:35:44 <ianw> i don't know. i think maybe merge the "fix" and just move on with life and don't think too hard about it :) 19:35:58 <clarkb> works for me. I'll bring it up with gerrit if we don't manage to make progress before the meeting 19:36:25 <ianw> something is still not quite right in mina I think, but this probably isn't the context to find it 19:37:50 <clarkb> #topic Upgrading Servers 19:38:08 <clarkb> I'm trying to pick this up again and have begun looking at the gitea backends 19:38:22 <clarkb> A couple of things make this easier than I feared and one thing makes this painful :) 19:39:04 <clarkb> We control gitea ansible group independently of what servers haproxy load balances to and gerrit replicates to. This means we can pretty easily spin up a new gitea on a new server running with a bunch of empty git repos 19:39:27 <clarkb> Then when we are happy with the state of the server add it to gerrit replication, force gerrit to replicate everything to that server, then wait 19:39:47 <clarkb> Then add the server to haproxy and probably remove an old server. Repeat in a loop 19:40:02 <clarkb> What makes this painful/difficult is ensuring gitea state is what we want it to be. Specifically for redirects 19:40:52 <clarkb> I poked around in a held gitea test node's db yesterday and I think we can construct the redirects from scratch given info we have, but one thing that compliactes that is we need to create gitea orgs that don't exist in projects.yaml 19:41:22 <clarkb> essentially leading me to realize that bootstrapping that all from an empty state is probably more effort than necessary right now (though a noble exercise and maybe one we should get around to eventually) 19:41:43 <clarkb> instead I think we should stop gitea after the initial bring up then replace its db with a prod db 19:41:58 <clarkb> er replace its fresh db with a copy of a prod db from an old host 19:42:12 <clarkb> that will bring over the other orgs and redirects in theory. 19:42:38 <clarkb> What I'm concerned about doing this is that maybe we'll end up with stuff missing on disk. But since we never have to put the server into a public facing capacity until we are happy with it I think we just do that and see if it works 19:43:14 <clarkb> Looking at my current calendar and todo list maybe I can spin up that new server tomorrow, getit deployed as a blank gitea then start attempting to make it a prod like gitea on thrusday 19:44:00 <clarkb> For things that are not gitea we have etherpad, nameservers, static, mirrors, and jitsimeet. Of those I think etherpad and nameservers are the priorities 19:44:04 <ianw> this is all because in the past we've made the gitea projects as usual via the api, but then they've been moved, which we've also done via gitea, which has internally applied db updates to reflect this on it's instance, but when we're starting a new host we have no way of capturing this (at the moment, at least), right? 19:44:37 <clarkb> ianw: correct. We have a repo that captures the renames at that point in time but there is no tooling to apply that to gitea as a set of old orgs and redirects 19:45:06 <clarkb> I suppose as an alternative we could do inplace server upgrades. But I like to avoid those when we can 19:45:54 <ianw> it is always nice to validate we can start fresh 19:46:17 <clarkb> For the other servers I'm thinking etherpad and nameservers are the other priorities. In particular I had some notes about doing the nameservers but am not really confident in the process for that. If anyone has time to think that through and write out a small plan that would be appreciated 19:46:27 <clarkb> I suspect I'm overcomplicating the effort to update the nameservers in my head 19:46:58 <clarkb> and yes help much appreciated. Thanks for all the help so far too 19:49:29 <ianw> ++ i can have a look at nameservers 19:49:58 <clarkb> #topic Quo vadis Storyboard 19:50:11 <clarkb> This topic like the service has become a victim of a lack of time 19:50:49 <clarkb> I don't have anything new here. But maybe we should have a meeting dedicated to this in order to create a forcing function to spend time on it 19:51:09 <clarkb> I'd suggest hte PTG but the TPG conflicts with spring break around here so I'm trying to limit my PTG commitments :) 19:51:30 <clarkb> But maybe a higher bw call type setup the week before PTG or something? 19:52:19 <clarkb> Let me get through next week's travel and then try to put something together for that 19:52:32 <clarkb> #topic Open Discussion 19:53:02 <clarkb> As mentioned at the beginning of the meeting I'll make my service coordinator nomination official in an hour or so after lunch assuming no one beats me to it 19:53:35 <clarkb> Zuul's sqlalchemy 2.0 change merged earlier today. I may try to kick off a zuul restart sooner than the regularly scheduled weekend restart just to get that checked more quickly 19:56:16 <clarkb> anything else? 19:57:24 <ianw> not from me, thanks! 19:57:59 <clarkb> Thank you everyone for your time during this meeting but also for contributing to OpenDev. We'll skip next week's meeting and be back here in two weeks 19:58:01 <clarkb> #endmeeting