19:01:08 <clarkb> #startmeeting infra 19:01:08 <opendevmeet> Meeting started Tue Feb 28 19:01:08 2023 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:08 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:08 <opendevmeet> The meeting name has been set to 'infra' 19:01:19 <clarkb> Hello everyone, its been a couple of weeks since we had one of these 19:01:31 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/V2UYFDWIGJPXVEJRLIAF7WUNUMDGCJCI/ Our Agenda 19:01:41 <clarkb> #topic Announcements 19:02:11 <clarkb> The only Service Coordinator nomination I saw was the one I sent in. I believe that makes me it again. But if there was another and I missed it please call that out soon 19:02:41 <fungi> thank you for your service! 19:03:20 <clarkb> #topic Topics 19:03:28 <clarkb> #topic Bastion Host Updates 19:03:34 <clarkb> #link https://review.opendev.org/q/topic:bridge-backups 19:03:51 <clarkb> This stack of changes got some edits after I did my first pass of reviews. I need to do another pass of reviews. Hopefully this afternoon 19:04:03 <clarkb> ianw: anything specific to call out from that? 19:04:29 <ianw> nope, yeah just wants another look, i think i responded to all comments 19:04:56 <clarkb> any other bridge related activities? 19:06:40 <clarkb> #topic Mailman 3 19:06:56 <clarkb> The db migration stuff should be addressed now. Thank you ianw for that change 19:07:26 <clarkb> fungi got a response from upstream on how to create domains. Apparently you run some python script and don't do migrations. I still find it confusing, but should be able to sort out on a held node 19:07:34 <fungi> no progress yet on the domain piece yet, but one of the maintainers did get back to me with clearer explanations for site creation which doesn't look too complicated. unfortunately also pointed out that postorius's host associations aren't really api-driven and we'll need to reverse engineer it from the webui code 19:07:41 <fungi> last week was a complete wash, trying to catch back up now that i'm home again 19:07:56 <clarkb> ya I'm in a similar boat as we were both traveling 19:08:13 <clarkb> The good news is that we have better directions now 19:08:44 <clarkb> fungi: other than needing to try out upstream's suggestions is there anything else we need to be thinking about here? anything that needs help? 19:09:00 <fungi> planning additional migrations 19:09:03 <fungi> and the upgrade 19:09:32 <fungi> upgrade change is already proposed, but we probably want to do the host separation deployment fixes first 19:09:36 <clarkb> and did we decide on a prefered order for that? 19:09:42 <clarkb> ack 19:09:55 <clarkb> domain host separation, upgrade, migrations ? 19:10:00 <clarkb> er in that order I mean 19:10:04 <fungi> yeah, i think so 19:10:23 <fungi> i'm hoping to knock out airship, openinfra and starlingx in march if possible 19:10:32 <fungi> oh, and katacontainers 19:10:42 <fungi> that might be a bit ambitious, we'll see 19:10:44 <clarkb> sounds good 19:10:56 <fungi> aiming for openstack migration in april or maybe may 19:11:23 <fungi> and then we can clean up the old servers 19:11:42 <clarkb> exciting 19:11:54 <fungi> anyway, that's it for me on this topic 19:12:00 <clarkb> #topic Gerrit Updates 19:12:37 <clarkb> There are two long standing issues related to this. The java 17 switch and the ssh connection channel stuff. Both of which I'll bring up at the gerrit community meeting in 2 days 19:12:54 <clarkb> Hopefully that gets both of those moving for us or at least better direction 19:13:01 <corvus> ssh connection channel stuff? 19:13:14 <clarkb> #link https://github.com/apache/mina-sshd/issues/319 Gerrit SSH issues with flaky networks. 19:13:16 <clarkb> corvus: ^ 19:13:43 <clarkb> it seems to be a minor issue but makes big scary warnings in the logs. ianw has run it down some and likely wrote a bug fix for it but last I checked it hasn't merged 19:13:52 <clarkb> though maybe I didn't set appropriate warning bells on that change to see it merge 19:13:58 <ianw> yeah no comments on it 19:14:23 <clarkb> #link https://gerrit-review.googlesource.com/c/gerrit/+/358314 Possible gerrit ssh channel fix 19:15:12 <clarkb> but ya I'll try to get some movement on that Thursday at 8am pacific in the gerrit communty meeting 19:15:23 <ianw> thanks! 19:15:31 <clarkb> Yesterday we had some Gerrit fun too which ended up exposing a couple of things 19:15:51 <clarkb> The first is that after the change of base images I set the java package to the jre-headless package on debian which doesn't include debugging tools like jcmd 19:16:20 <clarkb> jcmd ended up being unnecessary to get a thread dump as I could do kill -3 instead. That said it seems like a good idea to have jdk tools in place since you don't know you'll need them until you need them 19:16:26 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/875553 Install full jdk on gerrit images 19:16:44 <clarkb> That change will add the extra tools by installing the full jdk headless package instead 19:16:47 <corvus> fyi the stream-events/"recheck" problem in gerrit 3.7 should be fixed in 3.7.1 (i have not verified this, but the expected fix is in the log for the latest release) 19:16:53 <corvus> #link https://bugs.chromium.org/p/gerrit/issues/detail?id=16475 stream-events issue with zuul expected fix is in gerrit 3.7.1 19:17:02 <clarkb> ack 19:17:35 <corvus> i like adding the tools to the img 19:17:37 <clarkb> The other thing that came up in my debugging of the issues yesterday is that several gerrit plugins expect to be able to write to review_site/data in a persistent manner 19:18:32 <clarkb> in particular the delete-project plugin "archives" deleted repos to data/ and deletes them from that location after some time. The plugin manager uses it for something I haven't figured out yet. Replication plugin uses it to persist replication tasks across gerrit restarts 19:18:49 <clarkb> it is this last one that is most interesting to us since I'm also working on gitea server replacements 19:18:53 <corvus> i also like the idea of bind-mounting that dir, even if not strictly necessary. our intent really was to use containers for "packaging convenience" and not really rely on volume management, etc. so bind-mounting to achieve the normal installation experience makes sense. 19:19:17 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/875570 bind mount gerrit's data dir 19:19:33 <fungi> technically it also highlighted a third thing: that there's still academic research interest in opendev activities! 19:20:02 <clarkb> I'll talk more about replication specific things when I get to gitea acitivities, but ya I think this is a good thing 19:21:23 <clarkb> Speaking of gerrit 3.7 one of the things we need to do is update our acls. ianw you sent email about the first change to do that. Did the change get applied yet? I think maybe not as you weren't around last week? 19:21:35 <clarkb> no worries. Just want to get up to speed if that did happen 19:21:40 <corvus> i know the stream-events issue was an upgrade blocker; with that [presumably] resolved, is anything blocking an upgrade to 3.7.1 now? 19:21:55 <fungi> acls ;) 19:22:01 <clarkb> corvus: yes, we need to convert all our acls to 3.7 acceptable formats 19:22:09 <corvus> strictly speaking they shouldn't be a blocker 19:22:19 <clarkb> oh? 19:22:25 <fungi> true, just need to merge the transformation 19:22:27 <corvus> at least, from a tech standpoint. i can get behind us wanting to have them in place though. 19:22:49 <corvus> gerrit has backwards compat for the old stuff -- it's only if you want to change an acl that in comes in to play 19:22:50 <clarkb> ya I think gerrit will start refusing to accept acl updates in the affected locations. And that is likely to be confusing for users 19:22:59 <corvus> yes that 19:23:00 <clarkb> best to get things converted upfront and avoid confusion 19:23:15 <corvus> so the "copy pasta" approach that happens so often would not go well 19:23:19 <clarkb> #link https://review.opendev.org/c/openstack/project-config/+/867931 Cleaning up deprecated copy conditions in project ACLs 19:23:31 <clarkb> this is the first step (but not a complete conversion of everything that needs doing) 19:23:32 <fungi> folks will be confused enough when they cargo-cult an old project creation change and it gets rejected 19:23:40 <corvus> fungi: yep 19:23:47 <ianw> sorry -- no that's not done yet 19:24:31 <clarkb> so ya I would suggest we do as much of the converting on 3.6 as we can. Then upgrade to 3.7.1 or newer 19:24:37 <ianw> we probably want to start our gerrit 3.7 upgrade checklist page 19:24:44 <fungi> so while not technically a blocker, setting a correct example with the current acls we have in project-config will hopefully defray some of that 19:24:50 <clarkb> ianw: ++ 19:24:58 <ianw> i can do that, so we start to have a checklist of things we know to work on 19:25:13 <fungi> also probably our gerrit integration testing will break if our test acls don't at least have the correct format 19:25:21 <clarkb> corvus: do you know if java 17 is required for 3.7? 19:25:28 <clarkb> 3.6 was the first release to "support" it 19:25:37 <clarkb> we may want to do that conversion pre 3.7 as well 19:25:50 <clarkb> and hopefully gerrit community can clarify that in thursday's meeting 19:26:16 <corvus> clarkb: i don't know off hand 19:26:57 <clarkb> I'll try to run that down 19:27:50 <clarkb> Ok lets move on to the next topic which has some overlap 19:27:56 <clarkb> #topic Upgrading Old Servers 19:28:11 <clarkb> As mentioned in our last meeting I think we need to prioritize gitea backends, nameservers, and etherpad 19:28:31 <clarkb> I've started with gitea and have made quite a bit of progress. I'll try to summarize what I've done so far and then what still needs to be done 19:28:57 <clarkb> Gitea09 has been booted in vexxhost sjc1 using a modern v3 flavor there with 8vcpu and 32GB memory and built in 120GB disk (no BFV) 19:29:12 <clarkb> This is a bit larger than our old servers and I think we may end up running fewer gitea backends as a result 19:29:41 <fungi> well, also most of our gitea semi-outages have been due to memory exhaustion/swap thrash 19:30:05 <clarkb> I added the gitea09 server to our gitea group in ansible and let ansible deploy a complete gitea server but without git repo content to it. I then transplanted the database from gitea01 to gitea09 to preserve redirects 19:30:05 <fungi> so it might help there anyway 19:30:08 <clarkb> ++ 19:30:37 <clarkb> After transplanting the database I discovered that some old orgs that are no longer in projects.yaml no longer had working logos. I fixed this by manually copying files for them 19:30:51 <clarkb> So far the db transplant and the copying of the ~4 logos are the only manual interventions I've had to do 19:31:21 <clarkb> I then added gitea09 to gerrit's replication config and the gerrit restarts yesterday picked that up. I triggered a full sync to gitea09 which appears to be near completion. 19:31:44 <fungi> i would stick with 8 backends if we can. the reason the recommended flavors changed is that the memory-to-cpu ratio in the underlying hardware is higher and so the provider had lots of ram going unused on the servers anyway 19:31:58 <clarkb> The next steps I've got in mind are to do another full resync (for all 9 giteas) to make sure the gerrit restarts and problems yesterday didn't introduce problems with replication 19:32:22 <clarkb> At that point I think we can add gitea09 to haproxy and have it in production 19:32:44 <clarkb> Then I would like to upgrade gitea to 1.18.5 19:32:49 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/875533 upgrade gitea to 1.18.5 19:33:20 <clarkb> Concurrent to that I'd also like to update Gerrit to autoreload replication configs 19:33:36 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/874340 Gerrit replication autoreloading. 19:33:49 <clarkb> This will allow us to add more giteas and remove old giteas without gerrit restarts. 19:34:15 <clarkb> Previously we had removed autoreloading because we had noticed it would lose replication tasks on reload and the giteas would not all be in sync 19:34:35 <clarkb> However, the data/ dir storage used by the replication plugin should mitigate this and after discussion with nasserg I think if we bind mount that properly we should be good 19:34:59 <fungi> yeah, i'm happy to try it again 19:35:18 <clarkb> Once all of ^ is settled we can remove gitea08 and test this hypothesis. Then I'll probably boot 3 more giteas and build them out assembly line style and we can add them to production in bulk 19:35:22 <fungi> the issues it caused last time were only as disruptive as they were because it took us so long to identify the underlying cause 19:35:30 <fungi> we'll be on the lookout for it this time anyway 19:35:33 <clarkb> ++ 19:35:37 <corvus> (i just want to clarify that a gerrit restart (via docker stop/start/restart) shouldn't delete the data dir -- only a docker-compose down + up would do that) 19:35:44 <clarkb> corvus: correct 19:35:59 <clarkb> this was exercised yesterday with the restarts that didn't pull new images 19:36:17 <clarkb> whichconfirmed the replication tasks persist across those restarts if data/ content is preserved 19:36:20 <fungi> however, older gerrit versions didn't persist that queue to disk at all, with obvious reprecussions 19:36:29 <corvus> okay, i just wanted to get that out there and make sure we still expect that there have been some improvements (eg, replication actually writing to that dir) to make us think it will work better now 19:36:46 <corvus> clarkbfungi excellent that explains it perfectly, thanks 19:37:01 <clarkb> and then if we need/want another 4 I figure I'll do those in another batch together too 19:37:25 <clarkb> The only other consideration is when to move the gitea db backups off of gitea01. I figure I can do that once gitea09 is in production and running happily 19:37:30 <ianw> with the logos, iirc, we walk the list of orgs via the api and copy in the logos. was the problem the old orgs had been setup to an old logo that wasn't copied? 19:37:32 <clarkb> that way we can remove gitea01 at any point 19:37:43 <clarkb> ianw: the problem is the old orgs are not in projects.yaml anymore 19:37:55 <clarkb> ianw: and so they really only exist in hte context of our rename history and in the gitea db 19:37:59 <fungi> yeah, it's an issue with renamed orgs 19:38:10 <clarkb> ianw: all I had to do was copy the opendev logo file into a file named after the old org name 19:38:35 <ianw> i'm just wondering if we should codify that in the logo-setting role 19:39:00 <clarkb> ianw: we probably can. We'd need to inspect the rename history files to know what those orgs are I think 19:39:01 <ianw> it doesn't look at projects.yaml, but the gitea api. but it could then also have an extra list of "old" names to also copy 19:39:22 <clarkb> its also an easy manual workaround so I dind't want to hold things up on it 19:39:27 <clarkb> but happy for the help/improvements 19:39:58 <clarkb> but ya I think we've got a process largely sorted out now except for the gerrit replication config updates. Reviews on those changes much appreciated 19:40:26 <clarkb> and I'll keep working on this for the next bit until we've removed the old gitea servers. Then I can look at the next server type to upgrade. 19:41:02 <clarkb> any other questions/ideas/comments about gitea or server upgrades? 19:41:25 <clarkb> oh I have one actually. There is a request for a project rename. I'd like to request we do not rename anything until the gitea replacements are done 19:41:48 <clarkb> Thats just one extra moving part I don't want to have to keep track of as I go through this :) 19:42:50 <clarkb> #topic Handing over x/virtualdpu to OpenStack Ironic PTL 19:43:10 <clarkb> is it dpu or pdu? I may have typoed. fungi want to fill us in? 19:43:19 <fungi> this should hopefully be fairly straightforward, but i'll start with some background 19:43:54 <fungi> once upon a time, developers at internap wrote a virtual pdu project, and the openstack ironic project came to depend on it 19:44:21 <fungi> virtualpdu, as it was called, was never officially added to openstack, but ironic still relies on it 19:44:44 <fungi> the internap developers moved on, and no longer work at internap nor on virtualpdu, so it's basically abandoned 19:45:10 <fungi> the openstack ironic developers would like to adopt it, but reaching people with control of the current acl was... hard 19:46:08 <fungi> rpittau was finally able to get a response from mathieu mitchell, who expressed approval of the ironic team taking control of the repository, but since then has not replied further nor updated the acl 19:46:53 <fungi> no other virtualpdu maintainers replied at all (but all were cc'd) 19:46:56 <fungi> i have copies of the e-mail messages, complete with received headers, in case there's some dispute over things 19:47:23 <clarkb> My suggestion would be to publicly post intent to do the change in ownership to service-announce and openstack-discuss and cc the old maintainers and post a date when we'll make the change. They can concede or fight it in that period and if we hear nothing we make the change 19:47:42 <fungi> probably the next step is to post the intent to hand over access to openstack on a mailing list as well as cc the listed maintainer addresses, and set a date 19:47:54 <clarkb> I don't think anyone has nefarious intent here, its a useful tool and the old group isn't interested. New group is so we should support that 19:47:57 <fungi> if we hear no objections, then move forward granting control 19:48:12 <clarkb> fungi: ++ 19:48:57 <ianw> all seems fine -- if there was negative feedback would be harder 19:49:12 <fungi> exactly 19:49:40 <fungi> so, openstack-discuss? service-announce? suggestions on what mailing list(s) is/are most appropriate to post the notice of intent? 19:50:13 <clarkb> I feel like service-announce at least since its something happening at the opendev level 19:50:16 <fungi> we don't really have a codified process since this is basically the first time it has ever come up 19:50:31 <clarkb> but then openstack-discuss too might help get the email in front of the right eyeballs if someone did want to object 19:50:46 <fungi> yeah, i can post copies to both of those 19:51:34 <fungi> mainly bringing it up in the meeting here to make sure we've got some consensus among opendev sysadmins on a prototype process, since this is potential precedent for future similar cases 19:52:18 <fungi> seems like we have no objections over the proposal anyway 19:52:28 <clarkb> yup. I think starting by contacting maintainers directly and trying to resolve it without opendev involvement is the first step. Then if we don't get objections but also don't resolve it making a public announcement of the plan with a period to object is a good prcess 19:52:36 <ianw> ++ 19:52:44 <fungi> i'll try to get something sent out to mailing lists and the current maintainers tomorrow in that case 19:53:32 <fungi> nothing else from me on this topic, unless anyone has questions 19:53:40 <clarkb> #topic Works on ARM feedback 19:53:48 <clarkb> we are almost out of time and I wanted to get to this really quickly 19:54:06 <clarkb> The works on arm folks have asked us to talk about what we've done the last 6 months with their program. 19:54:14 <clarkb> I've started a draft response in an etherpad 19:54:17 <clarkb> #link https://etherpad.opendev.org/p/3DcVXw0PBOknv1bgyZWh 19:54:33 <clarkb> unfortunately we don't actually have 6 months of use, which i try to clarify in the email 19:54:48 <clarkb> ianw: maybe we tighten up the bit fungi had concerns about then one of us can send that soon? 19:55:03 <ianw> ++ 19:55:49 <fungi> also some cleanup as soon as your vhost change deploys 19:55:52 <clarkb> ok cool lets sync on that after the meeting 19:56:00 <clarkb> fungi: yup 19:56:05 <clarkb> #topic Open Discussion 19:56:07 <clarkb> Anything else? 19:57:43 <ianw> not from me, thanks for once again running the meeting! 19:57:49 <clarkb> Sounds like that may be it. Thank you for your time today. We'll be back here same time and location next week 19:57:56 <clarkb> #endmeeting