19:01:08 <clarkb> #startmeeting infra
19:01:08 <opendevmeet> Meeting started Tue Feb 28 19:01:08 2023 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:08 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:08 <opendevmeet> The meeting name has been set to 'infra'
19:01:19 <clarkb> Hello everyone, its been a couple of weeks since we had one of these
19:01:31 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/V2UYFDWIGJPXVEJRLIAF7WUNUMDGCJCI/ Our Agenda
19:01:41 <clarkb> #topic Announcements
19:02:11 <clarkb> The only Service Coordinator nomination I saw was the one I sent in. I believe that makes me it again. But if there was another and I missed it please call that out soon
19:02:41 <fungi> thank you for your service!
19:03:20 <clarkb> #topic Topics
19:03:28 <clarkb> #topic Bastion Host Updates
19:03:34 <clarkb> #link https://review.opendev.org/q/topic:bridge-backups
19:03:51 <clarkb> This stack of changes got some edits after I did my first pass of reviews. I need to do another pass of reviews. Hopefully this afternoon
19:04:03 <clarkb> ianw: anything specific to call out from that?
19:04:29 <ianw> nope, yeah just wants another look, i think i responded to all comments
19:04:56 <clarkb> any other bridge related activities?
19:06:40 <clarkb> #topic Mailman 3
19:06:56 <clarkb> The db migration stuff should be addressed now. Thank you ianw for that change
19:07:26 <clarkb> fungi got a response from upstream on how to create domains. Apparently you run some python script and don't do migrations. I still find it confusing, but should be able to sort out on a held node
19:07:34 <fungi> no progress yet on the domain piece yet, but one of the maintainers did get back to me with clearer explanations for site creation which doesn't look too complicated. unfortunately also pointed out that postorius's host associations aren't really api-driven and we'll need to reverse engineer it from the webui code
19:07:41 <fungi> last week was a complete wash, trying to catch back up now that i'm home again
19:07:56 <clarkb> ya I'm in a similar boat as we were both traveling
19:08:13 <clarkb> The good news is that we have better directions now
19:08:44 <clarkb> fungi: other than needing to try out upstream's suggestions is there anything else we need to be thinking about here? anything that needs help?
19:09:00 <fungi> planning additional migrations
19:09:03 <fungi> and the upgrade
19:09:32 <fungi> upgrade change is already proposed, but we probably want to do the host separation deployment fixes first
19:09:36 <clarkb> and did we decide on a prefered order for that?
19:09:42 <clarkb> ack
19:09:55 <clarkb> domain host separation, upgrade, migrations ?
19:10:00 <clarkb> er in that order I mean
19:10:04 <fungi> yeah, i think so
19:10:23 <fungi> i'm hoping to knock out airship, openinfra and starlingx in march if possible
19:10:32 <fungi> oh, and katacontainers
19:10:42 <fungi> that might be a bit ambitious, we'll see
19:10:44 <clarkb> sounds good
19:10:56 <fungi> aiming for openstack migration in april or maybe may
19:11:23 <fungi> and then we can clean up the old servers
19:11:42 <clarkb> exciting
19:11:54 <fungi> anyway, that's it for me on this topic
19:12:00 <clarkb> #topic Gerrit Updates
19:12:37 <clarkb> There are two long standing issues related to this. The java 17 switch and the ssh connection channel stuff. Both of which I'll bring up at the gerrit community meeting in 2 days
19:12:54 <clarkb> Hopefully that gets both of those moving for us or at least better direction
19:13:01 <corvus> ssh connection channel stuff?
19:13:14 <clarkb> #link https://github.com/apache/mina-sshd/issues/319 Gerrit SSH issues with flaky networks.
19:13:16 <clarkb> corvus: ^
19:13:43 <clarkb> it seems to be a minor issue but makes big scary warnings in the logs. ianw has run it down some and likely wrote a bug fix for it but last I checked it hasn't merged
19:13:52 <clarkb> though maybe I didn't set appropriate warning bells on that change to see it merge
19:13:58 <ianw> yeah no comments on it
19:14:23 <clarkb> #link https://gerrit-review.googlesource.com/c/gerrit/+/358314 Possible gerrit ssh channel fix
19:15:12 <clarkb> but ya I'll try to get some movement on that Thursday at 8am pacific in the gerrit communty meeting
19:15:23 <ianw> thanks!
19:15:31 <clarkb> Yesterday we had some Gerrit fun too which ended up exposing a couple of things
19:15:51 <clarkb> The first is that after the change of base images I set the java package to the jre-headless package on debian which doesn't include debugging tools like jcmd
19:16:20 <clarkb> jcmd ended up being unnecessary to get a thread dump as I could do kill -3 instead. That said it seems like a good idea to have jdk tools in place since you don't know you'll need them until you need them
19:16:26 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/875553 Install full jdk on gerrit images
19:16:44 <clarkb> That change will add the extra tools by installing the full jdk headless package instead
19:16:47 <corvus> fyi the stream-events/"recheck" problem in gerrit 3.7 should be fixed in 3.7.1 (i have not verified this, but the expected fix is in the log for the latest release)
19:16:53 <corvus> #link https://bugs.chromium.org/p/gerrit/issues/detail?id=16475 stream-events issue with zuul expected fix is in gerrit 3.7.1
19:17:02 <clarkb> ack
19:17:35 <corvus> i like adding the tools to the img
19:17:37 <clarkb> The other thing that came up in my debugging of the issues yesterday is that several gerrit plugins expect to be able to write to review_site/data in a persistent manner
19:18:32 <clarkb> in particular the delete-project plugin "archives" deleted repos to data/ and deletes them from that location after some time. The plugin manager uses it for something I haven't figured out yet. Replication plugin uses it to persist replication tasks across gerrit restarts
19:18:49 <clarkb> it is this last one that is most interesting to us since I'm also working on gitea server replacements
19:18:53 <corvus> i also like the idea of bind-mounting that dir, even if not strictly necessary.  our intent really was to use containers for "packaging convenience" and not really rely on volume management, etc.  so bind-mounting to achieve the normal installation experience makes sense.
19:19:17 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/875570 bind mount gerrit's data dir
19:19:33 <fungi> technically it also highlighted a third thing: that there's still academic research interest in opendev activities!
19:20:02 <clarkb> I'll talk more about replication specific things when I get to gitea acitivities, but ya I think this is a good thing
19:21:23 <clarkb> Speaking of gerrit 3.7 one of the things we need to do is update our acls. ianw you sent email about the first change to do that. Did the change get applied yet? I think maybe not as you weren't around last week?
19:21:35 <clarkb> no worries. Just want to get up to speed if that did happen
19:21:40 <corvus> i know the stream-events issue was an upgrade blocker; with that [presumably] resolved, is anything blocking an upgrade to 3.7.1 now?
19:21:55 <fungi> acls ;)
19:22:01 <clarkb> corvus: yes, we need to convert all our acls to 3.7 acceptable formats
19:22:09 <corvus> strictly speaking they shouldn't be a blocker
19:22:19 <clarkb> oh?
19:22:25 <fungi> true, just need to merge the transformation
19:22:27 <corvus> at least, from a tech standpoint.  i can get behind us wanting to have them in place though.
19:22:49 <corvus> gerrit has backwards compat for the old stuff -- it's only if you want to change an acl that in comes in to play
19:22:50 <clarkb> ya I think gerrit will start refusing to accept acl updates in the affected locations. And that is likely to be confusing for users
19:22:59 <corvus> yes that
19:23:00 <clarkb> best to get things converted upfront and avoid confusion
19:23:15 <corvus> so the "copy pasta" approach that happens so often would not go well
19:23:19 <clarkb> #link https://review.opendev.org/c/openstack/project-config/+/867931 Cleaning up deprecated copy conditions in project ACLs
19:23:31 <clarkb> this is the first step (but not a complete conversion of everything that needs doing)
19:23:32 <fungi> folks will be confused enough when they cargo-cult an old project creation change and it gets rejected
19:23:40 <corvus> fungi: yep
19:23:47 <ianw> sorry -- no that's not done yet
19:24:31 <clarkb> so ya I would suggest we do as much of the converting on 3.6 as we can. Then upgrade to 3.7.1 or newer
19:24:37 <ianw> we probably want to start our gerrit 3.7 upgrade checklist page
19:24:44 <fungi> so while not technically a blocker, setting a correct example with the current acls we have in project-config will hopefully defray some of that
19:24:50 <clarkb> ianw: ++
19:24:58 <ianw> i can do that, so we start to have a checklist of things we know to work on
19:25:13 <fungi> also probably our gerrit integration testing will break if our test acls don't at least have the correct format
19:25:21 <clarkb> corvus: do you know if java 17 is required for 3.7?
19:25:28 <clarkb> 3.6 was the first release to "support" it
19:25:37 <clarkb> we may want to do that conversion pre 3.7 as well
19:25:50 <clarkb> and hopefully gerrit community can clarify that in thursday's meeting
19:26:16 <corvus> clarkb: i don't know off hand
19:26:57 <clarkb> I'll try to run that down
19:27:50 <clarkb> Ok lets move on to the next topic which has some overlap
19:27:56 <clarkb> #topic Upgrading Old Servers
19:28:11 <clarkb> As mentioned in our last meeting I think we need to prioritize gitea backends, nameservers, and etherpad
19:28:31 <clarkb> I've started with gitea and have made quite a bit of progress. I'll try to summarize what I've done so far and then what still needs to be done
19:28:57 <clarkb> Gitea09 has been booted in vexxhost sjc1 using a modern v3 flavor there with 8vcpu and 32GB memory and built in 120GB disk (no BFV)
19:29:12 <clarkb> This is a bit larger than our old servers and I think we may end up running fewer gitea backends as a result
19:29:41 <fungi> well, also most of our gitea semi-outages have been due to memory exhaustion/swap thrash
19:30:05 <clarkb> I added the gitea09 server to our gitea group in ansible and let ansible deploy a complete gitea server but without git repo content to it. I then transplanted the database from gitea01 to gitea09 to preserve redirects
19:30:05 <fungi> so it might help there anyway
19:30:08 <clarkb> ++
19:30:37 <clarkb> After transplanting the database I discovered that some old orgs that are no longer in projects.yaml no longer had working logos. I fixed this by manually copying files for them
19:30:51 <clarkb> So far the db transplant and the copying of the ~4 logos are the only manual interventions I've had to do
19:31:21 <clarkb> I then added gitea09 to gerrit's replication config and the gerrit restarts yesterday picked that up. I triggered a full sync to gitea09 which appears to be near completion.
19:31:44 <fungi> i would stick with 8 backends if we can. the reason the recommended flavors changed is that the memory-to-cpu ratio in the underlying hardware is higher and so the provider had lots of ram going unused on the servers anyway
19:31:58 <clarkb> The next steps I've got in mind are to do another full resync (for all 9 giteas) to make sure the gerrit restarts and problems yesterday didn't introduce problems with replication
19:32:22 <clarkb> At that point I think we can add gitea09 to haproxy and have it in production
19:32:44 <clarkb> Then I would like to upgrade gitea to 1.18.5
19:32:49 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/875533 upgrade gitea to 1.18.5
19:33:20 <clarkb> Concurrent to that I'd also like to update Gerrit to autoreload replication configs
19:33:36 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/874340 Gerrit replication autoreloading.
19:33:49 <clarkb> This will allow us to add more giteas and remove old giteas without gerrit restarts.
19:34:15 <clarkb> Previously we had removed autoreloading because we had noticed it would lose replication tasks on reload and the giteas would not all be in sync
19:34:35 <clarkb> However, the data/ dir storage used by the replication plugin should mitigate this and after discussion with nasserg I think if we bind mount that properly we should be good
19:34:59 <fungi> yeah, i'm happy to try it again
19:35:18 <clarkb> Once all of ^ is settled we can remove gitea08 and test this hypothesis. Then I'll probably boot 3 more giteas and build them out assembly line style and we can add them to production in bulk
19:35:22 <fungi> the issues it caused last time were only as disruptive as they were because it took us so long to identify the underlying cause
19:35:30 <fungi> we'll be on the lookout for it this time anyway
19:35:33 <clarkb> ++
19:35:37 <corvus> (i just want to clarify that a gerrit restart (via docker stop/start/restart)  shouldn't delete the data dir -- only a docker-compose down + up would do that)
19:35:44 <clarkb> corvus: correct
19:35:59 <clarkb> this was exercised yesterday with the restarts that didn't pull new images
19:36:17 <clarkb> whichconfirmed the replication tasks persist across those restarts if data/ content is preserved
19:36:20 <fungi> however, older gerrit versions didn't persist that queue to disk at all, with obvious reprecussions
19:36:29 <corvus> okay, i just wanted to get that out there and make sure we still expect that there have been some improvements (eg, replication actually writing to that dir) to make us think it will work better now
19:36:46 <corvus> clarkbfungi excellent that explains it perfectly, thanks
19:37:01 <clarkb> and then if we need/want another 4 I figure I'll do those in another batch together too
19:37:25 <clarkb> The only other consideration is when to move the gitea db backups off of gitea01. I figure I can do that once gitea09 is in production and running happily
19:37:30 <ianw> with the logos, iirc, we walk the list of orgs via the api and copy in the logos.  was the problem the old orgs had been setup to an old logo that wasn't copied?
19:37:32 <clarkb> that way we can remove gitea01 at any point
19:37:43 <clarkb> ianw: the problem is the old orgs are not in projects.yaml anymore
19:37:55 <clarkb> ianw: and so they really only exist in hte context of our rename history and in the gitea db
19:37:59 <fungi> yeah, it's an issue with renamed orgs
19:38:10 <clarkb> ianw: all I had to do was copy the opendev logo file into a file named after the old org name
19:38:35 <ianw> i'm just wondering if we should codify that in the logo-setting role
19:39:00 <clarkb> ianw: we probably can. We'd need to inspect the rename history files to know what those orgs are I think
19:39:01 <ianw> it doesn't look at projects.yaml, but the gitea api.  but it could then also have an extra list of "old" names to also copy
19:39:22 <clarkb> its also an easy manual workaround so I dind't want to hold things up on it
19:39:27 <clarkb> but happy for the help/improvements
19:39:58 <clarkb> but ya I think we've got a process largely sorted out now except for the gerrit replication config updates. Reviews on those changes much appreciated
19:40:26 <clarkb> and I'll keep working on this for the next bit until we've removed the old gitea servers. Then I can look at the next server type to upgrade.
19:41:02 <clarkb> any other questions/ideas/comments about gitea or server upgrades?
19:41:25 <clarkb> oh I have one actually. There is a request for a project rename. I'd like to request we do not rename anything until the gitea replacements are done
19:41:48 <clarkb> Thats just one extra moving part I don't want to have to keep track of as I go through this :)
19:42:50 <clarkb> #topic Handing over x/virtualdpu to OpenStack Ironic PTL
19:43:10 <clarkb> is it dpu or pdu? I may have typoed. fungi want to fill us in?
19:43:19 <fungi> this should hopefully be fairly straightforward, but i'll start with some background
19:43:54 <fungi> once upon a time, developers at internap wrote a virtual pdu project, and the openstack ironic project came to depend on it
19:44:21 <fungi> virtualpdu, as it was called, was never officially added to openstack, but ironic still relies on it
19:44:44 <fungi> the internap developers moved on, and no longer work at internap nor on virtualpdu, so it's basically abandoned
19:45:10 <fungi> the openstack ironic developers would like to adopt it, but reaching people with control of the current acl was... hard
19:46:08 <fungi> rpittau was finally able to get a response from mathieu mitchell, who expressed approval of the ironic team taking control of the repository, but since then has not replied further nor updated the acl
19:46:53 <fungi> no other virtualpdu maintainers replied at all (but all were cc'd)
19:46:56 <fungi> i have copies of the e-mail messages, complete with received headers, in case there's some dispute over things
19:47:23 <clarkb> My suggestion would be to publicly post intent to do the change in ownership to service-announce and openstack-discuss and cc the old maintainers and post a date when we'll make the change. They can concede or fight it in that period and if we hear nothing we make the change
19:47:42 <fungi> probably the next step is to post the intent to hand over access to openstack on a mailing list as well as cc the listed maintainer addresses, and set a date
19:47:54 <clarkb> I don't think anyone has nefarious intent here, its a useful tool and the old group isn't interested. New group is so we should support that
19:47:57 <fungi> if we hear no objections, then move forward granting control
19:48:12 <clarkb> fungi: ++
19:48:57 <ianw> all seems fine -- if there was negative feedback would be harder
19:49:12 <fungi> exactly
19:49:40 <fungi> so, openstack-discuss? service-announce? suggestions on what mailing list(s) is/are most appropriate to post the notice of intent?
19:50:13 <clarkb> I feel like service-announce at least since its something happening at the opendev level
19:50:16 <fungi> we don't really have a codified process since this is basically the first time it has ever come up
19:50:31 <clarkb> but then openstack-discuss too might help get the email in front of the right eyeballs if someone did want to object
19:50:46 <fungi> yeah, i can post copies to both of those
19:51:34 <fungi> mainly bringing it up in the meeting here to make sure we've got some consensus among opendev sysadmins on a prototype process, since this is potential precedent for future similar cases
19:52:18 <fungi> seems like we have no objections over the proposal anyway
19:52:28 <clarkb> yup. I think starting by contacting maintainers directly and trying to resolve it without opendev involvement is the first step. Then if we don't get objections but also don't resolve it making a public announcement of the plan with a period to object is a good prcess
19:52:36 <ianw> ++
19:52:44 <fungi> i'll try to get something sent out to mailing lists and the current maintainers tomorrow in that case
19:53:32 <fungi> nothing else from me on this topic, unless anyone has questions
19:53:40 <clarkb> #topic Works on ARM feedback
19:53:48 <clarkb> we are almost out of time and I wanted to get to this really quickly
19:54:06 <clarkb> The works on arm folks have asked us to talk about what we've done the last 6 months with their program.
19:54:14 <clarkb> I've started a draft response in an etherpad
19:54:17 <clarkb> #link https://etherpad.opendev.org/p/3DcVXw0PBOknv1bgyZWh
19:54:33 <clarkb> unfortunately we don't actually have 6 months of use, which i try to clarify in the email
19:54:48 <clarkb> ianw: maybe we tighten up the bit fungi had concerns about then one of us can send that soon?
19:55:03 <ianw> ++
19:55:49 <fungi> also some cleanup as soon as your vhost change deploys
19:55:52 <clarkb> ok cool lets sync on that after the meeting
19:56:00 <clarkb> fungi: yup
19:56:05 <clarkb> #topic Open Discussion
19:56:07 <clarkb> Anything else?
19:57:43 <ianw> not from me, thanks for once again running the meeting!
19:57:49 <clarkb> Sounds like that may be it. Thank you for your time today. We'll be back here same time and location next week
19:57:56 <clarkb> #endmeeting