Tuesday, 2023-02-28

*** kopecmartin_ is now known as kopecmartin15:00
clarkbThe OpenDev team meeting begins in a couple of minutes19:00
clarkb#startmeeting infra19:01
opendevmeetMeeting started Tue Feb 28 19:01:08 2023 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:01
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:01
opendevmeetThe meeting name has been set to 'infra'19:01
clarkbHello everyone, its been a couple of weeks since we had one of these19:01
clarkb#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/V2UYFDWIGJPXVEJRLIAF7WUNUMDGCJCI/ Our Agenda19:01
clarkb#topic Announcements19:01
clarkbThe only Service Coordinator nomination I saw was the one I sent in. I believe that makes me it again. But if there was another and I missed it please call that out soon19:02
fungithank you for your service!19:02
clarkb#topic Topics19:03
clarkb#topic Bastion Host Updates19:03
clarkb#link https://review.opendev.org/q/topic:bridge-backups19:03
clarkbThis stack of changes got some edits after I did my first pass of reviews. I need to do another pass of reviews. Hopefully this afternoon19:03
clarkbianw: anything specific to call out from that?19:04
ianwnope, yeah just wants another look, i think i responded to all comments19:04
clarkbany other bridge related activities?19:04
clarkb#topic Mailman 319:06
clarkbThe db migration stuff should be addressed now. Thank you ianw for that change19:06
clarkbfungi got a response from upstream on how to create domains. Apparently you run some python script and don't do migrations. I still find it confusing, but should be able to sort out on a held node19:07
fungino progress yet on the domain piece yet, but one of the maintainers did get back to me with clearer explanations for site creation which doesn't look too complicated. unfortunately also pointed out that postorius's host associations aren't really api-driven and we'll need to reverse engineer it from the webui code19:07
fungilast week was a complete wash, trying to catch back up now that i'm home again19:07
clarkbya I'm in a similar boat as we were both traveling19:07
clarkbThe good news is that we have better directions now19:08
clarkbfungi: other than needing to try out upstream's suggestions is there anything else we need to be thinking about here? anything that needs help?19:08
fungiplanning additional migrations19:09
fungiand the upgrade19:09
fungiupgrade change is already proposed, but we probably want to do the host separation deployment fixes first19:09
clarkband did we decide on a prefered order for that?19:09
clarkback19:09
clarkbdomain host separation, upgrade, migrations ?19:09
clarkber in that order I mean19:10
fungiyeah, i think so19:10
fungii'm hoping to knock out airship, openinfra and starlingx in march if possible19:10
fungioh, and katacontainers19:10
fungithat might be a bit ambitious, we'll see19:10
clarkbsounds good19:10
fungiaiming for openstack migration in april or maybe may19:10
fungiand then we can clean up the old servers19:11
clarkbexciting19:11
fungianyway, that's it for me on this topic19:11
clarkb#topic Gerrit Updates19:12
clarkbThere are two long standing issues related to this. The java 17 switch and the ssh connection channel stuff. Both of which I'll bring up at the gerrit community meeting in 2 days19:12
clarkbHopefully that gets both of those moving for us or at least better direction19:12
corvusssh connection channel stuff?19:13
clarkb#link https://github.com/apache/mina-sshd/issues/319 Gerrit SSH issues with flaky networks.19:13
clarkbcorvus: ^19:13
clarkbit seems to be a minor issue but makes big scary warnings in the logs. ianw has run it down some and likely wrote a bug fix for it but last I checked it hasn't merged19:13
clarkbthough maybe I didn't set appropriate warning bells on that change to see it merge19:13
ianwyeah no comments on it19:13
clarkb#link https://gerrit-review.googlesource.com/c/gerrit/+/358314 Possible gerrit ssh channel fix19:14
clarkbbut ya I'll try to get some movement on that Thursday at 8am pacific in the gerrit communty meeting19:15
ianwthanks!19:15
clarkbYesterday we had some Gerrit fun too which ended up exposing a couple of things19:15
clarkbThe first is that after the change of base images I set the java package to the jre-headless package on debian which doesn't include debugging tools like jcmd19:15
clarkbjcmd ended up being unnecessary to get a thread dump as I could do kill -3 instead. That said it seems like a good idea to have jdk tools in place since you don't know you'll need them until you need them19:16
clarkb#link https://review.opendev.org/c/opendev/system-config/+/875553 Install full jdk on gerrit images19:16
clarkbThat change will add the extra tools by installing the full jdk headless package instead19:16
corvusfyi the stream-events/"recheck" problem in gerrit 3.7 should be fixed in 3.7.1 (i have not verified this, but the expected fix is in the log for the latest release)19:16
corvus#link https://bugs.chromium.org/p/gerrit/issues/detail?id=16475 stream-events issue with zuul expected fix is in gerrit 3.7.119:16
clarkback19:17
corvusi like adding the tools to the img19:17
clarkbThe other thing that came up in my debugging of the issues yesterday is that several gerrit plugins expect to be able to write to review_site/data in a persistent manner19:17
clarkbin particular the delete-project plugin "archives" deleted repos to data/ and deletes them from that location after some time. The plugin manager uses it for something I haven't figured out yet. Replication plugin uses it to persist replication tasks across gerrit restarts19:18
clarkbit is this last one that is most interesting to us since I'm also working on gitea server replacements19:18
corvusi also like the idea of bind-mounting that dir, even if not strictly necessary.  our intent really was to use containers for "packaging convenience" and not really rely on volume management, etc.  so bind-mounting to achieve the normal installation experience makes sense.19:18
clarkb#link https://review.opendev.org/c/opendev/system-config/+/875570 bind mount gerrit's data dir19:19
fungitechnically it also highlighted a third thing: that there's still academic research interest in opendev activities!19:19
clarkbI'll talk more about replication specific things when I get to gitea acitivities, but ya I think this is a good thing19:20
clarkbSpeaking of gerrit 3.7 one of the things we need to do is update our acls. ianw you sent email about the first change to do that. Did the change get applied yet? I think maybe not as you weren't around last week?19:21
clarkbno worries. Just want to get up to speed if that did happen19:21
corvusi know the stream-events issue was an upgrade blocker; with that [presumably] resolved, is anything blocking an upgrade to 3.7.1 now?19:21
fungiacls ;)19:21
clarkbcorvus: yes, we need to convert all our acls to 3.7 acceptable formats19:22
corvusstrictly speaking they shouldn't be a blocker19:22
clarkboh?19:22
fungitrue, just need to merge the transformation19:22
corvusat least, from a tech standpoint.  i can get behind us wanting to have them in place though.19:22
corvusgerrit has backwards compat for the old stuff -- it's only if you want to change an acl that in comes in to play19:22
clarkbya I think gerrit will start refusing to accept acl updates in the affected locations. And that is likely to be confusing for users19:22
corvusyes that19:22
clarkbbest to get things converted upfront and avoid confusion19:23
corvusso the "copy pasta" approach that happens so often would not go well19:23
clarkb#link https://review.opendev.org/c/openstack/project-config/+/867931 Cleaning up deprecated copy conditions in project ACLs19:23
clarkbthis is the first step (but not a complete conversion of everything that needs doing)19:23
fungifolks will be confused enough when they cargo-cult an old project creation change and it gets rejected19:23
corvusfungi: yep19:23
ianwsorry -- no that's not done yet19:23
clarkbso ya I would suggest we do as much of the converting on 3.6 as we can. Then upgrade to 3.7.1 or newer19:24
ianwwe probably want to start our gerrit 3.7 upgrade checklist page19:24
fungiso while not technically a blocker, setting a correct example with the current acls we have in project-config will hopefully defray some of that19:24
clarkbianw: ++19:24
ianwi can do that, so we start to have a checklist of things we know to work on19:24
fungialso probably our gerrit integration testing will break if our test acls don't at least have the correct format19:25
clarkbcorvus: do you know if java 17 is required for 3.7?19:25
clarkb3.6 was the first release to "support" it19:25
clarkbwe may want to do that conversion pre 3.7 as well19:25
clarkband hopefully gerrit community can clarify that in thursday's meeting19:25
corvusclarkb: i don't know off hand19:26
clarkbI'll try to run that down19:26
clarkbOk lets move on to the next topic which has some overlap19:27
clarkb#topic Upgrading Old Servers19:27
clarkbAs mentioned in our last meeting I think we need to prioritize gitea backends, nameservers, and etherpad19:28
clarkbI've started with gitea and have made quite a bit of progress. I'll try to summarize what I've done so far and then what still needs to be done19:28
clarkbGitea09 has been booted in vexxhost sjc1 using a modern v3 flavor there with 8vcpu and 32GB memory and built in 120GB disk (no BFV)19:28
clarkbThis is a bit larger than our old servers and I think we may end up running fewer gitea backends as a result19:29
fungiwell, also most of our gitea semi-outages have been due to memory exhaustion/swap thrash19:29
clarkbI added the gitea09 server to our gitea group in ansible and let ansible deploy a complete gitea server but without git repo content to it. I then transplanted the database from gitea01 to gitea09 to preserve redirects19:30
fungiso it might help there anyway19:30
clarkb++19:30
clarkbAfter transplanting the database I discovered that some old orgs that are no longer in projects.yaml no longer had working logos. I fixed this by manually copying files for them19:30
clarkbSo far the db transplant and the copying of the ~4 logos are the only manual interventions I've had to do19:30
clarkbI then added gitea09 to gerrit's replication config and the gerrit restarts yesterday picked that up. I triggered a full sync to gitea09 which appears to be near completion.19:31
fungii would stick with 8 backends if we can. the reason the recommended flavors changed is that the memory-to-cpu ratio in the underlying hardware is higher and so the provider had lots of ram going unused on the servers anyway19:31
clarkbThe next steps I've got in mind are to do another full resync (for all 9 giteas) to make sure the gerrit restarts and problems yesterday didn't introduce problems with replication19:31
clarkbAt that point I think we can add gitea09 to haproxy and have it in production19:32
clarkbThen I would like to upgrade gitea to 1.18.519:32
clarkb#link https://review.opendev.org/c/opendev/system-config/+/875533 upgrade gitea to 1.18.519:32
clarkbConcurrent to that I'd also like to update Gerrit to autoreload replication configs19:33
clarkb#link https://review.opendev.org/c/opendev/system-config/+/874340 Gerrit replication autoreloading.19:33
clarkbThis will allow us to add more giteas and remove old giteas without gerrit restarts.19:33
clarkbPreviously we had removed autoreloading because we had noticed it would lose replication tasks on reload and the giteas would not all be in sync19:34
clarkbHowever, the data/ dir storage used by the replication plugin should mitigate this and after discussion with nasserg I think if we bind mount that properly we should be good19:34
fungiyeah, i'm happy to try it again19:34
clarkbOnce all of ^ is settled we can remove gitea08 and test this hypothesis. Then I'll probably boot 3 more giteas and build them out assembly line style and we can add them to production in bulk19:35
fungithe issues it caused last time were only as disruptive as they were because it took us so long to identify the underlying cause19:35
fungiwe'll be on the lookout for it this time anyway19:35
clarkb++19:35
corvus(i just want to clarify that a gerrit restart (via docker stop/start/restart)  shouldn't delete the data dir -- only a docker-compose down + up would do that)19:35
clarkbcorvus: correct19:35
clarkbthis was exercised yesterday with the restarts that didn't pull new images19:35
clarkbwhichconfirmed the replication tasks persist across those restarts if data/ content is preserved19:36
fungihowever, older gerrit versions didn't persist that queue to disk at all, with obvious reprecussions19:36
corvusokay, i just wanted to get that out there and make sure we still expect that there have been some improvements (eg, replication actually writing to that dir) to make us think it will work better now19:36
corvusclarkbfungi excellent that explains it perfectly, thanks19:36
clarkband then if we need/want another 4 I figure I'll do those in another batch together too19:37
clarkbThe only other consideration is when to move the gitea db backups off of gitea01. I figure I can do that once gitea09 is in production and running happily19:37
ianwwith the logos, iirc, we walk the list of orgs via the api and copy in the logos.  was the problem the old orgs had been setup to an old logo that wasn't copied?19:37
clarkbthat way we can remove gitea01 at any point19:37
clarkbianw: the problem is the old orgs are not in projects.yaml anymore19:37
clarkbianw: and so they really only exist in hte context of our rename history and in the gitea db19:37
fungiyeah, it's an issue with renamed orgs19:37
clarkbianw: all I had to do was copy the opendev logo file into a file named after the old org name19:38
ianwi'm just wondering if we should codify that in the logo-setting role19:38
clarkbianw: we probably can. We'd need to inspect the rename history files to know what those orgs are I think19:39
ianwit doesn't look at projects.yaml, but the gitea api.  but it could then also have an extra list of "old" names to also copy19:39
clarkbits also an easy manual workaround so I dind't want to hold things up on it19:39
clarkbbut happy for the help/improvements19:39
clarkbbut ya I think we've got a process largely sorted out now except for the gerrit replication config updates. Reviews on those changes much appreciated19:39
clarkband I'll keep working on this for the next bit until we've removed the old gitea servers. Then I can look at the next server type to upgrade.19:40
clarkbany other questions/ideas/comments about gitea or server upgrades?19:41
clarkboh I have one actually. There is a request for a project rename. I'd like to request we do not rename anything until the gitea replacements are done19:41
clarkbThats just one extra moving part I don't want to have to keep track of as I go through this :)19:41
clarkb#topic Handing over x/virtualdpu to OpenStack Ironic PTL19:42
clarkbis it dpu or pdu? I may have typoed. fungi want to fill us in?19:43
fungithis should hopefully be fairly straightforward, but i'll start with some background19:43
fungionce upon a time, developers at internap wrote a virtual pdu project, and the openstack ironic project came to depend on it19:43
fungivirtualpdu, as it was called, was never officially added to openstack, but ironic still relies on it19:44
fungithe internap developers moved on, and no longer work at internap nor on virtualpdu, so it's basically abandoned19:44
fungithe openstack ironic developers would like to adopt it, but reaching people with control of the current acl was... hard19:45
fungirpittau was finally able to get a response from mathieu mitchell, who expressed approval of the ironic team taking control of the repository, but since then has not replied further nor updated the acl19:46
fungino other virtualpdu maintainers replied at all (but all were cc'd)19:46
fungii have copies of the e-mail messages, complete with received headers, in case there's some dispute over things19:46
clarkbMy suggestion would be to publicly post intent to do the change in ownership to service-announce and openstack-discuss and cc the old maintainers and post a date when we'll make the change. They can concede or fight it in that period and if we hear nothing we make the change19:47
fungiprobably the next step is to post the intent to hand over access to openstack on a mailing list as well as cc the listed maintainer addresses, and set a date19:47
clarkbI don't think anyone has nefarious intent here, its a useful tool and the old group isn't interested. New group is so we should support that19:47
fungiif we hear no objections, then move forward granting control19:47
clarkbfungi: ++19:48
ianwall seems fine -- if there was negative feedback would be harder19:48
fungiexactly19:49
fungiso, openstack-discuss? service-announce? suggestions on what mailing list(s) is/are most appropriate to post the notice of intent?19:49
clarkbI feel like service-announce at least since its something happening at the opendev level19:50
fungiwe don't really have a codified process since this is basically the first time it has ever come up19:50
clarkbbut then openstack-discuss too might help get the email in front of the right eyeballs if someone did want to object19:50
fungiyeah, i can post copies to both of those19:50
fungimainly bringing it up in the meeting here to make sure we've got some consensus among opendev sysadmins on a prototype process, since this is potential precedent for future similar cases19:51
fungiseems like we have no objections over the proposal anyway19:52
clarkbyup. I think starting by contacting maintainers directly and trying to resolve it without opendev involvement is the first step. Then if we don't get objections but also don't resolve it making a public announcement of the plan with a period to object is a good prcess19:52
ianw++19:52
fungii'll try to get something sent out to mailing lists and the current maintainers tomorrow in that case19:52
funginothing else from me on this topic, unless anyone has questions19:53
clarkb#topic Works on ARM feedback19:53
clarkbwe are almost out of time and I wanted to get to this really quickly19:53
clarkbThe works on arm folks have asked us to talk about what we've done the last 6 months with their program.19:54
clarkbI've started a draft response in an etherpad19:54
clarkb#link https://etherpad.opendev.org/p/3DcVXw0PBOknv1bgyZWh19:54
clarkbunfortunately we don't actually have 6 months of use, which i try to clarify in the email19:54
clarkbianw: maybe we tighten up the bit fungi had concerns about then one of us can send that soon?19:54
ianw++19:55
fungialso some cleanup as soon as your vhost change deploys19:55
clarkbok cool lets sync on that after the meeting19:55
clarkbfungi: yup19:56
clarkb#topic Open Discussion19:56
clarkbAnything else?19:56
ianwnot from me, thanks for once again running the meeting!19:57
clarkbSounds like that may be it. Thank you for your time today. We'll be back here same time and location next week19:57
clarkb#endmeeting19:57
opendevmeetMeeting ended Tue Feb 28 19:57:56 2023 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)19:57
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2023/infra.2023-02-28-19.01.html19:57
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2023/infra.2023-02-28-19.01.txt19:57
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2023/infra.2023-02-28-19.01.log.html19:57
fungithanks clarkb !19:57

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!