19:01:24 <clarkb> #startmeeting infra
19:01:24 <opendevmeet> Meeting started Tue May 10 19:01:24 2022 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:24 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:24 <opendevmeet> The meeting name has been set to 'infra'
19:01:31 <clarkb> #link https://lists.opendev.org/pipermail/service-discuss/2022-May/000335.html Our Agenda
19:01:33 <ianw> o/
19:01:39 <frickler> \o
19:01:46 <clarkb> I'm going to add a couple of items to this agenda at the end due to things that came up over the last couple of days. But shouldn't add too much time
19:01:51 <clarkb> #topic Announcements
19:02:14 <clarkb> We are about a month away from the Open Infra Summit. Mostly a heads up that that is happening since its been a while since there was an in person event which tends to impact us in weird ways
19:02:36 <clarkb> Etherpad will likely see more usage for example and CI will be quiet. Though maybe not because who knows what in person events will be like these days
19:02:51 <clarkb> Anyway that runs June 7-9
19:03:18 <clarkb> I'm not sure I'm ready to experience jet lag again
19:04:27 <clarkb> #topic Actions from last meeting
19:04:31 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2022/infra.2022-05-03-19.01.txt minutes from last meeting
19:04:44 <clarkb> I'm back to thinking of dropping this portion of the meeting again :) No actions to call out
19:04:49 <clarkb> #topic Topics
19:05:00 <clarkb> #topic Improving OpenDevs CD throughput
19:05:32 <clarkb> A few of us had good discussion yesterday. I'm not publicly writing notes down anywhere yet just beacuse the idea was to be candid and it needs filtration
19:05:48 <clarkb> But there were good ideas around what sorts of thing we might do post zuulv6
19:06:01 <clarkb> feel free to PM me directly and I'm happy to fill individuals in
19:06:17 <clarkb> I'm hoping I'll have time to get to some of those improvements this week or next as well
19:06:34 <clarkb> But I think making those changes are a good idea before we dive into further zuul controls
19:06:53 <clarkb> but if anyone feels strongly the other way let me know
19:07:11 <clarkb> Anything else on this topic?
19:07:59 <fungi> not from me
19:08:13 <clarkb> #topic Container Maintenance
19:08:53 <clarkb> I was debugging the etherpad database stuff yseterady (we'll talk about that in depth later in the meeting) and noticed that the mariadb containers take an env flag to check for necessary upgrades on the db content during startup
19:09:36 <clarkb> One of the items on the container maintenance list is mariadb upgrades. I half suspect that performing those may be as simple as setting the flag in our docker-compose.yaml files and then updating the mariadb image version. It isn't clear to me if we have to step through versions one by one though and we should still test it
19:09:55 <clarkb> But that was a good discovery and I think it give us a straight forward path towards running newer mariadb in our containers
19:10:36 <clarkb> If anyone wants to look into that further I suspect the gerrit review upgrade job is a good place to start beacuse we can modify the mariadb version at the same time we upgrade gerrit and check taht it updates properly
19:11:00 <clarkb> there is also some content in that db for reviewed file flags iirc so not just upgrading an empty db
19:11:31 <clarkb> That was all I had on this topic.
19:11:38 <clarkb> #topic Spring cleaning old reviews
19:12:19 <clarkb> thank you everyone for the topic:topic:system-config-cleanup reviews. I think the only one left in there has some ongoing debugging (gerrit's diff timeout and how stackaltyics triggers it)
19:12:34 <ianw> clarkb: ^ that's good info, and i owe a checklist for the 3.5 upgrade, with downgrade instructions from previous weeks that i haven't got too.  i will add that there and think about it more
19:13:17 <clarkb> ianw: fwiw I don't think upgrading mariadb when we upgrade gerrit is necessarily how we want to appraoch it in prod. But the testing gives us a good framework to exercise mariadb upgrades too
19:13:40 <clarkb> system-config is down to under 150 open changes. I think we've trimmed the list by about 50% since we started
19:13:58 <clarkb> When I have time I'll continue to try and review things and update changes that should be resurrected. Abandon those we don't need, etc
19:14:30 <clarkb> Despite not being 10 open changes like we probably all want this is still really good progress. Thanks everyone for the help
19:14:43 <clarkb> #topic Automating the build process of our PPAs
19:14:57 <clarkb> ianw has recently been looking at automating the build process for our packages hosted in our PPAs
19:15:09 <clarkb> specifically openafs and vhd-util
19:15:19 <clarkb> ianw: maybe you can give us a rundown on what that looks like?
19:15:48 <clarkb> I am excited for this as I always find deb packaging to involve secret magic and it requires a lot of effort for me to figure it out from scratch each time. Having it encoded in the automated tooling will help a lot with that
19:16:16 <ianw> yes when adding jammy support i decided that manually uploading was a bit of a gap in our gitops
19:17:02 <fungi> also with the way the repos for these packages got organized, you can mostly follow the recommended distro package building workflows with them too
19:17:18 <ianw> we have two projects to hold the .deb source generation bits, and now we automatically upload them to ubuntu PPA for building
19:17:21 <ianw> #link https://opendev.org/opendev/infra-openafs-deb
19:17:36 <ianw> #link https://opendev.org/opendev/infra-vhd-util-deb
19:18:05 <fungi> branch-per-distro-release
19:18:32 <ianw> they run jobs that are defined in openstack-zuul-jobs and are configured in project-config (because it's a promote job, it needs a secret, so the config needs to be in a trusted repo)
19:19:14 <clarkb> I don't think the secret needs to be in a trusted repo? It just need to run in a post config context?
19:19:19 <clarkb> *post merge context
19:19:27 <ianw> #link https://opendev.org/openstack/openstack-zuul-jobs/src/branch/master/zuul.d/jobs.yaml#L1397
19:19:43 <fungi> yeah, we could also pass-to-parent the secret if being used by a playbook in another repo
19:20:11 <fungi> similar to how some other publication jobs are set up to use project-specific creds
19:20:16 <ianw> yep, i think that could work too
19:20:39 <ianw> openstack-zuul-jobs isn't really the best place for this, but it was where we have extant rpm build jobs
19:21:04 <ianw> those jobs publish to locations with openstack/* in them, so moving them is a multi-step process (but possible)
19:21:18 <clarkb> ya so I guess we're combining the desire to have ozj manage the package build directions but want to run the job on events to the actual deb repos
19:21:46 <fungi> longer term it would be nice to do it out of the opendev tenant instead, but what's there is working great
19:21:58 <ianw> right, the jobs are generic enough to build our debs, but not generic enough that i feel like they belong in zuul-jobs
19:22:04 <clarkb> My only concern doing it that way is that ozj has a separate set of reviewers, but currently that set isn't anything I'd worry about
19:22:31 <fungi> yeah, o-z-j was intended to be openstack-specific central job configuration
19:22:33 <clarkb> just something we'll have to keep in mind if adding people to ozj
19:22:41 <fungi> and this isn't really openstack-focused it's more for opendev
19:22:42 <ianw> anyway, so i've built everything in a testing environment and it all works
19:22:47 <ianw> #link https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/840572
19:23:11 <clarkb> fungi: right, maybe for that reason we should put the jobs the deb repos themselves?
19:23:16 <ianw> is the final change that pushes things to the "production" PPA (i.e. the ones that are consumed by production jobs, and nodepool for dib, etc.)
19:23:19 <clarkb> and have them manage their own secrets
19:24:11 <ianw> there's one secret for both ppa's, they're not separate?
19:24:39 <clarkb> ianw: each repo can encode that secret in the repo I mean
19:24:46 <clarkb> it will be two zuul secrets possibly of the same data
19:24:46 <fungi> right, we could put the job configuration in system-config instead of o-z-j but i'm not all that worried about it in the near term
19:25:17 <clarkb> but ya I'm also not worried about it in the short term. More of a "it would be good to lay this out without relying on openstack core reviewers being trustworthy for package buidls too"
19:25:37 <ianw> clarkb: yeah, putting in repo became a bit annoying because of the very branchy nature of it.  currently there's nothing on the master branch, and we have a distro branch
19:25:55 <clarkb> ah in that case using system-config to centralize things is probably fine
19:26:05 <clarkb> basically do what youare doing with ozj but put it in system-config?
19:26:18 <clarkb> then the overlap with people managing packages matches people managing config management
19:27:23 <ianw> system-config is probably a more natural home for both the rpm and deb generation
19:28:14 <ianw> i feel like the openafs rpm generation ended up there because it's closely tied to the wheel generation, where the wheel builders copy back their results to afs
19:28:31 <clarkb> Regarldess of where things end up longer term I'm very excited for this as it will help my noob packaging brain when we need to do updates :0-
19:28:33 <clarkb> er :)
19:28:48 <ianw> all the wheel jobs probably belong somewhere !o-z-j too, now
19:28:53 <fungi> yeah, this is awesome work
19:28:57 <clarkb> ianw: ya but I think we can have the wheels consume from the ppa too then it doesn't really matter where the ppa itself is managed
19:29:30 <ianw> yep that's right, but i just think that it was how it ended up there, years and years ago
19:29:36 <clarkb> gotcha
19:30:02 <clarkb> alright anything else on this topic?
19:30:09 <ianw> anyway, so it sounds like we're generally happy if i merge 840572 at some point today and switch production
19:30:30 <clarkb> ya then plan for reconfiguring the home of the job and secret longer term.
19:30:31 <ianw> then i'll clean up the testing ppas, and add some documentation to system-config
19:30:43 <ianw> and yeah, make todo notes about the cleanup
19:30:46 <clarkb> That might be a bit ugly to do but I bet enough has merged now that not merging the other change won't help anything
19:30:56 <clarkb> so proceeding is fine with me
19:31:23 <clarkb> thanks again, this is exciting
19:31:29 <clarkb> #topic Jammy Support
19:31:55 <clarkb> I think the last major piece missing from our Jammy support in CI is the wheel mirrors. The previous topic is ensuring we can build those so we should be on track for getting this solved.
19:32:14 <clarkb> Is there anything else related to Jammy that we need to address? Did dib handle the phased package updates on jammy yet?
19:32:34 <ianw> oh, no, sorry, i got distracted on that
19:33:00 <ianw> that's on the todo list too
19:33:15 <fungi> we put a short term hack in place for us though
19:33:31 <clarkb> ya opendev is fine. More considering that DIB users probably want a similar fix then we can revert our local fix
19:33:46 <clarkb> I've been trying to encourage others to push fixes as far upstream as possible so am taking my own advice :)
19:34:21 <clarkb> Please do let us know if you find new Jammy behaviors that are unexpected or problematic. We've already run into at least one (phased package updates)
19:34:27 <clarkb> but I guess for now we're good with Jammy and can move on
19:34:34 <clarkb> #topic Keycloak container image updates
19:34:43 <frickler> we'll also have to tackle the reprepro issue at one point
19:34:48 <clarkb> ++
19:34:56 <clarkb> probably by upgrading the mirror-pdate node to jammy
19:35:04 <fungi> remind me what the reprepro issue is again? libzstd?
19:35:10 <clarkb> #undo
19:35:10 <opendevmeet> Removing item from minutes: #topic Keycloak container image updates
19:35:12 <clarkb> fungi: ya that
19:35:18 <frickler> well it doesn't seem jammy fixes it
19:35:24 <clarkb> frickler: oh I missed that
19:36:02 <fungi> interesting
19:36:20 <clarkb> frickler: has upstream reprepro commented further on the issue? last I checked the bug was pretty bare
19:36:41 <clarkb> #link https://bugs.launchpad.net/ubuntu/+source/reprepro/+bug/1968198 Reprepro jammy issue
19:36:51 <frickler> oh wait, maybe I'm mixing that up with reprepro vs. phased updates
19:37:05 <clarkb> oh yes reprepro didn't support phased updates either
19:37:15 <clarkb> but for our purposes that is fine. Definitely something upstream reprepro will likely need to fix though
19:37:27 <clarkb> (since we are telling our nodes to ignore phased updates and just update to latest right away)
19:37:27 <fungi> but luckily we don't need reprepro to support phased updates right now
19:37:34 <fungi> right, that
19:38:21 <ianw> is there a debian bug?  i wonder what the relationship to ubuntu w.r.t. to reprepro is
19:38:29 <frickler> still the libzstd related issue remains
19:38:46 <frickler> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=984690 is for phased updates
19:38:48 <clarkb> ianw: I'm not sure google produced the bug I found above
19:39:19 <fungi> right, the suspicion is that by upgrading the mirror-update server to jammy, reprepro will have new enough libzstd and no longer error on some packages
19:40:29 <frickler> can we deploy a second server to test that or do we need to upgrade inplace?
19:40:57 <clarkb> frickler: we can deploy a second server and have it only run jammy updates and stop it on the original. Everything is written to afs which both servers can mount at the same time just shouldn't write to at the same time
19:40:59 <fungi> i think as long as we coordinate locks between them, we can have more than one server in existence
19:41:11 <clarkb> then if we're happy with it tell the new server to run all of mirror updates and stop the old server
19:41:45 <frickler> sounds like a plan
19:42:04 <ianw> yeah coordinate locks between them might be easier said than done
19:42:19 <clarkb> ya I was thinking more like remove cron jobs from one server entirely and add them to the other server
19:42:25 <clarkb> since a reboot will kill the lock
19:42:30 <ianw> it's certainly possible, but the ansible is written to put all cron jobs on both hosts
19:42:34 <clarkb> better to just not attempt the sync in the first place
19:42:43 <frickler> just moving ubuntu mirroring should be independent of the other mirrors at least
19:42:50 <clarkb> ianw: We should be able to modify that with a when clause and check the distro or something
19:43:06 <clarkb> when: distro == jammy and when distro != jammy
19:43:13 <clarkb> its a bit of work but doable
19:43:48 <ianw> yep, it would probably be a good ansible enhancement to have a config dict that said where each job went based on hostname
19:43:50 <fungi> ianw: yeah, i didn't mean make them capable of automatically coordinating locks, just that we would make manual changes on the servers to ensure only one writing at a time
19:44:11 <fungi> but getting more fancy could certainly be cool too
19:44:17 <ianw> that way it could be all deployed on one host when we test, and split up easily in production
19:44:28 <ianw> another todo list item.
19:44:54 <ianw> anyway, i can look at getting jammy nodes going as we bring up the openafs bits over the next day or two
19:45:16 <clarkb> thanks
19:45:30 <clarkb> #topic Keycloak container image updates
19:45:45 <clarkb> Ok I noticed recently that keycloak had stopped publishing to dockerhub and is only publishing to quay.io now
19:46:06 <clarkb> We updated our deployment to switch to quay.io and that will pull a newer image as well. I didn't check if it ended up auto updating the service yet. I should do that
19:46:27 <corvus> 08593cb05ab0   quay.io/keycloak/keycloak:legacy   "/opt/jboss/tools/do…"   26 hours ago   Up 26 hours             keycloak-docker_keycloak_1
19:46:29 <clarkb> looks like yes it restarted the container
19:46:39 <clarkb> assuming it still works then that transition went well.
19:47:01 <corvus> i just logged into zuul
19:47:19 <clarkb> The other thing to call out here is the :legacy tag. The legacy tag means keycloak + wildfly. Their non legacy images do not use wildfly anymore and appear to be more configurable. However, our testing shows we can't just do a straight conversion from legacy to non legacy
19:47:49 <clarkb> It would probably be worthwhile to dig into the new non legacy image setup to see if that would work better for us. I know one reason people are excited for it is the image is about half the size of the legacy wildfly image. THough i don't think that is a major problem for us
19:47:59 <corvus> the only non-standard thing we're doing there is adjusting the listening port so it works with host networking
19:48:13 <corvus> hopefully it's straightforward to figure out how to do that with whatever replaced it
19:48:16 <clarkb> ah ok so maybe we need to adjust that in a different way since wildfly isn't doing the listening
19:48:18 <clarkb> ++
19:48:34 <corvus> s/listening port/listening addr/
19:48:49 <clarkb> I mostly wanted to bring this up as I'm not the best person to test wildfly currently (I should fix that but ETOOMANYTHINGS)
19:49:05 <clarkb> sounds like we're good currently and if we can update the image and make CI happy we can probably convert to the new thing as well
19:49:18 <corvus> ++
19:49:47 <clarkb> #topic Etherpad Database Problems
19:50:05 <clarkb> Yesterday we upgraded Etherpad to 1.8.18 from 1.8.17. And it promptly broke on a database settings issue
19:50:29 <clarkb> The problem is that our schema's default charset and collation method were set to utf8 and not utf8mb4 and utf8mb4_bin respectively
19:50:53 <clarkb> 1.8.18 updated its euberdb dependency and it updated how it does logging and it was actually logging that crashed everything
19:50:56 <fungi> (even though the table we're using it it was set to those instead)
19:51:08 <clarkb> Previously it seems that etherpad logged this issue but then continued because the db table itself was fine
19:51:52 <fungi> also i have to say, it's kinda silly that they check the database default character set and collation instead of the table being used
19:51:54 <clarkb> but this update broke logging in that context which caused the whole thing to crash. To workaround this we updated the default charset and collation to what etherpad wants. THis shouldn't affect anything as it only affects CREATE TABLE (table is already created) LOAD DATA (we don't load data out of band etherpad writes to the db itself) and stored routines which I don't think
19:51:56 <clarkb> there are any
19:52:05 <clarkb> #link https://github.com/ether/ueberDB/issues/266
19:52:10 <clarkb> I filed this issue upstream calling out the problem
19:52:41 <clarkb> hopefully they can debug why their logging changes cause the whole thing to crash and fix it, but we should be good going forward. Our CI didn't catch it because mariadb must default to utf8mb4 and utf8mb4_bin when creating new databases
19:52:51 <clarkb> our old schema from the land before time didn't have that goodness so it broke in prod
19:53:00 <clarkb> But please do call out any utf8 issues wit hetherpad if you notice them
19:53:09 <clarkb> #topic acme.sh failing to issue new certs
19:53:34 <clarkb> This was the thing I debugged this morning. tl;dr is that this is another weird side effectproblem in the tool like etherpads
19:53:57 <clarkb> specifically acme.sh is trying to create new keys for our certs because it thinks the key size we are requesing doesn't match the existing key size
19:54:26 <clarkb> but that is because they are transitioning from being implicit that empty key size means 2048 to explicitly stating key size is 2048 and the comparison between "" and 2048 fails triggering the new key creation
19:54:35 <clarkb> #link https://github.com/acmesh-official/acme.sh/issues/4077 wrote an upstream issue
19:54:40 <clarkb> #link https://github.com/acmesh-official/acme.sh/pull/4078 and pushed a potential fix
19:55:11 <clarkb> I think we can go manually edit the key config files and set Le_Keylength to 2048 or "2048" and acme.sh would continue to function. Similar to out of band changing db charset
19:55:30 <clarkb> but I think that the PR I pushed upstream shoudl also address this. One thing that isn't claer to me is how to test this
19:55:31 <ianw> ++ thanks for that.  the other option is to pull an acme.sh with your change in it
19:55:38 <clarkb> ianw: ^ maybe you have ideas on testing?
19:56:24 <clarkb> ianw: oh we can point it to my fork on github I guess? I was wondering about that because I think ansible alwys updates acme.sh currently so I can't make local edits. I could locally edit them manually run the issue command but that won't tie into the whole dns syncing which would only half test it
19:56:56 <ianw> aiui it's really something that's occuring on hosts that already have a cert and are renewing?
19:57:01 <clarkb> ianw: correct
19:57:02 <fungi> yes
19:57:16 <clarkb> new hosts should be fine since they will write out the 2048 key length in their configs from the start
19:57:33 <ianw> that's two parts of the ci that aren't really tested :/  we test the general issuance, but never atually get the key
19:58:03 <clarkb> I think for a day or two we can probably live with waiting on upstream to accept or reject my change (and they have their own CI system)
19:58:08 <fungi> but also we don't test upgrading from a key written by acme.sh to a newer acme.sh version
19:58:15 <clarkb> but after that we should maybe consider manually editing the file and then letting a prod run run?
19:58:33 <fungi> and yeah, we have nearly a month before this becomes critical
19:58:35 <clarkb> *manually edit the config file to set Le_Keylength
19:59:06 <ianw> yeah, we could add an ansible to rm the autogenerated config-file (and then leave behind a .stamp file so that it doesn't run again until we want it to) and then it *should* just reissue
19:59:09 <clarkb> its mostly that I don't want stuff like this to pile up against all the summit prep and travel that I'm going to be busy with in a week or two :)
19:59:44 <clarkb> ianw: will it not require a --force in that case?
19:59:45 <ianw> that might be a good enhancement to the ansible anyway, give us a flag that we can set that says "ok, reissue everything on the next run"
19:59:52 <clarkb> because that is the other option. We could set --force and remove it in the future
20:00:13 <clarkb> but ya we don't have to solve thati n this meeting and we are at time
20:00:16 <clarkb> #topic Open Discussion
20:00:22 <ianw> i'll have to page it back in, but we can remove some config and it will regenerate
20:00:23 <clarkb> Really quickly before we end is there anything important that was missed?
20:00:25 <fungi> just a heads up that we had another gerrit user end up with a changed lp/uo openid, so i retired their old account and scrubbed the e-mail address so that could be reused with a new account
20:00:53 <clarkb> frickler mentioned fedora and opensuse mirroring are broken
20:01:12 <clarkb> I believe opensuse mirroring is broken on OBS mirroring. I suspect we'll just sjust mirroring that content and fixing opensuse isn't urgent
20:01:20 <clarkb> but the fedora mirroring may merit a closer look
20:01:25 <ianw> i saw that, i'll take a look.  i have a change out still to prune some of the fedora mirrors too
20:01:38 <clarkb> ya I think I +2'd that one
20:01:40 <ianw> #link https://review.opendev.org/c/opendev/system-config/+/837637
20:02:02 <ianw> yeah i might  try fixing this and then merge that while i'm looking at it
20:03:02 <clarkb> sounds good
20:03:05 <clarkb> tahnk you everyone!
20:03:06 <fungi> thanks clarkb!
20:03:10 <clarkb> we'll be here next week same time and location
20:03:17 <frickler> oh, then I'll revoke my +w
20:03:20 <clarkb> #endmeeting