19:01:24 #startmeeting infra 19:01:24 Meeting started Tue May 10 19:01:24 2022 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:24 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:24 The meeting name has been set to 'infra' 19:01:31 #link https://lists.opendev.org/pipermail/service-discuss/2022-May/000335.html Our Agenda 19:01:33 o/ 19:01:39 \o 19:01:46 I'm going to add a couple of items to this agenda at the end due to things that came up over the last couple of days. But shouldn't add too much time 19:01:51 #topic Announcements 19:02:14 We are about a month away from the Open Infra Summit. Mostly a heads up that that is happening since its been a while since there was an in person event which tends to impact us in weird ways 19:02:36 Etherpad will likely see more usage for example and CI will be quiet. Though maybe not because who knows what in person events will be like these days 19:02:51 Anyway that runs June 7-9 19:03:18 I'm not sure I'm ready to experience jet lag again 19:04:27 #topic Actions from last meeting 19:04:31 #link http://eavesdrop.openstack.org/meetings/infra/2022/infra.2022-05-03-19.01.txt minutes from last meeting 19:04:44 I'm back to thinking of dropping this portion of the meeting again :) No actions to call out 19:04:49 #topic Topics 19:05:00 #topic Improving OpenDevs CD throughput 19:05:32 A few of us had good discussion yesterday. I'm not publicly writing notes down anywhere yet just beacuse the idea was to be candid and it needs filtration 19:05:48 But there were good ideas around what sorts of thing we might do post zuulv6 19:06:01 feel free to PM me directly and I'm happy to fill individuals in 19:06:17 I'm hoping I'll have time to get to some of those improvements this week or next as well 19:06:34 But I think making those changes are a good idea before we dive into further zuul controls 19:06:53 but if anyone feels strongly the other way let me know 19:07:11 Anything else on this topic? 19:07:59 not from me 19:08:13 #topic Container Maintenance 19:08:53 I was debugging the etherpad database stuff yseterady (we'll talk about that in depth later in the meeting) and noticed that the mariadb containers take an env flag to check for necessary upgrades on the db content during startup 19:09:36 One of the items on the container maintenance list is mariadb upgrades. I half suspect that performing those may be as simple as setting the flag in our docker-compose.yaml files and then updating the mariadb image version. It isn't clear to me if we have to step through versions one by one though and we should still test it 19:09:55 But that was a good discovery and I think it give us a straight forward path towards running newer mariadb in our containers 19:10:36 If anyone wants to look into that further I suspect the gerrit review upgrade job is a good place to start beacuse we can modify the mariadb version at the same time we upgrade gerrit and check taht it updates properly 19:11:00 there is also some content in that db for reviewed file flags iirc so not just upgrading an empty db 19:11:31 That was all I had on this topic. 19:11:38 #topic Spring cleaning old reviews 19:12:19 thank you everyone for the topic:topic:system-config-cleanup reviews. I think the only one left in there has some ongoing debugging (gerrit's diff timeout and how stackaltyics triggers it) 19:12:34 clarkb: ^ that's good info, and i owe a checklist for the 3.5 upgrade, with downgrade instructions from previous weeks that i haven't got too. i will add that there and think about it more 19:13:17 ianw: fwiw I don't think upgrading mariadb when we upgrade gerrit is necessarily how we want to appraoch it in prod. But the testing gives us a good framework to exercise mariadb upgrades too 19:13:40 system-config is down to under 150 open changes. I think we've trimmed the list by about 50% since we started 19:13:58 When I have time I'll continue to try and review things and update changes that should be resurrected. Abandon those we don't need, etc 19:14:30 Despite not being 10 open changes like we probably all want this is still really good progress. Thanks everyone for the help 19:14:43 #topic Automating the build process of our PPAs 19:14:57 ianw has recently been looking at automating the build process for our packages hosted in our PPAs 19:15:09 specifically openafs and vhd-util 19:15:19 ianw: maybe you can give us a rundown on what that looks like? 19:15:48 I am excited for this as I always find deb packaging to involve secret magic and it requires a lot of effort for me to figure it out from scratch each time. Having it encoded in the automated tooling will help a lot with that 19:16:16 yes when adding jammy support i decided that manually uploading was a bit of a gap in our gitops 19:17:02 also with the way the repos for these packages got organized, you can mostly follow the recommended distro package building workflows with them too 19:17:18 we have two projects to hold the .deb source generation bits, and now we automatically upload them to ubuntu PPA for building 19:17:21 #link https://opendev.org/opendev/infra-openafs-deb 19:17:36 #link https://opendev.org/opendev/infra-vhd-util-deb 19:18:05 branch-per-distro-release 19:18:32 they run jobs that are defined in openstack-zuul-jobs and are configured in project-config (because it's a promote job, it needs a secret, so the config needs to be in a trusted repo) 19:19:14 I don't think the secret needs to be in a trusted repo? It just need to run in a post config context? 19:19:19 *post merge context 19:19:27 #link https://opendev.org/openstack/openstack-zuul-jobs/src/branch/master/zuul.d/jobs.yaml#L1397 19:19:43 yeah, we could also pass-to-parent the secret if being used by a playbook in another repo 19:20:11 similar to how some other publication jobs are set up to use project-specific creds 19:20:16 yep, i think that could work too 19:20:39 openstack-zuul-jobs isn't really the best place for this, but it was where we have extant rpm build jobs 19:21:04 those jobs publish to locations with openstack/* in them, so moving them is a multi-step process (but possible) 19:21:18 ya so I guess we're combining the desire to have ozj manage the package build directions but want to run the job on events to the actual deb repos 19:21:46 longer term it would be nice to do it out of the opendev tenant instead, but what's there is working great 19:21:58 right, the jobs are generic enough to build our debs, but not generic enough that i feel like they belong in zuul-jobs 19:22:04 My only concern doing it that way is that ozj has a separate set of reviewers, but currently that set isn't anything I'd worry about 19:22:31 yeah, o-z-j was intended to be openstack-specific central job configuration 19:22:33 just something we'll have to keep in mind if adding people to ozj 19:22:41 and this isn't really openstack-focused it's more for opendev 19:22:42 anyway, so i've built everything in a testing environment and it all works 19:22:47 #link https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/840572 19:23:11 fungi: right, maybe for that reason we should put the jobs the deb repos themselves? 19:23:16 is the final change that pushes things to the "production" PPA (i.e. the ones that are consumed by production jobs, and nodepool for dib, etc.) 19:23:19 and have them manage their own secrets 19:24:11 there's one secret for both ppa's, they're not separate? 19:24:39 ianw: each repo can encode that secret in the repo I mean 19:24:46 it will be two zuul secrets possibly of the same data 19:24:46 right, we could put the job configuration in system-config instead of o-z-j but i'm not all that worried about it in the near term 19:25:17 but ya I'm also not worried about it in the short term. More of a "it would be good to lay this out without relying on openstack core reviewers being trustworthy for package buidls too" 19:25:37 clarkb: yeah, putting in repo became a bit annoying because of the very branchy nature of it. currently there's nothing on the master branch, and we have a distro branch 19:25:55 ah in that case using system-config to centralize things is probably fine 19:26:05 basically do what youare doing with ozj but put it in system-config? 19:26:18 then the overlap with people managing packages matches people managing config management 19:27:23 system-config is probably a more natural home for both the rpm and deb generation 19:28:14 i feel like the openafs rpm generation ended up there because it's closely tied to the wheel generation, where the wheel builders copy back their results to afs 19:28:31 Regarldess of where things end up longer term I'm very excited for this as it will help my noob packaging brain when we need to do updates :0- 19:28:33 er :) 19:28:48 all the wheel jobs probably belong somewhere !o-z-j too, now 19:28:53 yeah, this is awesome work 19:28:57 ianw: ya but I think we can have the wheels consume from the ppa too then it doesn't really matter where the ppa itself is managed 19:29:30 yep that's right, but i just think that it was how it ended up there, years and years ago 19:29:36 gotcha 19:30:02 alright anything else on this topic? 19:30:09 anyway, so it sounds like we're generally happy if i merge 840572 at some point today and switch production 19:30:30 ya then plan for reconfiguring the home of the job and secret longer term. 19:30:31 then i'll clean up the testing ppas, and add some documentation to system-config 19:30:43 and yeah, make todo notes about the cleanup 19:30:46 That might be a bit ugly to do but I bet enough has merged now that not merging the other change won't help anything 19:30:56 so proceeding is fine with me 19:31:23 thanks again, this is exciting 19:31:29 #topic Jammy Support 19:31:55 I think the last major piece missing from our Jammy support in CI is the wheel mirrors. The previous topic is ensuring we can build those so we should be on track for getting this solved. 19:32:14 Is there anything else related to Jammy that we need to address? Did dib handle the phased package updates on jammy yet? 19:32:34 oh, no, sorry, i got distracted on that 19:33:00 that's on the todo list too 19:33:15 we put a short term hack in place for us though 19:33:31 ya opendev is fine. More considering that DIB users probably want a similar fix then we can revert our local fix 19:33:46 I've been trying to encourage others to push fixes as far upstream as possible so am taking my own advice :) 19:34:21 Please do let us know if you find new Jammy behaviors that are unexpected or problematic. We've already run into at least one (phased package updates) 19:34:27 but I guess for now we're good with Jammy and can move on 19:34:34 #topic Keycloak container image updates 19:34:43 we'll also have to tackle the reprepro issue at one point 19:34:48 ++ 19:34:56 probably by upgrading the mirror-pdate node to jammy 19:35:04 remind me what the reprepro issue is again? libzstd? 19:35:10 #undo 19:35:10 Removing item from minutes: #topic Keycloak container image updates 19:35:12 fungi: ya that 19:35:18 well it doesn't seem jammy fixes it 19:35:24 frickler: oh I missed that 19:36:02 interesting 19:36:20 frickler: has upstream reprepro commented further on the issue? last I checked the bug was pretty bare 19:36:41 #link https://bugs.launchpad.net/ubuntu/+source/reprepro/+bug/1968198 Reprepro jammy issue 19:36:51 oh wait, maybe I'm mixing that up with reprepro vs. phased updates 19:37:05 oh yes reprepro didn't support phased updates either 19:37:15 but for our purposes that is fine. Definitely something upstream reprepro will likely need to fix though 19:37:27 (since we are telling our nodes to ignore phased updates and just update to latest right away) 19:37:27 but luckily we don't need reprepro to support phased updates right now 19:37:34 right, that 19:38:21 is there a debian bug? i wonder what the relationship to ubuntu w.r.t. to reprepro is 19:38:29 still the libzstd related issue remains 19:38:46 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=984690 is for phased updates 19:38:48 ianw: I'm not sure google produced the bug I found above 19:39:19 right, the suspicion is that by upgrading the mirror-update server to jammy, reprepro will have new enough libzstd and no longer error on some packages 19:40:29 can we deploy a second server to test that or do we need to upgrade inplace? 19:40:57 frickler: we can deploy a second server and have it only run jammy updates and stop it on the original. Everything is written to afs which both servers can mount at the same time just shouldn't write to at the same time 19:40:59 i think as long as we coordinate locks between them, we can have more than one server in existence 19:41:11 then if we're happy with it tell the new server to run all of mirror updates and stop the old server 19:41:45 sounds like a plan 19:42:04 yeah coordinate locks between them might be easier said than done 19:42:19 ya I was thinking more like remove cron jobs from one server entirely and add them to the other server 19:42:25 since a reboot will kill the lock 19:42:30 it's certainly possible, but the ansible is written to put all cron jobs on both hosts 19:42:34 better to just not attempt the sync in the first place 19:42:43 just moving ubuntu mirroring should be independent of the other mirrors at least 19:42:50 ianw: We should be able to modify that with a when clause and check the distro or something 19:43:06 when: distro == jammy and when distro != jammy 19:43:13 its a bit of work but doable 19:43:48 yep, it would probably be a good ansible enhancement to have a config dict that said where each job went based on hostname 19:43:50 ianw: yeah, i didn't mean make them capable of automatically coordinating locks, just that we would make manual changes on the servers to ensure only one writing at a time 19:44:11 but getting more fancy could certainly be cool too 19:44:17 that way it could be all deployed on one host when we test, and split up easily in production 19:44:28 another todo list item. 19:44:54 anyway, i can look at getting jammy nodes going as we bring up the openafs bits over the next day or two 19:45:16 thanks 19:45:30 #topic Keycloak container image updates 19:45:45 Ok I noticed recently that keycloak had stopped publishing to dockerhub and is only publishing to quay.io now 19:46:06 We updated our deployment to switch to quay.io and that will pull a newer image as well. I didn't check if it ended up auto updating the service yet. I should do that 19:46:27 08593cb05ab0 quay.io/keycloak/keycloak:legacy "/opt/jboss/tools/do…" 26 hours ago Up 26 hours keycloak-docker_keycloak_1 19:46:29 looks like yes it restarted the container 19:46:39 assuming it still works then that transition went well. 19:47:01 i just logged into zuul 19:47:19 The other thing to call out here is the :legacy tag. The legacy tag means keycloak + wildfly. Their non legacy images do not use wildfly anymore and appear to be more configurable. However, our testing shows we can't just do a straight conversion from legacy to non legacy 19:47:49 It would probably be worthwhile to dig into the new non legacy image setup to see if that would work better for us. I know one reason people are excited for it is the image is about half the size of the legacy wildfly image. THough i don't think that is a major problem for us 19:47:59 the only non-standard thing we're doing there is adjusting the listening port so it works with host networking 19:48:13 hopefully it's straightforward to figure out how to do that with whatever replaced it 19:48:16 ah ok so maybe we need to adjust that in a different way since wildfly isn't doing the listening 19:48:18 ++ 19:48:34 s/listening port/listening addr/ 19:48:49 I mostly wanted to bring this up as I'm not the best person to test wildfly currently (I should fix that but ETOOMANYTHINGS) 19:49:05 sounds like we're good currently and if we can update the image and make CI happy we can probably convert to the new thing as well 19:49:18 ++ 19:49:47 #topic Etherpad Database Problems 19:50:05 Yesterday we upgraded Etherpad to 1.8.18 from 1.8.17. And it promptly broke on a database settings issue 19:50:29 The problem is that our schema's default charset and collation method were set to utf8 and not utf8mb4 and utf8mb4_bin respectively 19:50:53 1.8.18 updated its euberdb dependency and it updated how it does logging and it was actually logging that crashed everything 19:50:56 (even though the table we're using it it was set to those instead) 19:51:08 Previously it seems that etherpad logged this issue but then continued because the db table itself was fine 19:51:52 also i have to say, it's kinda silly that they check the database default character set and collation instead of the table being used 19:51:54 but this update broke logging in that context which caused the whole thing to crash. To workaround this we updated the default charset and collation to what etherpad wants. THis shouldn't affect anything as it only affects CREATE TABLE (table is already created) LOAD DATA (we don't load data out of band etherpad writes to the db itself) and stored routines which I don't think 19:51:56 there are any 19:52:05 #link https://github.com/ether/ueberDB/issues/266 19:52:10 I filed this issue upstream calling out the problem 19:52:41 hopefully they can debug why their logging changes cause the whole thing to crash and fix it, but we should be good going forward. Our CI didn't catch it because mariadb must default to utf8mb4 and utf8mb4_bin when creating new databases 19:52:51 our old schema from the land before time didn't have that goodness so it broke in prod 19:53:00 But please do call out any utf8 issues wit hetherpad if you notice them 19:53:09 #topic acme.sh failing to issue new certs 19:53:34 This was the thing I debugged this morning. tl;dr is that this is another weird side effectproblem in the tool like etherpads 19:53:57 specifically acme.sh is trying to create new keys for our certs because it thinks the key size we are requesing doesn't match the existing key size 19:54:26 but that is because they are transitioning from being implicit that empty key size means 2048 to explicitly stating key size is 2048 and the comparison between "" and 2048 fails triggering the new key creation 19:54:35 #link https://github.com/acmesh-official/acme.sh/issues/4077 wrote an upstream issue 19:54:40 #link https://github.com/acmesh-official/acme.sh/pull/4078 and pushed a potential fix 19:55:11 I think we can go manually edit the key config files and set Le_Keylength to 2048 or "2048" and acme.sh would continue to function. Similar to out of band changing db charset 19:55:30 but I think that the PR I pushed upstream shoudl also address this. One thing that isn't claer to me is how to test this 19:55:31 ++ thanks for that. the other option is to pull an acme.sh with your change in it 19:55:38 ianw: ^ maybe you have ideas on testing? 19:56:24 ianw: oh we can point it to my fork on github I guess? I was wondering about that because I think ansible alwys updates acme.sh currently so I can't make local edits. I could locally edit them manually run the issue command but that won't tie into the whole dns syncing which would only half test it 19:56:56 aiui it's really something that's occuring on hosts that already have a cert and are renewing? 19:57:01 ianw: correct 19:57:02 yes 19:57:16 new hosts should be fine since they will write out the 2048 key length in their configs from the start 19:57:33 that's two parts of the ci that aren't really tested :/ we test the general issuance, but never atually get the key 19:58:03 I think for a day or two we can probably live with waiting on upstream to accept or reject my change (and they have their own CI system) 19:58:08 but also we don't test upgrading from a key written by acme.sh to a newer acme.sh version 19:58:15 but after that we should maybe consider manually editing the file and then letting a prod run run? 19:58:33 and yeah, we have nearly a month before this becomes critical 19:58:35 *manually edit the config file to set Le_Keylength 19:59:06 yeah, we could add an ansible to rm the autogenerated config-file (and then leave behind a .stamp file so that it doesn't run again until we want it to) and then it *should* just reissue 19:59:09 its mostly that I don't want stuff like this to pile up against all the summit prep and travel that I'm going to be busy with in a week or two :) 19:59:44 ianw: will it not require a --force in that case? 19:59:45 that might be a good enhancement to the ansible anyway, give us a flag that we can set that says "ok, reissue everything on the next run" 19:59:52 because that is the other option. We could set --force and remove it in the future 20:00:13 but ya we don't have to solve thati n this meeting and we are at time 20:00:16 #topic Open Discussion 20:00:22 i'll have to page it back in, but we can remove some config and it will regenerate 20:00:23 Really quickly before we end is there anything important that was missed? 20:00:25 just a heads up that we had another gerrit user end up with a changed lp/uo openid, so i retired their old account and scrubbed the e-mail address so that could be reused with a new account 20:00:53 frickler mentioned fedora and opensuse mirroring are broken 20:01:12 I believe opensuse mirroring is broken on OBS mirroring. I suspect we'll just sjust mirroring that content and fixing opensuse isn't urgent 20:01:20 but the fedora mirroring may merit a closer look 20:01:25 i saw that, i'll take a look. i have a change out still to prune some of the fedora mirrors too 20:01:38 ya I think I +2'd that one 20:01:40 #link https://review.opendev.org/c/opendev/system-config/+/837637 20:02:02 yeah i might try fixing this and then merge that while i'm looking at it 20:03:02 sounds good 20:03:05 tahnk you everyone! 20:03:06 thanks clarkb! 20:03:10 we'll be here next week same time and location 20:03:17 oh, then I'll revoke my +w 20:03:20 #endmeeting