19:01:10 <clarkb> #startmeeting infra 19:01:10 <opendevmeet> Meeting started Tue Oct 12 19:01:10 2021 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:10 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:10 <opendevmeet> The meeting name has been set to 'infra' 19:01:12 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2021-October/000288.html Our Agenda 19:01:19 <clarkb> #topic Announcements 19:01:33 <clarkb> I forgot to mention this in the agenda but next week is the PTG 19:01:52 <clarkb> I requested a short amount of time for ourselves largely as office hours that other projects can jump into to discuss stuff with us 19:02:13 <clarkb> I plan to be there and should get an etherpad together today. If the times work out for you feel to join, otherwise I think we'll have it covered 19:02:52 <clarkb> But also keep in mind that is happening next week and we shoudl avoid changes to meetpad/etherpad if possible 19:03:26 <clarkb> #topic Actions from last meeting 19:03:32 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-10-05-19.01.txt minutes from last meeting 19:03:46 <clarkb> We had actions last week but they were all related to specs. Let's just jump into specs discussion then :) 19:03:51 <clarkb> #topic Specs 19:04:20 <clarkb> First up I did manage to update the prometheus spec based on feedback on how to run it. Ended up settling on using the built binary to avoid docker weirdness and old versions in distros 19:04:27 <clarkb> #link https://review.opendev.org/c/opendev/infra-specs/+/804122 Prometheus Cacti replacement 19:04:43 <clarkb> corvus brought up a good concern which is that we should ensure that it can run on old old distros and I confirmed it seems to work on xenial 19:04:53 <clarkb> fungi: I was going to work with you to check it on trusty when you have time 19:05:00 <fungi> ahh, yep 19:05:05 <clarkb> I didn't want to touch the remaining trusty node without you being around 19:05:10 <fungi> i should have time tomorrow 19:05:15 <clarkb> Cool I'll ping you tomorrow then. 19:05:27 <clarkb> My plan is to approve this spec if nothing comes up by end of day Thursday for me 19:05:38 <clarkb> if you havent' reviewed the spec and would like to now is the time tod o that 19:05:46 <clarkb> and we can note any trusty problems before landing it too 19:05:56 <clarkb> Next up is the mailman 3 spec 19:05:58 <clarkb> #link https://review.opendev.org/810990 Mailman 3 spec 19:06:23 <clarkb> I've reviewed this and the plan seems straightforward. Essentially spin up a new machine running mm3. Then migrate existing mm2 vhosts into it as users are ready starting with opendev 19:06:43 <clarkb> If other infra-root can review this spec that would be much appreciated 19:07:07 <clarkb> fungi: ^ anything else to add on mailman 3? 19:07:43 <fungi> nah, the migration tools are fairly honed from what i understand, so other than new interfaces and probably some new message formatting in places, users shouldn't really be impacted 19:08:22 <clarkb> thank you for putting that together. I'm excited to be able to use the new frontend 19:08:32 <clarkb> #topic Topics 19:08:39 <clarkb> #topic Improving OpenDev CD throughput 19:08:41 <fungi> i guess the actual steps for the cut-over could stand to be drilled down into a bit, deciding how we want to go about making sure deliveries to the list get queued up while we're copying things over 19:08:59 <fungi> but we can work that out as we get closer 19:09:02 <clarkb> ++ 19:09:23 <clarkb> ianw: you mentioned trying to pick this up again. I think the next step is largely in making that first chagne in the stack mergeable (by runnign jobs for it somehow?) 19:10:17 <clarkb> note that corvus pointed out the change in zuul you thought would fix it likely won't as the playbooks aren't changing for those jobs 19:10:46 <ianw> yeah i added a update to a readme 19:11:15 <clarkb> ah ok, I should go back and rereview then 19:11:17 <ianw> (late last night ... and i just realised that file doesn't trigger anything either :) 19:11:33 <ianw> i'll try something else! but yep, i did respond to all comments 19:11:41 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/807672 is the first change in the sequence 19:11:50 <clarkb> cool and I'll look at rereviewing things this afternoon 19:13:07 <clarkb> #topic Gerrit account cleanups 19:13:44 <clarkb> This is something that has gone on the back burner with zuul updates, gerrit upgrades, gitea upgrades, openstack releases etc. I haven't forgotten about it and will try and pick it up next week if the PTG is quiet for me (I expect it to be but you never know with an event) 19:14:03 <clarkb> Really just noting that I still intend on getting to this but it is an easy punt because it doesn't typcially immediately affect stuff 19:14:21 <clarkb> #topic Gerrit Project Renames 19:14:37 <clarkb> We announced last week that we would rename gerrit projects Friday October 15 at 18:00 UTC 19:14:57 <clarkb> All of our project renaming testing continues to function we should largely be mechanically ready for this 19:15:20 <clarkb> The one thing that has been noted is that we need to update project metadata in gitea after renames to update descriptions and urls and storyboard links 19:15:57 <clarkb> I think the easiest way to do that would be to run the gitea project management with the full update flag set after we rename. Either using a subset projects.yaml or just doing it for everything (which could take hours) 19:16:10 <clarkb> fungi: ^ you were thinking about this too, did you have a sense for how you wanted to approach it? 19:16:45 <fungi> it's still not clear to me why we can't update specific projects 19:16:57 <fungi> though i suppose we do need to perform a full run at least once to catch up 19:17:28 <clarkb> fungi: the fundamental issue is that the input to renaming and the input to setting project metadata are different. The rename playbook takes taht simple yaml file with old and new names. The project metadata takes projects.yaml 19:17:42 <clarkb> This is why I think it is simpler to do it as two distinct steps. 19:17:59 <fungi> oh, it doesn't filter projects.yaml by specific entries? 19:18:15 <clarkb> no projects.yaml is not referenced at all in the rename process 19:18:32 <clarkb> Then to make things more complicated projects.yaml doesn't get updated until after the reanme is done and we merge the associated changes 19:18:38 <fungi> oh, i see, we would normally update projects.yaml after renaming 19:18:51 <clarkb> and since it takes hours to do the force update we don't do those 19:18:54 <fungi> so would need to run the metadata update after that 19:18:56 <clarkb> er we don't automatically do those 19:19:02 <clarkb> yup 19:19:25 <fungi> but we could tell it to only update the projects which had been renamed, once things are up and the projects.yaml update merges, right? 19:19:51 <fungi> rather than telling it to update every project listed in the file 19:20:00 <clarkb> fungi: the current code does not support that. We could hack it in by running the update against an edited projects.yaml. But even that might be complicated since I think the playbook syncs project config directly 19:20:13 <clarkb> this is the bit I was hoping someone would have time to look at 19:20:33 <clarkb> I think it is ok to do the metadata update as a separate step post rename, but ya we should device a method of making it less expensive 19:20:41 <fungi> ahh, okay, so we need a way to tell the metadata update script to filter its actions to specific project names, i guess 19:20:46 <clarkb> ya that 19:21:33 <fungi> seems like that shouldn't be too hard, we could probably have it use the rename file as the filter 19:21:53 <fungi> i'll see if i can figure out what we need to be able to do that bit 19:21:55 <clarkb> essentially our rename process becomes: run rename playbook, restart things and land project-config updates, ensure zuul is happy, manually run the gitea repo management playbook hopefully not in the most expensive configuration possible 19:22:00 <clarkb> and is that very last bit that we need someone to look at 19:22:11 <clarkb> thanks! 19:22:46 <fungi> i guess we could even avoid doing a full sync this time by feeding it the historical rename files too 19:23:06 <clarkb> fungi: yup, we could go through and pull out all the names that need updating as a subset of the whole 19:23:10 <clarkb> since we have those records 19:23:52 <clarkb> I'll work on an etherpad process doc as well as reviewing the project-config proposal and writing up the record files for this rename tomorrow 19:24:32 <clarkb> Anything else to cover on this topic? 19:25:01 <fungi> i don't think so 19:25:08 <clarkb> #topic Gerrit 3.3 upgrade 19:25:36 <clarkb> This went exceptionally well. I'm still looking around wondering what happened and if it is too good to be true :) 19:25:44 <clarkb> the 2.13 -> 3.2 upgrade has scarred me 19:26:18 <clarkb> I've been trying to add hashtag:gerrit-3.3 to changes related to the upgrade and the cleanup afterwards. Feel free to add this hashtag to your changes too 19:26:56 <clarkb> At this point the major change remaining is the 3.2 image cleanup change. 19:27:03 <fungi> i'm trying to run down what looks like a new comment-related crash in gertty, probably related to the upgrade to 3.3 19:27:33 <clarkb> Do we have opinions on when we'll feel comfortable dropping the 3.2 images? 19:28:04 <ianw> dropping the jobs won't purge the images from dockerhub though will it? 19:28:21 <clarkb> ianw: it will not, however the images in docker hub will eventually get aged out and deleted on that end 19:28:30 <clarkb> (I forget what the timing on that is with dockerhub's new policy) 19:29:19 <clarkb> https://review.opendev.org/c/opendev/system-config/+/813074 is the change. I'm feeling more and more confident in 3.3 since we hanven't had any issues yet that make me think revert 19:29:35 <fungi> iirc, the image ageout has to do with when it last got a download 19:29:37 <clarkb> maybe we go ahead and land it and we can always restore it again later if necessary or just use the existing 3.2 tag until that ages out in docker hub 19:29:44 <ianw> yep; we can always rebuild too, and have local copies. so i think 813074 is probably gtg 19:29:49 <clarkb> fungi: ya and our test jobs for 3.2 download that tag 19:30:08 <clarkb> ianw: ok I've removed the WIP 19:30:19 <fungi> i'm fine dropping them at any time, yeah. it's increasingly unlikely we'd try to roll back at this point 19:30:30 <fungi> we're nearing the 48-hour mark 19:30:30 <clarkb> cool please review the change then :) 19:30:51 <fungi> yeah, i was, but... gertty crashing on one of those changes distracted me ;) 19:31:06 <ianw> the attention set seems to be working 19:31:21 <clarkb> I've also got https://review.opendev.org/c/opendev/system-config/+/813534/ up which is semi related in that the chanegs to update post upgrade hit a bug where we set the heap limit to 96gb in testing whcih doesn't work if the jvm tries to allocate memory on an 8gb instance 19:31:50 <clarkb> ianw: ^ I made a new ps on that fixing an issue I cuaght when writing the followups we talked about yesterday. The followups are there too. I think the whole stack deserves careful review to ensure we don't accidentally remove anything. 19:32:12 <clarkb> oh it just occured to me that the gerrit -> review group rename needs to be checked against the private group vars. Let me WIP that until that is done 19:33:09 <clarkb> ianw: oh neat it is telling me to review the CD Improvement change since you pushed a new ps 19:33:26 <clarkb> I really think the attention set has the potential for being very powerful, just need to sort out how to make it work for us 19:33:53 <ianw> yeah i've been careful to unclick people when voting, which isn't something that needs attention 19:34:30 <clarkb> The last thing I had on this topic is pointing out that we are already testing the 3.3 -> 3.4 upgrade in CI now :) We should start thinking about scheduling that upgrade next. Probably really look at that post PTG so in 2 weeks? 19:34:48 <clarkb> ianw: using the modify button at the bottom of the comment window thing? 19:34:51 <ianw> i did add a note on that @ 19:34:55 <ianw> #link https://bugs.chromium.org/p/gerrit/issues/detail?id=15154 19:35:08 <ianw> still every bug seems to go into the polygerrit category there :/ 19:35:45 <ianw> i guess maybe this is polygerrit; i don't know who owns it 19:36:04 <clarkb> ianw: their bug tracker is broken and the nromal issue type can't be submitted because no one is assigned to receive notifications for them or some such 19:36:20 <clarkb> so it defaults to polygerrit and you have to hope it gets in front of the right people. But I agree this could be a polygerrit issue 19:36:45 <ianw> clarkb: yep; i think that's going to be the major issue -- if everyone adds your attention when they +1/+2 your change, your attention list becomes less useful 19:37:20 <ianw> also if anyone pops up with dashboard issues see 19:37:24 <ianw> #link https://groups.google.com/g/repo-discuss/c/565rD1Sjiag 19:37:45 <clarkb> might be worth an email to opendev-discuss calling out the modify action and the dashboard stuff 19:37:47 <ianw> basically; /#/dashboard/... doesn't work, /dashboard/... does. unclear if this is a bug or feature 19:38:08 <ianw> sure i can draft something 19:38:15 <clarkb> thanks 19:38:56 <fungi> service-discuss? 19:39:10 <clarkb> fungi: yup sorry. Every other -discuss is name-discuss 19:39:33 <fungi> cool, just making sure you didn't mean some other-discuss 19:39:41 <fungi> (like openstack-discuss) 19:40:20 <clarkb> Thanks again to everyone who helped make this upgrade happen. I think we're in a really good place as far as gerrit goes. We can upgrade with minimal impact and in some cases downgrade. Much of the things we do with gerrit like project creation, project renaming, etc are tested. We even have upgrade testing 19:40:30 <clarkb> Oh and the new server etc 19:40:44 <clarkb> We've come a long way since we were on 2.13 a yaer ago 19:42:12 <clarkb> #topic Open Discussion 19:42:29 <clarkb> That was it for the agenda. 19:42:35 <clarkb> Anything else? 19:42:56 <ianw> #link https://review.opendev.org/c/opendev/system-config/+/812622 19:43:17 <ianw> fungi: ^ maybe you could double check i didn't fat finger anything; that's from the lock issues i think we discussed last week 19:43:20 <clarkb> oh for some reason I thought that had landed and noticed we still had the error that should fix 19:43:31 <clarkb> but I guess it hasn't landed and ++ to reviewing it and getting it in to fix those conflicts 19:43:34 <fungi> checking 19:43:36 <ianw> borg verify lock issues i mean 19:44:05 <fungi> and yeah, i too saw the cron message and thought we had already fixed it, so good to know! 19:44:15 <clarkb> also gitea01 started failing backups again which makes me wonder about networking invexxhost again :) 19:44:26 <ianw> the recent -devel job failure had me thinking about bridge upgrades too 19:45:10 <ianw> at one time it seemed to be under a lot of pressure for it's small size, but i don't recall any issues recently 19:45:12 <clarkb> One tricky thing I remembered about bridge updates is we have ssh rules around bridge connectivity iirc. We may need to spin up a new bridge, then update all the things to talk to it, then swap over? 19:45:34 <clarkb> ianw: the effort to parallelize the CD stuff could have us wanting a bigger server again 19:45:50 <clarkb> ianw: might be a good idea to get that work done first, monitor resource needs and size appropriately for a new server? 19:46:06 <ianw> i thought it might be a good time to start thinking about using zuul-secrets in more places 19:47:52 <clarkb> ianw: to avoid needing the bastion? 19:47:56 <corvus> we could easily >4x the load on bridge before it's a problem. 19:49:39 <clarkb> ya the main issue that seems to affect bridge performance is having leaked ssh connections pile up ansible rules which causes system load to grow 19:49:45 <clarkb> when that isn't happening it doesn't ened a ton of resources 19:50:08 <ianw> clarkb: yep; it would be a different way of working but I think moves us even more towards a "gitops" model 19:51:07 <clarkb> ya I think my biggest struggle there is still around manually running stuff. It seems super useful to be able to do that when things pop up and zuul doesn't have a great answer to that (yet?) 19:52:07 <corvus> i think the cd story gets a lot better if we can remove the ansible module restrictions for untrusted jobs on the executor (but that's a v5+ idea) 19:52:15 <fungi> and the "leaked" (really indefinitely hung) ansible ssh processes seem to crop up when we have pathological server failures where ssh authentication begins but the login never completes 19:52:53 <fungi> rebooting whatever server is stuck generally clears it up 19:53:44 <clarkb> if we want to hold off a bit on updating bridge with the idea that zuul improves before we need to upgrade I'd be ok with that. But we should probably write done a concrete set of things we can go to zuul about making better for that. The modules thing is another good one 19:54:02 <frickler> before we end, I also want to mention the issue with suds-jurko, which is mostly masked in CI because we have an old wheel built for it 19:54:05 <ianw> clarkb: yep, i agree, but i imagine we could have some sort of "escape-hatch" where we do something like replicate the key and have a way to manually run ansible that provides the decrypted secrets 19:54:10 <clarkb> but also "zuul please run this job now" bypass as sysadmins would also be nice alternative to the escape hatch bridge gives us 19:54:25 <clarkb> ianw: ya that could work too 19:54:32 <fungi> worth noting, we'll be unable to upgrade to the upcoming ansible release until we're on focal, or else we need to use a nondefault python3 with it on bridge 19:54:45 <clarkb> frickler: oh ya thats a good call out 19:54:50 <frickler> so maybe we want to expire wheels after some time or have some other way to check whether they still can be built 19:55:05 <ianw> frickler: ahh, i have a spec for that i think :) 19:55:16 <corvus> i'm unaware of the suds-jurko issue, is there a summary? 19:55:26 <clarkb> we built a wheel for suds-jurko some time ago with older setuptools and have that in our wheel mirror. But suds-jurko doesn't build with current setuptools. This means running openstacky things outside of our CI system is problematic as that package doesn't install 19:55:28 <clarkb> corvus: ^ 19:55:35 <ianw> #link https://review.opendev.org/#/c/703916/ 19:55:36 <frickler> the wheel was built with setuptools < 58, with 58 it fails 19:56:03 <corvus> thx 19:56:15 <fungi> a sort of toctou problem, i guess 19:56:16 <clarkb> and ya doing a fresh wipe of the wheel mirrors periodically would be a good way to expose that stuff in CI 19:56:17 <ianw> that doesn't exactly cover this scenario, but related. basically that we are an append-only copy that grows indefinitely 19:56:25 <frickler> also https://bugs.launchpad.net/cinder/+bug/1946340 19:56:43 <frickler> we first noticed it with fedora, because we don't seem to have a wheel for py39 19:56:53 <clarkb> we probably don't need daily rebuilds but weekly or monthly might be a good approach. And solve the indefinite growth problem too 19:57:44 <clarkb> separately suds-jurko has been unmaintained for like 7 years... 19:58:03 <clarkb> and should be replaced too 19:58:18 <fungi> one of half a dozen (at least) dead dependencies which setuptools 58 turned up in various projects i know about 19:58:27 <frickler> yes, in that specific case, there's suds-community as a replacement 19:58:54 <frickler> https://review.opendev.org/c/openstack/requirements/+/813302 lists what fails in u-c 19:59:40 <frickler> with a job that doesn't use the pre-built wheels 20:00:12 <clarkb> We are at time. Feel free to continue conversation in #opendev or on the mailing list. 20:00:39 <clarkb> Thank you everyone for listening and participating. We'll probably be around next week since I don't expect a ton of direct PTG involvment. But if that changes I'll try to send and email about it 20:00:49 <clarkb> #endmeeting