19:00:05 <clarkb> #startmeeting infra 19:00:05 <opendevmeet> Meeting started Tue Jan 7 19:00:05 2025 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:05 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:05 <opendevmeet> The meeting name has been set to 'infra' 19:00:17 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/M5YEAMYDHKFBK7BJKHIQVFKGXJEW3KZF/ Our Agenda 19:00:23 <clarkb> #topic Announcements 19:00:35 <clarkb> Welcome to 2025. Apologies for interrupting the winter slumber 19:00:54 <tonyb> I won't be around much next week 19:01:17 <fungi> (or summer slumber in tonyb's case) 19:01:27 <clarkb> tonyb: ack thanks for the heads up 19:01:28 <tonyb> (Taking an actual vacation) 19:01:52 <clarkb> I didn't really have anything to announce today. Anyone else with an announcement? 19:02:24 <clarkb> #topic Zuul-launcher image builds 19:02:41 <clarkb> corvus: I feel like the holidays were just long enough that I've forgotten where we were on this 19:02:52 <clarkb> I know there was the raw image handling and testing 19:03:40 <corvus> mostly waiting on the #niz stack (which is in flaky-test fixing mode now) reviews welcome but not necessary at this time 19:03:46 <fungi> i wiped my brain clean like an etch-a-sketch, so don't feel bad 19:03:48 <corvus> that stack adds web ui and more image lifecycle stuff 19:03:56 <corvus> so it'll be good to have that in place before more live testing 19:04:10 <clarkb> corvus: did the API stuff land (and presumably deployed through our weekly deployments) 19:04:25 <corvus> nope that's interleaved in that stack 19:04:47 <corvus> it's completely separate so i think it's actually okay to do a single-phase merge for that instead of our usual two-phase 19:05:19 <clarkb> ok /me scribbles a note to try and review that stack if paperwork gets done 19:05:37 <corvus> more urgent than that though is the dockerhub image stuff :) 19:05:46 <clarkb> yup that has its own agenda item today 19:05:52 <corvus> i don't relish the idea of trying to get all that merged before that :) 19:06:12 <clarkb> anything else on this topic or should we continue on so that we can get to the container image mirroring? 19:06:22 <corvus> continue i think 19:06:29 <clarkb> #topic Deploying new Noble Servers 19:06:52 <clarkb> my podman prep change that updates install-docker to install podman and docker compose on Noble landed last week 19:07:13 <clarkb> we don't have any noble servers running containers yet so that was a noop (and I spot checked the deployments to ensure I dind't miss anything) 19:07:28 <clarkb> but that means next step is now to deploy a new server that uses containers on Noble 19:07:44 <fungi> what's a good candidate? 19:07:46 <clarkb> my hope is that I'll get beginning of the year paperwork stuff done tomorrowish and I can start on a new paste deployment late this week 19:07:49 <tonyb> I did help my ansible-devel changes which use install-docker on noble 19:07:51 <fungi> lodgeit/paste? 19:07:58 <clarkb> fungi: my plan is paste since that is representative with a database but also small and simple 19:08:00 <fungi> or maybe mediawiki 19:08:11 <fungi> yeah, paste feels right as a canary 19:08:37 <clarkb> tonyb: have you seen any problems with it or is it working in CI for your ansible devel stuff? 19:09:05 <clarkb> I guess that is the main callout on this topic today. The big change is in but I haven't put it to use in production yet. If you do put it to use and have feedback that is very much welcome 19:09:22 <tonyb> clarkb: No issues but I don't actually *use* compose but it installs 19:09:38 <fungi> it there some possibility of the ansible-devel job getting back into a passing state again? that's exciting 19:09:40 <clarkb> ya I think on bridge (which ansible-devel works on) its minimal use of containers 19:10:53 <tonyb> fungi: if nothing else it sets somewhat of a timeline for dropping xenial and moving bridge to noble 19:11:10 <clarkb> so ya let me know if you notice oddities, but I think we can continue on 19:11:13 <fungi> that would be great 19:11:16 <tonyb> both of which are needed before we can update the ansible version 19:11:34 <clarkb> #topic Upgrading old servers 19:11:43 <clarkb> the discussion seems to be trending into this topic anyway 19:11:55 <clarkb> ya the issue with newer ansible is it won't run on remote hosts with older python 19:12:13 <clarkb> historically ansible has maintained a wide array of python support for the remote side (the control side was more restricted) 19:12:27 <clarkb> but that has changed recently with a quite reduced set of supported python versions 19:13:34 <clarkb> anyway other than the noble work above is there anything else to be aware for upgrading servers? I think we're going to end up needing to focus on this for the first half of the year 19:13:51 <clarkb> but I suspect that a lot of those jumps will be to noble so getting that working well upfront seems worthwhile 19:13:56 <tonyb> I wanted to say I started looking at mediawiki again 19:14:26 <tonyb> I'd like to send and announce to service-announce this week 19:14:31 <tonyb> #link https://etherpad.opendev.org/p/opendev-wiki-announce 19:14:46 <clarkb> tonyb: are there new patches to review yet? (I haven't followed the irc notifications too closely the last few weeks) 19:14:56 <clarkb> but ya announcing that soonish seems good 19:15:22 <tonyb> No new patches, but I addressed a bunch of feedback yesterday 19:16:29 <frickler> I've looked at some old content on the wiki recently, and I do wonder whether it would be better to start fresh 19:16:37 <clarkb> ok I'll try to catch back up on the announcement and the review stack today or tomorrow ish so that the announcement at the end of week schedule isn't held up by me (though I had looekd at them previously and its probably fine to proceed) 19:16:46 <frickler> and possibly just leave a read-only copy available somewhere 19:17:10 <JayF> As long as that read-only copy will exist as long as the new wiki will, that sounds like an excellent idea 19:17:28 <fungi> tonyb: exciting that you're so close! announcement looks fine other than shoehorning you into using jammy, you might end up wanting noble depending on the timeline 19:17:36 <tonyb> where we you two a year ago ;P 19:18:01 <clarkb> I think I'm on the fence about starting over 19:18:10 <corvus> i don't think we should start over 19:18:12 <clarkb> anyone can simply start over on any content that is old as is and that avoids needing to maintain two wikis 19:18:30 <corvus> clarkb: ++ 19:18:37 <clarkb> and its not like starting over prevents things from becoming stale all over again. The fundamental problem is it needs curation and that doesn't change 19:18:53 <fungi> i don't see starting over as a solution, it doesn't solve the problem of the wiki containing old and abandoned content, merely resets the starting point for that 19:19:18 <JayF> That's true in the most general of senses, but the reality is we have years-old content in many places. That content is unlikely to *ever* be curated, so having it not carried over to confuse people would be nice. 19:19:19 <fungi> it will continue to be a problem 19:19:36 <clarkb> JayF: right but anyone can just go and archive that content today right? 19:19:38 <JayF> It's a lot easier to leave something behind than it is to delete something -- it's very hard to know when it's appropriate to remove it 19:19:47 <clarkb> we don't need to start over and host a special frozen wiki 19:19:47 <tonyb> fungi: I think I'd like to stick with Jammy and do the OS change once we're on the latest (mediawiki) LTS 19:20:35 <clarkb> (mediawiki archives things when you delete them aiui so you can jsut delet them if you're 80% certani ti should be deleted) 19:20:48 <fungi> better would be to actively delete any concerning old content, since starting from scratch doesn't prevent new content from ceasing to be cared for and ending up in the same state. or do we need semi-annual domain changes in order to force content refreshes? 19:21:08 <corvus> if, let's say, the openstack project feels pretty strongly about reducing the confusion caused by outdated articles, one approach would be a one-time mass deletion/archiving of those articles. 19:21:46 <corvus> or, something more like what wikipedia does: mass-additions of "this may be outdated" banners to articles. 19:21:55 <JayF> fungi: wiki.dalmatian.openstack.org here we come? /s 19:22:15 <JayF> I think a one time mass-update with "this is old" banners, or archiving old information, is a good idea 19:22:24 <fungi> there's probably a mw plugin to automatically insert admonitions in pages that haven't seen an edit in x years 19:22:31 <clarkb> https://www.mediawiki.org/wiki/Manual:Archive_table says things get archived when deleted 19:22:47 <clarkb> so ya I think we can keep the current content and its curators can delete as necessary 19:23:06 <clarkb> hashar can probably confirm that for us and you can test with a throwaway page 19:23:14 <frickler> makes sense, I'll try to take a look at that 19:23:24 <tonyb> frickler: Thanks. 19:23:51 <clarkb> alright anything else related to server upgrades? 19:24:32 <tonyb> not from me 19:24:53 <clarkb> #topic Mirroring Useful Container Images 19:25:05 <clarkb> the docker hub rate limit problems continue to plague us and others 19:25:31 <clarkb> corvus has made progress in setting up jobs to mirror useful images from docker hub to another registry (quay in this case) to alleviate the problem 19:25:39 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/938508 Initial change to mirror images that are generically useful 19:26:41 <clarkb> I have some thoughts on improving the tags that are mirrored, but I think that is good for a followup 19:26:50 <clarkb> for a first pass we should start smallish and make sure everything is owrking first? 19:27:36 <clarkb> corvus: those jobs will run in the periodic pipeline which means they will trigger at 0200 utc iirc and then fight for resources with all of the rest of the periodic jobs 19:28:05 <clarkb> just wondering if we should be careful when we merge that so that we can check nothing is exposed and quickly regen the api key if that does happen? 19:28:39 <corvus> yeah... do we have an earlier pipeline so we can jump the gun? :) or we can just see if the time is a problem. 19:29:04 <clarkb> corvus: we have the hourly opendev pipeline which might work temporarily? Also I realized that the python- images already have images in opendevorg 19:29:16 <corvus> they must be out of date though 19:29:19 <clarkb> I don't think those tags exists so I'm not sure that is ap roblem but wondering if we should be careful with those collisions to start 19:29:29 <clarkb> ya they would be old versions 19:30:03 <clarkb> gerrit will collide too 19:30:14 <frickler> downstream I added a second daily pipeline that runs 3h earlier for image builds that are then consumed by "normal" daily jobs, maybe we want to do something similar? 19:30:18 <corvus> i think mirroring existing or non-existing tags is what we want... 19:30:37 <clarkb> corvus: ya I think its fine for python- 19:30:39 <corvus> clarkb: interesting point on gerrit; maybe we should prefix with "mirror" or something? or even make a new org? 19:30:56 <clarkb> corvus: ya I think for things like gerrit we need to namespace further either with a prefix or a new org 19:31:09 <clarkb> we could also use different tags but I suspect that would be more confusing 19:31:32 <corvus> how about new org: opendevmirror? 19:31:37 <clarkb> corvus: I like that 19:31:51 <fungi> it's clear enough as to what it is, wfm 19:32:08 <corvus> frickler: ack; sounds like a good solution if 0200 is a problem 19:32:27 <frickler> +1 to opendevmirror 19:32:31 <clarkb> ok so make new org to namespace things and avoid collisions with stuff opendev wants to host itself eventually. Keep the initial list small like we've got currently. Then followup with additional tags etc 19:32:48 <tonyb> ++ on opendevmirror 19:32:52 <fungi> adding a new timer trigger pipeline is cheap if we decide there is sufficient need to warrant it 19:33:25 <corvus> sounds good; any of those images we want to say don't belong there and should instead be handled by a job in the zuul tenant? 19:33:30 <frickler> another reason to do that: run before the normal periodic rush eats up more rate limits? 19:33:48 <fungi> i guess it's worth monitoring for failures to decide on the separate pipeline 19:33:59 <clarkb> corvus: I think the only one that opendev doesn't consume today is httpd 19:34:05 <corvus> frickler: yeah, that's the main problem i could see from using 0200, but don't know how bad it will be yet 19:34:07 <clarkb> corvus: but that seems generic enough that I'm happy for us to have it 19:34:15 <clarkb> (we use the gerrit image in gerritlib testing iirc) 19:34:54 <corvus> #action corvus make opendevmirror quay.org and update 938508 19:35:08 <corvus> #undo 19:35:18 <corvus> #action corvus make opendevmirror quay.io org and update 938508 19:35:56 <clarkb> anything else on this topic? 19:35:57 <fungi> wrt separate vs existing pipeline, i have no objection other than not wanting to prematurely overengineer it 19:36:09 <corvus> fungi: ++ 19:36:31 <clarkb> ya I wouldn't go out of our way to add ap ipeline just yet but if an alternative to periodic already exists that might be a good option to start 19:37:13 <clarkb> ok there are a few more things I want to get into so lets keep moving 19:37:15 <clarkb> #topic Gerrit H2 Cache File Growth 19:37:31 <clarkb> Just before we all enjoyed some holiday time off we restarted Gerrit thinking it would be fine since we weren't changing the image 19:37:47 <fungi> it's just a restart, wcpgw? 19:37:54 <clarkb> turns out we were wrong and the underlying issue appears to be the growth of the git_file_diff and gerrit_file_diff h2 database cache backing files 19:37:59 <clarkb> one of them was over 200GB iirc 19:38:25 <clarkb> on startup Gerrit attempts to do db cleanup to prune caches down to size but this only affects the content within the db and not the db file itself 19:38:51 <clarkb> however I suspect that h2 becomes very unperformant when the backing file is that size and we had problems. We stopped Gerrit again and then moved the caches aside forcing gerrit to start over with a clean cache file 19:39:01 <clarkb> last I checked those cache files had already regrown back to abou 20GB in size 19:39:01 <fungi> thankfully hashar was no stranger to this problem 19:39:25 <clarkb> ya hashar points out the default h2 compaction time is like 200ms which isn't enough for files of this size to be compacted down to a resonable size 19:39:32 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/938000 Suggested workaround from Hashar improves compaction when we do shutdown 19:39:49 <clarkb> hashar's suggestion is that we allow compaction to run for up to 15 seconds instead. This compaction only runs on Gerrit shutdown though 19:40:16 <clarkb> which means if we aren't shutting down Gerrit often the files could still grow quite a bit. I'm thinking its still a good toggle to chagne though as it should help when we do shutdown 19:40:32 <fungi> and which i suppose could get skipped also in an unplanned outage 19:40:40 <clarkb> but its also got me thinking maybe we should revert my changes to allow those caches to be bigger (pruned daily but still leading to fragmentation on disk) 19:41:06 <fungi> have we observed performance improvements from the larger caches? 19:41:11 <clarkb> the goal with ^ is maybe daily pruning with smaller limits will reduce the fragmentation of the h2 backing files which leads to the growth 19:41:18 <clarkb> fungi: not really no 19:41:33 <clarkb> fungi: I was hoping that doing so would speedup gerrit startups because it wouldn't need to prune so much on startup 19:41:35 <fungi> i don't object to longer shutdown times if there's a runtime performance improvement to balance them out 19:41:46 <clarkb> but it seems the slowness may have been related to the size of the backing file all along 19:42:24 <corvus> if we revert your change, would that actually help this? the db may still produce just as much fragmentation garbage 19:42:56 <clarkb> corvus: right it isn't clear if the revert would helpf significantly since the cache is only pruned once a day and the sizes I picked were based on ~1day sizes anyway 19:43:28 <fungi> also all of this occurred during what is traditionally our slowest activity time of the year 19:43:55 <fungi> so we may not have great anecdotal exepriences with its impact either way 19:43:57 <clarkb> the limit is 2GB today on one of them whcih means that is our floor. In theory it may grow up to 4gb in size before its daily pruning down to 2gb. If we revert my change the old limit was 256mb iirc so we'd prune from 2gb ish down to 256mb ish 19:44:15 <clarkb> but I'm happy to change one thing at a time if we just want to start with increasing the compaction time 19:44:42 <clarkb> that would look something like updating the configs for that h2 setting, stopping gerrit, starting gerrit, then probably stopping gerrit again to see if compaction does what we expect before starting gerrit agagin 19:44:59 <clarkb> a bit back and forth/flappy but I think important to actual observe the improvement 19:45:25 <fungi> sounds fine to me 19:45:34 <clarkb> anyway if that seems reasonable leave a review on the change above (938000) and I'm happy to approve teh change and drive those restarts at an appropriate time 19:45:46 <fungi> i doubt our user experience is granular enough to notice the flapping 19:46:13 <clarkb> looks like corvus and tonyb already voted in favor so I'll proceed with change one thing at a time for now plan and that one thing is increased compaction time 19:46:21 <clarkb> #topic Rax-ord Noble Nodes with 1 VCPU 19:46:48 <clarkb> I've kept this agenda item because I wanted to followup and check if anyone had looked into a sanity check for our base pre playbook to early fail instances with only one vcpu on rax xen 19:47:15 <clarkb> I suspect this is a straightforward bit of ansible that looks at ansible facts but we do want to be careful to test it with base-test first to avoid unexepcted fallout 19:48:47 <clarkb> sounds like no. Thats fine and the problem is intermittent. I'll probably drop this from next weeks agenda and we can put it back if we need to (eg further debugging or problem gets worse etc) 19:48:54 <clarkb> #topic Service Coordinator Election 19:49:07 <clarkb> it is almost that time of year again where we need to elect a service coordinator for OpenDev 19:49:34 <clarkb> In the meeting agenda I wrote down this proposal: Nominations Open From February 4, 2025 to February 18, 2025. Voting February 19, 2025 to February 26, 2025. All times and dates will be UTC based. 19:50:01 <clarkb> this is basically a year after the first election in 2024 so should line up to 6 months after the last election 19:50:56 <clarkb> if that schedule doesn't work for some reason (holiday, travel etc) please let me know between now and our next meeting but I think we can probably make this plan official next week if nothing comes up before then 19:51:10 <tonyb> ++ 19:51:23 <clarkb> and start thinking about whether or not you'd like to run. I'm happy to support anyone that may be interested in taking on the role. 19:51:43 <clarkb> #topic Beginning of the Year (Virtual) Meetup 19:52:14 <clarkb> and for the last agenda item I'd like to try and do something similar to the pre ptg we did early 2024. I know we said we should do more of these then we didn't... but I think doing something like that early in the year is a good idea at the very least 19:52:47 <tonyb> Sounds good to me 19:53:07 <clarkb> Looking at a calendar I think one of the last two weeks of January would work for me so something like 21-23 or 28-30 ish 19:53:35 <clarkb> february is harder for me with random dentist and doctor appointments scattered through the month though I'm sure we can make something work if January doesn't 19:53:54 <clarkb> any opinions on willingness / ability to participate and if able when works best? 19:54:39 <fungi> i've got some travel going on for teh 15th through the 20th, but that should be doable for me 19:54:52 <fungi> in january i mean 19:54:53 * frickler is still very unclear on the ability part, will need to decide short term 19:55:00 <tonyb> 21-23 would be my preference as I can be more flexible with my awake hours that week which may make it easier/possible to get us all in "one place" 19:55:42 <corvus> lunar new year is jan 29. early 20s sounds good. 19:56:44 <clarkb> ok lets pencil in the days of 21-23. I will start working on compiling some agenda content and then we can nail down what hours work best as we get closer and have a better understanding of total content 19:56:59 <clarkb> frickler: and I guess let me know when you have better clarity 19:57:10 <frickler> sure 19:57:12 <tonyb> clarkb: perfect 19:57:21 <clarkb> #topic Open Discussion 19:57:24 <clarkb> Anything else? 19:58:02 <corvus> gosh there's a lot of steps to set up a quay org 19:58:35 <clarkb> oh I was also going to try and bring up the h2 db thing upstream 19:58:45 <clarkb> just to see if any other gerrit folks have input in addition to hashar 19:58:51 <fungi> there were some extra steps just to (re)use my existing quay/rh account 19:59:22 <corvus> apparently there's a lot of "inviting" accounts and users to join teams, which means a lot of clicking buttons in emails 19:59:34 <corvus> some infra-root folks should have some email invites 19:59:41 <fungi> they seemed to want me to fit my job role and position into some preset list that didn't even have "other" options 19:59:53 <corvus> and we may need to revisit the set of infra-root that own these orgs 20:00:00 <clarkb> corvus: yup I see an invite 20:00:01 <fungi> i'm now an "it - operations, engineer" 20:00:11 <clarkb> I'll look at that after lunch 20:00:16 <corvus> oh, yes, the root account is now a "System Administrator" in "IT Operations" 20:00:22 <clarkb> haha 20:00:37 <clarkb> and we are at time 20:00:39 <tonyb> LOL 20:00:39 <clarkb> thank you everyone 20:00:47 <clarkb> #endmeeting