#opendev-meeting log

19:00:05 <clarkb> #startmeeting infra
19:00:05 <opendevmeet> Meeting started Tue Jan  7 19:00:05 2025 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:05 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:05 <opendevmeet> The meeting name has been set to 'infra'
19:00:17 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/M5YEAMYDHKFBK7BJKHIQVFKGXJEW3KZF/ Our Agenda
19:00:23 <clarkb> #topic Announcements
19:00:35 <clarkb> Welcome to 2025.  Apologies for interrupting the winter slumber
19:00:54 <tonyb> I won't be around much next week
19:01:17 <fungi> (or summer slumber in tonyb's case)
19:01:27 <clarkb> tonyb: ack thanks for the heads up
19:01:28 <tonyb> (Taking an actual vacation)
19:01:52 <clarkb> I didn't really have anything to announce today. Anyone else with an announcement?
19:02:24 <clarkb> #topic Zuul-launcher image builds
19:02:41 <clarkb> corvus: I feel like the holidays were just long enough that I've forgotten where we were on this
19:02:52 <clarkb> I know there was the raw image handling and testing
19:03:40 <corvus> mostly waiting on the #niz stack (which is in flaky-test fixing mode now) reviews welcome but not necessary at this time
19:03:46 <fungi> i wiped my brain clean like an etch-a-sketch, so don't feel bad
19:03:48 <corvus> that stack adds web ui and more image lifecycle stuff
19:03:56 <corvus> so it'll be good to have that in place before more live testing
19:04:10 <clarkb> corvus: did the API stuff land (and presumably deployed through our weekly deployments)
19:04:25 <corvus> nope that's interleaved in that stack
19:04:47 <corvus> it's completely separate so i think it's actually okay to do a single-phase merge for that instead of our usual two-phase
19:05:19 <clarkb> ok /me scribbles a note to try and review that stack if paperwork gets done
19:05:37 <corvus> more urgent than that though is the dockerhub image stuff :)
19:05:46 <clarkb> yup that has its own agenda item today
19:05:52 <corvus> i don't relish the idea of trying to get all that merged before that :)
19:06:12 <clarkb> anything else on this topic or should we continue on so that we can get to the container image mirroring?
19:06:22 <corvus> continue i think
19:06:29 <clarkb> #topic Deploying new Noble Servers
19:06:52 <clarkb> my podman prep change that updates install-docker to install podman and docker compose on Noble landed last week
19:07:13 <clarkb> we don't have any noble servers running containers yet so that was a noop (and I spot checked the deployments to ensure I dind't miss anything)
19:07:28 <clarkb> but that means next step is now to deploy a new server that uses containers on Noble
19:07:44 <fungi> what's a good candidate?
19:07:46 <clarkb> my hope is that I'll get beginning of the year paperwork stuff done tomorrowish and I can start on a new paste deployment late this week
19:07:49 <tonyb> I did help my ansible-devel changes which use install-docker on noble
19:07:51 <fungi> lodgeit/paste?
19:07:58 <clarkb> fungi: my plan is paste since that is representative with a database but also small and simple
19:08:00 <fungi> or maybe mediawiki
19:08:11 <fungi> yeah, paste feels right as a canary
19:08:37 <clarkb> tonyb: have you seen any problems with it or is it working in CI for your ansible devel stuff?
19:09:05 <clarkb> I guess that is the main callout on this topic today. The big change is in but I haven't put it to use in production yet. If you do put it to use and have feedback that is very much welcome
19:09:22 <tonyb> clarkb: No issues but I don't actually *use* compose but it installs
19:09:38 <fungi> it there some possibility of the ansible-devel job getting back into a passing state again? that's exciting
19:09:40 <clarkb> ya I think on bridge (which ansible-devel works on) its minimal use of containers
19:10:53 <tonyb> fungi: if nothing else it sets somewhat of a timeline for dropping xenial and moving bridge to noble
19:11:10 <clarkb> so ya let me know if you notice oddities, but I think we can continue on
19:11:13 <fungi> that would be great
19:11:16 <tonyb> both of which are needed before we can update the ansible version
19:11:34 <clarkb> #topic Upgrading old servers
19:11:43 <clarkb> the discussion seems to be trending into this topic anyway
19:11:55 <clarkb> ya the issue with newer ansible is it won't run on remote hosts with older python
19:12:13 <clarkb> historically ansible has maintained a wide array of python support for the remote side (the control side was more restricted)
19:12:27 <clarkb> but that has changed recently with a quite reduced set of supported python versions
19:13:34 <clarkb> anyway other than the noble work above is there anything else to be aware for upgrading servers? I think we're going to end up needing to focus on this for the first half of the year
19:13:51 <clarkb> but I suspect that a lot of those jumps will be to noble so getting that working well upfront seems worthwhile
19:13:56 <tonyb> I wanted to say I started looking at mediawiki again
19:14:26 <tonyb> I'd like to send and announce to service-announce this week
19:14:31 <tonyb> #link https://etherpad.opendev.org/p/opendev-wiki-announce
19:14:46 <clarkb> tonyb: are there new patches to review yet? (I haven't followed the irc notifications too closely the last few weeks)
19:14:56 <clarkb> but ya announcing that soonish seems good
19:15:22 <tonyb> No new patches, but I addressed a bunch of feedback yesterday
19:16:29 <frickler> I've looked at some old content on the wiki recently, and I do wonder whether it would be better to start fresh
19:16:37 <clarkb> ok I'll try to catch back up on the announcement and the review stack today or tomorrow ish so that the announcement at the end of week schedule isn't held up by me (though I had looekd at them previously and its probably fine to proceed)
19:16:46 <frickler> and possibly just leave a read-only copy available somewhere
19:17:10 <JayF> As long as that read-only copy will exist as long as the new wiki will, that sounds like an excellent idea
19:17:28 <fungi> tonyb: exciting that you're so close! announcement looks fine other than shoehorning you into using jammy, you might end up wanting noble depending on the timeline
19:17:36 <tonyb> where we you two a year ago ;P
19:18:01 <clarkb> I think I'm on the fence about starting over
19:18:10 <corvus> i don't think we should start over
19:18:12 <clarkb> anyone can simply start over on any content that is old as is and that avoids needing to maintain two wikis
19:18:30 <corvus> clarkb: ++
19:18:37 <clarkb> and its not like starting over prevents things from becoming stale all over again. The fundamental problem is it needs curation and that doesn't change
19:18:53 <fungi> i don't see starting over as a solution, it doesn't solve the problem of the wiki containing old and abandoned content, merely resets the starting point for that
19:19:18 <JayF> That's true in the most general of senses, but the reality is we have years-old content in many places. That content is unlikely to *ever* be curated, so having it not carried over to confuse people would be nice.
19:19:19 <fungi> it will continue to be a problem
19:19:36 <clarkb> JayF: right but anyone can just go and archive that content today right?
19:19:38 <JayF> It's a lot easier to leave something behind than it is to delete something -- it's very hard to know when it's appropriate to remove it
19:19:47 <clarkb> we don't need to start over and host a special frozen wiki
19:19:47 <tonyb> fungi: I think I'd like to stick with Jammy and do the OS change once we're on the latest (mediawiki) LTS
19:20:35 <clarkb> (mediawiki archives things when you delete them aiui so you can jsut delet them if you're 80% certani ti should be deleted)
19:20:48 <fungi> better would be to actively delete any concerning old content, since starting from scratch doesn't prevent new content from ceasing to be cared for and ending up in the same state. or do we need semi-annual domain changes in order to force content refreshes?
19:21:08 <corvus> if, let's say, the openstack project feels pretty strongly about reducing the confusion caused by outdated articles, one approach would be a one-time mass deletion/archiving of those articles.
19:21:46 <corvus> or, something more like what wikipedia does: mass-additions of "this may be outdated" banners to articles.
19:21:55 <JayF> fungi: wiki.dalmatian.openstack.org here we come? /s
19:22:15 <JayF> I think a one time mass-update with "this is old" banners, or archiving old information, is a good idea
19:22:24 <fungi> there's probably a mw plugin to automatically insert admonitions in pages that haven't seen an edit in x years
19:22:31 <clarkb> https://www.mediawiki.org/wiki/Manual:Archive_table says things get archived when deleted
19:22:47 <clarkb> so ya I think we can keep the current content and its curators can delete as necessary
19:23:06 <clarkb> hashar can probably confirm that for us and you can test with a throwaway page
19:23:14 <frickler> makes sense, I'll try to take a look at that
19:23:24 <tonyb> frickler: Thanks.
19:23:51 <clarkb> alright anything else related to server upgrades?
19:24:32 <tonyb> not from me
19:24:53 <clarkb> #topic Mirroring Useful Container Images
19:25:05 <clarkb> the docker hub rate limit problems continue to plague us and others
19:25:31 <clarkb> corvus has made progress in setting up jobs to mirror useful images from docker hub to another registry (quay in this case) to alleviate the problem
19:25:39 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/938508 Initial change to mirror images that are generically useful
19:26:41 <clarkb> I have some thoughts on improving the tags that are mirrored, but I think that is good for a followup
19:26:50 <clarkb> for a first pass we should start smallish and make sure everything is owrking first?
19:27:36 <clarkb> corvus: those jobs will run in the periodic pipeline which means they will trigger at 0200 utc iirc and then fight for resources with all of the rest of the periodic jobs
19:28:05 <clarkb> just wondering if we should be careful when we merge that so that we can check nothing is exposed and quickly regen the api key if that does happen?
19:28:39 <corvus> yeah... do we have an earlier pipeline so we can jump the gun? :)  or we can just see if the time is a problem.
19:29:04 <clarkb> corvus: we have the hourly opendev pipeline which might work temporarily? Also I realized that the python- images already have images in opendevorg
19:29:16 <corvus> they must be out of date though
19:29:19 <clarkb> I don't think those tags exists so I'm not sure that is ap roblem but wondering if we should be careful with those collisions to start
19:29:29 <clarkb> ya they would be old versions
19:30:03 <clarkb> gerrit will collide too
19:30:14 <frickler> downstream I added a second daily pipeline that runs 3h earlier for image builds that are then consumed by "normal" daily jobs, maybe we want to do something similar?
19:30:18 <corvus> i think mirroring existing or non-existing tags is what we want...
19:30:37 <clarkb> corvus: ya I think its fine for python-
19:30:39 <corvus> clarkb: interesting point on gerrit; maybe we should prefix with "mirror" or something?  or even make a new org?
19:30:56 <clarkb> corvus: ya I think for things like gerrit we need to namespace further either with a prefix or a new org
19:31:09 <clarkb> we could also use different tags but I suspect that would be more confusing
19:31:32 <corvus> how about new org: opendevmirror?
19:31:37 <clarkb> corvus: I like that
19:31:51 <fungi> it's clear enough as to what it is, wfm
19:32:08 <corvus> frickler: ack; sounds like a good solution if 0200 is a problem
19:32:27 <frickler> +1 to opendevmirror
19:32:31 <clarkb> ok so make new org to namespace things and avoid collisions with stuff opendev wants to host itself eventually. Keep the initial list small like we've got currently. Then followup with additional tags etc
19:32:48 <tonyb> ++ on opendevmirror
19:32:52 <fungi> adding a new timer trigger pipeline is cheap if we decide there is sufficient need to warrant it
19:33:25 <corvus> sounds good; any of those images we want to say don't belong there and should instead be handled by a job in the zuul tenant?
19:33:30 <frickler> another reason to do that: run before the normal periodic rush eats up more rate limits?
19:33:48 <fungi> i guess it's worth monitoring for failures to decide on the separate pipeline
19:33:59 <clarkb> corvus: I think the only one that opendev doesn't consume today is httpd
19:34:05 <corvus> frickler: yeah, that's the main problem i could see from using 0200, but don't know how bad it will be yet
19:34:07 <clarkb> corvus: but that seems generic enough that I'm happy for us to have it
19:34:15 <clarkb> (we use the gerrit image in gerritlib testing iirc)
19:34:54 <corvus> #action corvus make opendevmirror quay.org and update 938508
19:35:08 <corvus> #undo
19:35:18 <corvus> #action corvus make opendevmirror quay.io org and update 938508
19:35:56 <clarkb> anything else on this topic?
19:35:57 <fungi> wrt separate vs existing pipeline, i have no objection other than not wanting to prematurely overengineer it
19:36:09 <corvus> fungi: ++
19:36:31 <clarkb> ya I wouldn't go out of our way to add ap ipeline just yet but if an alternative to periodic already exists that might be a good option to start
19:37:13 <clarkb> ok there are a few more things I want to get into so lets keep moving
19:37:15 <clarkb> #topic Gerrit H2 Cache File Growth
19:37:31 <clarkb> Just before we all enjoyed some holiday time off we restarted Gerrit thinking it would be fine since we weren't changing the image
19:37:47 <fungi> it's just a restart, wcpgw?
19:37:54 <clarkb> turns out we were wrong and the underlying issue appears to be the growth of the git_file_diff and gerrit_file_diff h2 database cache backing files
19:37:59 <clarkb> one of them was over 200GB iirc
19:38:25 <clarkb> on startup Gerrit attempts to do db cleanup to prune caches down to size but this only affects the content within the db and not the db file itself
19:38:51 <clarkb> however I suspect that h2 becomes very unperformant when the backing file is that size and we had problems. We stopped Gerrit again and then moved the caches aside forcing gerrit to start over with a clean cache file
19:39:01 <clarkb> last I checked those cache files had already regrown back to abou 20GB in size
19:39:01 <fungi> thankfully hashar was no stranger to this problem
19:39:25 <clarkb> ya hashar points out the default h2 compaction time is like 200ms which isn't enough for files of this size to be compacted down to a resonable size
19:39:32 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/938000 Suggested workaround from Hashar improves compaction when we do shutdown
19:39:49 <clarkb> hashar's suggestion is that we allow compaction to run for up to 15 seconds instead. This compaction only runs on Gerrit shutdown though
19:40:16 <clarkb> which means if we aren't shutting down Gerrit often the files could still grow quite a bit. I'm thinking its still a good toggle to chagne though as it should help when we do shutdown
19:40:32 <fungi> and which i suppose could get skipped also in an unplanned outage
19:40:40 <clarkb> but its also got me thinking maybe we should revert my changes to allow those caches to be bigger (pruned daily but still leading to fragmentation on disk)
19:41:06 <fungi> have we observed performance improvements from the larger caches?
19:41:11 <clarkb> the goal with ^ is maybe daily pruning with smaller limits will reduce the fragmentation of the h2 backing files which leads to the growth
19:41:18 <clarkb> fungi: not really no
19:41:33 <clarkb> fungi: I was hoping that doing so would speedup gerrit startups because it wouldn't need to prune so much on startup
19:41:35 <fungi> i don't object to longer shutdown times if there's a runtime performance improvement to balance them out
19:41:46 <clarkb> but it seems the slowness may have been related to the size of the backing file all along
19:42:24 <corvus> if we revert your change, would that actually help this?  the db may still produce just as much fragmentation garbage
19:42:56 <clarkb> corvus: right it isn't clear if the revert would helpf significantly since the cache is only pruned once a day and the sizes I picked were based on ~1day sizes anyway
19:43:28 <fungi> also all of this occurred during what is traditionally our slowest activity time of the year
19:43:55 <fungi> so we may not have great anecdotal exepriences with its impact either way
19:43:57 <clarkb> the limit is 2GB today on one of them whcih means that is our floor. In theory it may grow up to 4gb in size before its daily pruning down to 2gb. If we revert my change the old limit was 256mb iirc so we'd prune from 2gb ish down to 256mb ish
19:44:15 <clarkb> but I'm happy to change one thing at a time if we just want to start with increasing the compaction time
19:44:42 <clarkb> that would look something like updating the configs for that h2 setting, stopping gerrit, starting gerrit, then probably stopping gerrit again to see if compaction does what we expect before starting gerrit agagin
19:44:59 <clarkb> a bit back and forth/flappy but I think important to actual observe the improvement
19:45:25 <fungi> sounds fine to me
19:45:34 <clarkb> anyway if that seems reasonable leave a review on the change above (938000) and I'm happy to approve teh change and drive those restarts at an appropriate time
19:45:46 <fungi> i doubt our user experience is granular enough to notice the flapping
19:46:13 <clarkb> looks like corvus  and tonyb  already voted in favor so I'll proceed with change one thing at a time for now plan and that one thing is increased compaction time
19:46:21 <clarkb> #topic Rax-ord Noble Nodes with 1 VCPU
19:46:48 <clarkb> I've kept this agenda item because I wanted to followup and check if anyone had looked into a sanity check for our base pre playbook to early fail instances with only one vcpu on rax xen
19:47:15 <clarkb> I suspect this is a straightforward bit of ansible that looks at ansible facts but we do want to be careful to test it with base-test first to avoid unexepcted fallout
19:48:47 <clarkb> sounds like no. Thats fine and the problem is intermittent. I'll probably drop this from next weeks agenda and we can put it back if we need to (eg further debugging or problem gets worse etc)
19:48:54 <clarkb> #topic Service Coordinator Election
19:49:07 <clarkb> it is almost that time of year again where we need to elect a service coordinator for OpenDev
19:49:34 <clarkb> In the meeting agenda I wrote down this proposal: Nominations Open From February 4, 2025 to February 18, 2025. Voting February 19, 2025 to February 26, 2025. All times and dates will be UTC based.
19:50:01 <clarkb> this is basically a year after the first election in 2024 so should line up to 6 months after the last election
19:50:56 <clarkb> if that schedule doesn't work for some reason (holiday, travel etc) please let me know between now and our next meeting but I think we can probably make this plan official next week if nothing comes up before then
19:51:10 <tonyb> ++
19:51:23 <clarkb> and start thinking about whether or not you'd like to run. I'm happy to support anyone that may be interested in taking on the role.
19:51:43 <clarkb> #topic Beginning of the Year (Virtual) Meetup
19:52:14 <clarkb> and for the last agenda item I'd like to try and do something similar to the pre ptg we did early 2024. I know we said we should do more of these then we didn't... but I think doing something like that early in the year is a good idea at the very least
19:52:47 <tonyb> Sounds good to me
19:53:07 <clarkb> Looking at a calendar I think one of the last two weeks of January would work for me so something like 21-23 or 28-30 ish
19:53:35 <clarkb> february is harder for me with random dentist and doctor appointments scattered through the month though I'm sure we can make something work if January doesn't
19:53:54 <clarkb> any opinions on willingness / ability to participate and if able when works best?
19:54:39 <fungi> i've got some travel going on for teh 15th through the 20th, but that should be doable for me
19:54:52 <fungi> in january i mean
19:54:53 * frickler is still very unclear on the ability part, will need to decide short term
19:55:00 <tonyb> 21-23 would be my preference as I can be more flexible with my awake hours that week which may make it easier/possible to get us all in "one place"
19:55:42 <corvus> lunar new year is jan 29.  early 20s sounds good.
19:56:44 <clarkb> ok lets pencil in the days of 21-23. I will start working on compiling some agenda content and then we can nail down what hours work best as we get closer and have a better understanding of total content
19:56:59 <clarkb> frickler: and I guess let me know when you have better clarity
19:57:10 <frickler> sure
19:57:12 <tonyb> clarkb: perfect
19:57:21 <clarkb> #topic Open Discussion
19:57:24 <clarkb> Anything else?
19:58:02 <corvus> gosh there's a lot of steps to set up a quay org
19:58:35 <clarkb> oh I was also going to try and bring up the h2 db thing upstream
19:58:45 <clarkb> just to see if any other gerrit folks have input in addition to hashar
19:58:51 <fungi> there were some extra steps just to (re)use my existing quay/rh account
19:59:22 <corvus> apparently there's a lot of "inviting" accounts and users to join teams, which means a lot of clicking buttons in emails
19:59:34 <corvus> some infra-root folks should have some email invites
19:59:41 <fungi> they seemed to want me to fit my job role and position into some preset list that didn't even have "other" options
19:59:53 <corvus> and we may need to revisit the set of infra-root that own these orgs
20:00:00 <clarkb> corvus: yup I see an invite
20:00:01 <fungi> i'm now an "it - operations, engineer"
20:00:11 <clarkb> I'll look at that after lunch
20:00:16 <corvus> oh, yes, the root account is now a "System Administrator" in "IT Operations"
20:00:22 <clarkb> haha
20:00:37 <clarkb> and we are at time
20:00:39 <tonyb> LOL
20:00:39 <clarkb> thank you everyone
20:00:47 <clarkb> #endmeeting