19:00:15 <clarkb> #startmeeting infra
19:00:15 <opendevmeet> Meeting started Tue Oct 22 19:00:15 2024 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:15 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:15 <opendevmeet> The meeting name has been set to 'infra'
19:00:21 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/QGD26LEKHTM3AI6HTETDZWG6NQVM7ALV/ Our Agenda
19:00:27 <clarkb> #topic Announcements
19:00:33 <clarkb> #link https://www.socallinuxexpo.org/scale/22x/events/open-infra-days CFP for Open Infra Days event at SCaLE is open until November 1
19:01:20 <clarkb> Sounds like the zuul presentation at the recent open infra days in indiana was well received. I've been told I should encourage all ya'll with good ideas to propose presentations for the SCaLE event
19:01:59 <clarkb> also as I just mentioned the PTG is happening this week
19:02:10 * frickler is watching with one eye or so
19:02:16 <clarkb> please be careful making changes particularly to meetpad or etherpad and ptgbot
19:02:27 <fungi> yeah, i didn't get a chance to attend any zuul talks at oid-na but was glad to see there were some
19:02:36 <corvus> o/
19:02:59 <clarkb> along those lines I put meetpad02 and jvb02 in the emergency file because jitsi meet just cut new releases and docker images. Having those servers in the emergency file should ensure that we don't update when our daily infra-prod jobs runs in about ~6 hours?
19:03:18 <clarkb> once the PTG is over we can remove those servers and let them upgrade normally. This just avoids any problems as meetpad has been working pretty well so far
19:03:45 <clarkb> #topic Zuul-launcher image builds
19:04:04 <clarkb> Diving right into the agenda I feel like I need to cathc up on the state of things here. Anything new to report corvus ?
19:04:28 <corvus> at this point i think we can say the image build/upload is successful
19:05:09 <corvus> i think there is an opportunity to improve the throughput of the download/upload cycle on the launcher, but we're missing some log detail to confirm that
19:05:09 <fungi> yay!
19:05:27 <corvus> i even went as far as to try launching a node from the uploaded image
19:05:43 <fungi> i mean yay-success, of course, not yay-missing-logs ;)
19:05:44 <corvus> that almost worked, but the launcher didn't provide enough info to the executor to actually use the node
19:06:12 <corvus> technically we did run a job on a zuul-launcher -launched node, it just failed.  :)
19:06:33 <corvus> we just (right before this meeting) merged changes to address both of those things
19:06:43 <clarkb> ok was going to ask if the problem was in our configs or in zuul itself
19:06:47 <corvus> so i will retry the upload to get better logs, and retry the job to see what is to be seen there
19:07:20 <corvus> 2 things aside from the above:
19:08:11 <corvus> 1) i have a suspicion that the x-delete-after is not working.  maybe that's not honored by the swift cli when it's doing segmented uploads, or maybe the cloud doesn't support that.  i still need to confirm that with the most recent uploads, and then triage which of those things it is.
19:08:41 <corvus> 2) image build jobs for other images is still waiting for tonyb or someone to start on that (no rush, but it's ready for work to start whenever)
19:08:48 <corvus> oh bonus #3:
19:09:14 <corvus> 3) i don't think we're doing periodic builds yet; but we can; so i or someone should hook up the jobs to that pipeline (that's a simple zuul.yaml change)
19:09:29 <clarkb> re 1) probably a good idea to debug before we add a lot of image builds (just to keep the total amount of data as small as possible)
19:10:01 <corvus> yep -- though to be clear, we can work on the jobs for the other platforms and not upload the images yet
19:10:11 <corvus> (so #1 is not a blocker for #2)
19:10:22 <clarkb> got it, upload is a distinct step and we can start with simply doing builds
19:10:27 <corvus> ++
19:11:28 <corvus> i think that's about it for updates (i could yak longer, but that's the critical path)
19:11:35 <clarkb> thank you for the update
19:11:37 * tonyb promises to write at least one job this week
19:11:44 <clarkb> #topic OpenStack OpenAPI spec publishing
19:11:51 <clarkb> #link https://review.opendev.org/921934
19:12:09 <clarkb> I kept this on the agenda to make sure we don't lose track of it and I was hoping to maybe catch up with the PTG but I'm not sure timing will work out for that
19:12:42 <clarkb> the sdk team met yesterday during TC time
19:12:55 <clarkb> so maybe we just need to follow up after the PTG and see what is next
19:13:06 <clarkb> (there aren't any new comments in response to frickler or myself on the change)
19:13:09 <clarkb> any other thoughts on this?
19:13:42 <fungi> i have none
19:13:49 <clarkb> #topic Backup Server Pruning
19:14:19 <clarkb> we discussed options for reducing disk consumption on the smaller of the two backup servers 2 weeks ago then I went on a last minute international trip and haven't had a chance to do that
19:14:45 <clarkb> good news is tomorrow is a quiet day in my PTG schedule so I'm hoping I can sit down and carefully trim out the backup targets for old/ancient/gone servers
19:14:50 <clarkb> ask01, ethercalc02, etherpad01, gitea01, lists, review-dev01, and review01
19:15:07 <clarkb> that is the list I'll be looking at probably ethercalc to start since it seems the least impactful
19:15:18 <fungi> i think we already had consensus to remove those, but just to reiterate that list sounds good to me
19:15:57 <fungi> i'd volunteer to help but my dance card is full until at least mid-next week
19:16:00 <clarkb> ya between now and tomorrow is a good time to chime in if you think that we should replace the backing volume instead and keep those around or $otheridea
19:16:03 <tonyb> ++
19:16:13 <clarkb> but my itnention is to simply clear those out and ensure we're recovering expected disk space to start
19:16:59 <clarkb> we should have server snapshots and the other backup server too so risk seems low
19:17:21 <clarkb> #topic Updating Gerrit Cache Sizes
19:17:31 <clarkb> last Friday we upgraded Gerrit to pick up some bugfixes
19:17:43 <clarkb> when gerrit started up it complained about a number of caches being over sized and needing pruning
19:17:56 <clarkb> it turns out that gerrit prunes them automatically at 0100 but also on startup
19:18:27 <clarkb> https://paste.opendev.org/show/bk4pTIuQLCsWaF3dVVF7/ is the relavent logged output which shows several related caches were much larger than their configured sizes (defaults all)
19:18:32 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/932763 increase sizes for four gerrit caches
19:18:56 <clarkb> I psuhed this change to update the cache sizes based on the data in those logs and the documentation to what I hope is a larger more reasonable and performant set of sizes
19:19:24 <clarkb> updating this config will require another gerrit restart so this isn't a rush. May be good to try and get done after the PTG though as dev work should ramp up and give us an idea of whether or not this is helpful
19:20:22 <clarkb> probably the main concern is that we're increasing the size of some memory caches too but they seem clearly too small and this is likely impacting performance
19:20:25 <fungi> out of curiosity, i wonder if anyone has observed worse performance with the aggressively small cache target sizes
19:20:47 <fungi> but also no clue how recently this started complaining
19:20:56 <clarkb> fungi: I suspect that this is why we don't get diffs for a few minutes on gerrit startup. Gerrit is marking all of the cached data for those diffs as stale and it takes a while to repopulate
19:21:19 <fungi> does it persist caches over restarts? prune during startup?
19:21:38 <clarkb> fungi: the disk caches are persisted over restarts but it prunes them to the configured size on startup
19:21:49 <fungi> and also once a day
19:21:58 <clarkb> "Cache jdbc:h2:file:///var/gerrit/cache/gerrit_file_diff size (2.51g) is greater than maxSize (128.00m), pruning" basically all this content is marked invalid at startup
19:22:20 <clarkb> by increasing that cache size to 3g as proposed I suspect/hope that the next restart won't prune and we'll get diffs on startup
19:22:21 <fungi> so maybe if people have been observing sluggishness after 01z daily that could be an explanation
19:22:26 <clarkb> or if it prunes it will do so minimally
19:22:51 <fungi> that sounds like a great test
19:23:45 <clarkb> anyway comments welcome and definitely open to suggestions on size if we have different interpretations of the docs or concerns about memory consumption
19:23:58 <clarkb> and if we can reach general consensus a restart early next week would be great
19:24:33 <frickler> I already +2d, early next week sgtm
19:24:55 <clarkb> #topic Upgrading old servers
19:25:04 <clarkb> tonyb: not sure if you are still around. Any updates on the wiki changes?
19:25:38 <clarkb> I don't see new patchsets. Any other updates?
19:26:08 <fungi> i ended up adding some ua filters to the existing set in order to hopefully get a handle on ai training scrapers overrunning it
19:26:16 <fungi> on the production server that is
19:26:55 <clarkb> oh ya tonyb mentioned those would need syncing as part of the redeployment
19:27:02 <fungi> tonyb mentioned adding those bots to the robots.txt in an update to his changes, since most of those bots should be well-behaved but the old server doesn't present a robots.txt at all
19:27:28 <fungi> i think the load average was up around 50 when i was looking into the problem
19:27:31 <clarkb> I'm guessing tonyb managed to go on that run so we don't need to wait around
19:27:46 <clarkb> fungi: I'm guessing that your edits improved things based on my ability to edit the agenda yesterday :)
19:27:59 <fungi> well, i also fully rebooted the server
19:28:41 <fungi> load average is still pretty high, around 10 at the moment, but the reboot did seem to fix the inability to authenticate via openid
19:29:24 <fungi> anyway, the sooner we're able to move forward with the container replacement, the easier this all gets
19:29:40 <clarkb> and until the AI training wars subside we're likely to need to make continuous updates
19:31:04 <clarkb> #topic Docker compose plugin with podman service for servers
19:31:24 <clarkb> I don't think anyone has pushed up a chagne to start testing this with say paste/lodgeit but that is the current proposed plan
19:31:39 <clarkb> if I'm wrong about that please correct me and point out what needs reviewing or if there are any other questions
19:31:54 <corvus> i share that understanding
19:31:58 <fungi> i don't recall seeing a change yet
19:32:40 <clarkb> #topic Open Discussion
19:32:45 <clarkb> Anything else?
19:33:02 <fungi> i've got nothing
19:34:07 <clarkb> it may be worth mentioning that I'll be out around veterans day weekend. I can't remember if I'm back Tuesday or Wednesday though
19:34:21 <clarkb> also thanksgiving is about a month away for those of us in the US
19:34:47 <frickler> EU switches back from DST next sunday
19:35:00 <clarkb> looks liek I'll be back tuesday so no missed meeting for me and I expect to be around tuesday before thanksgiving
19:35:05 <fungi> i think it's a couple of weeks out that the usa does the same
19:35:20 <clarkb> yes we're a week later than the EU
19:35:35 <fungi> november 3, yep
19:36:21 <clarkb> keep those date changes in mind and as far as I can tell we should have meetings for the next month and a half or so
19:36:28 <clarkb> s/date/timezone/
19:36:45 <clarkb> I'll give it a few more minutes but we can end early if there is nothing else
19:38:59 <clarkb> thank you for your time everyone! have a productive PTG and we'll see you back here next week
19:39:02 <clarkb> #endmeeting