19:00:15 <clarkb> #startmeeting infra 19:00:15 <opendevmeet> Meeting started Tue Oct 22 19:00:15 2024 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:15 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:15 <opendevmeet> The meeting name has been set to 'infra' 19:00:21 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/QGD26LEKHTM3AI6HTETDZWG6NQVM7ALV/ Our Agenda 19:00:27 <clarkb> #topic Announcements 19:00:33 <clarkb> #link https://www.socallinuxexpo.org/scale/22x/events/open-infra-days CFP for Open Infra Days event at SCaLE is open until November 1 19:01:20 <clarkb> Sounds like the zuul presentation at the recent open infra days in indiana was well received. I've been told I should encourage all ya'll with good ideas to propose presentations for the SCaLE event 19:01:59 <clarkb> also as I just mentioned the PTG is happening this week 19:02:10 * frickler is watching with one eye or so 19:02:16 <clarkb> please be careful making changes particularly to meetpad or etherpad and ptgbot 19:02:27 <fungi> yeah, i didn't get a chance to attend any zuul talks at oid-na but was glad to see there were some 19:02:36 <corvus> o/ 19:02:59 <clarkb> along those lines I put meetpad02 and jvb02 in the emergency file because jitsi meet just cut new releases and docker images. Having those servers in the emergency file should ensure that we don't update when our daily infra-prod jobs runs in about ~6 hours? 19:03:18 <clarkb> once the PTG is over we can remove those servers and let them upgrade normally. This just avoids any problems as meetpad has been working pretty well so far 19:03:45 <clarkb> #topic Zuul-launcher image builds 19:04:04 <clarkb> Diving right into the agenda I feel like I need to cathc up on the state of things here. Anything new to report corvus ? 19:04:28 <corvus> at this point i think we can say the image build/upload is successful 19:05:09 <corvus> i think there is an opportunity to improve the throughput of the download/upload cycle on the launcher, but we're missing some log detail to confirm that 19:05:09 <fungi> yay! 19:05:27 <corvus> i even went as far as to try launching a node from the uploaded image 19:05:43 <fungi> i mean yay-success, of course, not yay-missing-logs ;) 19:05:44 <corvus> that almost worked, but the launcher didn't provide enough info to the executor to actually use the node 19:06:12 <corvus> technically we did run a job on a zuul-launcher -launched node, it just failed. :) 19:06:33 <corvus> we just (right before this meeting) merged changes to address both of those things 19:06:43 <clarkb> ok was going to ask if the problem was in our configs or in zuul itself 19:06:47 <corvus> so i will retry the upload to get better logs, and retry the job to see what is to be seen there 19:07:20 <corvus> 2 things aside from the above: 19:08:11 <corvus> 1) i have a suspicion that the x-delete-after is not working. maybe that's not honored by the swift cli when it's doing segmented uploads, or maybe the cloud doesn't support that. i still need to confirm that with the most recent uploads, and then triage which of those things it is. 19:08:41 <corvus> 2) image build jobs for other images is still waiting for tonyb or someone to start on that (no rush, but it's ready for work to start whenever) 19:08:48 <corvus> oh bonus #3: 19:09:14 <corvus> 3) i don't think we're doing periodic builds yet; but we can; so i or someone should hook up the jobs to that pipeline (that's a simple zuul.yaml change) 19:09:29 <clarkb> re 1) probably a good idea to debug before we add a lot of image builds (just to keep the total amount of data as small as possible) 19:10:01 <corvus> yep -- though to be clear, we can work on the jobs for the other platforms and not upload the images yet 19:10:11 <corvus> (so #1 is not a blocker for #2) 19:10:22 <clarkb> got it, upload is a distinct step and we can start with simply doing builds 19:10:27 <corvus> ++ 19:11:28 <corvus> i think that's about it for updates (i could yak longer, but that's the critical path) 19:11:35 <clarkb> thank you for the update 19:11:37 * tonyb promises to write at least one job this week 19:11:44 <clarkb> #topic OpenStack OpenAPI spec publishing 19:11:51 <clarkb> #link https://review.opendev.org/921934 19:12:09 <clarkb> I kept this on the agenda to make sure we don't lose track of it and I was hoping to maybe catch up with the PTG but I'm not sure timing will work out for that 19:12:42 <clarkb> the sdk team met yesterday during TC time 19:12:55 <clarkb> so maybe we just need to follow up after the PTG and see what is next 19:13:06 <clarkb> (there aren't any new comments in response to frickler or myself on the change) 19:13:09 <clarkb> any other thoughts on this? 19:13:42 <fungi> i have none 19:13:49 <clarkb> #topic Backup Server Pruning 19:14:19 <clarkb> we discussed options for reducing disk consumption on the smaller of the two backup servers 2 weeks ago then I went on a last minute international trip and haven't had a chance to do that 19:14:45 <clarkb> good news is tomorrow is a quiet day in my PTG schedule so I'm hoping I can sit down and carefully trim out the backup targets for old/ancient/gone servers 19:14:50 <clarkb> ask01, ethercalc02, etherpad01, gitea01, lists, review-dev01, and review01 19:15:07 <clarkb> that is the list I'll be looking at probably ethercalc to start since it seems the least impactful 19:15:18 <fungi> i think we already had consensus to remove those, but just to reiterate that list sounds good to me 19:15:57 <fungi> i'd volunteer to help but my dance card is full until at least mid-next week 19:16:00 <clarkb> ya between now and tomorrow is a good time to chime in if you think that we should replace the backing volume instead and keep those around or $otheridea 19:16:03 <tonyb> ++ 19:16:13 <clarkb> but my itnention is to simply clear those out and ensure we're recovering expected disk space to start 19:16:59 <clarkb> we should have server snapshots and the other backup server too so risk seems low 19:17:21 <clarkb> #topic Updating Gerrit Cache Sizes 19:17:31 <clarkb> last Friday we upgraded Gerrit to pick up some bugfixes 19:17:43 <clarkb> when gerrit started up it complained about a number of caches being over sized and needing pruning 19:17:56 <clarkb> it turns out that gerrit prunes them automatically at 0100 but also on startup 19:18:27 <clarkb> https://paste.opendev.org/show/bk4pTIuQLCsWaF3dVVF7/ is the relavent logged output which shows several related caches were much larger than their configured sizes (defaults all) 19:18:32 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/932763 increase sizes for four gerrit caches 19:18:56 <clarkb> I psuhed this change to update the cache sizes based on the data in those logs and the documentation to what I hope is a larger more reasonable and performant set of sizes 19:19:24 <clarkb> updating this config will require another gerrit restart so this isn't a rush. May be good to try and get done after the PTG though as dev work should ramp up and give us an idea of whether or not this is helpful 19:20:22 <clarkb> probably the main concern is that we're increasing the size of some memory caches too but they seem clearly too small and this is likely impacting performance 19:20:25 <fungi> out of curiosity, i wonder if anyone has observed worse performance with the aggressively small cache target sizes 19:20:47 <fungi> but also no clue how recently this started complaining 19:20:56 <clarkb> fungi: I suspect that this is why we don't get diffs for a few minutes on gerrit startup. Gerrit is marking all of the cached data for those diffs as stale and it takes a while to repopulate 19:21:19 <fungi> does it persist caches over restarts? prune during startup? 19:21:38 <clarkb> fungi: the disk caches are persisted over restarts but it prunes them to the configured size on startup 19:21:49 <fungi> and also once a day 19:21:58 <clarkb> "Cache jdbc:h2:file:///var/gerrit/cache/gerrit_file_diff size (2.51g) is greater than maxSize (128.00m), pruning" basically all this content is marked invalid at startup 19:22:20 <clarkb> by increasing that cache size to 3g as proposed I suspect/hope that the next restart won't prune and we'll get diffs on startup 19:22:21 <fungi> so maybe if people have been observing sluggishness after 01z daily that could be an explanation 19:22:26 <clarkb> or if it prunes it will do so minimally 19:22:51 <fungi> that sounds like a great test 19:23:45 <clarkb> anyway comments welcome and definitely open to suggestions on size if we have different interpretations of the docs or concerns about memory consumption 19:23:58 <clarkb> and if we can reach general consensus a restart early next week would be great 19:24:33 <frickler> I already +2d, early next week sgtm 19:24:55 <clarkb> #topic Upgrading old servers 19:25:04 <clarkb> tonyb: not sure if you are still around. Any updates on the wiki changes? 19:25:38 <clarkb> I don't see new patchsets. Any other updates? 19:26:08 <fungi> i ended up adding some ua filters to the existing set in order to hopefully get a handle on ai training scrapers overrunning it 19:26:16 <fungi> on the production server that is 19:26:55 <clarkb> oh ya tonyb mentioned those would need syncing as part of the redeployment 19:27:02 <fungi> tonyb mentioned adding those bots to the robots.txt in an update to his changes, since most of those bots should be well-behaved but the old server doesn't present a robots.txt at all 19:27:28 <fungi> i think the load average was up around 50 when i was looking into the problem 19:27:31 <clarkb> I'm guessing tonyb managed to go on that run so we don't need to wait around 19:27:46 <clarkb> fungi: I'm guessing that your edits improved things based on my ability to edit the agenda yesterday :) 19:27:59 <fungi> well, i also fully rebooted the server 19:28:41 <fungi> load average is still pretty high, around 10 at the moment, but the reboot did seem to fix the inability to authenticate via openid 19:29:24 <fungi> anyway, the sooner we're able to move forward with the container replacement, the easier this all gets 19:29:40 <clarkb> and until the AI training wars subside we're likely to need to make continuous updates 19:31:04 <clarkb> #topic Docker compose plugin with podman service for servers 19:31:24 <clarkb> I don't think anyone has pushed up a chagne to start testing this with say paste/lodgeit but that is the current proposed plan 19:31:39 <clarkb> if I'm wrong about that please correct me and point out what needs reviewing or if there are any other questions 19:31:54 <corvus> i share that understanding 19:31:58 <fungi> i don't recall seeing a change yet 19:32:40 <clarkb> #topic Open Discussion 19:32:45 <clarkb> Anything else? 19:33:02 <fungi> i've got nothing 19:34:07 <clarkb> it may be worth mentioning that I'll be out around veterans day weekend. I can't remember if I'm back Tuesday or Wednesday though 19:34:21 <clarkb> also thanksgiving is about a month away for those of us in the US 19:34:47 <frickler> EU switches back from DST next sunday 19:35:00 <clarkb> looks liek I'll be back tuesday so no missed meeting for me and I expect to be around tuesday before thanksgiving 19:35:05 <fungi> i think it's a couple of weeks out that the usa does the same 19:35:20 <clarkb> yes we're a week later than the EU 19:35:35 <fungi> november 3, yep 19:36:21 <clarkb> keep those date changes in mind and as far as I can tell we should have meetings for the next month and a half or so 19:36:28 <clarkb> s/date/timezone/ 19:36:45 <clarkb> I'll give it a few more minutes but we can end early if there is nothing else 19:38:59 <clarkb> thank you for your time everyone! have a productive PTG and we'll see you back here next week 19:39:02 <clarkb> #endmeeting