19:00:22 <clarkb> #startmeeting infra 19:00:22 <opendevmeet> Meeting started Tue Feb 25 19:00:22 2025 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:22 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:22 <opendevmeet> The meeting name has been set to 'infra' 19:00:29 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/ZEEHBRE5DEZGXFXPGE4MYFH4NYGRUOIP/ Our Agenda 19:00:36 <clarkb> #topic Announcements 19:01:33 <clarkb> I didn't have anything to announce (there is service coordinator election stuff but I've given that a full agenda topic for later) 19:02:16 <fungi> #link https://lists.openinfra.org/archives/list/foundation@lists.openinfra.org/message/I2XP4T2C47TEODOH4JYVUZNEWK33R3PN/ draft governance documents and feedback calls for proposed OpenInfra/LF merge 19:03:30 <clarkb> thanks seems like that may be it 19:03:44 <clarkb> a good reminder to keep an eye on that whole thread as well 19:04:04 <clarkb> #topic Zuul-launcher image builds 19:04:21 <clarkb> as mentioned previously corvus has been attempting to dogfood the zuul-launcher system with a test chagne in zuul itself 19:04:26 <fungi> yeah, one thing i wish hyperkitty had was a good way to deep-link to a message within a thread view 19:04:26 <clarkb> #link https://review.opendev.org/c/zuul/zuul/+/940824 Somewhat successful dogfooding in this zuul change 19:04:50 <clarkb> previously there were issues with quota limits and other bugs, but now we've got actual job runs that succeed on the images 19:05:21 <clarkb> thats pretty cool and great progress on the underlying migration of nodepool into zuul 19:05:33 <clarkb> I think there are still some quota problems with the latest buildset but some jobs ran 19:05:47 <clarkb> and the work to understand quotas has started in zuul-launcher so this should only get better 19:06:55 <clarkb> not sure if corvus has anything else to add, but ya good progress 19:08:44 <clarkb> #topic Fixing known_hosts generation on bridge 19:09:11 <clarkb> when deploying tracing02 last week I discoverd that sometimes ssh known_hosts isn't updated when ansible runs the in the infra-prod-base job 19:09:42 <clarkb> Eventually I was able to track that down to the code updating known_hosts running against system-config content that was already on bridge iwthout updating it first 19:10:05 <clarkb> sometimes ansible and ssh would work (known_hosts would update) because the load balancer jobs for zuul and gitea would sometimes run before the infra-prod-run job 19:10:13 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/942307 19:11:09 <clarkb> This change aims to fix that by updating system-config as part of bridge bootstrapping and then we have to ensure the other jobs for load balancers (and the zuul db) don't run concurrently and contend for the system-config on disk 19:12:03 <clarkb> testing this is a bit difficult without just doing it. fungi and corvus have reviewed and +2'd the change. DO we want to proceed with the update nowish or wait for some better time? The hourly jobs do run the bootrap bridge job so we should start to get feedback pretty quickly 19:12:26 <fungi> i'd be fine goign ahead with it 19:12:59 <clarkb> ok after the meeting I've got lunch but then maybe we go for it afterwards if there are no objections or suggestions for a different approach? That would be at about 2100 UTC 19:14:08 <clarkb> #topic Upgrading old servers 19:14:29 <clarkb> Now that I'm running the meeting I realize this topic and the next one can be folded together so why don't I just do that 19:14:42 <clarkb> I've continued to try and upgrade servers from focal to noble with the most recent one being tracing02 19:15:08 <clarkb> everything is switched over with zuul talking to the new server, dns is cleaned up/updated, the last step is to delete the old server which I'll try to do today 19:15:27 <clarkb> #link https://etherpad.opendev.org/p/opendev-server-replacement-sprint has some high level details on the backlog todo list there. Help is super apprecaited 19:15:40 <clarkb> tonyb not sure if you are awake yet, but anything to update from your side of things? 19:15:46 <clarkb> anything we can do to be useful etc? 19:16:13 <tonyb> Nope nothing from me 19:16:26 <clarkb> #topic Redeploying raxflex resources 19:16:44 <clarkb> sort of related but with different motivation is that we should redeploy our raxflex ci resources 19:17:42 <clarkb> there are two drivers for this. The first is that doing so in sjc3 will get us updated networking with 1500 mtus. The other is there is a new dfw3 region we can use but its tenants are different than the ones we are currently using in sjc3. The same tenants in dfw3 are available in sjc3 so if we redeploy we align with dfw3 and get updated networking 19:18:05 <fungi> yeah, i was hoping to hear back from folks who work on it as to whether we're okay to direct attach server instances to the PUBLICNET network since that was working in sjc3 at one point, though now it errors in both regions 19:18:53 <fungi> if i don't need to create all the additional network/router/et cetera boilerplate to handle floating-ip in the new projects, i'd rather not 19:19:01 <clarkb> once we sort out ^ we can deploy new mirrors then we can rollover the nodepool configs 19:19:18 <fungi> yes 19:19:50 <clarkb> have we asked cardoe yet? cardoe seems good at running down those questions 19:21:43 <clarkb> in any case I suspect ^ may be the next step if we haven't yet 19:21:47 <fungi> no, i was trying to get the attention of cloudnull or dan_with 19:22:22 <fungi> but can try to see if cardoe is able to at least find out, even though he doesn't work on that environment 19:22:37 <clarkb> ya cardoe seems to know who to talk to and is often available on irc 19:22:42 <clarkb> a good combo for us if not the most efficient 19:22:51 <clarkb> anything else on this topic? 19:22:57 <fungi> not from me 19:23:19 <clarkb> #topic Running certcheck on bridge 19:23:26 <clarkb> I think I just saw changes for this today 19:23:51 <fungi> yeah, i split up my earlier change to add the git version to bridge first 19:23:51 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/939187 and its parent 19:24:16 <fungi> but up for discussion if we want a toggle based on platform version or something 19:24:32 <clarkb> probably just needs reviews at this point? Any other consideratiosn to call out? 19:25:01 <fungi> also not sure how (or if) we'd want to go about stopping certcheck on cacti once we're good with how it's working on bridge, ansible to remove the cronjob only? clean up the git checkout too? 19:25:19 <fungi> or just manually delete stuff? 19:25:58 <fungi> similar for cleaning up the git deployment on bridge if we go down this path to switch it to the distro package later 19:26:10 <clarkb> I think we need to stop applying updates to cacti otherwise we don't get the benefit of switching to running this on bridge. And if we do that the lists will become stale over time so we should disable it 19:26:16 <clarkb> I don't know that we need to do any cleanup beyond disabling it 19:26:34 <clarkb> fungi: the other thing I notice is the change doesn't update ourtesting which is already doing certcheck stuff on bridge I think 19:26:43 <fungi> i realized that my earlier version of 939187 just did it all at once (moved off cacti onto bridge and switched from git to distro package), but it hadn't been getting reviews 19:26:47 <clarkb> we probably want to cleanup whatever special case code does ^ and rely on the production path in testing 19:27:53 <clarkb> playbooks/zuul/templates/gate-groups.yaml.j2 is the file I can leave a review after the meeting 19:28:55 <fungi> thanks 19:29:27 <clarkb> anything else? 19:29:48 <fungi> anyway, if people hadn't reviewed the earlier version of 939187 and are actually in favor of going back to a big-bang switch for install method and server at the same time, i'm happy to revery today's split 19:30:05 <fungi> revert 19:30:12 <clarkb> ack 19:31:26 <fungi> but yes, that's all from me on this 19:31:36 <clarkb> #topic Service Coordinator Election 19:31:56 <clarkb> The nomination period ended and as promised I nominted myself just before the time ran out as no one else had 19:32:16 <fungi> thanks! 19:32:20 <clarkb> I don't see any other nominations here https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/ 19:32:36 <clarkb> so that means I'm it by default. If I've missed a nominee somehow please bring that up asap but I don't think I have 19:33:01 <fungi> welcome back! 19:33:38 <clarkb> in six months we'll have another election and I'd be thrilled if someone else took on the role :) 19:33:57 <clarkb> if you're interested feel free to reach out and we can talk. We can probably even do a more gradual transition if that works 19:34:12 <clarkb> #topic Using more robust Gitea caches 19:34:16 <corvus> clarkb: congratulations1 19:35:13 <clarkb> About a week and a half ago or so mnaser pointed out that some requests to gitea13 were really slow. Then a refresh wouldbe fast. This had us suspecting the gitea caches as that is pretty normal for uncached things. However, this was different because the time delta was huge like 20 seconds instead of half a second 19:35:43 <clarkb> and the same page would do it occasioanlyl within a relatively short period of time which seemed at oods with the caching behavior (it evicts after 16 hours not half an hour) 19:36:47 <clarkb> so I dug around in the gitea source code and found the memory cache is implemented as a single Go hashmap. It sounds like massive Go hashmaps can create problems for the Go GC system. My suspicion now is that AI crawlers (and other usage of gitea) is slowly causing that hashmap to grow to some crazy size and eventually impacting GC with pause the world behavior leading to 19:36:49 <clarkb> this long requests 19:37:12 <clarkb> one thing in support of this is restarting the gitea service causes it to be happy again until it stops being happy some time later 19:38:41 <clarkb> I've since dug into alternative options to the 'memory' cache adapter and there are three: redis, memcached, and twoqueue. Redis is no longer open source and twoqueue is implemented within gitea as another memory cache but using a more sophistiacted implementation that allows you to configure a maximum entry count. I initially rejected these because redis isn't open source 19:38:43 <clarkb> (but apparently valkey is an option) and twoqueue would still be running with Go GC. This led to a change implementing memcached as teh cache system 19:38:49 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/942650 Set up gitea with memcached for caching 19:39:26 <clarkb> the upside to memcached is it is a fairly simple service, you can limit the total memory consumption, and it is open source. The downsides are that it is another container image to pull from docker hub (I would mirror it before we go to production if we decide to go to production with it) 19:39:59 <clarkb> so I guess I'm asking for reviews and feedback on this. I think we could also use twoqueue instead and see if we still have problems after setting a maximum cache entry limit 19:40:19 <corvus> i ❤️ memcached 19:40:23 <clarkb> that is probably simplest from an implementation perspective, and if it doesn't work we're in no worse position than today and could switch to memcached at that point 19:40:43 <fungi> yeah, on its face this seems like a fine approach 19:41:05 <corvus> this is still distributed right, we'd put one memcached on each gitea server? 19:41:05 <clarkb> cool I think that change is ready for review at this point as long as the new test case I added is functional 19:41:10 <clarkb> corvus: correct 19:41:45 <corvus> would centralized help? let multiple gitea servers share cache and reduce io/cpu? 19:41:54 <corvus> (at the cost of bandwidth and us needing to make that HA) 19:42:31 <clarkb> corvus: I'm not sure it would because we use hashed IPs for load balancing so we could have very different cache needs on one gitea vs another 19:42:34 <fungi> i guess that change is still meant to be wip since it seems to have some debugging enabled that may not make sense in production and also at least one todo comment 19:42:37 <clarkb> so having separate caches makes sense to me 19:43:02 <corvus> clarkb: yeah, i guess i was imagining that, like, lots of gitea servers might get a request for the nova README 19:43:44 <clarkb> fungi: the debugging should be disbled in the latest patchset (-v really isn't debugging its not even that verbose) and the TODO is something I noticed adjacent to the change but not directly related to it 19:43:54 <fungi> i think lately it's that all the gitea servers get crawled constantly for every possible url they can serve, so caches on the whole are of questionable value 19:43:56 <clarkb> corvus: I'm also not sure if the keys they use are stable across instances 19:44:23 <clarkb> corvus: I think they should be because we use a consistent database (so project ids should be the saem?) but I'm not positive of that 19:44:25 <corvus> i definitely think that starting with unshared/distributed caching is best, since it's the smallest delta from our previously working system and is the smallest change to fix the current problem. was mostly wondering if we should explore shared caching since that was the actual design goal of memcached and so it's natural to wonder if it's a good fit here. 19:44:29 <clarkb> I think the safest thing for now is distributed caches 19:44:36 <clarkb> ah 19:44:49 <corvus> yes agree, just can't help thinking another step ahead :) 19:44:49 <clarkb> I think it would definitely be possible if/when we use a shared db backend 19:45:00 <clarkb> its not compeltely clear to me if we can until then 19:45:13 <corvus> ack 19:45:43 <clarkb> fungi: we have to have a cache (there is no disabled caching option) and if my hunches are correct the default is actively making things worse so unfortauntely we need to go with something that is less bad at least 19:46:50 <clarkb> it does sound like we're happy to use memcached so maybe I should go ahead and propose a change to mirror that container image today and then update the change tomorrow to pull from the mirror 19:46:59 <fungi> yeah, i just meant tuning the cache for performance may not yield much difference 19:47:09 <clarkb> ah 19:47:15 <corvus> (even a basic LRU cache wouldn't be terrible for "crawl everything"; anything non-ai-crawler would still heat up part of the cache and keep it in memory) 19:47:58 <corvus> (but the cs in me says there's almost certainly a better algorithm for that) 19:48:29 <clarkb> evicting based on total requests since entry would probably work well here 19:48:36 <clarkb> since the AI things grab a page once before scanning again next week 19:48:36 <corvus> ++ 19:49:07 <clarkb> thanks for the feedback I think I see the path forward here 19:49:09 <clarkb> #topic Working through our TODO list 19:49:18 <clarkb> A reminder that I'm trying to keep a rough high level todo list on this etherpad 19:49:23 <clarkb> #link https://etherpad.opendev.org/p/opendev-january-2025-meetup 19:49:35 <clarkb> a good place for anyone to check if they'd like to get involved or if any of us get bored 19:49:49 <clarkb> I intend on editing that list to include parallelize infra-prod-* jobs 19:49:59 <clarkb> #topic Open Discussion 19:50:12 <fungi> i've started looking into mailman log rotation 19:50:27 <clarkb> The Matrix Foundation may be shutting down the OFTC irc bridge at the end of march if they don't get additional funding. Somethign for people to be awareof that rely on that bridge for irc access 19:50:42 <fungi> mailman-core specifically, since mailman-web is just django apps and django already handles rotation for those 19:50:58 <fungi> i ran across this bug, which worries me: https://gitlab.com/mailman/mailman/-/issues/931 19:51:09 <clarkb> fungi: I think the main questions I had about log rotation was ensuring that mailman doesn't have some mechanism it wants to use for that and figuring out what retention should be. I feel like mail things tend to move more slowly than web things so longer retention might be appropriate? 19:51:38 <fungi> seems like the lmtp handler logs to smtp.log and the command to tell mailman to reopen log files doesn't fully work 19:52:00 <clarkb> fungi: copytruncate should address that right? 19:52:18 <clarkb> iirc it exists because some services don't handle rotation gracefully? 19:52:33 <fungi> yeah, looking in separate bug report https://gitlab.com/mailman/mailman/-/issues/1078 copytuncate is suggested 19:55:21 <clarkb> sounds like that may be everything? 19:55:21 <fungi> it's not clear to me that it fully works even then, whether we also need to do a `mailman reopen` or just sighup, et cetera 19:55:36 <clarkb> ya there si a comment there indicating copytruncate may still be problematic 19:55:56 <clarkb> though that is surprising to me since that should keep the existing fd and path. Its the rotaetd logs that are new and detached from the process 19:57:05 <fungi> well, you still need the logging process to gracefully reorient to the start of the file after truncation and not, e.g., seek to the old end address 19:57:37 <fungi> but following those discussions it seems like it's partly about python logging base behaviors and handling multiple streams 19:57:55 <fungi> which probably handles that okay 19:58:05 <fungi> handles the truncation okay i mean 19:58:20 <clarkb> hrm if its python logging in play then I wonder if we need a python logging config instead 19:58:33 <clarkb> that is what we typically do for services like zuul, nodepool, etc and set the rotation schedule in that config 19:59:00 <fungi> i want to say we've seen something like this before, where sighup to the process isn't propagating to logging from different libraries that get imported 19:59:37 <fungi> at least the aiosmptd logging that ends up in smtp.log seems to be via python logging 20:00:14 <clarkb> oh but mailman would have to support loading a logging config from disk right? 20:00:22 <clarkb> python logging doesn't have a way to do that straight out of the library iirc 20:00:49 <clarkb> we are at time. The logging thing probably deserves some testing which we can do via a held node living long enough to do rotations 20:00:56 <clarkb> thank you everyone for your time! 20:01:03 <fungi> https://github.com/aio-libs/aiosmtpd/issues/278 has some further details 20:01:09 <fungi> will follow up in #opendev 20:01:09 <clarkb> feel free to continue discussion in #opendev and/or on the mailing list 20:01:11 <clarkb> #endmeeting