#opendev-meeting log

19:00:22 <clarkb> #startmeeting infra
19:00:22 <opendevmeet> Meeting started Tue Feb 25 19:00:22 2025 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:22 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:22 <opendevmeet> The meeting name has been set to 'infra'
19:00:29 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/ZEEHBRE5DEZGXFXPGE4MYFH4NYGRUOIP/ Our Agenda
19:00:36 <clarkb> #topic Announcements
19:01:33 <clarkb> I didn't have anything to announce (there is service coordinator election stuff but I've given that a full agenda topic for later)
19:02:16 <fungi> #link https://lists.openinfra.org/archives/list/foundation@lists.openinfra.org/message/I2XP4T2C47TEODOH4JYVUZNEWK33R3PN/ draft governance documents and feedback calls for proposed OpenInfra/LF merge
19:03:30 <clarkb> thanks seems like that may be it
19:03:44 <clarkb> a good reminder to keep an eye on that whole thread as well
19:04:04 <clarkb> #topic Zuul-launcher image builds
19:04:21 <clarkb> as mentioned previously corvus has been attempting to dogfood the zuul-launcher system with a test chagne in zuul itself
19:04:26 <fungi> yeah, one thing i wish hyperkitty had was a good way to deep-link to a message within a thread view
19:04:26 <clarkb> #link https://review.opendev.org/c/zuul/zuul/+/940824 Somewhat successful dogfooding in this zuul change
19:04:50 <clarkb> previously there were issues with quota limits and other bugs, but now we've got actual job runs that succeed on the images
19:05:21 <clarkb> thats pretty cool and great progress on the underlying migration of nodepool into zuul
19:05:33 <clarkb> I think there are still some quota problems with the latest buildset but some jobs ran
19:05:47 <clarkb> and the work to understand quotas has started in zuul-launcher so this should only get better
19:06:55 <clarkb> not sure if corvus has anything else to add, but ya good progress
19:08:44 <clarkb> #topic Fixing known_hosts generation on bridge
19:09:11 <clarkb> when deploying tracing02 last week I discoverd that sometimes ssh known_hosts isn't updated when ansible runs the in the infra-prod-base job
19:09:42 <clarkb> Eventually I was able to track that down to the code updating known_hosts running against system-config content that was already on bridge iwthout updating it first
19:10:05 <clarkb> sometimes ansible and ssh would work (known_hosts would update) because the load balancer jobs for zuul and gitea would sometimes run before the infra-prod-run job
19:10:13 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/942307
19:11:09 <clarkb> This change aims to fix that by updating system-config as part of bridge bootstrapping and then we have to ensure the other jobs for load balancers (and the zuul db) don't run concurrently and contend for the system-config on disk
19:12:03 <clarkb> testing this is a bit difficult without just doing it. fungi and corvus have reviewed and +2'd the change. DO we want to proceed with the update nowish or wait for some better time? The hourly jobs do run the bootrap bridge job so we should start to get feedback pretty quickly
19:12:26 <fungi> i'd be fine goign ahead with it
19:12:59 <clarkb> ok after the meeting I've got lunch but then maybe we go for it afterwards if there are no objections or suggestions for a different approach? That would be at about 2100 UTC
19:14:08 <clarkb> #topic Upgrading old servers
19:14:29 <clarkb> Now that I'm running the meeting I realize this topic and the next one can be folded together so why don't I just do that
19:14:42 <clarkb> I've continued to try and upgrade servers from focal to noble with the most recent one being tracing02
19:15:08 <clarkb> everything is switched over with zuul talking to the new server, dns is cleaned up/updated, the last step is to delete the old server which I'll try to do today
19:15:27 <clarkb> #link https://etherpad.opendev.org/p/opendev-server-replacement-sprint has some high level details on the backlog todo list there. Help is super apprecaited
19:15:40 <clarkb> tonyb not sure if you are awake yet, but anything to update from your side of things?
19:15:46 <clarkb> anything we can do to be useful etc?
19:16:13 <tonyb> Nope nothing from me
19:16:26 <clarkb> #topic Redeploying raxflex resources
19:16:44 <clarkb> sort of related but with different motivation is that we should redeploy our raxflex ci resources
19:17:42 <clarkb> there are two drivers for this. The first is that doing so in sjc3 will get us updated networking with 1500 mtus. The other is there is a new dfw3 region we can use but its tenants are different than the ones we are currently using in sjc3. The same tenants in dfw3 are available in sjc3 so if we redeploy we align with dfw3 and get updated networking
19:18:05 <fungi> yeah, i was hoping to hear back from folks who work on it as to whether we're okay to direct attach server instances to the PUBLICNET network since that was working in sjc3 at one point, though now it errors in both regions
19:18:53 <fungi> if i don't need to create all the additional network/router/et cetera boilerplate to handle floating-ip in the new projects, i'd rather not
19:19:01 <clarkb> once we sort out ^ we can deploy new mirrors then we can rollover the nodepool configs
19:19:18 <fungi> yes
19:19:50 <clarkb> have we asked cardoe yet? cardoe seems good at running down those questions
19:21:43 <clarkb> in any case I suspect ^ may be the next step if we haven't yet
19:21:47 <fungi> no, i was trying to get the attention of cloudnull or dan_with
19:22:22 <fungi> but can try to see if cardoe is able to at least find out, even though he doesn't work on that environment
19:22:37 <clarkb> ya cardoe seems to know who to talk to and is often available on irc
19:22:42 <clarkb> a good combo for us if not the most efficient
19:22:51 <clarkb> anything else on this topic?
19:22:57 <fungi> not from me
19:23:19 <clarkb> #topic Running certcheck on bridge
19:23:26 <clarkb> I think I just saw changes for this today
19:23:51 <fungi> yeah, i split up my earlier change to add the git version to bridge first
19:23:51 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/939187 and its parent
19:24:16 <fungi> but up for discussion if we want a toggle based on platform version or something
19:24:32 <clarkb> probably just needs reviews at this point? Any other consideratiosn to call out?
19:25:01 <fungi> also not sure how (or if) we'd want to go about stopping certcheck on cacti once we're good with how it's working on bridge, ansible to remove the cronjob only? clean up the git checkout too?
19:25:19 <fungi> or just manually delete stuff?
19:25:58 <fungi> similar for cleaning up the git deployment on bridge if we go down this path to switch it to the distro package later
19:26:10 <clarkb> I think we need to stop applying updates to cacti otherwise we don't get the benefit of switching to running this on bridge. And if we do that the lists will become stale over time so we should disable it
19:26:16 <clarkb> I don't know that we need to do any cleanup beyond disabling it
19:26:34 <clarkb> fungi: the other thing I notice is the change doesn't update ourtesting which is already doing certcheck stuff on bridge I think
19:26:43 <fungi> i realized that my earlier version of 939187 just did it all at once (moved off cacti onto bridge and switched from git to distro package), but it hadn't been getting reviews
19:26:47 <clarkb> we probably want to cleanup whatever special case code does ^ and rely on the production path in testing
19:27:53 <clarkb> playbooks/zuul/templates/gate-groups.yaml.j2 is the file I can leave a review after the meeting
19:28:55 <fungi> thanks
19:29:27 <clarkb> anything else?
19:29:48 <fungi> anyway, if people hadn't reviewed the earlier version of 939187 and are actually in favor of going back to a big-bang switch for install method and server at the same time, i'm happy to revery today's split
19:30:05 <fungi> revert
19:30:12 <clarkb> ack
19:31:26 <fungi> but yes, that's all from me on this
19:31:36 <clarkb> #topic Service Coordinator Election
19:31:56 <clarkb> The nomination period ended and as promised I nominted myself just before the time ran out as no one else had
19:32:16 <fungi> thanks!
19:32:20 <clarkb> I don't see any other nominations here https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/
19:32:36 <clarkb> so that means I'm it by default. If I've missed a nominee somehow please bring that up asap but I don't think I have
19:33:01 <fungi> welcome back!
19:33:38 <clarkb> in six months we'll have another election and I'd be thrilled if someone else took on the role :)
19:33:57 <clarkb> if you're interested feel free to reach out and we can talk. We can probably even do a more gradual transition if that works
19:34:12 <clarkb> #topic Using more robust Gitea caches
19:34:16 <corvus> clarkb: congratulations1
19:35:13 <clarkb> About a week and a half ago or so mnaser pointed out that some requests to gitea13 were really slow. Then a refresh wouldbe fast. This had us suspecting the gitea caches as that is pretty normal for uncached things. However, this was different because the time delta was huge like 20 seconds instead of half a second
19:35:43 <clarkb> and the same page would do it occasioanlyl within a relatively short period of time which seemed at oods with the caching behavior (it evicts after 16 hours not half an hour)
19:36:47 <clarkb> so I dug around in the gitea source code and found the memory cache is implemented as a single Go hashmap. It sounds like massive Go hashmaps can create problems for the Go GC system. My suspicion now is that AI crawlers (and other usage of gitea) is slowly causing that hashmap to grow to some crazy size and eventually impacting GC with pause the world behavior leading to
19:36:49 <clarkb> this long requests
19:37:12 <clarkb> one thing in support of this is restarting the gitea service causes it to be happy again until it stops being happy some time later
19:38:41 <clarkb> I've since dug into alternative options to the 'memory' cache adapter and there are three: redis, memcached, and twoqueue. Redis is no longer open source and twoqueue is implemented within gitea as another memory cache but using a more sophistiacted implementation that allows you to configure a maximum entry count. I initially rejected these because redis isn't open source
19:38:43 <clarkb> (but apparently valkey is an option) and twoqueue would still be running with Go GC. This led to a change implementing memcached as teh cache system
19:38:49 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/942650 Set up gitea with memcached for caching
19:39:26 <clarkb> the upside to memcached is it is a fairly simple service, you can limit the total memory consumption, and it is open source. The downsides are that it is another container image to pull from docker hub (I would mirror it before we go to production if we decide to go to production with it)
19:39:59 <clarkb> so I guess I'm asking for reviews and feedback on this. I think we could also use twoqueue instead and see if we still have problems after setting a maximum cache entry limit
19:40:19 <corvus> i ❤️ memcached
19:40:23 <clarkb> that is probably simplest from an implementation perspective, and if it doesn't work we're in no worse position than today and could switch to memcached at that point
19:40:43 <fungi> yeah, on its face this seems like a fine approach
19:41:05 <corvus> this is still distributed right, we'd put one memcached on each gitea server?
19:41:05 <clarkb> cool I think that change is ready for review at this point as long as the new test case I added is functional
19:41:10 <clarkb> corvus: correct
19:41:45 <corvus> would centralized help?  let multiple gitea servers share cache and reduce io/cpu?
19:41:54 <corvus> (at the cost of bandwidth and us needing to make that HA)
19:42:31 <clarkb> corvus: I'm not sure it would because we use hashed IPs for load balancing so we could have very different cache needs on one gitea vs another
19:42:34 <fungi> i guess that change is still meant to be wip since it seems to have some debugging enabled that may not make sense in production and also at least one todo comment
19:42:37 <clarkb> so having separate caches makes sense to me
19:43:02 <corvus> clarkb: yeah, i guess i was imagining that, like, lots of gitea servers might get a request for the nova README
19:43:44 <clarkb> fungi: the debugging should be disbled in the latest patchset (-v really isn't debugging its not even that verbose) and the TODO is something I noticed adjacent to the change but not directly related to it
19:43:54 <fungi> i think lately it's that all the gitea servers get crawled constantly for every possible url they can serve, so caches on the whole are of questionable value
19:43:56 <clarkb> corvus: I'm also not sure if the keys they use are stable across instances
19:44:23 <clarkb> corvus: I think they should be because we use a consistent database (so project ids should be the saem?) but I'm not positive of that
19:44:25 <corvus> i definitely think that starting with unshared/distributed caching is best, since it's the smallest delta from our previously working system and is the smallest change to fix the current problem.  was mostly wondering if we should explore shared caching since that was the actual design goal of memcached and so it's natural to wonder if it's a good fit here.
19:44:29 <clarkb> I think the safest thing for now is distributed caches
19:44:36 <clarkb> ah
19:44:49 <corvus> yes agree, just can't help thinking another step ahead :)
19:44:49 <clarkb> I think it would definitely be possible if/when we use a shared db backend
19:45:00 <clarkb> its not compeltely clear to me if we can until then
19:45:13 <corvus> ack
19:45:43 <clarkb> fungi: we have to have a cache (there is no disabled caching option) and if my hunches are correct the default is actively making things worse so unfortauntely we need to go with something that is less bad at least
19:46:50 <clarkb> it does sound like we're happy to use memcached so maybe I should go ahead and propose a change to mirror that container image today and then update the change tomorrow to pull from the mirror
19:46:59 <fungi> yeah, i just meant tuning the cache for performance may not yield much difference
19:47:09 <clarkb> ah
19:47:15 <corvus> (even a basic LRU cache wouldn't be terrible for "crawl everything"; anything non-ai-crawler would still heat up part of the cache and keep it in memory)
19:47:58 <corvus> (but the cs in me says there's almost certainly a better algorithm for that)
19:48:29 <clarkb> evicting based on total requests since entry would probably work well here
19:48:36 <clarkb> since the AI things grab a page once before scanning again next week
19:48:36 <corvus> ++
19:49:07 <clarkb> thanks for the feedback I think I see the path forward here
19:49:09 <clarkb> #topic Working through our TODO list
19:49:18 <clarkb> A reminder that I'm trying to keep a rough high level todo list on this etherpad
19:49:23 <clarkb> #link https://etherpad.opendev.org/p/opendev-january-2025-meetup
19:49:35 <clarkb> a good place for anyone to check if they'd like to get involved or if any of us get bored
19:49:49 <clarkb> I intend on editing that list to include parallelize infra-prod-* jobs
19:49:59 <clarkb> #topic Open Discussion
19:50:12 <fungi> i've started looking into mailman log rotation
19:50:27 <clarkb> The Matrix Foundation may be shutting down the OFTC irc bridge at the end of march if they don't get additional funding. Somethign for people to be awareof that rely on that bridge for irc access
19:50:42 <fungi> mailman-core specifically, since mailman-web is just django apps and django already handles rotation for those
19:50:58 <fungi> i ran across this bug, which worries me: https://gitlab.com/mailman/mailman/-/issues/931
19:51:09 <clarkb> fungi: I think the main questions I had about log rotation was ensuring that mailman doesn't have some mechanism it wants to use for that and figuring out what retention should be. I feel like mail things tend to move more slowly than web things so longer retention might be appropriate?
19:51:38 <fungi> seems like the lmtp handler logs to smtp.log and the command to tell mailman to reopen log files doesn't fully work
19:52:00 <clarkb> fungi: copytruncate should address that right?
19:52:18 <clarkb> iirc it exists because some services don't handle rotation gracefully?
19:52:33 <fungi> yeah, looking in separate bug report https://gitlab.com/mailman/mailman/-/issues/1078 copytuncate is suggested
19:55:21 <clarkb> sounds like that may be everything?
19:55:21 <fungi> it's not clear to me that it fully works even then, whether we also need to do a `mailman reopen` or just sighup, et cetera
19:55:36 <clarkb> ya there si a comment there indicating copytruncate may still be problematic
19:55:56 <clarkb> though that is surprising to me since that should keep the existing fd and path. Its the rotaetd logs that are new and detached from the process
19:57:05 <fungi> well, you still need the logging process to gracefully reorient to the start of the file after truncation and not, e.g., seek to the old end address
19:57:37 <fungi> but following those discussions it seems like it's partly about python logging base behaviors and handling multiple streams
19:57:55 <fungi> which probably handles that okay
19:58:05 <fungi> handles the truncation okay i mean
19:58:20 <clarkb> hrm if its python logging in play then I wonder if we need a python logging config instead
19:58:33 <clarkb> that is what we typically do for services like zuul, nodepool, etc and set the rotation schedule in that config
19:59:00 <fungi> i want to say we've seen something like this before, where sighup to the process isn't propagating to logging from different libraries that get imported
19:59:37 <fungi> at least the aiosmptd logging that ends up in smtp.log seems to be via python logging
20:00:14 <clarkb> oh but mailman would have to support loading a logging config from disk right?
20:00:22 <clarkb> python logging doesn't have a way to do that straight out of the library iirc
20:00:49 <clarkb> we are at time. The logging thing probably deserves some testing which we can do via a held node living long enough to do rotations
20:00:56 <clarkb> thank you everyone for your time!
20:01:03 <fungi> https://github.com/aio-libs/aiosmtpd/issues/278 has some further details
20:01:09 <fungi> will follow up in #opendev
20:01:09 <clarkb> feel free to continue discussion in #opendev and/or on the mailing list
20:01:11 <clarkb> #endmeeting