#opendev-meeting log

19:00:11 <clarkb> #startmeeting infra
19:00:11 <opendevmeet> Meeting started Tue Jan 28 19:00:11 2025 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:00:11 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:00:11 <opendevmeet> The meeting name has been set to 'infra'
19:00:17 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/5OE5J5DUQJXZWZ67O7CLANUCFWY7RNXB/ Our Agenda
19:00:21 <clarkb> #topic Announcements
19:00:27 <clarkb> I don't have anything to announce. Did anyone else?
19:00:43 <clarkb> I guess there is an openinfra party at FOSDEM for people who will be there
19:03:06 <clarkb> #topic Zuul-launcher image builds
19:03:35 <clarkb> The most recent Zuul upgrade pulled in sufficient updates to get api management of zuul launcher image builds in place
19:03:48 <clarkb> there was a little blip where we needed to deploy some extra config for zuul web to make that work
19:04:03 <clarkb> but since that was sorted out corvus has been able to trigger new image builds via the ui
19:04:49 <corvus> i just triggered another set; that will tell us if we're aging out old ones correctly
19:04:56 <clarkb> I think the latest is that old existing builds didn't have sufficient metadata to get automatically clear out but new ones should
19:04:58 <clarkb> ya that
19:05:17 <clarkb> anything else we should be aware of on this topic? Or areas that need help?
19:05:43 <corvus> nope; new image jobs are still welcome any time
19:05:56 <corvus> but that's not blocking yet
19:06:04 <clarkb> ack and thank your for sorting out zuul web post upgrade
19:06:13 <corvus> np
19:06:25 <clarkb> #topic Upgrading old servers
19:06:40 <clarkb> I'm going to fold the noble work under this topic now that its largely sorted out
19:06:50 <clarkb> some things I want to note:
19:07:42 <clarkb> New noble servers will deploy with borg 1.2.8 and backup to backup servers running borg 1.1.18. That will be the situation until we deploy new noble backup servers running borg 1.4.x. At that point we can convert servers capable of running 1.4.x to backup to those new backup servers (I believe this will include noble and jammy servers)
19:08:14 <clarkb> some services may need their docker-compose.yaml files updated to set an 'always' restart policy though most already use always
19:08:38 <clarkb> and fungi did dig into podman packaging for ubuntu and doesn't think that upgrading podman requires container restarts (upgrading dockerd packages did)
19:09:15 <clarkb> So far paste02 seems to be happy and after the initial set of problems (understanding reboot behavior, borg backups, and general podman support) I think we're in good shape to continue deploying new things on noble
19:09:40 <clarkb> we also converted lodgeit over to publishing to quay.io and speculative testing of those container images works with podman (this was the expectation but it is good to see in practice)
19:10:19 <clarkb> at this point I think what I'd like to do is find some more complicated service to upgrade to noble and run under podman. My ultimate goal is to redeploy review, but considering how much we learned from paste02 I think finding another canary first is a good idea
19:10:46 <clarkb> so far I've been thinking codesearch or grafana would be good simple updates like paste. But better than that might be something like zuul servers (scheduler and or executors?)
19:10:56 <clarkb> not sure if anyone had thoughts on that. I'm open to input
19:11:56 <clarkb> won't get to that today though so we have tmie to chime in. Hoping later this week though
19:12:04 <clarkb> tonyb: anything new with wiki to discuss?
19:12:43 <corvus> no objections to using zuul as a canary -- but one thought: i don't think we'd want to have the different components on different systems for long
19:13:05 <clarkb> that makes sense so probably need to do the whole thing in a coordinated fashion
19:13:14 <corvus> also, start with executor :)
19:13:20 <clarkb> I'll keep that and mind and look at the list again to see if there are any other better candidates
19:13:28 <corvus> most likely to introduce novel issues
19:13:55 <clarkb> ack
19:14:03 <corvus> otoh -- all of zuul is regularly tested with podman
19:14:04 <corvus> so that's nice
19:15:35 <clarkb> sounds like that may be it for this topic (we can swing back around later if there is time and need too)
19:15:38 <clarkb> #topic Switch to quay.io/opendevmirror images where possible
19:16:00 <clarkb> one thing I noticed when trying to upgrade gerrit and gitea and do lodgeit/paste work is that using the mirrored images from quay raelly does help with job reliability
19:17:00 <clarkb> One "easy" approach here is to switch our use of mariadb from docker to quay.io. Doing so does cause the database to restart though so keep that in mind. So far we have converted gitea and lodgeit. Other services to do this to include refstack, gerrit, etherpad, the zuul db system, and probably others I'm forgetting
19:17:22 <clarkb> I may just go ahead and try and push changes up for all of these cases I can find then we can land them as we feel is appropriate
19:17:51 <clarkb> then separately we may wish to switch our dockerfile image builds over to base images hosted on quay as well. For example with the python base and builder images
19:18:15 <clarkb> one thing to keep in mind with doing this is we'll lose the ability to speculatively test those base images against our image builds. I think this is somethign we can live with while we transition over to quay in general
19:18:48 <clarkb> speculative image building is far more useful with the actual service images and while we may have used the speculative state once or twice in the past to test base image updates I don't think they are as critical
19:19:23 <clarkb> just keep that in mind as a class of optimization we can apply to improve reliabiltiy in our ci system
19:19:34 <clarkb> #topic Unpinning our Grafana deployment
19:20:27 <clarkb> At some point (last year?) we updated grafana and some of our graphs stopped working. There were problems with CORS I guess. Anyway I've pushed up some changes to improve testing of grafana in system config so that we can inspect this better and have a change to bump up to the newest version of the current major release we are on
19:20:33 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/940073
19:21:01 <clarkb> I then held a node: https://217.182.143.14/ and through that was able to track down the problem (or at least I think I did)
19:21:27 <clarkb> the issue seems to be specific to grafana dashboards that use queries to look up info and dynamically build out the dashboard. These queries hit CORS errors
19:21:51 <clarkb> it turns out that we can have grafana proxy the requests to graphite to mitigate this problem: https://review.opendev.org/c/openstack/project-config/+/940276
19:22:10 <clarkb> that doesn't seem to break the graphs on the current version and I think makes the graphs work with the latest version
19:22:42 <clarkb> long story short I think if we land 940276 and confirm things continue to work with the current deployment then the next step is upgrading to the latest version of the current major release
19:23:09 <clarkb> I don't want to go straight to the next major release because we get warnings about graphs requiring angular and that has been deprecated so sorting that out will be the next battle before upgrading to the latest major release
19:23:16 <clarkb> reviews welcome
19:23:35 <corvus> what about adding cors to graphite?
19:24:07 <clarkb> I suspect that would work as well.
19:24:25 <corvus> we have some headers there already
19:24:52 <clarkb> basically we would need to add grafana.opendev.org to the allowed origins list I think
19:25:04 <corvus> https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/graphite/templates/graphite-statsd.conf.j2#L55-L58
19:25:28 <clarkb> oh hrm we already set it to *
19:25:33 <clarkb> so why is this breaking
19:25:36 <corvus> which dashobard failed?
19:25:54 <clarkb> corvus: dib status or any of the nodepool provider specific ones (the base nodepool one works)
19:26:15 <clarkb> I believe it was an OPTIONS requset which we set that header on
19:26:34 <clarkb> but ya maybe there is something subtly wrong in that header config for graphite
19:26:46 <clarkb> https://217.182.143.14/d/f3089338b3/nodepool3a-dib-status?orgId=1 this is an example failure
19:28:25 <clarkb> looking at the response headers I don't see the allowed origins in there. So maybe that config is ineffctive for some reason. Definitely something to debug further if we prefer that approach (it would make integration easier overall oprobably)
19:29:23 <corvus> hrm
19:29:43 <corvus> i wonder if it has to do with the post
19:29:57 <clarkb> oh ya that post seems to have no data in my firefox debugger
19:30:07 <clarkb> I didn't look at it previously because it didn't hard fail like the OPTIONS requet
19:30:25 <clarkb> we mgiht need to check server logs on graphite to untangle this too
19:30:25 <corvus> the browser sees a post request, so first it does an OPTIONS request, and maybe the options isn't handled by graphite so apache just passes through the 400 without adding the headers
19:30:48 <fungi> could that dashboard's configuration be subtly incompatible with newer grafana resulting in an empty query?
19:31:19 <clarkb> fungi: I don't think so because converting grafana to proxy instead of direct communication to graphite works
19:31:24 <corvus> so yeah, if it's something like that, then we might need to tell apache to return those headers even on 400 errors
19:31:37 <fungi> ah, didn't realize the proxy solution had been tried already
19:31:52 <clarkb> yes proxy solution is working based on testing in the changes linked above
19:32:12 <clarkb> corvus: do you mean nginx on the graphite host? but ya perhaps we aren't responding properly and firefox gets sad
19:32:32 <clarkb> I can try and dig more by looking at graphite web server logs later today
19:32:43 <corvus> heh that is an nginx config isn't it :)
19:33:07 <clarkb> ya iirf graphite is this big monolithic container with all the things in one image and we're just configuring what is there
19:33:17 <corvus> ah
19:33:51 <corvus> curl -i -X OPTIONS 'https://graphite.opendev.org/metrics/find?from=1738070749&until=1738092351'
19:33:56 <corvus> that returns 400 with no headers
19:34:16 <corvus> (it's missing the "query" parameter, which would presumably be included in the POST request)
19:34:36 <corvus> so that's my theory -- we're not adding the CORS headers on failing OPTIONS requests
19:34:54 <clarkb> seems plausible and likely somethign we can fix with the right nginx config
19:35:01 <corvus> ++
19:35:18 <corvus> i think that's worth doing in preference to the proxy
19:35:30 <fungi> sounds reasonable to me
19:35:31 <clarkb> agreed since that makes this work more broadly and not if you have special proxies
19:35:37 <clarkb> I can try and dig into that more later today
19:35:38 <corvus> should be a little more efficient, and we already try to have that work
19:35:48 <corvus> that too
19:36:20 <clarkb> #topic Increasing Exim's Acceptable Line Length
19:36:26 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/940248
19:36:47 <clarkb> tl;dr here is that our own exim on lists.opendev.org is generating bounces when people send messages with header lines that are too long
19:37:18 <clarkb> the thing I'm confused about is why I haven't been kicked out of the list yet but maybe I don't bounce often enough for this to trip me over the limit
19:37:28 <clarkb> anyway the change seems fine to me if this is part of noble's defaults anyway
19:37:30 <fungi> yes, latest discovery among the various challenges mailman has delivering messages
19:37:41 <clarkb> any reason we shouldn't proceed with alnding it?
19:38:12 <fungi> this particular failure seems to occur infrequently, but since it's technically an rfc violation and a default (if newer) exim behavior we'd be overriding, i wanted to make sure we had some consensus
19:38:45 <clarkb> I'm ok with rfc violations if experience says the real world is violating the rfc
19:38:51 <fungi> related to this, i'm starting to think that the bounce processing experiment is leading to more work for openstack-discuss than i have time for
19:39:38 <fungi> broad rejections like that one are leading to artifically inflated bounce scores for subscribers, and the separate probe messages are clearly confusing some subscribers
19:40:10 <fungi> yesterday we had one reply to their probe's verp address, which resulted in disabling their subscription, for example
19:40:14 <clarkb> on the other hand if we didn't do this we'd still bounce and not detect this problem?
19:40:33 <clarkb> and we'd just not deliver emails at all for those with lines that are too long?
19:40:34 <fungi> yes, unless we looked at logs
19:40:58 <fungi> well, another big issue with the verp probe messages is that they helpfully include a copy of the most recent bounce for that subscriber
19:41:13 <fungi> and bounce ndrs often helpfully include a copy of the message that bounced
19:41:36 <fungi> so if the message was bounced for having spammy content, the verp probe message quite often gets rejected by the mta for including the same content
19:42:39 <clarkb> that seems like a bug in the probe implementation used by mailman?
19:42:59 <clarkb> if all they are trying to do is determine if the address is valid probing with as neutral a set of inputs seems ideal
19:43:05 <clarkb> but we're not likely to fix that ourselves
19:43:28 <fungi> but worse than that, since exim will attempt to deliver multiple recipients at the same destination mta in one batch, and then bounce any resulting failures back to mailman in a single ndr, the resulting list of rcpt rejection messages sometimes ends up in the verp probe messages sent to subscribers, so for example a user subscribed from a domain handled by gmail will get a
19:43:30 <fungi> list of many/most of the other subscriber addresses whose domains are also managed by gmail
19:43:37 <clarkb> anyway I'm good with landing that chagne despite it being an rfc violation. Its the default in noble and clearly people are violating it. This seems like a case where we're best off being flexible for real world inputs
19:44:16 <clarkb> then separately if we want to disable the bounce processing on that list and ask mailman if these are bugs I'm good wit hthat too
19:44:37 <fungi> yeah, i think if we're going to continue to try to have mailman not modify messages sent to it, we need exim to not enforce restrictions on what mailman can send through it
19:45:45 <fungi> i'll go ahead and disable bounce processing on openstack-discuss for now, mostly out of concern for potential future disclosure of subscriber addresses to other subscribers, and post to the mailman-users list looking into possible ways to improve that problem
19:45:57 <clarkb> the change has my +2 anyone else want to review it before we proceed? I guess we're going to apply that to exim everywhere that supports it not just mailman but that also seems fine
19:46:26 <clarkb> sounds good
19:46:34 <clarkb> we've got a few more topics to cover so I'll keep moving
19:46:39 <clarkb> #topic Running certcheck on bridge
19:46:46 <clarkb> fungi: is there a change for this yet?
19:47:13 <fungi> ah, no i don't think i've written it yet
19:47:38 <clarkb> cool just making sure I haven't missed anything. Its been a busy few weeks
19:47:45 <clarkb> #topic Service Coordinator Election
19:47:52 <clarkb> #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/NGS2APEFQB45OCJCQ645P5N6XCH52BXW/
19:48:01 <clarkb> I made the plan official and haven't seen any complaints
19:48:15 <clarkb> Now is a great time to consider if you'd like to run and feel free to reach out if you have questions making that determination
19:48:32 <clarkb> I remain happy for someone else to step in and will happily support anyone who does
19:50:29 <clarkb> Nominations Open From February 4, 2025 to February 18, 2025
19:50:38 <clarkb> all dates will be treated with a UTC clock
19:50:44 <clarkb> consider yourselves warned :)
19:50:49 <clarkb> #topic Beginning of the Year (Virtual) Meetup Recap
19:50:54 <clarkb> #link https://etherpad.opendev.org/p/opendev-january-2025-meetup
19:51:28 <clarkb> we covered a number of topics and I tried to take notes on what we discussed (others helped too thanks!)
19:51:40 <clarkb> from those notes I then distilled thinsg down into a todo list at the end of the etherpad
19:51:57 <clarkb> the idea there is to make it easy to find what sorts of things we said we should do without reading through the more in depth notes
19:52:08 <clarkb> I'm hoping that helps us get more things done over the course of the year
19:52:27 <clarkb> if there are any topics in particular that you think need better notes or clarification let me know and I'll do my best to fill in details
19:53:19 <clarkb> otherwise I think it was a successful use of our time. Thank you to everyone who participated
19:53:33 <clarkb> feels good to have some alignment on general todo items and appraoches for those issues
19:53:40 <clarkb> #topic Open Discussion
19:54:02 <clarkb> PBR is likely to get a new release soon that adds better support for pyproject.toml usage and python3.12's lack of setuptools
19:54:14 <clarkb> it is intended to be backward compatible so ideally no one even notices
19:54:16 <clarkb> but be aware of that
19:54:23 <fungi> i did enjoy our vmeetup, thanks for organizing it!
19:55:13 <fungi> and yeah, we've got a stack of changes to bindep exercising various aspects of pbr's use in pyproject.toml
19:55:27 <fungi> that has proven exceedingly useful
19:58:10 <clarkb> as a heads up today is pretty amazing weather wise so I'm going to try and pop out after lunch for some outside time. It all changes thursday/friday and goes back to regularly scheduled precipitation
19:58:22 <fungi> enjoy!
19:58:34 <clarkb> sounds like that may be everything
19:58:40 <clarkb> thanks again for your time running opendev
19:58:48 <clarkb> I'll see you back here next week same time and location
19:58:50 <clarkb> #endmeeting