Tuesday, 2025-01-28

clarkbalmost meeting time18:58
clarkb#startmeeting infra19:00
opendevmeetMeeting started Tue Jan 28 19:00:11 2025 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:00
opendevmeetThe meeting name has been set to 'infra'19:00
clarkb#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/5OE5J5DUQJXZWZ67O7CLANUCFWY7RNXB/ Our Agenda19:00
clarkb#topic Announcements19:00
clarkbI don't have anything to announce. Did anyone else?19:00
clarkbI guess there is an openinfra party at FOSDEM for people who will be there19:00
clarkb#topic Zuul-launcher image builds19:03
clarkbThe most recent Zuul upgrade pulled in sufficient updates to get api management of zuul launcher image builds in place19:03
clarkbthere was a little blip where we needed to deploy some extra config for zuul web to make that work19:03
clarkbbut since that was sorted out corvus has been able to trigger new image builds via the ui19:04
corvusi just triggered another set; that will tell us if we're aging out old ones correctly19:04
clarkbI think the latest is that old existing builds didn't have sufficient metadata to get automatically clear out but new ones should19:04
clarkbya that19:04
clarkbanything else we should be aware of on this topic? Or areas that need help?19:05
corvusnope; new image jobs are still welcome any time19:05
corvusbut that's not blocking yet19:05
clarkback and thank your for sorting out zuul web post upgrade19:06
corvusnp19:06
clarkb#topic Upgrading old servers19:06
clarkbI'm going to fold the noble work under this topic now that its largely sorted out19:06
clarkbsome things I want to note:19:06
clarkbNew noble servers will deploy with borg 1.2.8 and backup to backup servers running borg 1.1.18. That will be the situation until we deploy new noble backup servers running borg 1.4.x. At that point we can convert servers capable of running 1.4.x to backup to those new backup servers (I believe this will include noble and jammy servers)19:07
clarkbsome services may need their docker-compose.yaml files updated to set an 'always' restart policy though most already use always19:08
clarkband fungi did dig into podman packaging for ubuntu and doesn't think that upgrading podman requires container restarts (upgrading dockerd packages did)19:08
clarkbSo far paste02 seems to be happy and after the initial set of problems (understanding reboot behavior, borg backups, and general podman support) I think we're in good shape to continue deploying new things on noble19:09
clarkbwe also converted lodgeit over to publishing to quay.io and speculative testing of those container images works with podman (this was the expectation but it is good to see in practice)19:09
clarkbat this point I think what I'd like to do is find some more complicated service to upgrade to noble and run under podman. My ultimate goal is to redeploy review, but considering how much we learned from paste02 I think finding another canary first is a good idea19:10
clarkbso far I've been thinking codesearch or grafana would be good simple updates like paste. But better than that might be something like zuul servers (scheduler and or executors?)19:10
clarkbnot sure if anyone had thoughts on that. I'm open to input19:10
clarkbwon't get to that today though so we have tmie to chime in. Hoping later this week though19:11
clarkbtonyb: anything new with wiki to discuss?19:12
corvusno objections to using zuul as a canary -- but one thought: i don't think we'd want to have the different components on different systems for long19:12
clarkbthat makes sense so probably need to do the whole thing in a coordinated fashion19:13
corvusalso, start with executor :)19:13
clarkbI'll keep that and mind and look at the list again to see if there are any other better candidates19:13
corvusmost likely to introduce novel issues19:13
clarkback19:13
corvusotoh -- all of zuul is regularly tested with podman19:14
corvusso that's nice19:14
clarkbsounds like that may be it for this topic (we can swing back around later if there is time and need too)19:15
clarkb#topic Switch to quay.io/opendevmirror images where possible19:15
clarkbone thing I noticed when trying to upgrade gerrit and gitea and do lodgeit/paste work is that using the mirrored images from quay raelly does help with job reliability19:16
clarkbOne "easy" approach here is to switch our use of mariadb from docker to quay.io. Doing so does cause the database to restart though so keep that in mind. So far we have converted gitea and lodgeit. Other services to do this to include refstack, gerrit, etherpad, the zuul db system, and probably others I'm forgetting19:17
clarkbI may just go ahead and try and push changes up for all of these cases I can find then we can land them as we feel is appropriate19:17
clarkbthen separately we may wish to switch our dockerfile image builds over to base images hosted on quay as well. For example with the python base and builder images19:17
clarkbone thing to keep in mind with doing this is we'll lose the ability to speculatively test those base images against our image builds. I think this is somethign we can live with while we transition over to quay in general19:18
clarkbspeculative image building is far more useful with the actual service images and while we may have used the speculative state once or twice in the past to test base image updates I don't think they are as critical19:18
clarkbjust keep that in mind as a class of optimization we can apply to improve reliabiltiy in our ci system19:19
clarkb#topic Unpinning our Grafana deployment19:19
clarkbAt some point (last year?) we updated grafana and some of our graphs stopped working. There were problems with CORS I guess. Anyway I've pushed up some changes to improve testing of grafana in system config so that we can inspect this better and have a change to bump up to the newest version of the current major release we are on19:20
clarkb#link https://review.opendev.org/c/opendev/system-config/+/94007319:20
clarkbI then held a node: https://217.182.143.14/ and through that was able to track down the problem (or at least I think I did)19:21
clarkbthe issue seems to be specific to grafana dashboards that use queries to look up info and dynamically build out the dashboard. These queries hit CORS errors19:21
clarkbit turns out that we can have grafana proxy the requests to graphite to mitigate this problem: https://review.opendev.org/c/openstack/project-config/+/94027619:21
clarkbthat doesn't seem to break the graphs on the current version and I think makes the graphs work with the latest version19:22
clarkblong story short I think if we land 940276 and confirm things continue to work with the current deployment then the next step is upgrading to the latest version of the current major release19:22
clarkbI don't want to go straight to the next major release because we get warnings about graphs requiring angular and that has been deprecated so sorting that out will be the next battle before upgrading to the latest major release19:23
clarkbreviews welcome19:23
corvuswhat about adding cors to graphite?19:23
clarkbI suspect that would work as well.19:24
corvuswe have some headers there already19:24
clarkbbasically we would need to add grafana.opendev.org to the allowed origins list I think19:24
corvushttps://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/graphite/templates/graphite-statsd.conf.j2#L55-L5819:25
clarkboh hrm we already set it to *19:25
clarkbso why is this breaking19:25
corvuswhich dashobard failed?19:25
clarkbcorvus: dib status or any of the nodepool provider specific ones (the base nodepool one works)19:25
clarkbI believe it was an OPTIONS requset which we set that header on19:26
clarkbbut ya maybe there is something subtly wrong in that header config for graphite19:26
clarkbhttps://217.182.143.14/d/f3089338b3/nodepool3a-dib-status?orgId=1 this is an example failure19:26
clarkblooking at the response headers I don't see the allowed origins in there. So maybe that config is ineffctive for some reason. Definitely something to debug further if we prefer that approach (it would make integration easier overall oprobably)19:28
corvushrm19:29
corvusi wonder if it has to do with the post19:29
clarkboh ya that post seems to have no data in my firefox debugger19:29
clarkbI didn't look at it previously because it didn't hard fail like the OPTIONS requet19:30
clarkbwe mgiht need to check server logs on graphite to untangle this too19:30
corvusthe browser sees a post request, so first it does an OPTIONS request, and maybe the options isn't handled by graphite so apache just passes through the 400 without adding the headers19:30
fungicould that dashboard's configuration be subtly incompatible with newer grafana resulting in an empty query?19:30
clarkbfungi: I don't think so because converting grafana to proxy instead of direct communication to graphite works19:31
corvusso yeah, if it's something like that, then we might need to tell apache to return those headers even on 400 errors19:31
fungiah, didn't realize the proxy solution had been tried already19:31
clarkbyes proxy solution is working based on testing in the changes linked above19:31
clarkbcorvus: do you mean nginx on the graphite host? but ya perhaps we aren't responding properly and firefox gets sad19:32
clarkbI can try and dig more by looking at graphite web server logs later today19:32
corvusheh that is an nginx config isn't it :)19:32
clarkbya iirf graphite is this big monolithic container with all the things in one image and we're just configuring what is there19:33
corvusah19:33
corvuscurl -i -X OPTIONS 'https://graphite.opendev.org/metrics/find?from=1738070749&until=1738092351'19:33
corvusthat returns 400 with no headers19:33
corvus(it's missing the "query" parameter, which would presumably be included in the POST request)19:34
corvusso that's my theory -- we're not adding the CORS headers on failing OPTIONS requests19:34
clarkbseems plausible and likely somethign we can fix with the right nginx config19:34
corvus++19:35
corvusi think that's worth doing in preference to the proxy19:35
fungisounds reasonable to me19:35
clarkbagreed since that makes this work more broadly and not if you have special proxies19:35
clarkbI can try and dig into that more later today19:35
corvusshould be a little more efficient, and we already try to have that work19:35
corvusthat too19:35
clarkb#topic Increasing Exim's Acceptable Line Length 19:36
clarkb#link https://review.opendev.org/c/opendev/system-config/+/94024819:36
clarkbtl;dr here is that our own exim on lists.opendev.org is generating bounces when people send messages with header lines that are too long19:36
clarkbthe thing I'm confused about is why I haven't been kicked out of the list yet but maybe I don't bounce often enough for this to trip me over the limit19:37
clarkbanyway the change seems fine to me if this is part of noble's defaults anyway19:37
fungiyes, latest discovery among the various challenges mailman has delivering messages19:37
clarkbany reason we shouldn't proceed with alnding it?19:37
fungithis particular failure seems to occur infrequently, but since it's technically an rfc violation and a default (if newer) exim behavior we'd be overriding, i wanted to make sure we had some consensus19:38
clarkbI'm ok with rfc violations if experience says the real world is violating the rfc19:38
fungirelated to this, i'm starting to think that the bounce processing experiment is leading to more work for openstack-discuss than i have time for19:38
fungibroad rejections like that one are leading to artifically inflated bounce scores for subscribers, and the separate probe messages are clearly confusing some subscribers19:39
fungiyesterday we had one reply to their probe's verp address, which resulted in disabling their subscription, for example19:40
clarkbon the other hand if we didn't do this we'd still bounce and not detect this problem?19:40
clarkband we'd just not deliver emails at all for those with lines that are too long?19:40
fungiyes, unless we looked at logs19:40
fungiwell, another big issue with the verp probe messages is that they helpfully include a copy of the most recent bounce for that subscriber19:40
fungiand bounce ndrs often helpfully include a copy of the message that bounced19:41
fungiso if the message was bounced for having spammy content, the verp probe message quite often gets rejected by the mta for including the same content19:41
clarkbthat seems like a bug in the probe implementation used by mailman?19:42
clarkbif all they are trying to do is determine if the address is valid probing with as neutral a set of inputs seems ideal19:42
clarkbbut we're not likely to fix that ourselves19:43
fungibut worse than that, since exim will attempt to deliver multiple recipients at the same destination mta in one batch, and then bounce any resulting failures back to mailman in a single ndr, the resulting list of rcpt rejection messages sometimes ends up in the verp probe messages sent to subscribers, so for example a user subscribed from a domain handled by gmail will get a19:43
fungilist of many/most of the other subscriber addresses whose domains are also managed by gmail19:43
clarkbanyway I'm good with landing that chagne despite it being an rfc violation. Its the default in noble and clearly people are violating it. This seems like a case where we're best off being flexible for real world inputs19:43
clarkbthen separately if we want to disable the bounce processing on that list and ask mailman if these are bugs I'm good wit hthat too19:44
fungiyeah, i think if we're going to continue to try to have mailman not modify messages sent to it, we need exim to not enforce restrictions on what mailman can send through it19:44
fungii'll go ahead and disable bounce processing on openstack-discuss for now, mostly out of concern for potential future disclosure of subscriber addresses to other subscribers, and post to the mailman-users list looking into possible ways to improve that problem19:45
clarkbthe change has my +2 anyone else want to review it before we proceed? I guess we're going to apply that to exim everywhere that supports it not just mailman but that also seems fine19:45
clarkbsounds good19:46
clarkbwe've got a few more topics to cover so I'll keep moving19:46
clarkb#topic Running certcheck on bridge19:46
clarkbfungi: is there a change for this yet?19:46
fungiah, no i don't think i've written it yet19:47
clarkbcool just making sure I haven't missed anything. Its been a busy few weeks19:47
clarkb#topic Service Coordinator Election19:47
clarkb#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/NGS2APEFQB45OCJCQ645P5N6XCH52BXW/19:47
clarkbI made the plan official and haven't seen any complaints19:48
clarkbNow is a great time to consider if you'd like to run and feel free to reach out if you have questions making that determination19:48
clarkbI remain happy for someone else to step in and will happily support anyone who does19:48
clarkbNominations Open From February 4, 2025 to February 18, 202519:50
clarkball dates will be treated with a UTC clock19:50
clarkbconsider yourselves warned :)19:50
clarkb#topic Beginning of the Year (Virtual) Meetup Recap19:50
clarkb#link https://etherpad.opendev.org/p/opendev-january-2025-meetup19:50
clarkbwe covered a number of topics and I tried to take notes on what we discussed (others helped too thanks!)19:51
clarkbfrom those notes I then distilled thinsg down into a todo list at the end of the etherpad19:51
clarkbthe idea there is to make it easy to find what sorts of things we said we should do without reading through the more in depth notes19:51
clarkbI'm hoping that helps us get more things done over the course of the year19:52
clarkbif there are any topics in particular that you think need better notes or clarification let me know and I'll do my best to fill in details19:52
clarkbotherwise I think it was a successful use of our time. Thank you to everyone who participated19:53
clarkbfeels good to have some alignment on general todo items and appraoches for those issues19:53
clarkb#topic Open Discussion19:53
clarkbPBR is likely to get a new release soon that adds better support for pyproject.toml usage and python3.12's lack of setuptools19:54
clarkbit is intended to be backward compatible so ideally no one even notices19:54
clarkbbut be aware of that19:54
fungii did enjoy our vmeetup, thanks for organizing it!19:54
fungiand yeah, we've got a stack of changes to bindep exercising various aspects of pbr's use in pyproject.toml19:55
fungithat has proven exceedingly useful19:55
clarkbas a heads up today is pretty amazing weather wise so I'm going to try and pop out after lunch for some outside time. It all changes thursday/friday and goes back to regularly scheduled precipitation19:58
fungienjoy!19:58
clarkbsounds like that may be everything19:58
clarkbthanks again for your time running opendev19:58
clarkbI'll see you back here next week same time and location19:58
clarkb#endmeeting19:58
opendevmeetMeeting ended Tue Jan 28 19:58:50 2025 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)19:58
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2025/infra.2025-01-28-19.00.html19:58
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2025/infra.2025-01-28-19.00.txt19:58
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2025/infra.2025-01-28-19.00.log.html19:58

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!