clarkb | almost meeting time | 18:58 |
---|---|---|
clarkb | #startmeeting infra | 19:00 |
opendevmeet | Meeting started Tue Jan 28 19:00:11 2025 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:00 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:00 |
opendevmeet | The meeting name has been set to 'infra' | 19:00 |
clarkb | #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/5OE5J5DUQJXZWZ67O7CLANUCFWY7RNXB/ Our Agenda | 19:00 |
clarkb | #topic Announcements | 19:00 |
clarkb | I don't have anything to announce. Did anyone else? | 19:00 |
clarkb | I guess there is an openinfra party at FOSDEM for people who will be there | 19:00 |
clarkb | #topic Zuul-launcher image builds | 19:03 |
clarkb | The most recent Zuul upgrade pulled in sufficient updates to get api management of zuul launcher image builds in place | 19:03 |
clarkb | there was a little blip where we needed to deploy some extra config for zuul web to make that work | 19:03 |
clarkb | but since that was sorted out corvus has been able to trigger new image builds via the ui | 19:04 |
corvus | i just triggered another set; that will tell us if we're aging out old ones correctly | 19:04 |
clarkb | I think the latest is that old existing builds didn't have sufficient metadata to get automatically clear out but new ones should | 19:04 |
clarkb | ya that | 19:04 |
clarkb | anything else we should be aware of on this topic? Or areas that need help? | 19:05 |
corvus | nope; new image jobs are still welcome any time | 19:05 |
corvus | but that's not blocking yet | 19:05 |
clarkb | ack and thank your for sorting out zuul web post upgrade | 19:06 |
corvus | np | 19:06 |
clarkb | #topic Upgrading old servers | 19:06 |
clarkb | I'm going to fold the noble work under this topic now that its largely sorted out | 19:06 |
clarkb | some things I want to note: | 19:06 |
clarkb | New noble servers will deploy with borg 1.2.8 and backup to backup servers running borg 1.1.18. That will be the situation until we deploy new noble backup servers running borg 1.4.x. At that point we can convert servers capable of running 1.4.x to backup to those new backup servers (I believe this will include noble and jammy servers) | 19:07 |
clarkb | some services may need their docker-compose.yaml files updated to set an 'always' restart policy though most already use always | 19:08 |
clarkb | and fungi did dig into podman packaging for ubuntu and doesn't think that upgrading podman requires container restarts (upgrading dockerd packages did) | 19:08 |
clarkb | So far paste02 seems to be happy and after the initial set of problems (understanding reboot behavior, borg backups, and general podman support) I think we're in good shape to continue deploying new things on noble | 19:09 |
clarkb | we also converted lodgeit over to publishing to quay.io and speculative testing of those container images works with podman (this was the expectation but it is good to see in practice) | 19:09 |
clarkb | at this point I think what I'd like to do is find some more complicated service to upgrade to noble and run under podman. My ultimate goal is to redeploy review, but considering how much we learned from paste02 I think finding another canary first is a good idea | 19:10 |
clarkb | so far I've been thinking codesearch or grafana would be good simple updates like paste. But better than that might be something like zuul servers (scheduler and or executors?) | 19:10 |
clarkb | not sure if anyone had thoughts on that. I'm open to input | 19:10 |
clarkb | won't get to that today though so we have tmie to chime in. Hoping later this week though | 19:11 |
clarkb | tonyb: anything new with wiki to discuss? | 19:12 |
corvus | no objections to using zuul as a canary -- but one thought: i don't think we'd want to have the different components on different systems for long | 19:12 |
clarkb | that makes sense so probably need to do the whole thing in a coordinated fashion | 19:13 |
corvus | also, start with executor :) | 19:13 |
clarkb | I'll keep that and mind and look at the list again to see if there are any other better candidates | 19:13 |
corvus | most likely to introduce novel issues | 19:13 |
clarkb | ack | 19:13 |
corvus | otoh -- all of zuul is regularly tested with podman | 19:14 |
corvus | so that's nice | 19:14 |
clarkb | sounds like that may be it for this topic (we can swing back around later if there is time and need too) | 19:15 |
clarkb | #topic Switch to quay.io/opendevmirror images where possible | 19:15 |
clarkb | one thing I noticed when trying to upgrade gerrit and gitea and do lodgeit/paste work is that using the mirrored images from quay raelly does help with job reliability | 19:16 |
clarkb | One "easy" approach here is to switch our use of mariadb from docker to quay.io. Doing so does cause the database to restart though so keep that in mind. So far we have converted gitea and lodgeit. Other services to do this to include refstack, gerrit, etherpad, the zuul db system, and probably others I'm forgetting | 19:17 |
clarkb | I may just go ahead and try and push changes up for all of these cases I can find then we can land them as we feel is appropriate | 19:17 |
clarkb | then separately we may wish to switch our dockerfile image builds over to base images hosted on quay as well. For example with the python base and builder images | 19:17 |
clarkb | one thing to keep in mind with doing this is we'll lose the ability to speculatively test those base images against our image builds. I think this is somethign we can live with while we transition over to quay in general | 19:18 |
clarkb | speculative image building is far more useful with the actual service images and while we may have used the speculative state once or twice in the past to test base image updates I don't think they are as critical | 19:18 |
clarkb | just keep that in mind as a class of optimization we can apply to improve reliabiltiy in our ci system | 19:19 |
clarkb | #topic Unpinning our Grafana deployment | 19:19 |
clarkb | At some point (last year?) we updated grafana and some of our graphs stopped working. There were problems with CORS I guess. Anyway I've pushed up some changes to improve testing of grafana in system config so that we can inspect this better and have a change to bump up to the newest version of the current major release we are on | 19:20 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/940073 | 19:20 |
clarkb | I then held a node: https://217.182.143.14/ and through that was able to track down the problem (or at least I think I did) | 19:21 |
clarkb | the issue seems to be specific to grafana dashboards that use queries to look up info and dynamically build out the dashboard. These queries hit CORS errors | 19:21 |
clarkb | it turns out that we can have grafana proxy the requests to graphite to mitigate this problem: https://review.opendev.org/c/openstack/project-config/+/940276 | 19:21 |
clarkb | that doesn't seem to break the graphs on the current version and I think makes the graphs work with the latest version | 19:22 |
clarkb | long story short I think if we land 940276 and confirm things continue to work with the current deployment then the next step is upgrading to the latest version of the current major release | 19:22 |
clarkb | I don't want to go straight to the next major release because we get warnings about graphs requiring angular and that has been deprecated so sorting that out will be the next battle before upgrading to the latest major release | 19:23 |
clarkb | reviews welcome | 19:23 |
corvus | what about adding cors to graphite? | 19:23 |
clarkb | I suspect that would work as well. | 19:24 |
corvus | we have some headers there already | 19:24 |
clarkb | basically we would need to add grafana.opendev.org to the allowed origins list I think | 19:24 |
corvus | https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/graphite/templates/graphite-statsd.conf.j2#L55-L58 | 19:25 |
clarkb | oh hrm we already set it to * | 19:25 |
clarkb | so why is this breaking | 19:25 |
corvus | which dashobard failed? | 19:25 |
clarkb | corvus: dib status or any of the nodepool provider specific ones (the base nodepool one works) | 19:25 |
clarkb | I believe it was an OPTIONS requset which we set that header on | 19:26 |
clarkb | but ya maybe there is something subtly wrong in that header config for graphite | 19:26 |
clarkb | https://217.182.143.14/d/f3089338b3/nodepool3a-dib-status?orgId=1 this is an example failure | 19:26 |
clarkb | looking at the response headers I don't see the allowed origins in there. So maybe that config is ineffctive for some reason. Definitely something to debug further if we prefer that approach (it would make integration easier overall oprobably) | 19:28 |
corvus | hrm | 19:29 |
corvus | i wonder if it has to do with the post | 19:29 |
clarkb | oh ya that post seems to have no data in my firefox debugger | 19:29 |
clarkb | I didn't look at it previously because it didn't hard fail like the OPTIONS requet | 19:30 |
clarkb | we mgiht need to check server logs on graphite to untangle this too | 19:30 |
corvus | the browser sees a post request, so first it does an OPTIONS request, and maybe the options isn't handled by graphite so apache just passes through the 400 without adding the headers | 19:30 |
fungi | could that dashboard's configuration be subtly incompatible with newer grafana resulting in an empty query? | 19:30 |
clarkb | fungi: I don't think so because converting grafana to proxy instead of direct communication to graphite works | 19:31 |
corvus | so yeah, if it's something like that, then we might need to tell apache to return those headers even on 400 errors | 19:31 |
fungi | ah, didn't realize the proxy solution had been tried already | 19:31 |
clarkb | yes proxy solution is working based on testing in the changes linked above | 19:31 |
clarkb | corvus: do you mean nginx on the graphite host? but ya perhaps we aren't responding properly and firefox gets sad | 19:32 |
clarkb | I can try and dig more by looking at graphite web server logs later today | 19:32 |
corvus | heh that is an nginx config isn't it :) | 19:32 |
clarkb | ya iirf graphite is this big monolithic container with all the things in one image and we're just configuring what is there | 19:33 |
corvus | ah | 19:33 |
corvus | curl -i -X OPTIONS 'https://graphite.opendev.org/metrics/find?from=1738070749&until=1738092351' | 19:33 |
corvus | that returns 400 with no headers | 19:33 |
corvus | (it's missing the "query" parameter, which would presumably be included in the POST request) | 19:34 |
corvus | so that's my theory -- we're not adding the CORS headers on failing OPTIONS requests | 19:34 |
clarkb | seems plausible and likely somethign we can fix with the right nginx config | 19:34 |
corvus | ++ | 19:35 |
corvus | i think that's worth doing in preference to the proxy | 19:35 |
fungi | sounds reasonable to me | 19:35 |
clarkb | agreed since that makes this work more broadly and not if you have special proxies | 19:35 |
clarkb | I can try and dig into that more later today | 19:35 |
corvus | should be a little more efficient, and we already try to have that work | 19:35 |
corvus | that too | 19:35 |
clarkb | #topic Increasing Exim's Acceptable Line Length | 19:36 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/940248 | 19:36 |
clarkb | tl;dr here is that our own exim on lists.opendev.org is generating bounces when people send messages with header lines that are too long | 19:36 |
clarkb | the thing I'm confused about is why I haven't been kicked out of the list yet but maybe I don't bounce often enough for this to trip me over the limit | 19:37 |
clarkb | anyway the change seems fine to me if this is part of noble's defaults anyway | 19:37 |
fungi | yes, latest discovery among the various challenges mailman has delivering messages | 19:37 |
clarkb | any reason we shouldn't proceed with alnding it? | 19:37 |
fungi | this particular failure seems to occur infrequently, but since it's technically an rfc violation and a default (if newer) exim behavior we'd be overriding, i wanted to make sure we had some consensus | 19:38 |
clarkb | I'm ok with rfc violations if experience says the real world is violating the rfc | 19:38 |
fungi | related to this, i'm starting to think that the bounce processing experiment is leading to more work for openstack-discuss than i have time for | 19:38 |
fungi | broad rejections like that one are leading to artifically inflated bounce scores for subscribers, and the separate probe messages are clearly confusing some subscribers | 19:39 |
fungi | yesterday we had one reply to their probe's verp address, which resulted in disabling their subscription, for example | 19:40 |
clarkb | on the other hand if we didn't do this we'd still bounce and not detect this problem? | 19:40 |
clarkb | and we'd just not deliver emails at all for those with lines that are too long? | 19:40 |
fungi | yes, unless we looked at logs | 19:40 |
fungi | well, another big issue with the verp probe messages is that they helpfully include a copy of the most recent bounce for that subscriber | 19:40 |
fungi | and bounce ndrs often helpfully include a copy of the message that bounced | 19:41 |
fungi | so if the message was bounced for having spammy content, the verp probe message quite often gets rejected by the mta for including the same content | 19:41 |
clarkb | that seems like a bug in the probe implementation used by mailman? | 19:42 |
clarkb | if all they are trying to do is determine if the address is valid probing with as neutral a set of inputs seems ideal | 19:42 |
clarkb | but we're not likely to fix that ourselves | 19:43 |
fungi | but worse than that, since exim will attempt to deliver multiple recipients at the same destination mta in one batch, and then bounce any resulting failures back to mailman in a single ndr, the resulting list of rcpt rejection messages sometimes ends up in the verp probe messages sent to subscribers, so for example a user subscribed from a domain handled by gmail will get a | 19:43 |
fungi | list of many/most of the other subscriber addresses whose domains are also managed by gmail | 19:43 |
clarkb | anyway I'm good with landing that chagne despite it being an rfc violation. Its the default in noble and clearly people are violating it. This seems like a case where we're best off being flexible for real world inputs | 19:43 |
clarkb | then separately if we want to disable the bounce processing on that list and ask mailman if these are bugs I'm good wit hthat too | 19:44 |
fungi | yeah, i think if we're going to continue to try to have mailman not modify messages sent to it, we need exim to not enforce restrictions on what mailman can send through it | 19:44 |
fungi | i'll go ahead and disable bounce processing on openstack-discuss for now, mostly out of concern for potential future disclosure of subscriber addresses to other subscribers, and post to the mailman-users list looking into possible ways to improve that problem | 19:45 |
clarkb | the change has my +2 anyone else want to review it before we proceed? I guess we're going to apply that to exim everywhere that supports it not just mailman but that also seems fine | 19:45 |
clarkb | sounds good | 19:46 |
clarkb | we've got a few more topics to cover so I'll keep moving | 19:46 |
clarkb | #topic Running certcheck on bridge | 19:46 |
clarkb | fungi: is there a change for this yet? | 19:46 |
fungi | ah, no i don't think i've written it yet | 19:47 |
clarkb | cool just making sure I haven't missed anything. Its been a busy few weeks | 19:47 |
clarkb | #topic Service Coordinator Election | 19:47 |
clarkb | #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/NGS2APEFQB45OCJCQ645P5N6XCH52BXW/ | 19:47 |
clarkb | I made the plan official and haven't seen any complaints | 19:48 |
clarkb | Now is a great time to consider if you'd like to run and feel free to reach out if you have questions making that determination | 19:48 |
clarkb | I remain happy for someone else to step in and will happily support anyone who does | 19:48 |
clarkb | Nominations Open From February 4, 2025 to February 18, 2025 | 19:50 |
clarkb | all dates will be treated with a UTC clock | 19:50 |
clarkb | consider yourselves warned :) | 19:50 |
clarkb | #topic Beginning of the Year (Virtual) Meetup Recap | 19:50 |
clarkb | #link https://etherpad.opendev.org/p/opendev-january-2025-meetup | 19:50 |
clarkb | we covered a number of topics and I tried to take notes on what we discussed (others helped too thanks!) | 19:51 |
clarkb | from those notes I then distilled thinsg down into a todo list at the end of the etherpad | 19:51 |
clarkb | the idea there is to make it easy to find what sorts of things we said we should do without reading through the more in depth notes | 19:51 |
clarkb | I'm hoping that helps us get more things done over the course of the year | 19:52 |
clarkb | if there are any topics in particular that you think need better notes or clarification let me know and I'll do my best to fill in details | 19:52 |
clarkb | otherwise I think it was a successful use of our time. Thank you to everyone who participated | 19:53 |
clarkb | feels good to have some alignment on general todo items and appraoches for those issues | 19:53 |
clarkb | #topic Open Discussion | 19:53 |
clarkb | PBR is likely to get a new release soon that adds better support for pyproject.toml usage and python3.12's lack of setuptools | 19:54 |
clarkb | it is intended to be backward compatible so ideally no one even notices | 19:54 |
clarkb | but be aware of that | 19:54 |
fungi | i did enjoy our vmeetup, thanks for organizing it! | 19:54 |
fungi | and yeah, we've got a stack of changes to bindep exercising various aspects of pbr's use in pyproject.toml | 19:55 |
fungi | that has proven exceedingly useful | 19:55 |
clarkb | as a heads up today is pretty amazing weather wise so I'm going to try and pop out after lunch for some outside time. It all changes thursday/friday and goes back to regularly scheduled precipitation | 19:58 |
fungi | enjoy! | 19:58 |
clarkb | sounds like that may be everything | 19:58 |
clarkb | thanks again for your time running opendev | 19:58 |
clarkb | I'll see you back here next week same time and location | 19:58 |
clarkb | #endmeeting | 19:58 |
opendevmeet | Meeting ended Tue Jan 28 19:58:50 2025 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 19:58 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2025/infra.2025-01-28-19.00.html | 19:58 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2025/infra.2025-01-28-19.00.txt | 19:58 |
opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2025/infra.2025-01-28-19.00.log.html | 19:58 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!