Tuesday, 2026-03-24

@mnasiadka:matrix.orgHmm, I just had three POST_FAILURE in Kolla gate, any idea where to look to get more insight than no logs at all? (e.g. https://zuul.opendev.org/t/openstack/build/2f8944418bd248e1814b96ae0635243b)13:44
@fungicide:matrix.orgmnasiadka: i usually grep the build id against the zuul-executor debug logs on ze01-ze12 to find where it ran and narrow down the exception13:48
@mnasiadka:matrix.orgfungi: I was thinking to do that, but thought maybe there's some centralised log management - I'll have a look, thanks :)13:49
@fungicide:matrix.orgthere's not (yet anyway)13:49
@fungicide:matrix.orgin theory you can look in the scheduler logs to identify which executor ran the build, but it's easier for me to just check all 12 of them in a scripted loop13:50
@tafkamax:matrix.orgAs stated previously before in my org we are trying to test out vector for all monitoring. e.g. it reads data from prome metrics, logs from paths and so-on, syslog from 514 port and whatnot. https://vector.dev13:51
@tafkamax:matrix.organd read it all from grafana13:51
@tafkamax:matrix.orgview13:52
@fungicide:matrix.orgi would mainly worry about the volume/rate of debug logs. just one of our executors generated 1.7m lines of debug logging yesterday, and that was a relatively "quiet" day for us. we not only have 12 of those, but also 8 mergers, 2 launchers, 2 schedulers (one scheduler generated almost 10m lines of debug logs yesterday)... and that's just the zuul components13:59
@fungicide:matrix.orga "busy" day for us could conceivably see an order of magnitude more log lines14:00
@fungicide:matrix.orgapache worker slots seem to be exhausted on lists01, i'm working to see if i can get it back on track14:09
@fungicide:matrix.orgit feels like apache workers were getting into swap thrash, given how long it's taking to stop them, seems like they're being slow to page their memory back in to free it14:13
@fungicide:matrix.orgwe probably ought to think about either getting the server more memory (in-place resize if possible, i can't remember if we can do it in that region) or splitting the mailman components (hyperkitty, postorius and mailman-core can all run on separate servers if we want)14:16
@fungicide:matrix.org(mariadb too, of course)14:16
@fungicide:matrix.orgsince most of the new load seems to be from crawlers digging through every corner of the archives, moving hyperkitty to its own server would allow us to scale it independently without impacting the reputation of the ip address e-mails are coming from (though also we could turn lists01 into a smarthost and send/receive through it while only leaving exim there)14:18
@harbott.osism.tech:regio.chatseems the POST-FAILURES are very widely spread, I'll try to look at executors now, too14:30
@fungicide:matrix.orgthanks!14:30
@fungicide:matrix.orgguessing one of the swift providers is having problems14:30
@harbott.osism.tech:regio.chatyes, failure in swift upload. but I'm not sure yet how to identify which one is affected14:35
@harbott.osism.tech:regio.chatif someone knows better than me, check build e622a14892134d30bcb513a613836b94 on ze0714:36
@harbott.osism.tech:regio.chatah, spotted it. `ovh_gra`. will make a patch to disable14:37
@fungicide:matrix.org`2026-03-24 14:21:54,663 DEBUG zuul.AnsibleJob: [e: d2fb0d9967794642a0b6554e4bd3dbc5] [build: e622a14892134d30bcb513a613836b94]       name: Upload swift logs to ovh_gra`14:39
@fungicide:matrix.orgyep, that one14:39
@fungicide:matrix.orgstanding by to fast-approve14:39
@harbott.osism.tech:regio.chatlooks like ovs_bhs is also affected, will disable both. not sure if amorin is around and can check their end?14:44
@fungicide:matrix.orgthanks. i don't see any incidents on https://public-cloud.status-ovhcloud.com/ yet nor e-mail to infra-root about anything related14:45
@harbott.osism.tech:regio.chat`Unresolved incident: [DE/SGP/RBX/YYZ/MUM][Storage] - S3 maintenance notification.` on https://public-cloud.status-ovhcloud.com/14:45
@fungicide:matrix.orgarnaudm looks likely to be amorin's matrix nick, if that helps highlight this14:46
@fungicide:matrix.orgJens Harbott: yes, that didn't mention swift nor the regions we're using, so i assumed it was unrelated14:47
@fungicide:matrix.orgbut i suppose it still could be14:47
-@gerrit:opendev.org- Dr. Jens Harbott proposed: [opendev/base-jobs] 981907: Disable job log uploads to ovh swift https://review.opendev.org/c/opendev/base-jobs/+/98190714:48
@harbott.osism.tech:regio.chatah, right, for me S3+swift is always the same thing in my mind when providing both via ceph ;)14:49
@fungicide:matrix.orgit's possible we may need to bypass zuul and merge that in gerrit14:49
@arnaudm:matrix.orghey, yes it's me :)14:51
@arnaudm:matrix.orgswift is having an issue?14:51
@fungicide:matrix.orgarnaudm: we're just seeing swift upload failures in bhs and gra but nothing obvious on the ovh cloud status page14:52
@arnaudm:matrix.orgdo you have any e.g. of calls that are failing, so I can ask the team?14:52
@harbott.osism.tech:regio.chatyes, looks like it, lots of logs uploads failing, we don't see details due to `no_log` in ansible14:52
@fungicide:matrix.orgi think we'd have to perform manual tests to reproduce it14:52
@arnaudm:matrix.orgI am asking the team anyway, maybe they know something14:57
-@gerrit:opendev.org- Zuul merged on behalf of Dr. Jens Harbott: [opendev/base-jobs] 981907: Disable job log uploads to ovh swift https://review.opendev.org/c/opendev/base-jobs/+/98190714:58
@harbott.osism.tech:regio.chatseems we were lucky? or did the change fix uploads for itself already?14:58
@harbott.osism.tech:regio.chatfungi: maybe worth a status notice that rechecks are fine now? can you phrase something?14:59
@fungicide:matrix.org#status notice Recent POST_FAILURE job results with no logs were due to upload errors in one of our providers, which has been temporarily disabled now so rechecking those should be safe15:02
@status:opendev.org@fungicide:matrix.org: sending notice15:02
@clarkb:matrix.orgthe source code is https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/upload-logs-base/library/zuul_swift_upload.py so even without logging we can probably narrow down the number of things that could be the problem15:03
-@status:opendev.org- NOTICE: Recent POST_FAILURE job results with no logs were due to upload errors in one of our providers, which has been temporarily disabled now so rechecking those should be safe15:05
@status:opendev.org@fungicide:matrix.org: finished sending notice15:05
@clarkb:matrix.orgfungi: thank you for sending the meeting agenda. Did you want to run it or should I? (I should be able to no problem)15:06
@fungicide:matrix.orgup to you, i just figured you didn't need to worry about sending out the agenda on your day off15:08
@clarkb:matrix.organd then the other thing I want to bring up for people to start thinking about is the Gerrit upgrade schedule. I don't think april 5/6 is reasonable for 3.12 anymore and the reason is they just released new bugfix releases that I would like to upgrade to and we have that cache that luca says we should disable. And we should consider extending the docker compose shutdown timeout to make the transition to h2 v2 safer. All of that involves Gerrit updates on 3.11 as well as at least one restart of the prod service and I'm not sure we want to do that the week before the openstack release? I will work on getting changes up for the cache disablement and the new release versions and the docker config today15:08
@fungicide:matrix.orgsounds good15:08
@clarkb:matrix.orgin my head it may be better to upgrade to 3.12 ~April 10, then maybe just do 3.13 after fungi returns in May?15:09
@arnaudm:matrix.orgdo we agree that the project id your try to upload is the one starting with dcaa ?15:10
@clarkb:matrix.orgarnaudm: I think so here is an example successful upload from the other day: https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_ee2/openstack/ee2e3130a8e5435e9c643c48682156db/job-output.txt15:14
@clarkb:matrix.orgfungi: are there any changes or tasks I can be helpful with to get static server hosting into a properly managed state again? Or maybe we're already there? (I don't know what the backlog is remaining on that)15:15
@fungicide:matrix.orgi think we're there for now, other than deciding if we want to keep the present server split/sizes15:17
@clarkb:matrix.orgI guess the goaccess report generation is still outstanding as well and that depends on how we want to have servers split?15:17
@fungicide:matrix.orgi need to shift to figuring out what we can do to get mailman into a better place performance-wise15:17
@fungicide:matrix.orgbut yes, the split makes a problem for goaccess stats at the moment15:18
@clarkb:matrix.orgre lists I wonder if we should consider adapting the abubis change to lists?15:18
@fungicide:matrix.orgthough also the swarm has probably rendered those stats useless anyway15:18
@clarkb:matrix.organd re static split I think if its working I'm wary to change it at the moment particularly with the release happening15:18
@clarkb:matrix.orgso maybe we should think about ways to deal with things in a split setup like this?15:19
@fungicide:matrix.orgClark: anubis could be a good fit for lists, since we already proxy all that in apache15:19
@clarkb:matrix.orgyup and I think the service actually does rely on javascript unlike much of the static.o.o hosted content15:19
@clarkb:matrix.orgfungi: is that something that would be helpful for me to work on? I'm currently in weekly todo list curation mode so let me know. I think what i have proposed should be straightforward to adapt to lists either way15:23
-@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 981917: Update Gerrit images to 3.11.10, 3.12.6, and 3.13.5 https://review.opendev.org/c/opendev/system-config/+/98191715:44
@fungicide:matrix.orgClark: yeah maybe, my to do list is a little fuller than usual with the openstack release looming15:47
-@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 981922: Disable the Gerrit per request ref cache https://review.opendev.org/c/opendev/system-config/+/98192215:53
@clarkb:matrix.orgfungi: ok I will work on porting that to lists15:57
-@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 981923: Increase Gerrit's stop grace period to 15 minutes https://review.opendev.org/c/opendev/system-config/+/98192315:58
@clarkb:matrix.orgThose are the three Gerrit changes I mentioned wanting to make as part of the upgrade prep process. I think we can bundle them all into a single restart of the production server ahead of the upgrade whenever we feel that is safe to do around the openstack release15:58
-@gerrit:opendev.org- Dmitriy Rabotyagov proposed: [openstack/project-config] 981924: Introduce OpenStack-Ansible Approvers group https://review.opendev.org/c/openstack/project-config/+/98192416:06
-@gerrit:opendev.org- Dmitriy Rabotyagov proposed: [openstack/project-config] 981925: Allow OpenStack-Ansible Cores to change WIP state https://review.opendev.org/c/openstack/project-config/+/98192516:08
@clarkb:matrix.orgmnaser: thanks for the review on the anubis stuff. Also good to know that a single instance can work. Reading the doc you linked it looks like the backend http vhost configuration needs to do the routing to make that work though which might not play nice with the static server setup we've currently got. But something to keep in mind as doable and add to the testing pile16:34
-@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 981932: Apply Anubis to the Mailman lists server https://review.opendev.org/c/opendev/system-config/+/98193216:53
-@gerrit:opendev.org- Dmitriy Rabotyagov proposed: [openstack/project-config] 981924: Introduce OpenStack-Ansible Approvers group https://review.opendev.org/c/openstack/project-config/+/98192416:56
-@gerrit:opendev.org- Dmitriy Rabotyagov proposed: [openstack/project-config] 981924: Introduce OpenStack-Ansible Approvers group https://review.opendev.org/c/openstack/project-config/+/98192416:58
-@gerrit:opendev.org- Dmitriy Rabotyagov proposed: [openstack/project-config] 981925: Allow OpenStack-Ansible Cores to change WIP state https://review.opendev.org/c/openstack/project-config/+/98192517:15
-@gerrit:opendev.org- Dmitriy Rabotyagov proposed: [openstack/project-config] 981944: Stop Venus gating https://review.opendev.org/c/openstack/project-config/+/98194417:26
@mnasiadka:matrix.orgClark: should I proceed with mirror01.iad3.openmetal removal somewhere this week - or we want to wait more?18:00
@clarkb:matrix.orgmnasiadka: I think it is ok to proceed18:01
@clarkb:matrix.orgmnasiadka: the general process order is drop the dns records, drop the host from the system-config inventory (and host var etc), then we can openstack server delete the node from bridge18:01
@mnasiadka:matrix.orgClark: that's what I thought :)18:01
@clarkb:matrix.orgI'm happy to do that last step together if we want to sanity check things before committing to deleting anything18:01
@mnasiadka:matrix.org(as per the process)18:01
@mnasiadka:matrix.orgOk, I'll raise the changes18:01
@clarkb:matrix.orgsounds good thanks!18:02
@clarkb:matrix.orgI like to do a process similar to what people describe for sql delete from where. Basically do a select from where first and make sure the expected records come back. In this case I like to openstack server show $name/uuid then if the host looks correct replace server show with server delete.18:03
-@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/zone-opendev.org] 981956: Drop mirror01.iad3.openmetal https://review.opendev.org/c/opendev/zone-opendev.org/+/98195618:04
-@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/system-config] 981958: Remove mirror01.iad3.openmetal https://review.opendev.org/c/opendev/system-config/+/98195818:05
@fungicide:matrix.orgi have the exact same habit, fwiw18:07
-@gerrit:opendev.org- Jason Paroly proposed: [openstack/diskimage-builder] 981961: Fix dead code in MBR module https://review.opendev.org/c/openstack/diskimage-builder/+/98196118:14
-@gerrit:opendev.org- Dmitriy Rabotyagov proposed: [openstack/project-config] 981962: Retire Venus Project https://review.opendev.org/c/openstack/project-config/+/98196218:16
-@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 981932: Apply Anubis to the Mailman lists server https://review.opendev.org/c/opendev/system-config/+/98193218:22
@clarkb:matrix.orgmnasiadka: noted a couple of extra cleanups to add to https://review.opendev.org/c/opendev/system-config/+/98195818:40
-@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 981932: Apply Anubis to the Mailman lists server https://review.opendev.org/c/opendev/system-config/+/98193218:59
-@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/system-config] 981958: Remove mirror01.iad3.openmetal https://review.opendev.org/c/opendev/system-config/+/98195820:04
@mnasiadka:matrix.org> <@clarkb:matrix.org> mnasiadka: noted a couple of extra cleanups to add to https://review.opendev.org/c/opendev/system-config/+/98195820:39
Updated the patch, I totally forgot the handler (and the host_vars)
@clarkb:matrix.orgThanks I'll rereview after lunch20:39
-@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 982012: Force lists to fail in order to hold a node https://review.opendev.org/c/opendev/system-config/+/98201221:11
@clarkb:matrix.orgfungi: ^ I put an autohold in place for that so we can check the anubis setup against the running held node21:12
@fungicide:matrix.orgoh thanks!21:24
@clarkb:matrix.orghttps://23.253.159.78/ is the server and I got an anubis page. But then I got a 400 error. I think possibly because mailman wants the vhost name in the request? I'll try with an /etc/hosts override. Also firefox has updated its ssl certificate is bad warning to be far more scary21:50
@clarkb:matrix.orgyup using an /etc/hosts override makes things work. It seems to be working too. All the lists are there (with empty archives) and I can click around and i don't get the anubis page after the initial hash calculation21:51
@clarkb:matrix.orgfungi: I tested with firefox. Do you think we should do testing with chrom*? In any case I suspect this is deployable21:52
@clarkb:matrix.orgthanos is complementary to prometheus and runs alongside it to compact data and eventually ship it off to objcet storage. Looking at their docs it appears quite complicated though22:06
@clarkb:matrix.orgmimir appears to be similar. They both rely on prometheus to scrape the data then remote write it to the alternative storage system (not sure how you get it back out again)22:08
@clarkb:matrix.orgI guess the upside to both is they appear to work in conjunction with prometheus. So we could theoretically deploy prometheus in the naive low retention manner that we have currently available to us. Then work on the expansion to long term storage via one of these tool?22:09
@clarkb:matrix.orgaha in the case of mimir their api is compatible with the prometheus api. So I think once the data goes to mimir you just point grafana at mimir instead of prometheus as the data source for your metrics queries and visualization22:10
@clarkb:matrix.orgprometheus then is really just a bus for the metrics. Mimir seems to take over after they are collected22:10
@clarkb:matrix.orgI think we should probably not let this hold up getting something working with prometheus even if it isn't ideal. 60 days of retention is still better than the 0 we have now (publci facing). Then we can look into mimir et al to supplement things later?22:14
@clarkb:matrix.orghttps://grafana.com/docs/mimir/latest/set-up/migrate/migrate-from-thanos-or-prometheus/#overview this indicates it should be possible for mimir at least without losing any data (beyond what has already rolled off due to retention not being as long as desired in the first place)22:16
@clarkb:matrix.orghrm thanos supports downsampling and compression. Mimir does not appear to do so. So it may be more technically proficient22:17
@clarkb:matrix.orgeither way it does seem to complicate the setup a bit. But seems doable22:19
@fungicide:matrix.orgClark: if it's working with ff it's likely fine, but i can try it out tomorrow with other browsers i have on hand too22:31
@fungicide:matrix.organd yeah, you need working hostname resolution, it's essentially using the host header to determine which site's data you want to access22:32
@clarkb:matrix.orghttps://thanos.io/v41.0/thanos/quick-tutorial.md/ this seems to cover the various thanos services reasonably well. We would need to run the sidecar with prometheus, then the querier and storage gateway. Compaction can be done periodically. But again seems like we can add this on and they discuss how you can reduce your prometheus retention as thanos takes on the long term storage22:33
@clarkb:matrix.orgfungi: yup I figured that was the issue since I know we are vhosting in mailman itself22:33

Generated by irclog2html.py 4.1.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!