| @mnasiadka:matrix.org | Hmm, I just had three POST_FAILURE in Kolla gate, any idea where to look to get more insight than no logs at all? (e.g. https://zuul.opendev.org/t/openstack/build/2f8944418bd248e1814b96ae0635243b) | 13:44 |
|---|---|---|
| @fungicide:matrix.org | mnasiadka: i usually grep the build id against the zuul-executor debug logs on ze01-ze12 to find where it ran and narrow down the exception | 13:48 |
| @mnasiadka:matrix.org | fungi: I was thinking to do that, but thought maybe there's some centralised log management - I'll have a look, thanks :) | 13:49 |
| @fungicide:matrix.org | there's not (yet anyway) | 13:49 |
| @fungicide:matrix.org | in theory you can look in the scheduler logs to identify which executor ran the build, but it's easier for me to just check all 12 of them in a scripted loop | 13:50 |
| @tafkamax:matrix.org | As stated previously before in my org we are trying to test out vector for all monitoring. e.g. it reads data from prome metrics, logs from paths and so-on, syslog from 514 port and whatnot. https://vector.dev | 13:51 |
| @tafkamax:matrix.org | and read it all from grafana | 13:51 |
| @tafkamax:matrix.org | view | 13:52 |
| @fungicide:matrix.org | i would mainly worry about the volume/rate of debug logs. just one of our executors generated 1.7m lines of debug logging yesterday, and that was a relatively "quiet" day for us. we not only have 12 of those, but also 8 mergers, 2 launchers, 2 schedulers (one scheduler generated almost 10m lines of debug logs yesterday)... and that's just the zuul components | 13:59 |
| @fungicide:matrix.org | a "busy" day for us could conceivably see an order of magnitude more log lines | 14:00 |
| @fungicide:matrix.org | apache worker slots seem to be exhausted on lists01, i'm working to see if i can get it back on track | 14:09 |
| @fungicide:matrix.org | it feels like apache workers were getting into swap thrash, given how long it's taking to stop them, seems like they're being slow to page their memory back in to free it | 14:13 |
| @fungicide:matrix.org | we probably ought to think about either getting the server more memory (in-place resize if possible, i can't remember if we can do it in that region) or splitting the mailman components (hyperkitty, postorius and mailman-core can all run on separate servers if we want) | 14:16 |
| @fungicide:matrix.org | (mariadb too, of course) | 14:16 |
| @fungicide:matrix.org | since most of the new load seems to be from crawlers digging through every corner of the archives, moving hyperkitty to its own server would allow us to scale it independently without impacting the reputation of the ip address e-mails are coming from (though also we could turn lists01 into a smarthost and send/receive through it while only leaving exim there) | 14:18 |
| @harbott.osism.tech:regio.chat | seems the POST-FAILURES are very widely spread, I'll try to look at executors now, too | 14:30 |
| @fungicide:matrix.org | thanks! | 14:30 |
| @fungicide:matrix.org | guessing one of the swift providers is having problems | 14:30 |
| @harbott.osism.tech:regio.chat | yes, failure in swift upload. but I'm not sure yet how to identify which one is affected | 14:35 |
| @harbott.osism.tech:regio.chat | if someone knows better than me, check build e622a14892134d30bcb513a613836b94 on ze07 | 14:36 |
| @harbott.osism.tech:regio.chat | ah, spotted it. `ovh_gra`. will make a patch to disable | 14:37 |
| @fungicide:matrix.org | `2026-03-24 14:21:54,663 DEBUG zuul.AnsibleJob: [e: d2fb0d9967794642a0b6554e4bd3dbc5] [build: e622a14892134d30bcb513a613836b94] name: Upload swift logs to ovh_gra` | 14:39 |
| @fungicide:matrix.org | yep, that one | 14:39 |
| @fungicide:matrix.org | standing by to fast-approve | 14:39 |
| @harbott.osism.tech:regio.chat | looks like ovs_bhs is also affected, will disable both. not sure if amorin is around and can check their end? | 14:44 |
| @fungicide:matrix.org | thanks. i don't see any incidents on https://public-cloud.status-ovhcloud.com/ yet nor e-mail to infra-root about anything related | 14:45 |
| @harbott.osism.tech:regio.chat | `Unresolved incident: [DE/SGP/RBX/YYZ/MUM][Storage] - S3 maintenance notification.` on https://public-cloud.status-ovhcloud.com/ | 14:45 |
| @fungicide:matrix.org | arnaudm looks likely to be amorin's matrix nick, if that helps highlight this | 14:46 |
| @fungicide:matrix.org | Jens Harbott: yes, that didn't mention swift nor the regions we're using, so i assumed it was unrelated | 14:47 |
| @fungicide:matrix.org | but i suppose it still could be | 14:47 |
| -@gerrit:opendev.org- Dr. Jens Harbott proposed: [opendev/base-jobs] 981907: Disable job log uploads to ovh swift https://review.opendev.org/c/opendev/base-jobs/+/981907 | 14:48 | |
| @harbott.osism.tech:regio.chat | ah, right, for me S3+swift is always the same thing in my mind when providing both via ceph ;) | 14:49 |
| @fungicide:matrix.org | it's possible we may need to bypass zuul and merge that in gerrit | 14:49 |
| @arnaudm:matrix.org | hey, yes it's me :) | 14:51 |
| @arnaudm:matrix.org | swift is having an issue? | 14:51 |
| @fungicide:matrix.org | arnaudm: we're just seeing swift upload failures in bhs and gra but nothing obvious on the ovh cloud status page | 14:52 |
| @arnaudm:matrix.org | do you have any e.g. of calls that are failing, so I can ask the team? | 14:52 |
| @harbott.osism.tech:regio.chat | yes, looks like it, lots of logs uploads failing, we don't see details due to `no_log` in ansible | 14:52 |
| @fungicide:matrix.org | i think we'd have to perform manual tests to reproduce it | 14:52 |
| @arnaudm:matrix.org | I am asking the team anyway, maybe they know something | 14:57 |
| -@gerrit:opendev.org- Zuul merged on behalf of Dr. Jens Harbott: [opendev/base-jobs] 981907: Disable job log uploads to ovh swift https://review.opendev.org/c/opendev/base-jobs/+/981907 | 14:58 | |
| @harbott.osism.tech:regio.chat | seems we were lucky? or did the change fix uploads for itself already? | 14:58 |
| @harbott.osism.tech:regio.chat | fungi: maybe worth a status notice that rechecks are fine now? can you phrase something? | 14:59 |
| @fungicide:matrix.org | #status notice Recent POST_FAILURE job results with no logs were due to upload errors in one of our providers, which has been temporarily disabled now so rechecking those should be safe | 15:02 |
| @status:opendev.org | @fungicide:matrix.org: sending notice | 15:02 |
| @clarkb:matrix.org | the source code is https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/upload-logs-base/library/zuul_swift_upload.py so even without logging we can probably narrow down the number of things that could be the problem | 15:03 |
| -@status:opendev.org- NOTICE: Recent POST_FAILURE job results with no logs were due to upload errors in one of our providers, which has been temporarily disabled now so rechecking those should be safe | 15:05 | |
| @status:opendev.org | @fungicide:matrix.org: finished sending notice | 15:05 |
| @clarkb:matrix.org | fungi: thank you for sending the meeting agenda. Did you want to run it or should I? (I should be able to no problem) | 15:06 |
| @fungicide:matrix.org | up to you, i just figured you didn't need to worry about sending out the agenda on your day off | 15:08 |
| @clarkb:matrix.org | and then the other thing I want to bring up for people to start thinking about is the Gerrit upgrade schedule. I don't think april 5/6 is reasonable for 3.12 anymore and the reason is they just released new bugfix releases that I would like to upgrade to and we have that cache that luca says we should disable. And we should consider extending the docker compose shutdown timeout to make the transition to h2 v2 safer. All of that involves Gerrit updates on 3.11 as well as at least one restart of the prod service and I'm not sure we want to do that the week before the openstack release? I will work on getting changes up for the cache disablement and the new release versions and the docker config today | 15:08 |
| @fungicide:matrix.org | sounds good | 15:08 |
| @clarkb:matrix.org | in my head it may be better to upgrade to 3.12 ~April 10, then maybe just do 3.13 after fungi returns in May? | 15:09 |
| @arnaudm:matrix.org | do we agree that the project id your try to upload is the one starting with dcaa ? | 15:10 |
| @clarkb:matrix.org | arnaudm: I think so here is an example successful upload from the other day: https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_ee2/openstack/ee2e3130a8e5435e9c643c48682156db/job-output.txt | 15:14 |
| @clarkb:matrix.org | fungi: are there any changes or tasks I can be helpful with to get static server hosting into a properly managed state again? Or maybe we're already there? (I don't know what the backlog is remaining on that) | 15:15 |
| @fungicide:matrix.org | i think we're there for now, other than deciding if we want to keep the present server split/sizes | 15:17 |
| @clarkb:matrix.org | I guess the goaccess report generation is still outstanding as well and that depends on how we want to have servers split? | 15:17 |
| @fungicide:matrix.org | i need to shift to figuring out what we can do to get mailman into a better place performance-wise | 15:17 |
| @fungicide:matrix.org | but yes, the split makes a problem for goaccess stats at the moment | 15:18 |
| @clarkb:matrix.org | re lists I wonder if we should consider adapting the abubis change to lists? | 15:18 |
| @fungicide:matrix.org | though also the swarm has probably rendered those stats useless anyway | 15:18 |
| @clarkb:matrix.org | and re static split I think if its working I'm wary to change it at the moment particularly with the release happening | 15:18 |
| @clarkb:matrix.org | so maybe we should think about ways to deal with things in a split setup like this? | 15:19 |
| @fungicide:matrix.org | Clark: anubis could be a good fit for lists, since we already proxy all that in apache | 15:19 |
| @clarkb:matrix.org | yup and I think the service actually does rely on javascript unlike much of the static.o.o hosted content | 15:19 |
| @clarkb:matrix.org | fungi: is that something that would be helpful for me to work on? I'm currently in weekly todo list curation mode so let me know. I think what i have proposed should be straightforward to adapt to lists either way | 15:23 |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 981917: Update Gerrit images to 3.11.10, 3.12.6, and 3.13.5 https://review.opendev.org/c/opendev/system-config/+/981917 | 15:44 | |
| @fungicide:matrix.org | Clark: yeah maybe, my to do list is a little fuller than usual with the openstack release looming | 15:47 |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 981922: Disable the Gerrit per request ref cache https://review.opendev.org/c/opendev/system-config/+/981922 | 15:53 | |
| @clarkb:matrix.org | fungi: ok I will work on porting that to lists | 15:57 |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 981923: Increase Gerrit's stop grace period to 15 minutes https://review.opendev.org/c/opendev/system-config/+/981923 | 15:58 | |
| @clarkb:matrix.org | Those are the three Gerrit changes I mentioned wanting to make as part of the upgrade prep process. I think we can bundle them all into a single restart of the production server ahead of the upgrade whenever we feel that is safe to do around the openstack release | 15:58 |
| -@gerrit:opendev.org- Dmitriy Rabotyagov proposed: [openstack/project-config] 981924: Introduce OpenStack-Ansible Approvers group https://review.opendev.org/c/openstack/project-config/+/981924 | 16:06 | |
| -@gerrit:opendev.org- Dmitriy Rabotyagov proposed: [openstack/project-config] 981925: Allow OpenStack-Ansible Cores to change WIP state https://review.opendev.org/c/openstack/project-config/+/981925 | 16:08 | |
| @clarkb:matrix.org | mnaser: thanks for the review on the anubis stuff. Also good to know that a single instance can work. Reading the doc you linked it looks like the backend http vhost configuration needs to do the routing to make that work though which might not play nice with the static server setup we've currently got. But something to keep in mind as doable and add to the testing pile | 16:34 |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 981932: Apply Anubis to the Mailman lists server https://review.opendev.org/c/opendev/system-config/+/981932 | 16:53 | |
| -@gerrit:opendev.org- Dmitriy Rabotyagov proposed: [openstack/project-config] 981924: Introduce OpenStack-Ansible Approvers group https://review.opendev.org/c/openstack/project-config/+/981924 | 16:56 | |
| -@gerrit:opendev.org- Dmitriy Rabotyagov proposed: [openstack/project-config] 981924: Introduce OpenStack-Ansible Approvers group https://review.opendev.org/c/openstack/project-config/+/981924 | 16:58 | |
| -@gerrit:opendev.org- Dmitriy Rabotyagov proposed: [openstack/project-config] 981925: Allow OpenStack-Ansible Cores to change WIP state https://review.opendev.org/c/openstack/project-config/+/981925 | 17:15 | |
| -@gerrit:opendev.org- Dmitriy Rabotyagov proposed: [openstack/project-config] 981944: Stop Venus gating https://review.opendev.org/c/openstack/project-config/+/981944 | 17:26 | |
| @mnasiadka:matrix.org | Clark: should I proceed with mirror01.iad3.openmetal removal somewhere this week - or we want to wait more? | 18:00 |
| @clarkb:matrix.org | mnasiadka: I think it is ok to proceed | 18:01 |
| @clarkb:matrix.org | mnasiadka: the general process order is drop the dns records, drop the host from the system-config inventory (and host var etc), then we can openstack server delete the node from bridge | 18:01 |
| @mnasiadka:matrix.org | Clark: that's what I thought :) | 18:01 |
| @clarkb:matrix.org | I'm happy to do that last step together if we want to sanity check things before committing to deleting anything | 18:01 |
| @mnasiadka:matrix.org | (as per the process) | 18:01 |
| @mnasiadka:matrix.org | Ok, I'll raise the changes | 18:01 |
| @clarkb:matrix.org | sounds good thanks! | 18:02 |
| @clarkb:matrix.org | I like to do a process similar to what people describe for sql delete from where. Basically do a select from where first and make sure the expected records come back. In this case I like to openstack server show $name/uuid then if the host looks correct replace server show with server delete. | 18:03 |
| -@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/zone-opendev.org] 981956: Drop mirror01.iad3.openmetal https://review.opendev.org/c/opendev/zone-opendev.org/+/981956 | 18:04 | |
| -@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/system-config] 981958: Remove mirror01.iad3.openmetal https://review.opendev.org/c/opendev/system-config/+/981958 | 18:05 | |
| @fungicide:matrix.org | i have the exact same habit, fwiw | 18:07 |
| -@gerrit:opendev.org- Jason Paroly proposed: [openstack/diskimage-builder] 981961: Fix dead code in MBR module https://review.opendev.org/c/openstack/diskimage-builder/+/981961 | 18:14 | |
| -@gerrit:opendev.org- Dmitriy Rabotyagov proposed: [openstack/project-config] 981962: Retire Venus Project https://review.opendev.org/c/openstack/project-config/+/981962 | 18:16 | |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 981932: Apply Anubis to the Mailman lists server https://review.opendev.org/c/opendev/system-config/+/981932 | 18:22 | |
| @clarkb:matrix.org | mnasiadka: noted a couple of extra cleanups to add to https://review.opendev.org/c/opendev/system-config/+/981958 | 18:40 |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 981932: Apply Anubis to the Mailman lists server https://review.opendev.org/c/opendev/system-config/+/981932 | 18:59 | |
| -@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/system-config] 981958: Remove mirror01.iad3.openmetal https://review.opendev.org/c/opendev/system-config/+/981958 | 20:04 | |
| @mnasiadka:matrix.org | > <@clarkb:matrix.org> mnasiadka: noted a couple of extra cleanups to add to https://review.opendev.org/c/opendev/system-config/+/981958 | 20:39 |
| Updated the patch, I totally forgot the handler (and the host_vars) | ||
| @clarkb:matrix.org | Thanks I'll rereview after lunch | 20:39 |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 982012: Force lists to fail in order to hold a node https://review.opendev.org/c/opendev/system-config/+/982012 | 21:11 | |
| @clarkb:matrix.org | fungi: ^ I put an autohold in place for that so we can check the anubis setup against the running held node | 21:12 |
| @fungicide:matrix.org | oh thanks! | 21:24 |
| @clarkb:matrix.org | https://23.253.159.78/ is the server and I got an anubis page. But then I got a 400 error. I think possibly because mailman wants the vhost name in the request? I'll try with an /etc/hosts override. Also firefox has updated its ssl certificate is bad warning to be far more scary | 21:50 |
| @clarkb:matrix.org | yup using an /etc/hosts override makes things work. It seems to be working too. All the lists are there (with empty archives) and I can click around and i don't get the anubis page after the initial hash calculation | 21:51 |
| @clarkb:matrix.org | fungi: I tested with firefox. Do you think we should do testing with chrom*? In any case I suspect this is deployable | 21:52 |
| @clarkb:matrix.org | thanos is complementary to prometheus and runs alongside it to compact data and eventually ship it off to objcet storage. Looking at their docs it appears quite complicated though | 22:06 |
| @clarkb:matrix.org | mimir appears to be similar. They both rely on prometheus to scrape the data then remote write it to the alternative storage system (not sure how you get it back out again) | 22:08 |
| @clarkb:matrix.org | I guess the upside to both is they appear to work in conjunction with prometheus. So we could theoretically deploy prometheus in the naive low retention manner that we have currently available to us. Then work on the expansion to long term storage via one of these tool? | 22:09 |
| @clarkb:matrix.org | aha in the case of mimir their api is compatible with the prometheus api. So I think once the data goes to mimir you just point grafana at mimir instead of prometheus as the data source for your metrics queries and visualization | 22:10 |
| @clarkb:matrix.org | prometheus then is really just a bus for the metrics. Mimir seems to take over after they are collected | 22:10 |
| @clarkb:matrix.org | I think we should probably not let this hold up getting something working with prometheus even if it isn't ideal. 60 days of retention is still better than the 0 we have now (publci facing). Then we can look into mimir et al to supplement things later? | 22:14 |
| @clarkb:matrix.org | https://grafana.com/docs/mimir/latest/set-up/migrate/migrate-from-thanos-or-prometheus/#overview this indicates it should be possible for mimir at least without losing any data (beyond what has already rolled off due to retention not being as long as desired in the first place) | 22:16 |
| @clarkb:matrix.org | hrm thanos supports downsampling and compression. Mimir does not appear to do so. So it may be more technically proficient | 22:17 |
| @clarkb:matrix.org | either way it does seem to complicate the setup a bit. But seems doable | 22:19 |
| @fungicide:matrix.org | Clark: if it's working with ff it's likely fine, but i can try it out tomorrow with other browsers i have on hand too | 22:31 |
| @fungicide:matrix.org | and yeah, you need working hostname resolution, it's essentially using the host header to determine which site's data you want to access | 22:32 |
| @clarkb:matrix.org | https://thanos.io/v41.0/thanos/quick-tutorial.md/ this seems to cover the various thanos services reasonably well. We would need to run the sidecar with prometheus, then the querier and storage gateway. Compaction can be done periodically. But again seems like we can add this on and they discuss how you can reduce your prometheus retention as thanos takes on the long term storage | 22:33 |
| @clarkb:matrix.org | fungi: yup I figured that was the issue since I know we are vhosting in mailman itself | 22:33 |
Generated by irclog2html.py 4.1.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!