| clarkb | mnasiadka: I left a comment on the zuul-jobs change. I don't know that the change is wrong, but I want to make sure we've got some sort of idea of how we're getting to the end state we want (which this change may be one of potentially many steps towards) | 00:09 |
|---|---|---|
| *** liuxie is now known as liushy | 02:34 | |
| mnasiadka | clarkb: commented back | 05:17 |
| mnasiadka | frickler, tonyb: are you willing to do a second review on https://review.opendev.org/c/opendev/zuul-providers/+/967962 ? I think that should fix the iptables.service not running on rocky hosts | 05:20 |
| priteau | Good morning. We are seeing lots of timeouts in Kayobe CI since yesterday. Are there issues with mirrors? | 09:09 |
| priteau | https://zuul.opendev.org/t/openstack/build/44e5376c969d4ca08f1823515c317218/log/primary/ansible/overcloud-deploy#27957-27964 | 09:10 |
| priteau | https://zuul.opendev.org/t/openstack/build/025eed087a274169b59967b0e74dbed6/log/primary/ansible/seed-deploy-pre-upgrade#18763 | 09:10 |
| priteau | https://zuul.opendev.org/t/openstack/build/fb560ac624fe4356acb3db9b000fbc6f/log/primary/ansible/overcloud-deploy#37810-37817 | 09:10 |
| priteau | I keep seeing rax.opendev.org in all these failures, but it is not always the same region | 09:11 |
| priteau | Actually one has just failed with OVH: https://zuul.opendev.org/t/openstack/build/815abbe97dfd4730b0a9c24d6a622c14/log/primary/ansible/overcloud-deploy#27488-27495 | 09:35 |
| priteau | So maybe it isn't an infrastructure issue | 09:36 |
| mnasiadka | priteau: we don’t see that in Kolla - because we stopped using the local Apache/mod_proxy servers (which you refer to as mirrors) due to issues that we have seen after upgrading to docker-ce 29 | 09:45 |
| priteau | mnasiadka: was it the same issue? (timeouts) | 10:10 |
| mnasiadka | priteau: I think so, or some image layer fetch problems - I don’t remember in detail | 10:12 |
| priteau | mnasiadka: thanks, I will make the same change in kayobe | 10:44 |
| priteau | Looking better with the fix | 13:43 |
| fungi | priteau: mnasiadka: looks like that's a proxy to quay.io? sounds like the mirror servers are not getting responses from there fast enough | 14:29 |
| fungi | i wonder if there's something more global going on, i'm getting "no server is available to handle your request" errors from github this morning too | 14:30 |
| mnasiadka | fungi: it might be also docker-ce 29.0 or new containerd version that is more picky or does things differently - but last time when we had these problems - disabling usage of the proxy in docker config made things working - so Kolla-Ansible is running like that for a couple of weeks | 14:34 |
| fungi | makes sense. also i think our proxy caches are not well suited for large blobs like images from container registries. there's just not enough local storage to keep enough of the content hot | 14:35 |
| fungi | all the proxies combined share around 60gb (we have htcacheclean set to delete anything more than that so the volume doesn't fill up) | 14:38 |
| mnasiadka | Ah, right - that makes sense | 14:41 |
| mnasiadka | fungi: need another +2 on https://review.opendev.org/c/opendev/zuul-providers/+/967962 - willing to have a look? | 14:41 |
| fungi | mnasiadka: looks fine but what's the reason for doing it on all systemd systems except gentoo? was putting it outside that conditional check problematic? | 14:44 |
| fungi | alternatively, since you say it only seems to be happening on rocky, why not just in the conditional check for rhel derivatives? | 14:45 |
| mnasiadka | fungi: we basically get Rocky 10 with firewalld - and it seems iptables.service doesn’t get started at all (even though it’s enabled) | 14:45 |
| mnasiadka | Sure, I can update the change to do it only on RHEL derivatives | 14:45 |
| fungi | right, i understood that much from the commit message, mostly just wondering what the rationale is for the particular conditional you nested it in | 14:45 |
| fungi | i would either have expected it to be applied to all systemd systems, or to a narrower set covering the platform where you observed the problem | 14:46 |
| mnasiadka | That felt like the most reasonable place without introducing another conditional | 14:46 |
| fungi | but the spot you picked was basically "all systemd systems except gentoo" | 14:46 |
| mnasiadka | Yeah, now I see that :-) | 14:47 |
| mnasiadka | L54 sounds like all RHEL derivatives sort of | 14:47 |
| fungi | right | 14:47 |
| mnasiadka | So, should we have it for RHEL derivatives or for all? I can also make it only for Rocky if it makes any sense | 14:48 |
| fungi | i would say move it up between lines 42 and 43 and just apply it to all systemd systems, if we think it could have the potential to happen on other platforms that start installing firewalld in the future | 14:48 |
| fungi | but i could also be convinced that having it more tightly scoped would be safer | 14:49 |
| opendevreview | Michal Nasiadka proposed opendev/zuul-providers master: nodepool-base: Mask firewalld unit https://review.opendev.org/c/opendev/zuul-providers/+/967962 | 14:49 |
| mnasiadka | I don’t think firewalld is even installed on any other distro or version, and since we rely on iptables.service - it makes sense it doesn’t get in the way | 14:50 |
| mnasiadka | (And systemctl doesn’t do any checks - just links firewalld.service to /dev/null) | 14:50 |
| fungi | let's see if clarkb agrees when he wakes up | 14:51 |
| fungi | then hopefully we can get it merged | 14:51 |
| mnasiadka | Great, thanks | 14:58 |
| clarkb | I think firewalld is installable on debuntu (but ufw is default) then is default on rhel derivatives. Masking it globally seems fine as long as systemd doesn't complain when it isn't present (and apparently it doesn't) | 15:49 |
| clarkb | I've approved it | 15:49 |
| fungi | heading out briefly to grab lunch, bbiab | 16:13 |
| clarkb | I will be headed to that doctors appointment shortly as well | 16:25 |
| opendevreview | Clark Boylan proposed opendev/system-config master: Update developer.openstack.org vhost AllowOverride Rules https://review.opendev.org/c/opendev/system-config/+/969838 | 18:51 |
| opendevreview | Clark Boylan proposed openstack/project-config master: Update Jeepyb's Gerrit builds to Gerrit 3.11 https://review.opendev.org/c/openstack/project-config/+/969846 | 19:43 |
| opendevreview | Clark Boylan proposed opendev/system-config master: Update infra-prod review and manage-projects deps for new Gerrit https://review.opendev.org/c/opendev/system-config/+/969847 | 19:49 |
| clarkb | those two changes are being staged for after the gerrit upgrade and shouldn't be landed just yet. I've put them on the etherpad notes so that we don't forget though | 19:50 |
| opendevreview | Merged opendev/system-config master: Update developer.openstack.org vhost AllowOverride Rules https://review.opendev.org/c/opendev/system-config/+/969838 | 20:03 |
| clarkb | fungi: ^ that applied to apache's sites enabled content but the server was not restarted (it may haev been reloaded I'm not sure yet). Do we know if we a restart is necesasry to pick that up? | 20:31 |
| fungi | usually a reload should be sufficient for any vhost configuration | 20:32 |
| clarkb | ok I'll see if bridge logs indicate that a reload was performed | 20:32 |
| clarkb | RUNNING HANDLER [static : Reload apache2] changed: [static02.opendev.org] "changed": true "ActiveExitTimestamp": "Wed 2025-11-19 18:55:26 UTC" | 20:36 |
| clarkb | I believe it did reload so if we think that is sufficient we should be all set | 20:37 |
| clarkb | and redirects continue to work for me | 20:37 |
| clarkb | looking at the apache docs ExtendedStatus (in the same doc file as AllowOverride) indicates it cannot be updated with a graceful restart. But AllowOverride doesn't say anything about it so I suspect that it is fine with a reload | 20:41 |
| fungi | argh, 967962 got killed by a single arm64 image timeout | 21:19 |
| fungi | no, wait, two of them | 21:19 |
| fungi | during the same task we were seeing be really slow previously, where i suspected an implicit background fsync blocking the next write | 21:21 |
| fungi | or at least one of them did. the other timed out closer to the end of the job but that same fstab copy took ~97 minutes to complete | 21:22 |
| fungi | is there any more interest in merging https://review.opendev.org/968386 so we can start collecting dstat data? | 21:23 |
| fungi | would have been great to know if the node was steadily writing to disk during that 97 minutes of silence | 21:24 |
| clarkb | fungi: I've approved it now | 21:35 |
| clarkb | it == the dstat change | 21:35 |
| fungi | cool, thanks! | 21:35 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!