| opendevreview | Merged openstack/diskimage-builder master: root password for dynamic-login made simpler https://review.opendev.org/c/openstack/diskimage-builder/+/961449 | 00:44 |
|---|---|---|
| sean-k-mooney | fungi: so the error Connection failed: [Errno 113] EHOSTUNREACH | 12:30 |
| frickler | infra-root: seems we are getting nodes that have two ethernet interfaces in raxflex, see e.g. https://zuul.opendev.org/t/openstack/build/488552424c654e2e8a4d8c5cb82f02a0/log/compute1/logs/worlddump-latest.txt | 12:30 |
| sean-k-mooney | often means arp or dns resoltuion is not working | 12:30 |
| frickler | this may be related to latest cloud config changes | 12:30 |
| sean-k-mooney | but not alwasy as it can be just a routing issue | 12:31 |
| sean-k-mooney | https://acbbb7942ca0556fa51c-bd29254d8f6365fc838eabec881efe79.ssl.cf1.rackcdn.com/openstack/488552424c654e2e8a4d8c5cb82f02a0/compute1/logs/worlddump-latest.txt | 12:31 |
| sean-k-mooney | ah you just pasted that oo | 12:31 |
| frickler | sean-k-mooney: but there should only be one interface in the first place, I think that that is the root cause for the issue | 12:31 |
| sean-k-mooney | so ya we have 2 routs for the same subnet | 12:31 |
| sean-k-mooney | and the second interface is listed first | 12:31 |
| sean-k-mooney | ya | 12:32 |
| sean-k-mooney | so we are expectign to usei 10.0.16.0/20 dev ens4 proto kernel scope link src 10.0.16.132 | 12:32 |
| sean-k-mooney | i think based on the ansibel vars that i looked at | 12:32 |
| frickler | I'm not sure yet whether this is a new issue in zuul-launcher or caused by our config changes, will wait for someone else to take a closer look | 12:33 |
| sean-k-mooney | ack, ill also wait for an update but what im thinkin is this might also be reated to how linux does filtering of reply path traffic | 12:36 |
| sean-k-mooney | filtering | 12:36 |
| sean-k-mooney | we do not have world dumps form both host but we do have the roting info in https://acbbb7942ca0556fa51c-bd29254d8f6365fc838eabec881efe79.ssl.cf1.rackcdn.com/openstack/488552424c654e2e8a4d8c5cb82f02a0/zuul-info/zuul-info.compute1.txt and | 12:48 |
| sean-k-mooney | https://acbbb7942ca0556fa51c-bd29254d8f6365fc838eabec881efe79.ssl.cf1.rackcdn.com/openstack/488552424c654e2e8a4d8c5cb82f02a0/zuul-info/zuul-info.controller.txt | 12:48 |
| sean-k-mooney | while ens4 is the default in both case the first 10.0.16.0/20 route on the comptue is via ens3 and its via ens4 on the other host | 12:49 |
| mnasiadka | frickler: that might also explain networking issues I see on multi node Kolla-Ansible jobs | 12:51 |
| sean-k-mooney | so that why i think it might be related to net.ipv4.conf.all.rp_filter | 12:52 |
| sean-k-mooney | obviously if we only expect 1 port that also a problem | 12:52 |
| sean-k-mooney | but i think settign sysctl -w net.ipv4.conf.all.rp_filter=2 might have allowd it to work | 12:53 |
| sean-k-mooney | well or 0 | 12:53 |
| sean-k-mooney | https://github.com/torvalds/linux/blob/master/Documentation/networking/ip-sysctl.rst?plain=1#L1972-L1991 | 12:56 |
| fungi | sean-k-mooney: our iptables rules may be configured to reject blocked connections with icmp "host unreachable" responses, which would account for that too, though we should be using the "administratively prohibited" code instead | 13:24 |
| fungi | but yeah, if we've suddenly grown a second network interface on those nodes with another ipv4 default route, that could cause quite a bit of chaos | 13:27 |
| fungi | https://zuul.opendev.org/t/openstack/build/488552424c654e2e8a4d8c5cb82f02a0/log/zuul-info/zuul-info.compute1.txt#44-46 | 13:29 |
| fungi | yeah, so only one default route but the gateway is on the same lan as both ens3 and ens4 interfaces | 13:30 |
| fungi | depending on whether the kernel knows to always reply from the same interface the connection came in on, there could be quite a bit of craziness | 13:32 |
| fungi | looks like ens3 is probably the interface we're connecting to, but the default route is "via ens4" so the kernel is probably replying from the ens3 ip address but with the ens4 mac, which would cause constant arp overwrites on the gateway | 13:33 |
| fungi | i think https://zuul.opendev.org/t/openstack/build/488552424c654e2e8a4d8c5cb82f02a0/log/zuul-info/inventory.yaml#196 indicates that the floating ip is bound to the ens3 interface but i'm not positive | 13:36 |
| frickler | I'm pretty convinced https://review.opendev.org/c/opendev/system-config/+/961537 is the trigger for the extraneous interface | 13:40 |
| frickler | not overriding the default, but adding another one | 13:41 |
| fungi | before that, the flex clouds were refusing to boot anything, insisting we needed to now specify a network (we still don't know what changed a few days ago that caused it to start happening) | 13:43 |
| fungi | i think none of us realized that specifying the existing network and setting it as the default interface would result in the instance getting a second interface on the same network | 13:46 |
| fungi | frickler: do you happen to know what the correct syntax would have been? | 13:46 |
| frickler | I've never used that part of clouds.yaml, it might also be specific to how the zuul-launcher is invoking the sdk | 14:03 |
| fungi | we discussed the option of setting it in zuul-launcher configuration rather than clouds.yaml, maybe that would have worked the way we expected... | 14:08 |
| clarkb | fungi: ya my atke on this is sdk or the cloud are doing sometign wrong and we're being treated poorly by the tools :) | 14:43 |
| clarkb | well the problem originated with the cloud rejecting all boot attempts because multiple networks are present | 14:43 |
| clarkb | so we were getting 0 interfaces and failed boots as a result. Thats bug 1 | 14:44 |
| clarkb | we assumed we could workaround this by specifying an explicit network and listing that as the default interface. Apparently that was wrong that is bug 2 | 14:44 |
| clarkb | in both cases I don't think we the user have done anything wrong. We need otrack down where the tools are beraking I guess | 14:44 |
| clarkb | the first two questions I have are: do we have two fips or one? and did this work as expected when the chagne first landed and we've regressed in a new way or did the "fix" produce this behavior from the start? | 14:46 |
| fungi | i guess we could find a build result that ran in a flex region just after the config change was restarted onto | 14:47 |
| fungi | in order to see if there was just one interface originally | 14:48 |
| fungi | as to the multiple fip question, i guess we can just ask for details on those ports for a currently-booted instance? | 14:49 |
| clarkb | yup I think server list / server show would answer the fip count question | 14:49 |
| clarkb | also do we know if this affects all of the rax-flex regions or just one? | 14:51 |
| clarkb | we might be able to figure that out via server list/ server show too | 14:51 |
| fungi | i've only seen the one example so far | 14:51 |
| clarkb | server list seems to show ti affecting dfw3 and sjc3. No instances in iad3 right now. And they only have one fip | 14:52 |
| clarkb | ok zuul launcher does supply label.networks to the server instance creation. I wonder if sdk's latest release changed how it handles defaults for that and we went from nullish value means autoselect to null value means supply explicit null value and cloud can no longer automatically select | 14:56 |
| clarkb | but then by using the clouds.yaml override we're again flipping back into some auto select behavior that has gone wrong | 14:57 |
| clarkb | hrm no we were already supplying the network value for this cloud | 14:57 |
| opendevreview | Clark Boylan proposed opendev/zuul-providers master: Stop supplying the network value for rax-flex https://review.opendev.org/c/opendev/zuul-providers/+/961811 | 15:01 |
| clarkb | infra-root ^ I think we can try that without restarting any services and see if it produces a better result | 15:01 |
| clarkb | I still don't understand why this is happening, but I figure that is a low cost change that is easy to revert etc to check if we get a happier state | 15:02 |
| corvus | that makes me wonder why the clouds.yaml fix changed anything. | 15:02 |
| clarkb | corvus: exactly | 15:02 |
| clarkb | I still suspect a bug in the recent sdk release changing behaviors around this stuff | 15:03 |
| clarkb | maybe the launcher network value isn't supplying sufficient info like nat_destination or default_interface and the sdk is confused? | 15:03 |
| corvus | clarkb: yesterday i was ambivalent, but now i think all this stuff should go in zuul-providers so it's easier to change. | 15:03 |
| opendevreview | Merged opendev/zuul-providers master: Stop supplying the network value for rax-flex https://review.opendev.org/c/opendev/zuul-providers/+/961811 | 15:04 |
| corvus | i +3d that change, but i kind of think the next thing we do should be to move that stuff out of clouds.yaml. | 15:04 |
| corvus | i mean, after the dust settles. | 15:04 |
| clarkb | corvus: I agree, except that it aws already there and wasn't working so we need to figure out how to make it work (if this naive update does fix it) | 15:04 |
| corvus | yes, perhaps there are settings needed in clouds.yaml that we don't support in zuul-launcher. | 15:05 |
| clarkb | one upside to putting things like this in clouds.yaml is it makes it easier to manually try to reproduce, but I think having the config in zuul-providers is likely to be less confusing in the long run if we're consistent about it so am willing to sort out extra flags to openstack client when necessary | 15:05 |
| corvus | yeah. i don't feel strongly about it. just noting that it took a few hours to get the update in clouds.yaml and a few seconds in zuul-providers. :) | 15:06 |
| clarkb | SJC3 is building a handful of nodes. Not sure if those would'ev used the old or new config | 15:06 |
| clarkb | ++ | 15:06 |
| clarkb | np0bbc0c68f75f4 cloud uuid 67e69f3d-a259-421e-94db-d67e851a894f has one interface and one fip I think | 15:07 |
| clarkb | that is in sjc3 | 15:07 |
| clarkb | so I think this did "fix" it | 15:07 |
| clarkb | I have no idea why at this moment. The change seemed to start after last weekends zuul-launcher restart which would've picked up this new release of openstacksdk for the first time https://pypi.org/project/openstacksdk/4.7.1/ | 15:08 |
| clarkb | as far as we can tell the network resources in the clouds themselves have not updated in this time period so cloud side changes to the resource themselves don't appear to be at fault | 15:09 |
| clarkb | could be that cloud side changes to the api code did change in a meaningful way though | 15:09 |
| clarkb | frickler: ^ do you know what if anything openstacksdk 4.7.1 might have changed around network selection and utilization | 15:09 |
| clarkb | corvus: I wonder if we can simply change networks:\n - opendevzuul-network1 to networks:\n - name: opendevzuul-network1\n default_interface: true\n nat_destination: true in the zuul launcher config and have it pass through the same attributes as clouds.yaml? | 15:13 |
| clarkb | there might be a schema update necessary first | 15:13 |
| clarkb | https://opendev.org/zuul/zuul/src/branch/master/zuul/driver/openstack/openstackendpoint.py#L712-L717 I think this explains why we got two interfaces | 15:14 |
| clarkb | we explicitly asked for one nic on the zuul-provider listed network, then the clouds.yaml config must imply a second interface (possibly because I set default_interface or nat_destination and either one may trigger creation of a nic?) | 15:15 |
| clarkb | still not clear why explicitly creating a nic like we did before would start failing at the beginning of the week. | 15:16 |
| clarkb | oh I see we don't supply the zuul-provider network info beyond that explicit nic list so yes it seems very likely that the nic information is incomplete (at least according to the sdk or cloud) | 15:17 |
| clarkb | openstacksdk diff between 4.7.0 and 4.7.1 looks unremarkable so now I'm back to thinking something changed in the cloud | 15:19 |
| clarkb | corvus: I'm beginning to wonder if maybe https://opendev.org/zuul/zuul/src/branch/master/zuul/driver/openstack/openstackendpoint.py#L714 is the source of the original issue. Specifically I'm wondering if we cached a network id value that was invalid somehow (possibly due to an api issue) | 15:25 |
| corvus | clarkb: that is cached indefinitely because the network id should never change | 15:26 |
| corvus | (so if it did, then that's certainly a very unexpected event) | 15:27 |
| corvus | though... the exception is that we clear the cache on some errors | 15:28 |
| corvus | that would handle the case where someone deleted the network and replaced it with a new network with the same name | 15:29 |
| clarkb | agreed. Part of my suspicion for that is reading through the network and nic related code in openstacsdk it seems like sdk converts the network values into nics similar to what launcher is already doing | 15:29 |
| corvus | so.. that is expected. but since we didn't do that, then ... :) | 15:29 |
| clarkb | corvus: we manage that opendevzuul-network1 network ourslves | 15:29 |
| clarkb | ya that | 15:29 |
| fungi | heading out to run a lunch errand, shouldn't be too long but once i get back i'll start on the mirror cleanup tasks and then final prep for our 20:00 utc mailman server maintenance | 15:29 |
| clarkb | we do have ansible automatically deploy that though so maybe something went wrong there? | 15:29 |
| corvus | if you list the networks thru the api is there a timestamp? | 15:30 |
| clarkb | checking | 15:31 |
| clarkb | updated_at | 2025-06-25T15:55:47Z for DFW3 | 15:31 |
| clarkb | that is within a minute of the SJC3 updated at value and IAD3 updated on August 20 | 15:32 |
| clarkb | corvus: https://opendev.org/openstack/openstacksdk/src/branch/master/openstack/cloud/_compute.py#L980-L983 this is the area of code where I think openstacksdk is now currently doing roughly the same thing we should've been doing with launcher previously | 15:34 |
| clarkb | whcih is why I'm now wondering if we had a data error | 15:34 |
| corvus | if zuul got an invalid net-id and never got an openstack.exceptions.BadRequestException it would have kept using it. | 15:37 |
| corvus | https://opendev.org/zuul/zuul/src/branch/master/zuul/driver/openstack/openstackendpoint.py#L724 is the check for that | 15:38 |
| clarkb | If we want I think we can remove the clouds.yaml update, revert 961811 then restart the launcher and see if things just work again | 15:39 |
| corvus | i think that's worth doing. do you read the sdk code as suggesting that during the time we had both in place, we may have been using only the launcher-provided data? | 15:40 |
| corvus | because i don't think https://opendev.org/openstack/openstacksdk/src/branch/master/openstack/cloud/_compute.py#L980-L983 will have run until just now, since the launcher will have been supplying the nics arg | 15:40 |
| corvus | and if that's true, that suggests one of two possibilites: 1) your change to clouds.yaml to add the other network settings had some effect other than what we're looking at here in the create server call, or 2) the thing that actually fixed it was the restart and cache clearing. | 15:42 |
| Clark[m] | I'm having what is now becoming a ritual morning ISP packet loss problem arg. corvus looking at that code findNetworks raises Exception if the network is nil. I think we may cache that nullish value? but since we raised eception and not bad requests maybe we use that going forward | 15:42 |
| Clark[m] | I suspect that we want to clear the cache in that situation too | 15:42 |
| Clark[m] | The other thought is the comment there indicates that exception occurs on 400 errors. Maybe we also need to handle 500 errors? | 15:42 |
| Clark[m] | as for coinciding with the weekly restarts I guess the thought there is we could've had this issue during the startup process and if it was something persistent on the cloud side both launchers could've cached the same results | 15:44 |
| Clark[m] | (this is all still a bunch of hunches, I think we have to revert back to the config state we were in previously and restart to really start to blame this stuff. But I think the story is feasible) | 15:45 |
| corvus | Clark: the functools lru cache will not cache the exception | 15:45 |
| corvus | i agree that it may be worth expanding the cache clearing if we're seeing a new error that warrants it, but i want to be careful and not just add all 5xx errors -- i want to know it really could be related to sending bad data. because with the behavior we see from some clouds, i worry if we're too agressive we could just nullify the caches altogether. | 15:47 |
| clarkb | looks like client.get_network calls an internal find_network method with ignore_missing=True. That ignore_missing=True parameter is what causes a None return if no result is found rather than raising an exception. That explains why we haev that test in _findNetwork() | 15:51 |
| clarkb | I thought my irc connection was happier but its still iffy... | 15:51 |
| clarkb | corvus: I'm thinking maybe we can add debug logging to _findNetwork after line 790 since in theory we're calling that almost never. Then revert back to the old config staet (no networks specified in clouds.yaml and network specified in zuul-provider) then restart on that and see what we get? | 15:53 |
| clarkb | then if we run into a similar situation again that extra logging should hopefully expose what the source of the issue was? | 15:53 |
| clarkb | if we immediately start failing then we probably aren't on the right hunch. If things work then the hunch is probably a good one and we just need more info on where things are going sideways? | 15:53 |
| corvus | clarkb: yep sounds good. want me to monkeypatch that in? | 15:58 |
| corvus | oh | 15:58 |
| corvus | no yeah, i can do that | 15:58 |
| corvus | i'd monkeypatch it, then clear the caches to force it to re-run. | 15:58 |
| corvus | or do you just want to merge a change and let it go in with the restarts? | 15:59 |
| clarkb | I think the main issue is that we have to restart either way to clear out the clouds.yaml config? | 16:00 |
| clarkb | so its a question of do we update clouds.yaml and zuul-provider config and restart with the debug info or without then monkey patch? | 16:00 |
| clarkb | considering we're going to restart automatically in a few hours maybe we should rely on that process more so that we're not in its way (or we don't have it undo our work) | 16:01 |
| corvus | yeah, just not sure if you want to do that like right now real quick, or get both changes merged and restart (which is like... later this afternoon) | 16:01 |
| corvus | ack. if you can write those changes, i'm happy to review/approve | 16:01 |
| clarkb | I think later is fine. Things are working right now. I'll push up the two revert changes | 16:01 |
| corvus | and the debug line too pls. | 16:02 |
| corvus | my workspace is not conducive to writing that atm. :) | 16:02 |
| clarkb | will do | 16:02 |
| opendevreview | Clark Boylan proposed opendev/system-config master: Revert "Select the network to use in raxflex" https://review.opendev.org/c/opendev/system-config/+/961815 | 16:03 |
| opendevreview | Clark Boylan proposed opendev/zuul-providers master: Revert "Stop supplying the network value for rax-flex" https://review.opendev.org/c/opendev/zuul-providers/+/961816 | 16:05 |
| clarkb | remote: https://review.opendev.org/c/zuul/zuul/+/961817 Add debug logging for openstack network lookups | 16:12 |
| clarkb | my opportunity for a bike ride for the next few days is basically in the next 15 minutes or so. I think I'm going to pop out nowish to take advantage of that but then should be back to help with ^ and lists stuff etc | 16:13 |
| clarkb | if anyone sees problems with those three changes feel free to push updates I don't mind | 16:14 |
| corvus | all lgtm and approved | 16:14 |
| opendevreview | Merged opendev/zuul-providers master: Revert "Stop supplying the network value for rax-flex" https://review.opendev.org/c/opendev/zuul-providers/+/961816 | 16:14 |
| opendevreview | Merged opendev/system-config master: Revert "Select the network to use in raxflex" https://review.opendev.org/c/opendev/system-config/+/961815 | 16:36 |
| fungi | back just in time to see that it's all back to a wait-and-see | 17:18 |
| Clark[m] | fungi: I can't recall if putting lists01 in the emergency file was on the plan doc but it might be a good idea to do so. Also due to a scheduling conflict I have to do the school run at ~2105 UTC | 18:13 |
| fungi | yep, it's step 1 in fact. i'll do that in a sec | 18:16 |
| fungi | planning to do a penultimate rsync starting in a little over an hour | 18:17 |
| fungi | it's in the disable list now | 18:17 |
| clarkb | corvus: my change is hitting a test error. I'm going to look into it `AttributeError: 'FakeOpenstackProviderEndpoint' object has no attribute 'provider'` | 18:34 |
| clarkb | I used the same attributes that the exception uses to generate its message but I guess we don't have it faked out? | 18:34 |
| clarkb | I've updated my zuul launcher change to fix it | 18:43 |
| fungi | infra-root: in precisely one hour i'll be starting our lists01 maintenance as described at https://etherpad.opendev.org/p/2025-09-mailman-volume-maintenance | 18:59 |
| fungi | status notice All hosted mailing lists are undergoing maintenance for the next hour: https://lists.opendev.org/archives/list/service-announce@lists.opendev.org/message/UTMXRWWTE5WA3IF6WS3BIEJAORI2D62V/ | 19:01 |
| fungi | at 20:00 utc i'll send something like this ^ to irc | 19:01 |
| Clark[m] | Lgtm. I'm eating lunch now so that I'm not distracted by hunger in an hour | 19:01 |
| fungi | i've cleaned up stretch and bullseye-backports from the mirror.debian volume, working on the same for mirror.debian-security now | 19:03 |
| clarkb | fwiw I thought about whether or not it is an issue to allow multiple interfaces on the same network to be attached to a node. I don't think it is as you may assign them to different namespaces or pass them through to VMs hosted by the instance | 19:14 |
| clarkb | that said I do think there is a small bug in openstacksdk: I think explicit network lists should override not supplement the networks provided in clouds.yaml | 19:14 |
| clarkb | the reason for this is that you can already override most clouds.yaml options by supplying explicit values (think api versions or even credentials) | 19:15 |
| clarkb | but also if you want to override and not add you have to rewrite your clouds.yaml file which seems like a pain. That said this is a minor issue and I don't think the internals of openstacksdk can currently distinguish what was passed explicitly vs via clouds.yaml right now so would require some refactoring | 19:16 |
| fungi | there is a potential problem for ipv4 routing where you may connect remotely to the machine on a different address than the one through which its default route lies | 19:17 |
| fungi | ipv6 doesn't have that problem | 19:18 |
| clarkb | fungi: I think if you're doing pass through or separate namespces you avoid that problem though | 19:18 |
| clarkb | as each network stack is effectively decoupled from the other and they are both going to see that interface attached to their bubble as the default route | 19:19 |
| fungi | yeah, it's really just for services listening directly on the interface | 19:19 |
| corvus | clarkb: are you sure we weren't overriding clouds.yaml? https://opendev.org/openstack/openstacksdk/src/branch/master/openstack/cloud/_compute.py#L980-L983 reads like the caller wins if they supply nics, and that's should have been happening | 19:19 |
| clarkb | corvus: I guess I'm not 100% certain but its the only explanation I can come up with for why we got two nics | 19:21 |
| clarkb | corvus: oen from the clouds.yaml definition and the other from the zuul-provider network list (that gets passed to openstacksdk as a nics list) | 19:21 |
| corvus | yeaah -- maybe there's something happening at another level | 19:27 |
| fungi | penultimate mailman rsync is in progress now, should finish by the top of the hour | 19:33 |
| fungi | #status notice All hosted mailing lists are undergoing maintenance for the next hour: https://lists.opendev.org/archives/list/service-announce@lists.opendev.org/message/UTMXRWWTE5WA3IF6WS3BIEJAORI2D62V/ | 20:00 |
| opendevstatus | fungi: sending notice | 20:00 |
| -opendevstatus- NOTICE: All hosted mailing lists are undergoing maintenance for the next hour: https://lists.opendev.org/archives/list/service-announce@lists.opendev.org/message/UTMXRWWTE5WA3IF6WS3BIEJAORI2D62V/ | 20:00 | |
| fungi | the irony of linking to the ml archive in that isn't lost on me | 20:00 |
| clarkb | heh | 20:01 |
| opendevstatus | fungi: finished sending notice | 20:02 |
| clarkb | looks like things are shutdown at this point | 20:05 |
| fungi | yep, final rsync is already underway | 20:05 |
| fungi | once that's done, i'll start the containers again and send a test post for the maintenance conclusion | 20:05 |
| fungi | i already have it queued up | 20:05 |
| fungi | if earlier rsyncs were any indication, this should finish around 20:25 utc | 20:07 |
| fungi | maybe sooner since the data shouldn't be changing this time | 20:07 |
| fungi | hoping i'll have the maintenance wrapped up by half-past | 20:07 |
| clarkb | ack I'm following along just holler if I can be useful | 20:08 |
| fungi | especially if the faster filesystem means quicker container startup | 20:08 |
| fungi | will do, so far this is all going to plan | 20:08 |
| fungi | done and starting | 20:16 |
| fungi | https://lists.opendev.org/archives/list/service-announce@lists.opendev.org/message/UTMXRWWTE5WA3IF6WS3BIEJAORI2D62V/ is loading for me now | 20:18 |
| fungi | as is https://lists.opendev.org/mailman3/lists/service-announce.lists.opendev.org/ | 20:19 |
| fungi | so that covers both hyperkitty and postorius | 20:19 |
| fungi | sending the completion e-mail | 20:19 |
| clarkb | lists.zuul-ci.org archives also load for me (just checking a different vhost for completeness) | 20:20 |
| clarkb | which list is the completion email being sent to? I'm not seeing it yet | 20:22 |
| fungi | service-announce. i'm about to start cross-referencing logs | 20:22 |
| fungi | 2025-09-19 20:19:40 1uzhZx-0003R2-Qh => service-announce@lists.opendev.org R=dnslookup T=remote_smtp H=lists.opendev.org [2001:4800:7813:516:be76:4eff:fe04:5423] X=TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256 CV=no DN="C=UK,O=Exim Developers,CN=lists01.opendev.org" C="250 OK id=1uzha0-006Dzl-A3" | 20:25 |
| fungi | that's my end | 20:25 |
| clarkb | '2025-09-19 20:19:40 1uzha0-006Dzl-A3 <= fungi@yuggoth.org' that is in exim's mainlog file | 20:25 |
| fungi | 2025-09-19 20:25:44 1uzha0-006Dzl-A3 == service-announce@lists.opendev.org R=mailman_router T=mailman_transport defer (-54): retry time not reached for any host for 'lists.opendev.org' | 20:26 |
| clarkb | and then ^ that ya | 20:26 |
| clarkb | is it possible that exim noticed that we shut things down and is simply waiting to try deliveries again? | 20:27 |
| fungi | since exim was up and running while mailman was down, i bet it tried to deliver some spam | 20:27 |
| fungi | and yeah, it'll retry in a bit | 20:27 |
| clarkb | looking at https://www.exim.org/exim-html-current/doc/html/spec_html/ch-retry_configuration.html I'm still not quite sure what I should be looking at in the exim config to know when it is likely to retry heh | 20:30 |
| clarkb | I think we retry every 15 minutes for 2 hours then back off | 20:31 |
| corvus | exim -qff if you want to process the queue | 20:32 |
| clarkb | so around 20:34 we should expect it to try again | 20:32 |
| clarkb | corvus: thanks! I suspect it will try on its own in just a minute or two at this point | 20:32 |
| clarkb | oh yup I just got the email | 20:32 |
| fungi | yeah, i'm not in any hurry, i blocked out to 21:00 in the announcement anyway | 20:32 |
| fungi | ah perfect | 20:32 |
| fungi | as did i | 20:32 |
| fungi | https://lists.opendev.org/archives/list/service-announce@lists.opendev.org/thread/UTMXRWWTE5WA3IF6WS3BIEJAORI2D62V/#UTMXRWWTE5WA3IF6WS3BIEJAORI2D62V shows it now too | 20:33 |
| clarkb | now we'll have to see if the performance is better. I noticed that iowait was somewhat high around the time things were starting up but that may be residual due to all the startup actions. But also the web ui was responsive for me during that time so could also be that we still have iowait but we're processing io requests quickly enough that we don't notice as much | 20:34 |
| clarkb | time will tell | 20:34 |
| fungi | other than cleaning up the temporary /var/cache/var_lib_mailman.old directory and taking the server back out of the disable list, the maintenance is done | 20:34 |
| fungi | system load seems a bit lower, and iowait, while bursty, is not sustained at a significant percent of cpu right now | 20:35 |
| clarkb | ya that may be an indication that when we need the disk we really need ti and nwo w'ere able to get through those requests more quickly | 20:36 |
| fungi | i just went through the moderation queues for about a dozen lists discarding some spam and everything was snappy | 20:36 |
| clarkb | nice | 20:36 |
| fungi | there have been days recently where i'd tell my browser to load the moderation queue for a list, then wait 2 minutes for the page to render | 20:37 |
| fungi | then select some messages to discard, and wait a couple more minutes for it to do that | 20:38 |
| fungi | i'm going to go ahead and self-approve https://review.opendev.org/961528 to clear the mirror.openeuler volume contents | 20:42 |
| clarkb | sounds good. Is that the last cleanup of the known cleanups that we can do at this point? | 20:43 |
| fungi | i think so, other than maybe going through puppet/ceph mirrors and some of the wheel volumes | 20:44 |
| fungi | we still have bionic arm64 wheels for example | 20:44 |
| fungi | and xenial amd64 | 20:44 |
| fungi | openeuler's the biggest cleanup opportunity though at the moment | 20:46 |
| fungi | 337gb of data | 20:46 |
| fungi | i've taken lists01 back out of the emergency disable list now, but haven't terminated the screen session nor deleted the moved original data directory yet | 20:49 |
| clarkb | 337gb is not small (thats almost a whole centos 9 stream) | 20:51 |
| fungi | yeah, it's nearly 10% of our total data | 20:52 |
| clarkb | for the wheel caches/mirrors we never added noble (or centos 10 stream or rocky linux etc) and it seems to work. I think that enough of the python ecosystem caught up with needing to publish wheels that we just don't have problems htere anymore. We might even be able to look into cleaning up wheels for other things too | 20:52 |
| clarkb | I know as you go back in time in terms of python versions wheels were less common though so maybe its best to let them die on the vine instead | 20:52 |
| fungi | right, i think we stop adding new wheels and clean up the old ones when we drop images/nodes for those platforms | 20:53 |
| fungi | so could clean up the wheel volumes for xenial-amd64 and bionic-arm64 but probably makes sense to batch those up with other cleanups since they're small to begin with | 20:54 |
| fungi | xenial is under 10gb and it's the largest of them | 20:54 |
| fungi | pre-noble, we didn't add wheel mirrors for jammy either | 20:56 |
| fungi | debian bullseye, ubuntu focal and centos 9 stream are the newest | 20:56 |
| fungi | oh, we have a wheel mirror volume for debian buster too | 20:57 |
| clarkb | huh we'll be able to clear out everything but centos 9 pretty son probably (relative to how long we've had up xenial) | 20:57 |
| opendevreview | Merged opendev/system-config master: Stop updating and delete OpenEuler mirror content https://review.opendev.org/c/opendev/system-config/+/961528 | 21:11 |
| *** dmellado9 is now known as dmellado | 22:01 | |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!