Friday, 2025-09-19

opendevreview	Merged openstack/diskimage-builder master: root password for dynamic-login made simpler https://review.opendev.org/c/openstack/diskimage-builder/+/961449	00:44
sean-k-mooney	fungi: so the error Connection failed: [Errno 113] EHOSTUNREACH	12:30
frickler	infra-root: seems we are getting nodes that have two ethernet interfaces in raxflex, see e.g. https://zuul.opendev.org/t/openstack/build/488552424c654e2e8a4d8c5cb82f02a0/log/compute1/logs/worlddump-latest.txt	12:30
sean-k-mooney	often means arp or dns resoltuion is not working	12:30
frickler	this may be related to latest cloud config changes	12:30
sean-k-mooney	but not alwasy as it can be just a routing issue	12:31
sean-k-mooney	https://acbbb7942ca0556fa51c-bd29254d8f6365fc838eabec881efe79.ssl.cf1.rackcdn.com/openstack/488552424c654e2e8a4d8c5cb82f02a0/compute1/logs/worlddump-latest.txt	12:31
sean-k-mooney	ah you just pasted that oo	12:31
frickler	sean-k-mooney: but there should only be one interface in the first place, I think that that is the root cause for the issue	12:31
sean-k-mooney	so ya we have 2 routs for the same subnet	12:31
sean-k-mooney	and the second interface is listed first	12:31
sean-k-mooney	ya	12:32
sean-k-mooney	so we are expectign to usei 10.0.16.0/20 dev ens4 proto kernel scope link src 10.0.16.132	12:32
sean-k-mooney	i think based on the ansibel vars that i looked at	12:32
frickler	I'm not sure yet whether this is a new issue in zuul-launcher or caused by our config changes, will wait for someone else to take a closer look	12:33
sean-k-mooney	ack, ill also wait for an update but what im thinkin is this might also be reated to how linux does filtering of reply path traffic	12:36
sean-k-mooney	filtering	12:36
sean-k-mooney	we do not have world dumps form both host but we do have the roting info in https://acbbb7942ca0556fa51c-bd29254d8f6365fc838eabec881efe79.ssl.cf1.rackcdn.com/openstack/488552424c654e2e8a4d8c5cb82f02a0/zuul-info/zuul-info.compute1.txt and	12:48
sean-k-mooney	https://acbbb7942ca0556fa51c-bd29254d8f6365fc838eabec881efe79.ssl.cf1.rackcdn.com/openstack/488552424c654e2e8a4d8c5cb82f02a0/zuul-info/zuul-info.controller.txt	12:48
sean-k-mooney	while ens4 is the default in both case the first 10.0.16.0/20 route on the comptue is via ens3 and its via ens4 on the other host	12:49
mnasiadka	frickler: that might also explain networking issues I see on multi node Kolla-Ansible jobs	12:51
sean-k-mooney	so that why i think it might be related to net.ipv4.conf.all.rp_filter	12:52
sean-k-mooney	obviously if we only expect 1 port that also a problem	12:52
sean-k-mooney	but i think settign sysctl -w net.ipv4.conf.all.rp_filter=2 might have allowd it to work	12:53
sean-k-mooney	well or 0	12:53
sean-k-mooney	https://github.com/torvalds/linux/blob/master/Documentation/networking/ip-sysctl.rst?plain=1#L1972-L1991	12:56
fungi	sean-k-mooney: our iptables rules may be configured to reject blocked connections with icmp "host unreachable" responses, which would account for that too, though we should be using the "administratively prohibited" code instead	13:24
fungi	but yeah, if we've suddenly grown a second network interface on those nodes with another ipv4 default route, that could cause quite a bit of chaos	13:27
fungi	https://zuul.opendev.org/t/openstack/build/488552424c654e2e8a4d8c5cb82f02a0/log/zuul-info/zuul-info.compute1.txt#44-46	13:29
fungi	yeah, so only one default route but the gateway is on the same lan as both ens3 and ens4 interfaces	13:30
fungi	depending on whether the kernel knows to always reply from the same interface the connection came in on, there could be quite a bit of craziness	13:32
fungi	looks like ens3 is probably the interface we're connecting to, but the default route is "via ens4" so the kernel is probably replying from the ens3 ip address but with the ens4 mac, which would cause constant arp overwrites on the gateway	13:33
fungi	i think https://zuul.opendev.org/t/openstack/build/488552424c654e2e8a4d8c5cb82f02a0/log/zuul-info/inventory.yaml#196 indicates that the floating ip is bound to the ens3 interface but i'm not positive	13:36
frickler	I'm pretty convinced https://review.opendev.org/c/opendev/system-config/+/961537 is the trigger for the extraneous interface	13:40
frickler	not overriding the default, but adding another one	13:41
fungi	before that, the flex clouds were refusing to boot anything, insisting we needed to now specify a network (we still don't know what changed a few days ago that caused it to start happening)	13:43
fungi	i think none of us realized that specifying the existing network and setting it as the default interface would result in the instance getting a second interface on the same network	13:46
fungi	frickler: do you happen to know what the correct syntax would have been?	13:46
frickler	I've never used that part of clouds.yaml, it might also be specific to how the zuul-launcher is invoking the sdk	14:03
fungi	we discussed the option of setting it in zuul-launcher configuration rather than clouds.yaml, maybe that would have worked the way we expected...	14:08
clarkb	fungi: ya my atke on this is sdk or the cloud are doing sometign wrong and we're being treated poorly by the tools :)	14:43
clarkb	well the problem originated with the cloud rejecting all boot attempts because multiple networks are present	14:43
clarkb	so we were getting 0 interfaces and failed boots as a result. Thats bug 1	14:44
clarkb	we assumed we could workaround this by specifying an explicit network and listing that as the default interface. Apparently that was wrong that is bug 2	14:44
clarkb	in both cases I don't think we the user have done anything wrong. We need otrack down where the tools are beraking I guess	14:44
clarkb	the first two questions I have are: do we have two fips or one? and did this work as expected when the chagne first landed and we've regressed in a new way or did the "fix" produce this behavior from the start?	14:46
fungi	i guess we could find a build result that ran in a flex region just after the config change was restarted onto	14:47
fungi	in order to see if there was just one interface originally	14:48
fungi	as to the multiple fip question, i guess we can just ask for details on those ports for a currently-booted instance?	14:49
clarkb	yup I think server list / server show would answer the fip count question	14:49
clarkb	also do we know if this affects all of the rax-flex regions or just one?	14:51
clarkb	we might be able to figure that out via server list/ server show too	14:51
fungi	i've only seen the one example so far	14:51
clarkb	server list seems to show ti affecting dfw3 and sjc3. No instances in iad3 right now. And they only have one fip	14:52
clarkb	ok zuul launcher does supply label.networks to the server instance creation. I wonder if sdk's latest release changed how it handles defaults for that and we went from nullish value means autoselect to null value means supply explicit null value and cloud can no longer automatically select	14:56
clarkb	but then by using the clouds.yaml override we're again flipping back into some auto select behavior that has gone wrong	14:57
clarkb	hrm no we were already supplying the network value for this cloud	14:57
opendevreview	Clark Boylan proposed opendev/zuul-providers master: Stop supplying the network value for rax-flex https://review.opendev.org/c/opendev/zuul-providers/+/961811	15:01
clarkb	infra-root ^ I think we can try that without restarting any services and see if it produces a better result	15:01
clarkb	I still don't understand why this is happening, but I figure that is a low cost change that is easy to revert etc to check if we get a happier state	15:02
corvus	that makes me wonder why the clouds.yaml fix changed anything.	15:02
clarkb	corvus: exactly	15:02
clarkb	I still suspect a bug in the recent sdk release changing behaviors around this stuff	15:03
clarkb	maybe the launcher network value isn't supplying sufficient info like nat_destination or default_interface and the sdk is confused?	15:03
corvus	clarkb: yesterday i was ambivalent, but now i think all this stuff should go in zuul-providers so it's easier to change.	15:03
opendevreview	Merged opendev/zuul-providers master: Stop supplying the network value for rax-flex https://review.opendev.org/c/opendev/zuul-providers/+/961811	15:04
corvus	i +3d that change, but i kind of think the next thing we do should be to move that stuff out of clouds.yaml.	15:04
corvus	i mean, after the dust settles.	15:04
clarkb	corvus: I agree, except that it aws already there and wasn't working so we need to figure out how to make it work (if this naive update does fix it)	15:04
corvus	yes, perhaps there are settings needed in clouds.yaml that we don't support in zuul-launcher.	15:05
clarkb	one upside to putting things like this in clouds.yaml is it makes it easier to manually try to reproduce, but I think having the config in zuul-providers is likely to be less confusing in the long run if we're consistent about it so am willing to sort out extra flags to openstack client when necessary	15:05
corvus	yeah. i don't feel strongly about it. just noting that it took a few hours to get the update in clouds.yaml and a few seconds in zuul-providers. :)	15:06
clarkb	SJC3 is building a handful of nodes. Not sure if those would'ev used the old or new config	15:06
clarkb	++	15:06
clarkb	np0bbc0c68f75f4 cloud uuid 67e69f3d-a259-421e-94db-d67e851a894f has one interface and one fip I think	15:07
clarkb	that is in sjc3	15:07
clarkb	so I think this did "fix" it	15:07
clarkb	I have no idea why at this moment. The change seemed to start after last weekends zuul-launcher restart which would've picked up this new release of openstacksdk for the first time https://pypi.org/project/openstacksdk/4.7.1/	15:08
clarkb	as far as we can tell the network resources in the clouds themselves have not updated in this time period so cloud side changes to the resource themselves don't appear to be at fault	15:09
clarkb	could be that cloud side changes to the api code did change in a meaningful way though	15:09
clarkb	frickler: ^ do you know what if anything openstacksdk 4.7.1 might have changed around network selection and utilization	15:09
clarkb	corvus: I wonder if we can simply change networks:\n - opendevzuul-network1 to networks:\n - name: opendevzuul-network1\n default_interface: true\n nat_destination: true in the zuul launcher config and have it pass through the same attributes as clouds.yaml?	15:13
clarkb	there might be a schema update necessary first	15:13
clarkb	https://opendev.org/zuul/zuul/src/branch/master/zuul/driver/openstack/openstackendpoint.py#L712-L717 I think this explains why we got two interfaces	15:14
clarkb	we explicitly asked for one nic on the zuul-provider listed network, then the clouds.yaml config must imply a second interface (possibly because I set default_interface or nat_destination and either one may trigger creation of a nic?)	15:15
clarkb	still not clear why explicitly creating a nic like we did before would start failing at the beginning of the week.	15:16
clarkb	oh I see we don't supply the zuul-provider network info beyond that explicit nic list so yes it seems very likely that the nic information is incomplete (at least according to the sdk or cloud)	15:17
clarkb	openstacksdk diff between 4.7.0 and 4.7.1 looks unremarkable so now I'm back to thinking something changed in the cloud	15:19
clarkb	corvus: I'm beginning to wonder if maybe https://opendev.org/zuul/zuul/src/branch/master/zuul/driver/openstack/openstackendpoint.py#L714 is the source of the original issue. Specifically I'm wondering if we cached a network id value that was invalid somehow (possibly due to an api issue)	15:25
corvus	clarkb: that is cached indefinitely because the network id should never change	15:26
corvus	(so if it did, then that's certainly a very unexpected event)	15:27
corvus	though... the exception is that we clear the cache on some errors	15:28
corvus	that would handle the case where someone deleted the network and replaced it with a new network with the same name	15:29
clarkb	agreed. Part of my suspicion for that is reading through the network and nic related code in openstacsdk it seems like sdk converts the network values into nics similar to what launcher is already doing	15:29
corvus	so.. that is expected. but since we didn't do that, then ... :)	15:29
clarkb	corvus: we manage that opendevzuul-network1 network ourslves	15:29
clarkb	ya that	15:29
fungi	heading out to run a lunch errand, shouldn't be too long but once i get back i'll start on the mirror cleanup tasks and then final prep for our 20:00 utc mailman server maintenance	15:29
clarkb	we do have ansible automatically deploy that though so maybe something went wrong there?	15:29
corvus	if you list the networks thru the api is there a timestamp?	15:30
clarkb	checking	15:31
clarkb	updated_at \| 2025-06-25T15:55:47Z for DFW3	15:31
clarkb	that is within a minute of the SJC3 updated at value and IAD3 updated on August 20	15:32
clarkb	corvus: https://opendev.org/openstack/openstacksdk/src/branch/master/openstack/cloud/_compute.py#L980-L983 this is the area of code where I think openstacksdk is now currently doing roughly the same thing we should've been doing with launcher previously	15:34
clarkb	whcih is why I'm now wondering if we had a data error	15:34
corvus	if zuul got an invalid net-id and never got an openstack.exceptions.BadRequestException it would have kept using it.	15:37
corvus	https://opendev.org/zuul/zuul/src/branch/master/zuul/driver/openstack/openstackendpoint.py#L724 is the check for that	15:38
clarkb	If we want I think we can remove the clouds.yaml update, revert 961811 then restart the launcher and see if things just work again	15:39
corvus	i think that's worth doing. do you read the sdk code as suggesting that during the time we had both in place, we may have been using only the launcher-provided data?	15:40
corvus	because i don't think https://opendev.org/openstack/openstacksdk/src/branch/master/openstack/cloud/_compute.py#L980-L983 will have run until just now, since the launcher will have been supplying the nics arg	15:40
corvus	and if that's true, that suggests one of two possibilites: 1) your change to clouds.yaml to add the other network settings had some effect other than what we're looking at here in the create server call, or 2) the thing that actually fixed it was the restart and cache clearing.	15:42
Clark[m]	I'm having what is now becoming a ritual morning ISP packet loss problem arg. corvus looking at that code findNetworks raises Exception if the network is nil. I think we may cache that nullish value? but since we raised eception and not bad requests maybe we use that going forward	15:42
Clark[m]	I suspect that we want to clear the cache in that situation too	15:42
Clark[m]	The other thought is the comment there indicates that exception occurs on 400 errors. Maybe we also need to handle 500 errors?	15:42
Clark[m]	as for coinciding with the weekly restarts I guess the thought there is we could've had this issue during the startup process and if it was something persistent on the cloud side both launchers could've cached the same results	15:44
Clark[m]	(this is all still a bunch of hunches, I think we have to revert back to the config state we were in previously and restart to really start to blame this stuff. But I think the story is feasible)	15:45
corvus	Clark: the functools lru cache will not cache the exception	15:45
corvus	i agree that it may be worth expanding the cache clearing if we're seeing a new error that warrants it, but i want to be careful and not just add all 5xx errors -- i want to know it really could be related to sending bad data. because with the behavior we see from some clouds, i worry if we're too agressive we could just nullify the caches altogether.	15:47
clarkb	looks like client.get_network calls an internal find_network method with ignore_missing=True. That ignore_missing=True parameter is what causes a None return if no result is found rather than raising an exception. That explains why we haev that test in _findNetwork()	15:51
clarkb	I thought my irc connection was happier but its still iffy...	15:51
clarkb	corvus: I'm thinking maybe we can add debug logging to _findNetwork after line 790 since in theory we're calling that almost never. Then revert back to the old config staet (no networks specified in clouds.yaml and network specified in zuul-provider) then restart on that and see what we get?	15:53
clarkb	then if we run into a similar situation again that extra logging should hopefully expose what the source of the issue was?	15:53
clarkb	if we immediately start failing then we probably aren't on the right hunch. If things work then the hunch is probably a good one and we just need more info on where things are going sideways?	15:53
corvus	clarkb: yep sounds good. want me to monkeypatch that in?	15:58
corvus	oh	15:58
corvus	no yeah, i can do that	15:58
corvus	i'd monkeypatch it, then clear the caches to force it to re-run.	15:58
corvus	or do you just want to merge a change and let it go in with the restarts?	15:59
clarkb	I think the main issue is that we have to restart either way to clear out the clouds.yaml config?	16:00
clarkb	so its a question of do we update clouds.yaml and zuul-provider config and restart with the debug info or without then monkey patch?	16:00
clarkb	considering we're going to restart automatically in a few hours maybe we should rely on that process more so that we're not in its way (or we don't have it undo our work)	16:01
corvus	yeah, just not sure if you want to do that like right now real quick, or get both changes merged and restart (which is like... later this afternoon)	16:01
corvus	ack. if you can write those changes, i'm happy to review/approve	16:01
clarkb	I think later is fine. Things are working right now. I'll push up the two revert changes	16:01
corvus	and the debug line too pls.	16:02
corvus	my workspace is not conducive to writing that atm. :)	16:02
clarkb	will do	16:02
opendevreview	Clark Boylan proposed opendev/system-config master: Revert "Select the network to use in raxflex" https://review.opendev.org/c/opendev/system-config/+/961815	16:03
opendevreview	Clark Boylan proposed opendev/zuul-providers master: Revert "Stop supplying the network value for rax-flex" https://review.opendev.org/c/opendev/zuul-providers/+/961816	16:05
clarkb	remote: https://review.opendev.org/c/zuul/zuul/+/961817 Add debug logging for openstack network lookups	16:12
clarkb	my opportunity for a bike ride for the next few days is basically in the next 15 minutes or so. I think I'm going to pop out nowish to take advantage of that but then should be back to help with ^ and lists stuff etc	16:13
clarkb	if anyone sees problems with those three changes feel free to push updates I don't mind	16:14
corvus	all lgtm and approved	16:14
opendevreview	Merged opendev/zuul-providers master: Revert "Stop supplying the network value for rax-flex" https://review.opendev.org/c/opendev/zuul-providers/+/961816	16:14
opendevreview	Merged opendev/system-config master: Revert "Select the network to use in raxflex" https://review.opendev.org/c/opendev/system-config/+/961815	16:36
fungi	back just in time to see that it's all back to a wait-and-see	17:18
Clark[m]	fungi: I can't recall if putting lists01 in the emergency file was on the plan doc but it might be a good idea to do so. Also due to a scheduling conflict I have to do the school run at ~2105 UTC	18:13
fungi	yep, it's step 1 in fact. i'll do that in a sec	18:16
fungi	planning to do a penultimate rsync starting in a little over an hour	18:17
fungi	it's in the disable list now	18:17
clarkb	corvus: my change is hitting a test error. I'm going to look into it `AttributeError: 'FakeOpenstackProviderEndpoint' object has no attribute 'provider'`	18:34
clarkb	I used the same attributes that the exception uses to generate its message but I guess we don't have it faked out?	18:34
clarkb	I've updated my zuul launcher change to fix it	18:43
fungi	infra-root: in precisely one hour i'll be starting our lists01 maintenance as described at https://etherpad.opendev.org/p/2025-09-mailman-volume-maintenance	18:59
fungi	status notice All hosted mailing lists are undergoing maintenance for the next hour: https://lists.opendev.org/archives/list/service-announce@lists.opendev.org/message/UTMXRWWTE5WA3IF6WS3BIEJAORI2D62V/	19:01
fungi	at 20:00 utc i'll send something like this ^ to irc	19:01
Clark[m]	Lgtm. I'm eating lunch now so that I'm not distracted by hunger in an hour	19:01
fungi	i've cleaned up stretch and bullseye-backports from the mirror.debian volume, working on the same for mirror.debian-security now	19:03
clarkb	fwiw I thought about whether or not it is an issue to allow multiple interfaces on the same network to be attached to a node. I don't think it is as you may assign them to different namespaces or pass them through to VMs hosted by the instance	19:14
clarkb	that said I do think there is a small bug in openstacksdk: I think explicit network lists should override not supplement the networks provided in clouds.yaml	19:14
clarkb	the reason for this is that you can already override most clouds.yaml options by supplying explicit values (think api versions or even credentials)	19:15
clarkb	but also if you want to override and not add you have to rewrite your clouds.yaml file which seems like a pain. That said this is a minor issue and I don't think the internals of openstacksdk can currently distinguish what was passed explicitly vs via clouds.yaml right now so would require some refactoring	19:16
fungi	there is a potential problem for ipv4 routing where you may connect remotely to the machine on a different address than the one through which its default route lies	19:17
fungi	ipv6 doesn't have that problem	19:18
clarkb	fungi: I think if you're doing pass through or separate namespces you avoid that problem though	19:18
clarkb	as each network stack is effectively decoupled from the other and they are both going to see that interface attached to their bubble as the default route	19:19
fungi	yeah, it's really just for services listening directly on the interface	19:19
corvus	clarkb: are you sure we weren't overriding clouds.yaml? https://opendev.org/openstack/openstacksdk/src/branch/master/openstack/cloud/_compute.py#L980-L983 reads like the caller wins if they supply nics, and that's should have been happening	19:19
clarkb	corvus: I guess I'm not 100% certain but its the only explanation I can come up with for why we got two nics	19:21
clarkb	corvus: oen from the clouds.yaml definition and the other from the zuul-provider network list (that gets passed to openstacksdk as a nics list)	19:21
corvus	yeaah -- maybe there's something happening at another level	19:27
fungi	penultimate mailman rsync is in progress now, should finish by the top of the hour	19:33
fungi	#status notice All hosted mailing lists are undergoing maintenance for the next hour: https://lists.opendev.org/archives/list/service-announce@lists.opendev.org/message/UTMXRWWTE5WA3IF6WS3BIEJAORI2D62V/	20:00
opendevstatus	fungi: sending notice	20:00
-opendevstatus- NOTICE: All hosted mailing lists are undergoing maintenance for the next hour: https://lists.opendev.org/archives/list/service-announce@lists.opendev.org/message/UTMXRWWTE5WA3IF6WS3BIEJAORI2D62V/		20:00
fungi	the irony of linking to the ml archive in that isn't lost on me	20:00
clarkb	heh	20:01
opendevstatus	fungi: finished sending notice	20:02
clarkb	looks like things are shutdown at this point	20:05
fungi	yep, final rsync is already underway	20:05
fungi	once that's done, i'll start the containers again and send a test post for the maintenance conclusion	20:05
fungi	i already have it queued up	20:05
fungi	if earlier rsyncs were any indication, this should finish around 20:25 utc	20:07
fungi	maybe sooner since the data shouldn't be changing this time	20:07
fungi	hoping i'll have the maintenance wrapped up by half-past	20:07
clarkb	ack I'm following along just holler if I can be useful	20:08
fungi	especially if the faster filesystem means quicker container startup	20:08
fungi	will do, so far this is all going to plan	20:08
fungi	done and starting	20:16
fungi	https://lists.opendev.org/archives/list/service-announce@lists.opendev.org/message/UTMXRWWTE5WA3IF6WS3BIEJAORI2D62V/ is loading for me now	20:18
fungi	as is https://lists.opendev.org/mailman3/lists/service-announce.lists.opendev.org/	20:19
fungi	so that covers both hyperkitty and postorius	20:19
fungi	sending the completion e-mail	20:19
clarkb	lists.zuul-ci.org archives also load for me (just checking a different vhost for completeness)	20:20
clarkb	which list is the completion email being sent to? I'm not seeing it yet	20:22
fungi	service-announce. i'm about to start cross-referencing logs	20:22
fungi	2025-09-19 20:19:40 1uzhZx-0003R2-Qh => service-announce@lists.opendev.org R=dnslookup T=remote_smtp H=lists.opendev.org [2001:4800:7813:516:be76:4eff:fe04:5423] X=TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256 CV=no DN="C=UK,O=Exim Developers,CN=lists01.opendev.org" C="250 OK id=1uzha0-006Dzl-A3"	20:25
fungi	that's my end	20:25
clarkb	'2025-09-19 20:19:40 1uzha0-006Dzl-A3 <= fungi@yuggoth.org' that is in exim's mainlog file	20:25
fungi	2025-09-19 20:25:44 1uzha0-006Dzl-A3 == service-announce@lists.opendev.org R=mailman_router T=mailman_transport defer (-54): retry time not reached for any host for 'lists.opendev.org'	20:26
clarkb	and then ^ that ya	20:26
clarkb	is it possible that exim noticed that we shut things down and is simply waiting to try deliveries again?	20:27
fungi	since exim was up and running while mailman was down, i bet it tried to deliver some spam	20:27
fungi	and yeah, it'll retry in a bit	20:27
clarkb	looking at https://www.exim.org/exim-html-current/doc/html/spec_html/ch-retry_configuration.html I'm still not quite sure what I should be looking at in the exim config to know when it is likely to retry heh	20:30
clarkb	I think we retry every 15 minutes for 2 hours then back off	20:31
corvus	exim -qff if you want to process the queue	20:32
clarkb	so around 20:34 we should expect it to try again	20:32
clarkb	corvus: thanks! I suspect it will try on its own in just a minute or two at this point	20:32
clarkb	oh yup I just got the email	20:32
fungi	yeah, i'm not in any hurry, i blocked out to 21:00 in the announcement anyway	20:32
fungi	ah perfect	20:32
fungi	as did i	20:32
fungi	https://lists.opendev.org/archives/list/service-announce@lists.opendev.org/thread/UTMXRWWTE5WA3IF6WS3BIEJAORI2D62V/#UTMXRWWTE5WA3IF6WS3BIEJAORI2D62V shows it now too	20:33
clarkb	now we'll have to see if the performance is better. I noticed that iowait was somewhat high around the time things were starting up but that may be residual due to all the startup actions. But also the web ui was responsive for me during that time so could also be that we still have iowait but we're processing io requests quickly enough that we don't notice as much	20:34
clarkb	time will tell	20:34
fungi	other than cleaning up the temporary /var/cache/var_lib_mailman.old directory and taking the server back out of the disable list, the maintenance is done	20:34
fungi	system load seems a bit lower, and iowait, while bursty, is not sustained at a significant percent of cpu right now	20:35
clarkb	ya that may be an indication that when we need the disk we really need ti and nwo w'ere able to get through those requests more quickly	20:36
fungi	i just went through the moderation queues for about a dozen lists discarding some spam and everything was snappy	20:36
clarkb	nice	20:36
fungi	there have been days recently where i'd tell my browser to load the moderation queue for a list, then wait 2 minutes for the page to render	20:37
fungi	then select some messages to discard, and wait a couple more minutes for it to do that	20:38
fungi	i'm going to go ahead and self-approve https://review.opendev.org/961528 to clear the mirror.openeuler volume contents	20:42
clarkb	sounds good. Is that the last cleanup of the known cleanups that we can do at this point?	20:43
fungi	i think so, other than maybe going through puppet/ceph mirrors and some of the wheel volumes	20:44
fungi	we still have bionic arm64 wheels for example	20:44
fungi	and xenial amd64	20:44
fungi	openeuler's the biggest cleanup opportunity though at the moment	20:46
fungi	337gb of data	20:46
fungi	i've taken lists01 back out of the emergency disable list now, but haven't terminated the screen session nor deleted the moved original data directory yet	20:49
clarkb	337gb is not small (thats almost a whole centos 9 stream)	20:51
fungi	yeah, it's nearly 10% of our total data	20:52
clarkb	for the wheel caches/mirrors we never added noble (or centos 10 stream or rocky linux etc) and it seems to work. I think that enough of the python ecosystem caught up with needing to publish wheels that we just don't have problems htere anymore. We might even be able to look into cleaning up wheels for other things too	20:52
clarkb	I know as you go back in time in terms of python versions wheels were less common though so maybe its best to let them die on the vine instead	20:52
fungi	right, i think we stop adding new wheels and clean up the old ones when we drop images/nodes for those platforms	20:53
fungi	so could clean up the wheel volumes for xenial-amd64 and bionic-arm64 but probably makes sense to batch those up with other cleanups since they're small to begin with	20:54
fungi	xenial is under 10gb and it's the largest of them	20:54
fungi	pre-noble, we didn't add wheel mirrors for jammy either	20:56
fungi	debian bullseye, ubuntu focal and centos 9 stream are the newest	20:56
fungi	oh, we have a wheel mirror volume for debian buster too	20:57
clarkb	huh we'll be able to clear out everything but centos 9 pretty son probably (relative to how long we've had up xenial)	20:57
opendevreview	Merged opendev/system-config master: Stop updating and delete OpenEuler mirror content https://review.opendev.org/c/opendev/system-config/+/961528	21:11
*** dmellado9 is now known as dmellado		22:01

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!