Thursday, 2025-03-27

opendevreview	Merged opendev/system-config master: Accessbot fix for running on Python 3.12 https://review.opendev.org/c/opendev/system-config/+/945657	00:04
opendevreview	Merged opendev/irc-meetings master: Move Large Scale SIG meeting one hour earlier https://review.opendev.org/c/opendev/irc-meetings/+/945633	07:49
*** sfinucan is now known as stephenfin		10:23
fungi	looks like infra-prod-run-accessbot succeeded on deploy for 945657	12:41
opendevreview	Merged opendev/system-config master: docs: Switch a mailing list to default moderation https://review.opendev.org/c/opendev/system-config/+/944893	13:06
fungi	i ended up doing that ^ for another list just now	13:07
fungi	https://lists.openstack.org/archives/list/legal-discuss@lists.openstack.org/message/AJFK6RDQMR27E45TV4HT6HO2D3QSCKPH/	13:08
fungi	infra-root: jamesdenton_: progress report on the replaced mirror.dfw.rax.opendev.org network performance... checking the graphs in cacti the new server instance does not seem to have its eth1 traffic constrained the way we saw impacting the old one, so we can probably restore our max-servers value there to normal now	13:49
fungi	i'll push up a change for it	13:49
jamesdenton_alt	thanks fungi	13:49
jamesdenton_alt	still looking to investigate further on my end	13:49
fungi	we haven't deleted the old server instance, merely shifted all our traffic to a replacement as of just before midnight utc	13:50
jamesdenton_alt	ok cool	13:50
fungi	here's what the graph looks like now: https://fungi.yuggoth.org/tmp/dfw-mirror-traffic-post-replacement.png	13:53
opendevreview	Jeremy Stanley proposed openstack/project-config master: Restore max-servers in rax-dfw https://review.opendev.org/c/openstack/project-config/+/945707	13:58
corvus	wow it managed a 1Gbps spike. that's a lot different than the old one.	14:07
corvus	very comparable to the other 2 regions now	14:07
fungi	yep	14:36
corvus	nice hunch clarkb :)	14:42
opendevreview	Dan Smith proposed openstack/project-config master: Add viommu support to vexxhost ubuntu-noble https://review.opendev.org/c/openstack/project-config/+/945715	15:00
clarkb	I've approved the restoration of max-servers in dfw. rax iad's mirror is on the todo list for replacement due to its age. I'm going to start on that since I already have half the ingredients sorted out in my gnu screen mixing bowl	15:11
clarkb	but first catching up on morning things. Looks like accessbot is happy again. Thank you for sorting that out fungi	15:14
opendevreview	Merged openstack/project-config master: Restore max-servers in rax-dfw https://review.opendev.org/c/openstack/project-config/+/945707	15:18
opendevreview	Dan Smith proposed openstack/project-config master: Add viommu support to vexxhost ubuntu-noble https://review.opendev.org/c/openstack/project-config/+/945715	15:38
clarkb	as a heads up I have a family lunch thing today so will be out midday for a bit	15:42
fungi	take all the time you need. i'll probably need to cook at some point but i don't plan to go anywhere today	15:43
clarkb	also selinux, apparmor, xmonad, and grub just updated. This could be an exciting reboot	15:49
clarkb	(yes tumbleweed runs both selinux and apparmor apparently)	15:49
clarkb	new iad mirror is launching. And I survived a local reboot	16:14
fungi	sounds like i will actually be disappearing briefly this afternoon, maybe the same time as clarkb, but it probably shouldn't be for more than 45 minutes	16:33
clarkb	things seem quiet so probably not a big deal	16:33
fungi	yeah, that was my reasoning as well	16:33
fungi	we decided going out to grab a quick mid-afternoon meal at the pub around the corner would be less impact to day than trying to cook dinner	16:34
fungi	chris is trying to work around her day full of appointments and i have a massive to-do list i'm trying to tackle	16:35
slittle	https://review.opendev.org/c/starlingx/root/+/945717 seems to be stuck	16:38
opendevreview	Clark Boylan proposed opendev/zone-opendev.org master: Add the new Noble IAD rackspace Mirror to DNS https://review.opendev.org/c/opendev/zone-opendev.org/+/945727	16:39
opendevreview	Clark Boylan proposed opendev/system-config master: Add the new Noble IAD Rackspace Mirror to Inventory https://review.opendev.org/c/opendev/system-config/+/945728	16:42
opendevreview	Clark Boylan proposed opendev/zone-opendev.org master: Switch rackspace iad to the new noble mirror https://review.opendev.org/c/opendev/zone-opendev.org/+/945730	16:44
fungi	looking	16:44
clarkb	slittle: fungi: its parent isn't approved	16:45
fungi	slittle: you haven't approved the parent change https://review.opendev.org/c/starlingx/root/+/945710 yet	16:45
clarkb	infra-root I think the series of three changes above to add the new iad mirror is ready to go	16:48
clarkb	I'm happy to +A them as things update if they get reviews	16:48
clarkb	then I'll push out cleanup changes when we're happy with the new system. I'm not cleaning up the new mirror in dfw because jamesdenton_alt was still looking at it	16:49
jamesdenton_alt	thank you	16:49
clarkb	er cleaning up the old mirror but I think people understood	16:49
fungi	where "cleaning up" means `openstack server delete ...`	16:51
fungi	yeah, we can hold that step off as long as needed	16:51
slittle	ah, I see it	16:56
opendevreview	Dan Smith proposed openstack/project-config master: Add viommu support to vexxhost ubuntu-noble https://review.opendev.org/c/openstack/project-config/+/945715	17:22
opendevreview	Merged opendev/zone-opendev.org master: Add the new Noble IAD rackspace Mirror to DNS https://review.opendev.org/c/opendev/zone-opendev.org/+/945727	17:29
clarkb	that name resolves I'm approving the inventory udpate now	17:35
opendevreview	Merged openstack/project-config master: Add viommu support to vexxhost ubuntu-noble https://review.opendev.org/c/openstack/project-config/+/945715	17:38
fungi	that ovh login notification just now was me, i was just checking the billing since i needed to do the same for my own personal account	17:39
clarkb	ack thanks for letting us know	17:42
fungi	looks like our credits are still active	17:43
clarkb	the matrix eavesdrop bot uses matrix-nio and is all async which makes me suspect it is less likely to have python3.12 problems. I can approve it after lunch today if I have time	17:48
opendevreview	Merged opendev/system-config master: Add the new Noble IAD Rackspace Mirror to Inventory https://review.opendev.org/c/opendev/system-config/+/945728	18:17
Clark[m]	I'm off to lunch so will have to check the new mirror afterwards	18:50
fungi	i'm heading out shortly as well, but should be back fairly soon, by 20:00 utc for sure	19:03
fungi	okay, out for a bit, brb	19:17
fungi	been back for a few, seems like everything's still quiet	20:24
clarkb	https://mirror02.iad.rax.opendev.org/ is up for me I'll approve the dns update next	20:29
fungi	cool	20:30
fungi	i agree, has content	20:30
opendevreview	Merged opendev/zone-opendev.org master: Switch rackspace iad to the new noble mirror https://review.opendev.org/c/opendev/zone-opendev.org/+/945730	20:31
opendevreview	Clark Boylan proposed opendev/system-config master: Remove the old rax iad mirror from the inventory https://review.opendev.org/c/opendev/system-config/+/945774	20:38
opendevreview	Clark Boylan proposed opendev/zone-opendev.org master: Cleanup mirror01.iad.rax.opendev.org records https://review.opendev.org/c/opendev/zone-opendev.org/+/945776	20:40
clarkb	whenever we're happy with the new instance I think it is safe to merge those two changes	20:41
clarkb	then I can delete the old instance	20:41
clarkb	and I'm going to approve the matrix eavesdrop change now	20:42
clarkb	the other thing I've got on my mind is booting a new gerrit server. I was going to do that this week but then the noble kernel issues hit. I'll probably plan to do it next week kernel update or not	21:17
clarkb	I know that we have had discussion about whether or not the gerrit server should consider a location move. Every location seems to have its own downsides but the current one seems relatively minor to me (but of course I'm biased as my home isp doesn't offer ipv6 connectivity)	21:18
clarkb	infra-root maybe give that a consideration and say something if you think the gerrit server shouldn't be a enw server in vexxhost ymq	21:19
opendevreview	Merged opendev/system-config master: Update matrix eavesdrop runtime to python3.12 https://review.opendev.org/c/opendev/system-config/+/944402	21:19
clarkb	the upsides for that location are the size of the bootable VM and general performance keeps up with our current demands. The downside is that ipv6 routing to there is not working from some networks	21:19
fungi	we moved it to vexxhost because, back when we were experiencing significant unbounded memory leaks, we needed a very large flavor... do we still?	21:20
clarkb	that did enqueue the eavesdrop deployment job so once that deploys I can go send a message to the zuul room and see that it gets logged	21:20
fungi	even if we leave review.o.o in the same provider, should we consider smaller flavor for the replacement?	21:21
clarkb	memory consumption has reduced on average but I believe that we still see it spiek when there is a lot of querying going on	21:21
clarkb	I'm not personally comfortable with going smaller just with what we know of gerrit historically and how it behaves	21:22
clarkb	when you need memory you really need it	21:22
clarkb	it is also particularly useful when doing things like reindexing	21:22
fungi	fair, though i haven't seen it even come close to 50% of its available memory lately	21:22
clarkb	which makes upgrades much easier and shortens downtime	21:22
clarkb	we only let gerrit use up to 3/4 fwiw	21:23
fungi	in part, i tink, because we limit what the jvm is allowed to allocate?	21:23
clarkb	then the rest we try to keep for filesystem caching and the like	21:23
fungi	ah okay	21:23
clarkb	yup exactly the limit is 96gb iirc and we have 128gb	21:23
corvus	funny you should say that; i just switched to my laptop while i wait for my desktop oom killer to make up its mind (zuul unit test run went awry)	21:24
corvus	"when you need memory you need it"	21:24
clarkb	I'll put it this way: we spent years fighting to squeeze gerrit into a small box and managing gerrit is much easier today for many reasons but one of which I strongly suspect is we stopped squeezing it to fit	21:25
fungi	fair, with approaching 2 months uptime it's using about 1/3 ram active, 1/3 buffers+cache, and 1/3 unused	21:25
clarkb	so unless there is a strong reason to downsize then I'd like to stay at hte current allocation	21:26
fungi	yeah, if the provider isn't asking us to scale back, i'm not overly concerned	21:26
corvus	ah, it finally made a choice. my desktop session. nice.	21:26
corvus	clarkb: sgtm	21:26
fungi	you didn't need it anyway	21:27
clarkb	https://meetings.opendev.org/irclogs/%23zuul/%23zuul.2025-03-27.log matrix-eavesdrop appears to still work after being updated	21:28
clarkb	that last message from me was after the restart	21:28
fungi	awesome	21:29
clarkb	corvus: this way you can still check the test results maybe :)	21:29
corvus	if only i had run them in screen :)	21:31
clarkb	how did https://review.opendev.org/c/opendev/lodgeit/+/944410 and https://review.opendev.org/c/opendev/lodgeit/+/945135 happen	21:31
clarkb	I wrote the same change twice fun	21:31
clarkb	I'm going to double check they are identical and abandon the unreviewed one. Then maybe we can proceed with upgrading lodgeit	21:32
clarkb	they do appaer to be identical	21:33
clarkb	I abandoned 944410 and approved 945135	21:35
fungi	sgtm	21:37
clarkb	and then https://review.opendev.org/c/opendev/system-config/+/945774 is ready to start cleaning up the old iad mirror whenever we feel we're ready. I'm happy to wait for tomorrow and dobule check cacti shows the new server handling periodic job demand before cleaning stuff up	21:40
clarkb	that does seem to be a good test for the mirrors. The spike from those jobs on dfw was really noticeable	21:41
fungi	agreed	21:42
ianw	(btw ipv6 to ymq from .au has been generally fine for me. i do remember one blip)	21:52
fungi	seems like it's been worse from/western eu	21:53
clarkb	ianw: I think it is frickler's isp that complains the most out of the people we've heard from. And I guess technically the prefix isn't being advertised properly but every other isp seems to work so I dunno	21:53
opendevreview	Merged opendev/lodgeit master: Bump lodgeit up to python3.12 https://review.opendev.org/c/opendev/lodgeit/+/945135	22:02
clarkb	if that requires manual intervention to deploy the update I'll do that shortly (hourly jobs are enqueued now so not a rush I don't think)	22:04
clarkb	https://discuss.python.org/t/upcoming-changes-in-the-pypa-wheel-project/85967/12	22:05
clarkb	oh ya that merged to lodgeit not system-config so I think it does need a manual bump to go before the daily periodics. I'll get on that shortly	22:05
clarkb	8f68b3c9a806 is the current lodgeit image that is running	22:08
clarkb	Image updated and service restarted. I can still load https://paste.opendev.org/show/bVee2HZdsSTODhvEse6U/ from fungi yesterday	22:10
clarkb	https://paste.opendev.org/show/bEv2tbiu2zwZCkCg3NoF/ and this new paste pasted	22:11
fungi	yep, seems to be working just fine	22:12
clarkb	fwiw it does appear that the service is getting crawled by yet another one of our friends. But I'm not surprised	22:13
clarkb	this one is using aws ip addrs	22:13
clarkb	just something to keep in mind if the service starts struggling to be responsing	22:13
opendevreview	Clark Boylan proposed opendev/lodgeit master: Run lodgeit with granian instead of uwsgi https://review.opendev.org/c/opendev/lodgeit/+/944805	22:23
clarkb	that needed a rebase to go on top of the 3.12 update. I'll recheck the system-config change too	22:23
clarkb	the lodgeit paste granian change did pass testing. However it seems like running more than one worker might not be safe for the database: https://zuul.opendev.org/t/openstack/build/5a73ef6d31e6439fa3577f0e697fb7a3/log/paste99.opendev.org/containers/docker-lodgeit.log I guess I'll just switch to a single worker instead	23:26
opendevreview	Clark Boylan proposed opendev/system-config master: Run lodgeit with granian https://review.opendev.org/c/opendev/system-config/+/944806	23:27
corvus	2025-03-27 23:27:10,370 ERROR zuul.Launcher: [e: 53eb1c7d3c4247aabc30f29959586571] [req: 4aeac8dff75d4e25b2166a684a2cd819] Error in creating the server. Compute service reports fault: Build of instance 6abcc5cf-a0af-41f6-a277-c53e0b6ca66c aborted: Failed to allocate the network(s), not rescheduling.	23:28
corvus	that's on raxflex-dfw3	23:28
opendevreview	Clark Boylan proposed opendev/system-config master: Run lodgeit with granian https://review.opendev.org/c/opendev/system-config/+/944806	23:28
corvus	is that something we should anticipate? or is that an expected error?	23:28
clarkb	corvus: that is a new error to me. I wonder if that is due to there not being an available floating ip?	23:29
clarkb	corvus: the cloud launcher creates a /24 network for the internal network which is ~252 useable addresses and we have a max server count of 32	23:30
clarkb	I doubt that the problem is our internal network which is why I suspect the problem is the floating ip attachment piece	23:30
clarkb	however, dependingon the lease time for the IPs in that /24 maybe we are running out of addrs. I'll try to check that	23:31
corvus	https://grafana.opendev.org/d/6d29645669/nodepool3a-rackspace-flex?orgId=1&from=now-6h&to=now&timezone=utc&var-region=$__all	23:31
corvus	doesn't look like we hit the max servers in nodepool recently; and there would have been only 2 extra from zuul launcher	23:31
clarkb	oh its a /20 actually not a /24	23:32
clarkb	there is no subnet pool associated to our subnet so not sure how to check dhcp details	23:35
clarkb	I guess we can ssh into a server and check the laese time from there	23:35
clarkb	but that doesn't show pool usage	23:35
corvus	looks like that error has shown up in nodepool 5 times over the past 3 days. in dfw3 and sjc3.	23:36
corvus	spot checked the most recent incident there too, and it doesn't look like heavy use at the time	23:37
corvus	the retry in zuul-launcher succeeded in sjc3. so it failed over from dfw3 to sjc3.	23:40
clarkb	option dhcp-lease-time 43200;	23:41
clarkb	a 12 hours lease time is a bit long but we'd have to get through 4k ish nodes within that time frame to run out of addrs which seems like a lot	23:41
clarkb	all that to say it seems unlikely that not having enough free internal ips is at fault. But doing subnet lists the public networks are much smaller so I'm thinking its more likely we couldn't get a floating ip	23:42
corvus	we have gone through a total of 1600 notes in the almost-24h period of today utc time, across all providers.	23:44
corvus	wow that was wrong	23:44
corvus	we have gone through a total of 16000 nodes in the almost-24h period of today utc time, across all providers.	23:44
corvus	that's more like it	23:44
corvus	so i agree, it seems unlikely we would hit that limit in one flex region	23:45
clarkb	ist possible jamesdenton_alt would be willing to check logs on the cloud side if you haev the failed instance uuid	23:47
clarkb	I guess the other thing to check is where that error originates within nova	23:48
clarkb	that might give a clue	23:48
corvus	jamesdenton_alt: `Error in creating the server. Compute service reports fault: Build of instance 6abcc5cf-a0af-41f6-a277-c53e0b6ca66c aborted: Failed to allocate the network(s), not rescheduling.`	23:48
corvus	jamesdenton_alt: ^ that's an example of an error that we seem to get a few times a day in flex	23:48
clarkb	_build_networks_for_instance() in nova is what raises the exception that leads to that message	23:50
clarkb	and looking at that I think this is building the underlying network not the end result which attaches a floating ip	23:51
clarkb	so while I still think that we likely aren't running out of IPs something else in that process of attaching the networking is going wrong maybe?	23:51
clarkb	that method has built in retries inside nova too so maybe something they expect to fail occasionally	23:52
clarkb	and it calls out to neutron	23:52
clarkb	so ya I suspect that the real interesting info is in the neutron logs but we're isolated from that in what bubbles up to us	23:53
corvus	sounds like noting the error for our operator friends is the right thing :)	23:54
clarkb	++	23:54

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!