Thursday, 2025-03-27

opendevreviewMerged opendev/system-config master: Accessbot fix for running on Python 3.12  https://review.opendev.org/c/opendev/system-config/+/94565700:04
opendevreviewMerged opendev/irc-meetings master: Move Large Scale SIG meeting one hour earlier  https://review.opendev.org/c/opendev/irc-meetings/+/94563307:49
*** sfinucan is now known as stephenfin10:23
fungilooks like infra-prod-run-accessbot succeeded on deploy for 94565712:41
opendevreviewMerged opendev/system-config master: docs: Switch a mailing list to default moderation  https://review.opendev.org/c/opendev/system-config/+/94489313:06
fungii ended up doing that ^ for another list just now13:07
fungihttps://lists.openstack.org/archives/list/legal-discuss@lists.openstack.org/message/AJFK6RDQMR27E45TV4HT6HO2D3QSCKPH/13:08
fungiinfra-root: jamesdenton_: progress report on the replaced mirror.dfw.rax.opendev.org network performance... checking the graphs in cacti the new server instance does not seem to have its eth1 traffic constrained the way we saw impacting the old one, so we can probably restore our max-servers value there to normal now13:49
fungii'll push up a change for it13:49
jamesdenton_altthanks fungi13:49
jamesdenton_altstill looking to investigate further on my end13:49
fungiwe haven't deleted the old server instance, merely shifted all our traffic to a replacement as of just before midnight utc13:50
jamesdenton_altok cool13:50
fungihere's what the graph looks like now: https://fungi.yuggoth.org/tmp/dfw-mirror-traffic-post-replacement.png13:53
opendevreviewJeremy Stanley proposed openstack/project-config master: Restore max-servers in rax-dfw  https://review.opendev.org/c/openstack/project-config/+/94570713:58
corvuswow it managed a 1Gbps spike.  that's a lot different than the old one.14:07
corvusvery comparable to the other 2 regions now14:07
fungiyep14:36
corvusnice hunch clarkb  :)14:42
opendevreviewDan Smith proposed openstack/project-config master: Add viommu support to vexxhost ubuntu-noble  https://review.opendev.org/c/openstack/project-config/+/94571515:00
clarkbI've approved the restoration of max-servers in dfw. rax iad's mirror is on the todo list for replacement due to its age. I'm going to start on that since I already have half the ingredients sorted out in my gnu screen mixing bowl15:11
clarkbbut first catching up on morning things. Looks like accessbot is happy again. Thank you for sorting that out fungi 15:14
opendevreviewMerged openstack/project-config master: Restore max-servers in rax-dfw  https://review.opendev.org/c/openstack/project-config/+/94570715:18
opendevreviewDan Smith proposed openstack/project-config master: Add viommu support to vexxhost ubuntu-noble  https://review.opendev.org/c/openstack/project-config/+/94571515:38
clarkbas a heads up I have a family lunch thing today so will be out midday for a bit15:42
fungitake all the time you need. i'll probably need to cook at some point but i don't plan to go anywhere today15:43
clarkbalso selinux, apparmor, xmonad, and grub just updated. This could be an exciting reboot15:49
clarkb(yes tumbleweed runs both selinux and apparmor apparently)15:49
clarkbnew iad mirror is launching. And I survived a local reboot16:14
fungisounds like i will actually be disappearing briefly this afternoon, maybe the same time as clarkb, but it probably shouldn't be for more than 45 minutes16:33
clarkbthings seem quiet so probably not a big deal16:33
fungiyeah, that was my reasoning as well16:33
fungiwe decided going out to grab a quick mid-afternoon meal at the pub around the corner would be less impact to day than trying to cook dinner16:34
fungichris is trying to work around her day full of appointments and i have a massive to-do list i'm trying to tackle16:35
slittlehttps://review.opendev.org/c/starlingx/root/+/945717   seems to be stuck16:38
opendevreviewClark Boylan proposed opendev/zone-opendev.org master: Add the new Noble IAD rackspace Mirror to DNS  https://review.opendev.org/c/opendev/zone-opendev.org/+/94572716:39
opendevreviewClark Boylan proposed opendev/system-config master: Add the new Noble IAD Rackspace Mirror to Inventory  https://review.opendev.org/c/opendev/system-config/+/94572816:42
opendevreviewClark Boylan proposed opendev/zone-opendev.org master: Switch rackspace iad to the new noble mirror  https://review.opendev.org/c/opendev/zone-opendev.org/+/94573016:44
fungilooking16:44
clarkbslittle: fungi: its parent isn't approved16:45
fungislittle: you haven't approved the parent change https://review.opendev.org/c/starlingx/root/+/945710 yet16:45
clarkbinfra-root I think the series of three changes above to add the new iad mirror is ready to go16:48
clarkbI'm happy to +A them as things update if they get reviews16:48
clarkbthen I'll push out cleanup changes when we're happy with the new system. I'm not cleaning up the new mirror in dfw because jamesdenton_alt was still looking at it16:49
jamesdenton_altthank you16:49
clarkber cleaning up the old mirror but I think people understood16:49
fungiwhere "cleaning up" means `openstack server delete ...`16:51
fungiyeah, we can hold that step off as long as needed16:51
slittleah, I see it16:56
opendevreviewDan Smith proposed openstack/project-config master: Add viommu support to vexxhost ubuntu-noble  https://review.opendev.org/c/openstack/project-config/+/94571517:22
opendevreviewMerged opendev/zone-opendev.org master: Add the new Noble IAD rackspace Mirror to DNS  https://review.opendev.org/c/opendev/zone-opendev.org/+/94572717:29
clarkbthat name resolves I'm approving the inventory udpate now17:35
opendevreviewMerged openstack/project-config master: Add viommu support to vexxhost ubuntu-noble  https://review.opendev.org/c/openstack/project-config/+/94571517:38
fungithat ovh login notification just now was me, i was just checking the billing since i needed to do the same for my own personal account17:39
clarkback thanks for letting us know17:42
fungilooks like our credits are still active17:43
clarkbthe matrix eavesdrop bot uses matrix-nio and is all async which makes me suspect it is less likely to have python3.12 problems. I can approve it after lunch today if I have time17:48
opendevreviewMerged opendev/system-config master: Add the new Noble IAD Rackspace Mirror to Inventory  https://review.opendev.org/c/opendev/system-config/+/94572818:17
Clark[m]I'm off to lunch so will have to check the new mirror afterwards 18:50
fungii'm heading out shortly as well, but should be back fairly soon, by 20:00 utc for sure19:03
fungiokay, out for a bit, brb19:17
fungibeen back for a few, seems like everything's still quiet20:24
clarkbhttps://mirror02.iad.rax.opendev.org/ is up for me I'll approve the dns update next20:29
fungicool20:30
fungii agree, has content20:30
opendevreviewMerged opendev/zone-opendev.org master: Switch rackspace iad to the new noble mirror  https://review.opendev.org/c/opendev/zone-opendev.org/+/94573020:31
opendevreviewClark Boylan proposed opendev/system-config master: Remove the old rax iad mirror from the inventory  https://review.opendev.org/c/opendev/system-config/+/94577420:38
opendevreviewClark Boylan proposed opendev/zone-opendev.org master: Cleanup mirror01.iad.rax.opendev.org records  https://review.opendev.org/c/opendev/zone-opendev.org/+/94577620:40
clarkbwhenever we're happy with the new instance I think it is safe to merge those two changes20:41
clarkbthen I can delete the old instance20:41
clarkband I'm going to approve the matrix eavesdrop change now20:42
clarkbthe other thing I've got on my mind is booting a new gerrit server. I was going to do that this week but then the noble kernel issues hit. I'll probably plan to do it next week kernel update or not21:17
clarkbI know that we have had discussion about whether or not the gerrit server should consider a location move. Every location seems to have its own downsides but the current one seems relatively minor to me (but of course I'm biased as my home isp doesn't offer ipv6 connectivity)21:18
clarkbinfra-root maybe give that a consideration and say something if you think the gerrit server shouldn't be a enw server in vexxhost ymq21:19
opendevreviewMerged opendev/system-config master: Update matrix eavesdrop runtime to python3.12  https://review.opendev.org/c/opendev/system-config/+/94440221:19
clarkbthe upsides for that location are the size of the bootable VM and general performance keeps up with our current demands. The downside is that ipv6 routing to there is not working from some networks21:19
fungiwe moved it to vexxhost because, back when we were experiencing significant unbounded memory leaks, we needed a very large flavor... do we still?21:20
clarkbthat did enqueue the eavesdrop deployment job so once that deploys I can go send a message to the zuul room and see that it gets logged21:20
fungieven if we leave review.o.o in the same provider, should we consider  smaller flavor for the replacement?21:21
clarkbmemory consumption has reduced on average but I believe that we still see it spiek when there is a lot of querying going on21:21
clarkbI'm not personally comfortable with going smaller just with what we know of gerrit historically and how it behaves21:22
clarkbwhen you need memory you really need it21:22
clarkbit is also particularly useful when doing things like reindexing21:22
fungifair, though i haven't seen it even come close to 50% of its available memory lately21:22
clarkbwhich makes upgrades much easier and shortens downtime21:22
clarkbwe only let gerrit use up to 3/4 fwiw21:23
fungiin part, i tink, because we limit what the jvm is allowed to allocate?21:23
clarkbthen the rest we try to keep for filesystem caching and the like21:23
fungiah okay21:23
clarkbyup exactly the limit is 96gb iirc and we have 128gb21:23
corvusfunny you should say that; i just switched to my laptop while i wait for my desktop oom killer to make up its mind (zuul unit test run went awry)21:24
corvus"when you need memory you need it"21:24
clarkbI'll put it this way: we spent years fighting to squeeze gerrit into a small box and managing gerrit is much easier today for many reasons but one of which I strongly suspect is we stopped squeezing it to fit21:25
fungifair, with approaching 2 months uptime it's using about 1/3 ram active, 1/3 buffers+cache, and 1/3 unused21:25
clarkbso unless there is a strong reason to downsize then I'd like to stay at hte current allocation21:26
fungiyeah, if the provider isn't asking us to scale back, i'm not overly concerned21:26
corvusah, it finally made a choice.  my desktop session.  nice.21:26
corvusclarkb: sgtm21:26
fungiyou didn't need it anyway21:27
clarkbhttps://meetings.opendev.org/irclogs/%23zuul/%23zuul.2025-03-27.log matrix-eavesdrop appears to still work after being updated21:28
clarkbthat last message from me was after the restart21:28
fungiawesome21:29
clarkbcorvus: this way you can still check the test results maybe :)21:29
corvusif only i had run them in screen :)21:31
clarkbhow did https://review.opendev.org/c/opendev/lodgeit/+/944410 and https://review.opendev.org/c/opendev/lodgeit/+/945135 happen21:31
clarkbI wrote the same change twice fun21:31
clarkbI'm going to double check they are identical and abandon the unreviewed one. Then maybe we can proceed with upgrading lodgeit21:32
clarkbthey do appaer to be identical21:33
clarkbI abandoned 944410 and approved 94513521:35
fungisgtm21:37
clarkband then https://review.opendev.org/c/opendev/system-config/+/945774 is ready to start cleaning up the old iad mirror whenever we feel we're ready. I'm happy to wait for tomorrow and dobule check cacti shows the new server handling periodic job demand before cleaning stuff up21:40
clarkbthat does seem to be a good test for the mirrors. The spike from those jobs on dfw was really noticeable21:41
fungiagreed21:42
ianw(btw ipv6 to ymq from .au has been generally fine for me.  i do remember one blip)21:52
fungiseems like it's been worse from/western eu21:53
clarkbianw: I think it is frickler's isp that complains the most out of the people we've heard from. And I guess technically the prefix isn't being advertised properly but every other isp seems to work so I dunno21:53
opendevreviewMerged opendev/lodgeit master: Bump lodgeit up to python3.12  https://review.opendev.org/c/opendev/lodgeit/+/94513522:02
clarkbif that requires manual intervention to deploy the update I'll do that shortly (hourly jobs are enqueued now so not a rush I don't think)22:04
clarkbhttps://discuss.python.org/t/upcoming-changes-in-the-pypa-wheel-project/85967/1222:05
clarkboh ya that merged to lodgeit not system-config so I think it does need a manual bump to go before the daily periodics. I'll get on that shortly22:05
clarkb8f68b3c9a806 is the current lodgeit image that is running22:08
clarkbImage updated and service restarted. I can still load https://paste.opendev.org/show/bVee2HZdsSTODhvEse6U/ from fungi yesterday22:10
clarkbhttps://paste.opendev.org/show/bEv2tbiu2zwZCkCg3NoF/ and this new paste pasted22:11
fungiyep, seems to be working just fine22:12
clarkbfwiw it does appear that the service is getting crawled by yet another one of our friends. But I'm not surprised22:13
clarkbthis one is using aws ip addrs22:13
clarkbjust something to keep in mind if the service starts struggling to be responsing22:13
opendevreviewClark Boylan proposed opendev/lodgeit master: Run lodgeit with granian instead of uwsgi  https://review.opendev.org/c/opendev/lodgeit/+/94480522:23
clarkbthat needed a rebase to go on top of the 3.12 update. I'll recheck the system-config change too22:23
clarkbthe lodgeit paste granian change did pass testing. However it seems like running more than one worker might not be safe for the database: https://zuul.opendev.org/t/openstack/build/5a73ef6d31e6439fa3577f0e697fb7a3/log/paste99.opendev.org/containers/docker-lodgeit.log I guess I'll just switch to a single worker instead23:26
opendevreviewClark Boylan proposed opendev/system-config master: Run lodgeit with granian  https://review.opendev.org/c/opendev/system-config/+/94480623:27
corvus2025-03-27 23:27:10,370 ERROR zuul.Launcher: [e: 53eb1c7d3c4247aabc30f29959586571] [req: 4aeac8dff75d4e25b2166a684a2cd819] Error in creating the server. Compute service reports fault: Build of instance 6abcc5cf-a0af-41f6-a277-c53e0b6ca66c aborted: Failed to allocate the network(s), not rescheduling.23:28
corvusthat's on raxflex-dfw323:28
opendevreviewClark Boylan proposed opendev/system-config master: Run lodgeit with granian  https://review.opendev.org/c/opendev/system-config/+/94480623:28
corvusis that something we should anticipate?  or is that an expected error?23:28
clarkbcorvus: that is a new error to me. I wonder if that is due to there not being an available floating ip?23:29
clarkbcorvus: the cloud launcher creates a /24 network for the internal network which is ~252 useable addresses and we have a max server count of 3223:30
clarkbI doubt that the problem is our internal network which is why I suspect the problem is the floating ip attachment piece23:30
clarkbhowever, dependingon the lease time for the IPs in that /24 maybe we are running out of addrs. I'll try to check that23:31
corvushttps://grafana.opendev.org/d/6d29645669/nodepool3a-rackspace-flex?orgId=1&from=now-6h&to=now&timezone=utc&var-region=$__all23:31
corvusdoesn't look like we hit the max servers in nodepool recently; and there would have been only 2 extra from zuul launcher23:31
clarkboh its a /20 actually not a /2423:32
clarkbthere is no subnet pool associated to our subnet so not sure how to check dhcp details23:35
clarkbI guess we can ssh into a server and check the laese time from there23:35
clarkbbut that doesn't show pool usage23:35
corvuslooks like that error has shown up in nodepool 5 times over the past 3 days.  in dfw3 and sjc3.23:36
corvusspot checked the most recent incident there too, and it doesn't look like heavy use at the time23:37
corvusthe retry in zuul-launcher succeeded in sjc3.  so it failed over from dfw3 to sjc3.23:40
clarkboption dhcp-lease-time 43200;23:41
clarkba 12 hours lease time is a bit long but we'd have to get through 4k ish nodes within that time frame to run out of addrs which seems like a lot23:41
clarkball that to say it seems unlikely that not having enough free internal ips is at fault. But doing subnet lists the public networks are much smaller so I'm thinking its more likely we couldn't get a floating ip23:42
corvuswe have gone through a total of 1600 notes in the almost-24h period of today utc time, across all providers.23:44
corvuswow that was wrong23:44
corvuswe have gone through a total of 16000 nodes in the almost-24h period of today utc time, across all providers.23:44
corvusthat's more like it23:44
corvusso i agree, it seems unlikely we would hit that limit in one flex region23:45
clarkbist possible jamesdenton_alt would be willing to check logs on the cloud side if you haev the failed instance uuid23:47
clarkbI guess the other thing to check is where that error originates within nova23:48
clarkbthat might give a clue23:48
corvusjamesdenton_alt:  `Error in creating the server. Compute service reports fault: Build of instance 6abcc5cf-a0af-41f6-a277-c53e0b6ca66c aborted: Failed to allocate the network(s), not rescheduling.`23:48
corvusjamesdenton_alt: ^ that's an example of an error that we seem to get a few times a day in flex23:48
clarkb_build_networks_for_instance() in nova is what raises the exception that leads to that message23:50
clarkband looking at that I think this is building the underlying network not the end result which attaches a floating ip23:51
clarkbso while I still think that we likely aren't running out of IPs something else in that process of attaching the networking is going wrong maybe?23:51
clarkbthat method has built in retries inside nova too so maybe something they expect to fail occasionally23:52
clarkband it calls out to neutron23:52
clarkbso ya I suspect that the real interesting info is in the neutron logs but we're isolated from that in what bubbles up to us23:53
corvussounds like noting the error for our operator friends is the right thing :)23:54
clarkb++23:54

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!