Wednesday, 2025-11-19

opendevreviewMichal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role  https://review.opendev.org/c/zuul/zuul-jobs/+/96618707:16
noonedeadpunkthis indeed was like you attempted and I never managed to work on implementation07:34
opendevreviewMichal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role  https://review.opendev.org/c/zuul/zuul-jobs/+/96618707:35
opendevreviewMichal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role  https://review.opendev.org/c/zuul/zuul-jobs/+/96618707:37
opendevreviewMichal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role  https://review.opendev.org/c/zuul/zuul-jobs/+/96618707:37
mnasiadkaclarkb: looking at trixie arm64 patch - I think there are some storage iops issues again in the OSUOSL provider…12:48
opendevreviewMichal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role  https://review.opendev.org/c/zuul/zuul-jobs/+/96618713:03
dtantsurhey folks, how do I know why these builds failed? I cannot find any hints: https://zuul.opendev.org/t/openstack/build/b95e332cef70428f8aca06889906e394 https://zuul.opendev.org/t/openstack/build/9c4bd30268a9432c912d808ae6f2efc315:34
dtantsur(to make matters worse, both contain critical fixes for the images we build, so the images are still broken)15:35
dtantsurAh, so one of the playbooks timed out.. which does not explain me anything, unfortunately15:36
dtantsurBut it looks like uploading images to tarballs.o.o may be broken15:37
opendevreviewJeremy Stanley proposed opendev/system-config master: DNM: Trigger some channel log collisions  https://review.opendev.org/c/opendev/system-config/+/96770815:41
fungidtantsur: looking15:42
clarkbmnasiadka: osuosl shares a timezone with me. I suspect if there were problems in the middle of the night they may not be immediately handled. That said its now morning here and we can either try again to see if the issue was temporary (though it appears tohaev occurred ~4 times already) or just directly ask osuosl to take a look15:47
clarkbRamereth[m]: ^ fyi we're seeing slow image builds in the osuosl cloud which may imply some sort of iops issue. Not sure if you're aware of anything going on (realizing its early still)15:47
fungidtantsur: TASK [Copy files from /home/zuul/src/opendev.org/openstack/ironic-python-agent/UPLOAD_RAW on node] took 18 minutes, i think that's where the bulk of the time was spent on the post play that timed out15:48
fungiTASK [Copy files from /home/zuul/src/opendev.org/openstack/ironic-python-agent/UPLOAD_TAR on node] took a further 12 minutes before it reached the timeout15:49
fungidtantsur: so either the files are a lot larger or bandwidth between the executor and job node was more constrained than usual or there were i/o problems reading on the job node or writing to the local disk on the executor15:52
dtantsurI don't think the files have increased in size recently (and the patch does not change the size)15:53
fungithat analysis was specific to the first example. for the second example just the copy from UPLOAD_RAW task alone ran almost 28 minutes, leaving about 2 minutes for the UPLOAD_TAR copy before it got killed15:55
clarkbfstrim was able to trim almost 1GB of data according to the dib log. Unfortunately, that doesn't tell us how large the result is15:56
fungii'll check for commonalities between these (same executor, same cloud provider/region, et cetera), maybe there's a correlation15:56
mnasiadkaclarkb: well, I thought that if there’s some slowness in the middle of the night - it might be worse in daytime :)15:58
fungidtantsur: both builds ran in our openmetal provider, so that's one possible thread to pull on. checking executors next15:59
fungione ran on ze06 and one on ze10 so i don't suspect it's an executor-specific issue15:59
fungii wonder if i/o is slow in openmetal right now, or if it's impacted by some network issue16:00
clarkbpulling packages was quick Fetched 231 MB in 2s (112 MB/s)16:00
fungiyeah, so maybe inbound network connectivity is fine but outbound is constrained?16:01
clarkbor specific to the path between these clouds16:01
fungiin this case the slow transfers were from openmetal to the executors16:01
fungimaybe worth trying to pull a large file from the openmetal mirror to an executor16:02
fungii need to step away for a moment, but can try that in a few minutes16:02
clarkbhttps://zuul.opendev.org/t/openstack/build/b95e332cef70428f8aca06889906e394/log/job-output.txt#6106 this says the initramfs file is 302MB so not massive16:03
clarkb(there aer a few other things copied too so we haven't ruled out total file size being huge yet but its looking less and less likely that is the issue)16:04
clarkbI'm getting ~400-500KBps from the mirror to my local machine16:09
clarkb(it is much easier to test that then constraining the test to the executor(s)16:09
clarkbwhich doesn't explain 20 minutes for a 300MB ish transfer but does probably point at a problem16:10
priteaufungi: Following up on the {tarballs,releases}.openstack.org issues I mentioned yesterday, I am not sure they are related to Cloudflare actually. What we see in Kayobe CI is an occasional "The handshake operation timed out" which I think would happen once DNS resolution is complete anyway. Examples from just earlier today:16:10
priteauhttps://9646a7fb82b47fbe6288-a22e2178400a1d74c0dfc0d0570ba9cf.ssl.cf2.rackcdn.com/openstack/e30a4c91bd504fb38196e56dfc18b9de/primary/ansible/tenks-deploy16:10
priteauhttps://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_97a/openstack/97a23553e11845d08dd4ae3fd923f181/primary/ansible/overcloud-deploy-pre-upgrade16:10
priteauAlso just for regular browsing of https://releases.openstack.org/ it feels very slow16:11
clarkbpriteau: if you link to the logs within the zuul web ui it makes things a lot easier because you can link directly to the lines with the issues and we can more easily navigate to other information like where the job ran etc16:11
priteauSorry, I don't use this feature often, checking16:13
priteauhttps://zuul.opendev.org/t/openstack/build/e30a4c91bd504fb38196e56dfc18b9de/log/primary/ansible/tenks-deploy#2463-246816:13
priteauhttps://zuul.opendev.org/t/openstack/build/97a23553e11845d08dd4ae3fd923f181/log/primary/ansible/overcloud-deploy-pre-upgrade#30143-3014416:14
clarkbthat job did not run in openmetal so unlikely to be directly related to the potential network issues there16:14
priteauIt only happens occasionally, but often enough to require regular rechecks16:15
priteauI know, we should add retries on our http fetches16:15
clarkblooks like you have retries on the second one16:16
priteauYou're right, it failed multiple times then16:17
priteauI am just wondering if the server is overloaded16:17
clarkbthe first request is to https://releases.openstack.org/constraints/upper/master which redirects you to https://opendev.org/openstack/requirements/raw/branch/master/upper-constraints.txt The second is to https://tarballs.openstack.org/ironic-python-agent/tinyipa/files/tinyipa-stable-2025.1.vmlinuz which redirects you to16:19
clarkbhttps://tarballs.opendev.org/openstack/ironic-python-agent/tinyipa/files/tinyipa-stable-2025.1.vmlinuz. releases.openstack.org, tarballs.openstack.org and tarballs.opendev.org are all hosted on the same afs backed service (opendev.org is not)16:19
clarkbso it does seem likely that the common issue is in the static file hosting backed by afs. Considering the error reported is an inability to establish an ssl connection probably on the frontend (so not an afs problem)16:19
clarkbon static.opendev.org we have a server limit of 32 and we have 32 child pids for apache2 so yes seems likely we're getting all the server slots filled up probably by ai crawlers16:21
clarkbwe recentlyish bumped the total number of connections on that server and that helped quite a bit. But maybe we should incrase the limits further16:22
clarkbultimately the problem now is there is an arms race on the internet to collect as much data as quickly as possible. consequences be damned that is someone elses problem. We're the collateral damage16:22
clarkbin any case server load seems reasonable so I think we can bump up those limits16:23
clarkbthen we wait another month until we can't keep up anymore and decide if we can bump limits further. Eventually can decide if we round robin or load balance across more servers16:23
fungishould we multiply all the tuning values we set in https://review.opendev.org/c/opendev/system-config/+/962973 or just some of them?16:26
clarkbfungi: I'm thinking we multiply just the process and connection limit. I have a change almost ready16:28
fungiah cool, standing by to review16:28
fungithough i'm going to need to disappear to run errands shortly16:28
clarkbfungi: on the openmetal side of things if you are able to confirm slow data transfer off of the mirror then we can consider simply disabling that cloud for now and sending them an email16:28
fungiyeah, checking16:29
opendevreviewClark Boylan proposed opendev/system-config master: Increase static webserver limits to 4096 connections  https://review.opendev.org/c/opendev/system-config/+/96771116:30
priteauThanks a lot clarkb, I will report if it helps16:32
clarkband my thought for only bumping the process and connection limit is that it gives us room for emergency increases via the max thread bump later if we need to16:32
fungii temporarily created http://mirror.iad3.openmetal.opendev.org/1gb-test.dat as a test file and am retrieving it with wget on ze1216:33
fungitransfer rate is around 200KB/s but goes up and down quite a bit16:34
fungijust made a smaller 1mb version and average transfer rate fetching it was 277 KB/s16:35
clarkbat 200KB/s we can expect a 302MB file to take almost 1600 seconds to transfer16:35
clarkbwhich is inline with the timing on one of the two test cases so ya I suspect this is our culprit16:36
fungii'll push a change to temporarily turn down that provider16:36
opendevreviewJeremy Stanley proposed opendev/zuul-providers master: Temporarily turn down OpenMetal  https://review.opendev.org/c/opendev/zuul-providers/+/96771716:41
opendevreviewJeremy Stanley proposed opendev/zuul-providers master: Revert "Temporarily turn down OpenMetal"  https://review.opendev.org/c/opendev/zuul-providers/+/96771816:41
fungiokay, heading out to run errands, but will try to make it quick. back soon16:42
opendevreviewMerged opendev/zuul-providers master: Temporarily turn down OpenMetal  https://review.opendev.org/c/opendev/zuul-providers/+/96771716:44
clarkbdtantsur: ^ that is the workaround for now. I'm currently drafting an email to that cloud to see if we can correct it properly16:45
dtantsurThank you!16:47
dtantsurIs it possible to somehow restart the jobs? I'd really hate to hurry a dummy change through the gate to get the images fixed.16:47
clarkbI think the answer is it depends on the jobs and the buildset. We can reenqueue the entire buildset associated with those jobs but that will run all of the jobs in the buildset and if there are isses with idempotency then we shouldn't do that16:49
clarkbI think publish-openstack-python-branch-tarball is only safe if nothing else has merged since16:50
clarkbsince we don't want to rollback that data16:50
clarkbI want to say the github replication is safe as it takes the state of the world and pushes it but I'm not certain of that16:51
clarkbhttps://zuul.opendev.org/t/openstack/buildset/f1db495797004110af2396d07bd18057 is what I'm looking at16:51
clarkbI think in both cases we're safe if nothing else has merged. It becomes trickier if other changes have merged (including to other branches)16:51
dtantsurThis was the last thing that merged, yes https://review.opendev.org/q/project:openstack/ironic-python-agent+status:merged16:52
dtantsurand I don't see anything in the gate now16:52
clarkbdtantsur: ack give me a few and I'll reenqueue that run16:53
clarkbI think I might be able to do that if I login as admin in the web ui so I'll try that first16:53
clarkbdtantsur: https://zuul.opendev.org/t/openstack/buildset/f1db495797004110af2396d07bd18057 this is the correct set of failures right?16:56
clarkbdtantsur: I'm about to reenqueue them so want to double check with you that that is the correct state we want to rerun first16:56
dtantsurclarkb: yep, that's it16:56
clarkbdtantsur: ok things should be enqueued again16:57
dtantsurThanks!!16:57
clarkbyou're welcome. Sorry for the trouble16:59
clarkbinfra-root my draft email to openmetal https://etherpad.opendev.org/p/Xd6JERl87Kr1zFnl3rHN17:18
clarkbdtantsur: looks like the jobs succeeded this time around17:20
dtantsurYep, success, thanks again!17:21
dtantsur(I've been waiting for new images to show up on tarballs.o.o, which I guess takes a bit)17:21
clarkbdtantsur: afs publishes every 5 minutes17:21
clarkbusing cron so its next 5 minute block + time to publish17:22
clarkbshould be done within 10 minutes of job completion typically17:22
dtantsurHmm, I think it's 20 minutes already17:22
clarkbhttps://grafana.opendev.org/d/9871b26303/afs?orgId=1&from=now-6h&to=now&timezone=utc the 'project vos release timers' graph captures this17:23
clarkblooks like your larger image data caused a longer vos release (it took 13 minutes) but appears to be done now according to that graph17:24
dtantsurI see some of the files (e.g. ipa-debian-master.kernel) from yesterday still, while most are updated17:24
dtantsurstrange17:24
dtantsurwhat do you see in https://tarballs.opendev.org/openstack/ironic-python-agent/dib/files/?17:24
dtantsurmaybe something caches them on the way to me?17:24
clarkbhttps://tarballs.opendev.org/openstack/ironic-python-agent/dib/?C=M;O=D they look updated to me there17:24
clarkbtwo new files froim 17:02UTC today17:25
dtantsurThese two, but not the rest17:25
dtantsurchecking https://ddbcd145bedf294c8288-f0a55fc4957fe55450e72f1f6d277d79.ssl.cf1.rackcdn.com/openstack/4fdd4e83f2754616826a7bc82adbbba4/job-output.txt, it looks like all files were uploaded to AFS17:26
clarkbya the subdir contains updated manifests17:26
clarkbhttps://tarballs.opendev.org/openstack/ironic-python-agent/dib/files/ipa-centos9-master.d/ and https://tarballs.opendev.org/openstack/ironic-python-agent/dib/files/ipa-debian-master.d/ are up todate17:26
dtantsurah, now everything is updated, sorry for the noise17:26
clarkback I'm guessing your early checks cached something and it took a few for it to renew the cached data17:27
dtantsurcould be. will wait half an hour next time just to be sure17:27
fungiclarkb: minor note on the draft e-mail but lgtm overall, thanks!18:04
clarkbfungi: yup just made the edit that I think you're suggestiing. Does that look right? If so I'll go ahead and send this out and cc infra rooters on it18:04
fungiyep, perfect. thanks ahain!18:05
fungiagain18:05
clarkbemail is sent18:12
clarkboh also I meant to mention that reenqueing the buildset was super easy via the zuul web ui18:15
clarkbdefinitely preferable to figuring ou the cli command and getting all the input data correct18:15
clarkblooks like the static.o.o connection limit bump change has two +2's I'm going to approve it now18:19
clarkbI'll double check it deploys properly and the server doesn't look sad afterawrds18:19
fungiyeah, i've done reenqueue via webui a few times and it's super easy18:20
opendevreviewMerged opendev/system-config master: Increase static webserver limits to 4096 connections  https://review.opendev.org/c/opendev/system-config/+/96771118:52
clarkbif apache isn't restarted by the deployment of ^ I will do that as I think increasing this limit requires a proper restart18:53
clarkbthe restart was automatic18:56
clarkband I'm able to reach releases.openstack.org as well as tarballs.openstack.org. I will check back in a bit to see if all of the processes are being used18:57
clarkbso far I've seen us running up to 27 child processes and now we are down to 24. This means we aren't exercising the new limit (which is good as maybe that means we were right on the edge? that would explain why this happens infrequently too)19:20
clarkbok we're over the old limit now and system load seems fine19:40
fungiyeah, checking in and load average on the server is under 1.020:15
clarkbtonyb: looking at a calendar has does December 7 for me and December 8 for you look if we start thinking about the gerrit 3.11 upgrade? I don't want to do November 30 because its the tail end of a holiday weekend for me20:16
clarkbI'm thinking that should be a decent amount of time to work through testing and all that and we can announce it soonish if we think that is a workable date. Probably ~2100 UTC December 7?20:16
fungii'm available for that window, and agree it gives us time to prep20:37
clarkbI pulled up the gerrit 3.11 release notes right before lunch and I'm remembering why this is taking so long. There is a lot of information to digest and prepare for. That said I suspect that the actual amount of change is not that bad and probably no worse than any other upgrade20:56
clarkbI also thought about upgrading straight to 3.12 but the reason not to do that is the java 21 transition. It will be easier to do that in place on 3.11 I think20:56
dmsimard[m]hi, just following up, I was firefighting most of the day and didn't get a chance to sit down with Arnaud to talk about the flavors but we plan to talk about it soon, I'll let you know when I have an update22:10
clarkbdmsimard[m]: thanks!22:11
clarkbinfra-root I'm slowly getting https://etherpad.opendev.org/p/gerrit-upgrade-3.11 into shape. So far I've gone through the existing known issues and breaking changes and added a couple of newer items from the release notes and have also started evaluating some of them with notes. For others where work needs to be done I'm trying to capture that with explicit TODOs22:14
clarkbbut there are still a number of release note entries I need to read through and decide if they need to go on the etherpad or not22:15
fungiclarkb: looks like openmetal replied, but also in testing now i get 85.3 MB/s average where before i was getting orders of magnitude slower transfer rates. i wonder if it could have sped up after we stopped running jobs there, and maybe something about our workload was competing for limited bandwidth?22:44
clarkbfungi: that is an interesting theory. I tested all of this after the change disabling the region landed, but existing running jobs would keep going until complete so we'd have a time period of overlap?22:45
fungiperhaps22:46
clarkbfungi: do you want to respond to them with what you've just found and we can suggest that as a potential cause? Probably reenable it and then monitor from there?22:46
fungior it could be that we were competing for network infrastructure with another customer there22:46
clarkbya could also be a different noisy neighbor that is now queit. Considering the web crawling activity on the internet that wouldn't surprise me at all22:46
fungiyeah, can reply in a few minutes22:46
clarkbthanks. I'm happy to as well, but figured if you've arelady rerun tests then you're far ahead of me22:47
opendevreviewMerged opendev/zuul-providers master: Revert "Temporarily turn down OpenMetal"  https://review.opendev.org/c/opendev/zuul-providers/+/96771823:02
fungii've replied to ramon23:04
fungi(cc'ing everyone still)23:04
clarkbthanks again23:04
fungioh, though they won't receive it because openmetal.io is using gmail, so it bounced back to me for not using a corporatey enough mailserver23:05
clarkbI think I've gotten through the entire 3.11 release notes document and have done my best to call out things that need further attention in the etherpad23:05
clarkbfungi: I can respond with your content23:05
fungithanks23:05
clarkb(if that is the best way to handle this)23:05
fungiwfm, sure23:05
clarkbdone23:06
clarkbso now I have a good number of TODO items to look into on the held nodes23:06
clarkbplease feel free to look over the release notes and call out anything I missed, or identify issues with the evaluations I've already done on the etherpad, or volunteer to dig into any of these items. THat said I'll be doing my best to look into things myself over the next few days23:07
clarkb3.11 has a new feature where you can tell it to not automatically run the online reindex upon upgrade. The documentation says this is alrgely geared at ha sites and managing zero downtime upgrades, but I think we can take advantage of this to make downgrades less painful if we need to do them. Basically upgrade to version N+1 but don't reindex yet. Make sure everything seems to be23:11
clarkbworking then manually trigger online reindexing when ready. That way if you find a problem you can downgrade without taking the time to do an offline reindex to reindex back to version N's index versions23:11
clarkb3.11 has no new index versions so this isn't a good version to test that assumption on as it should noop anyway23:11
clarkbbut somethign to keep in mind for the future and I've written a note about it23:11
corvusclarkb: fungi er, i tested it a while ago (some time after clarkb sent email) and got something very slow (slower than fungi).  sorry i don't know the time; i didn't think it was interesting enough to mention.23:20
clarkbcorvus: thanks, we'll just have to monitor it and see if it comes back23:20
clarkbit is entirely possible something on the internet fell over and got fixed too. Internet connectivity is such fun to debug23:21
*** diablo_rojo_phone is now known as Guest3164723:58

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!