Wednesday, 2025-11-19

opendevreview	Michal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role https://review.opendev.org/c/zuul/zuul-jobs/+/966187	07:16
noonedeadpunk	this indeed was like you attempted and I never managed to work on implementation	07:34
opendevreview	Michal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role https://review.opendev.org/c/zuul/zuul-jobs/+/966187	07:35
opendevreview	Michal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role https://review.opendev.org/c/zuul/zuul-jobs/+/966187	07:37
opendevreview	Michal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role https://review.opendev.org/c/zuul/zuul-jobs/+/966187	07:37
mnasiadka	clarkb: looking at trixie arm64 patch - I think there are some storage iops issues again in the OSUOSL provider…	12:48
opendevreview	Michal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role https://review.opendev.org/c/zuul/zuul-jobs/+/966187	13:03
dtantsur	hey folks, how do I know why these builds failed? I cannot find any hints: https://zuul.opendev.org/t/openstack/build/b95e332cef70428f8aca06889906e394 https://zuul.opendev.org/t/openstack/build/9c4bd30268a9432c912d808ae6f2efc3	15:34
dtantsur	(to make matters worse, both contain critical fixes for the images we build, so the images are still broken)	15:35
dtantsur	Ah, so one of the playbooks timed out.. which does not explain me anything, unfortunately	15:36
dtantsur	But it looks like uploading images to tarballs.o.o may be broken	15:37
opendevreview	Jeremy Stanley proposed opendev/system-config master: DNM: Trigger some channel log collisions https://review.opendev.org/c/opendev/system-config/+/967708	15:41
fungi	dtantsur: looking	15:42
clarkb	mnasiadka: osuosl shares a timezone with me. I suspect if there were problems in the middle of the night they may not be immediately handled. That said its now morning here and we can either try again to see if the issue was temporary (though it appears tohaev occurred ~4 times already) or just directly ask osuosl to take a look	15:47
clarkb	Ramereth[m]: ^ fyi we're seeing slow image builds in the osuosl cloud which may imply some sort of iops issue. Not sure if you're aware of anything going on (realizing its early still)	15:47
fungi	dtantsur: TASK [Copy files from /home/zuul/src/opendev.org/openstack/ironic-python-agent/UPLOAD_RAW on node] took 18 minutes, i think that's where the bulk of the time was spent on the post play that timed out	15:48
fungi	TASK [Copy files from /home/zuul/src/opendev.org/openstack/ironic-python-agent/UPLOAD_TAR on node] took a further 12 minutes before it reached the timeout	15:49
fungi	dtantsur: so either the files are a lot larger or bandwidth between the executor and job node was more constrained than usual or there were i/o problems reading on the job node or writing to the local disk on the executor	15:52
dtantsur	I don't think the files have increased in size recently (and the patch does not change the size)	15:53
fungi	that analysis was specific to the first example. for the second example just the copy from UPLOAD_RAW task alone ran almost 28 minutes, leaving about 2 minutes for the UPLOAD_TAR copy before it got killed	15:55
clarkb	fstrim was able to trim almost 1GB of data according to the dib log. Unfortunately, that doesn't tell us how large the result is	15:56
fungi	i'll check for commonalities between these (same executor, same cloud provider/region, et cetera), maybe there's a correlation	15:56
mnasiadka	clarkb: well, I thought that if there’s some slowness in the middle of the night - it might be worse in daytime :)	15:58
fungi	dtantsur: both builds ran in our openmetal provider, so that's one possible thread to pull on. checking executors next	15:59
fungi	one ran on ze06 and one on ze10 so i don't suspect it's an executor-specific issue	15:59
fungi	i wonder if i/o is slow in openmetal right now, or if it's impacted by some network issue	16:00
clarkb	pulling packages was quick Fetched 231 MB in 2s (112 MB/s)	16:00
fungi	yeah, so maybe inbound network connectivity is fine but outbound is constrained?	16:01
clarkb	or specific to the path between these clouds	16:01
fungi	in this case the slow transfers were from openmetal to the executors	16:01
fungi	maybe worth trying to pull a large file from the openmetal mirror to an executor	16:02
fungi	i need to step away for a moment, but can try that in a few minutes	16:02
clarkb	https://zuul.opendev.org/t/openstack/build/b95e332cef70428f8aca06889906e394/log/job-output.txt#6106 this says the initramfs file is 302MB so not massive	16:03
clarkb	(there aer a few other things copied too so we haven't ruled out total file size being huge yet but its looking less and less likely that is the issue)	16:04
clarkb	I'm getting ~400-500KBps from the mirror to my local machine	16:09
clarkb	(it is much easier to test that then constraining the test to the executor(s)	16:09
clarkb	which doesn't explain 20 minutes for a 300MB ish transfer but does probably point at a problem	16:10
priteau	fungi: Following up on the {tarballs,releases}.openstack.org issues I mentioned yesterday, I am not sure they are related to Cloudflare actually. What we see in Kayobe CI is an occasional "The handshake operation timed out" which I think would happen once DNS resolution is complete anyway. Examples from just earlier today:	16:10
priteau	https://9646a7fb82b47fbe6288-a22e2178400a1d74c0dfc0d0570ba9cf.ssl.cf2.rackcdn.com/openstack/e30a4c91bd504fb38196e56dfc18b9de/primary/ansible/tenks-deploy	16:10
priteau	https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_97a/openstack/97a23553e11845d08dd4ae3fd923f181/primary/ansible/overcloud-deploy-pre-upgrade	16:10
priteau	Also just for regular browsing of https://releases.openstack.org/ it feels very slow	16:11
clarkb	priteau: if you link to the logs within the zuul web ui it makes things a lot easier because you can link directly to the lines with the issues and we can more easily navigate to other information like where the job ran etc	16:11
priteau	Sorry, I don't use this feature often, checking	16:13
priteau	https://zuul.opendev.org/t/openstack/build/e30a4c91bd504fb38196e56dfc18b9de/log/primary/ansible/tenks-deploy#2463-2468	16:13
priteau	https://zuul.opendev.org/t/openstack/build/97a23553e11845d08dd4ae3fd923f181/log/primary/ansible/overcloud-deploy-pre-upgrade#30143-30144	16:14
clarkb	that job did not run in openmetal so unlikely to be directly related to the potential network issues there	16:14
priteau	It only happens occasionally, but often enough to require regular rechecks	16:15
priteau	I know, we should add retries on our http fetches	16:15
clarkb	looks like you have retries on the second one	16:16
priteau	You're right, it failed multiple times then	16:17
priteau	I am just wondering if the server is overloaded	16:17
clarkb	the first request is to https://releases.openstack.org/constraints/upper/master which redirects you to https://opendev.org/openstack/requirements/raw/branch/master/upper-constraints.txt The second is to https://tarballs.openstack.org/ironic-python-agent/tinyipa/files/tinyipa-stable-2025.1.vmlinuz which redirects you to	16:19
clarkb	https://tarballs.opendev.org/openstack/ironic-python-agent/tinyipa/files/tinyipa-stable-2025.1.vmlinuz. releases.openstack.org, tarballs.openstack.org and tarballs.opendev.org are all hosted on the same afs backed service (opendev.org is not)	16:19
clarkb	so it does seem likely that the common issue is in the static file hosting backed by afs. Considering the error reported is an inability to establish an ssl connection probably on the frontend (so not an afs problem)	16:19
clarkb	on static.opendev.org we have a server limit of 32 and we have 32 child pids for apache2 so yes seems likely we're getting all the server slots filled up probably by ai crawlers	16:21
clarkb	we recentlyish bumped the total number of connections on that server and that helped quite a bit. But maybe we should incrase the limits further	16:22
clarkb	ultimately the problem now is there is an arms race on the internet to collect as much data as quickly as possible. consequences be damned that is someone elses problem. We're the collateral damage	16:22
clarkb	in any case server load seems reasonable so I think we can bump up those limits	16:23
clarkb	then we wait another month until we can't keep up anymore and decide if we can bump limits further. Eventually can decide if we round robin or load balance across more servers	16:23
fungi	should we multiply all the tuning values we set in https://review.opendev.org/c/opendev/system-config/+/962973 or just some of them?	16:26
clarkb	fungi: I'm thinking we multiply just the process and connection limit. I have a change almost ready	16:28
fungi	ah cool, standing by to review	16:28
fungi	though i'm going to need to disappear to run errands shortly	16:28
clarkb	fungi: on the openmetal side of things if you are able to confirm slow data transfer off of the mirror then we can consider simply disabling that cloud for now and sending them an email	16:28
fungi	yeah, checking	16:29
opendevreview	Clark Boylan proposed opendev/system-config master: Increase static webserver limits to 4096 connections https://review.opendev.org/c/opendev/system-config/+/967711	16:30
priteau	Thanks a lot clarkb, I will report if it helps	16:32
clarkb	and my thought for only bumping the process and connection limit is that it gives us room for emergency increases via the max thread bump later if we need to	16:32
fungi	i temporarily created http://mirror.iad3.openmetal.opendev.org/1gb-test.dat as a test file and am retrieving it with wget on ze12	16:33
fungi	transfer rate is around 200KB/s but goes up and down quite a bit	16:34
fungi	just made a smaller 1mb version and average transfer rate fetching it was 277 KB/s	16:35
clarkb	at 200KB/s we can expect a 302MB file to take almost 1600 seconds to transfer	16:35
clarkb	which is inline with the timing on one of the two test cases so ya I suspect this is our culprit	16:36
fungi	i'll push a change to temporarily turn down that provider	16:36
opendevreview	Jeremy Stanley proposed opendev/zuul-providers master: Temporarily turn down OpenMetal https://review.opendev.org/c/opendev/zuul-providers/+/967717	16:41
opendevreview	Jeremy Stanley proposed opendev/zuul-providers master: Revert "Temporarily turn down OpenMetal" https://review.opendev.org/c/opendev/zuul-providers/+/967718	16:41
fungi	okay, heading out to run errands, but will try to make it quick. back soon	16:42
opendevreview	Merged opendev/zuul-providers master: Temporarily turn down OpenMetal https://review.opendev.org/c/opendev/zuul-providers/+/967717	16:44
clarkb	dtantsur: ^ that is the workaround for now. I'm currently drafting an email to that cloud to see if we can correct it properly	16:45
dtantsur	Thank you!	16:47
dtantsur	Is it possible to somehow restart the jobs? I'd really hate to hurry a dummy change through the gate to get the images fixed.	16:47
clarkb	I think the answer is it depends on the jobs and the buildset. We can reenqueue the entire buildset associated with those jobs but that will run all of the jobs in the buildset and if there are isses with idempotency then we shouldn't do that	16:49
clarkb	I think publish-openstack-python-branch-tarball is only safe if nothing else has merged since	16:50
clarkb	since we don't want to rollback that data	16:50
clarkb	I want to say the github replication is safe as it takes the state of the world and pushes it but I'm not certain of that	16:51
clarkb	https://zuul.opendev.org/t/openstack/buildset/f1db495797004110af2396d07bd18057 is what I'm looking at	16:51
clarkb	I think in both cases we're safe if nothing else has merged. It becomes trickier if other changes have merged (including to other branches)	16:51
dtantsur	This was the last thing that merged, yes https://review.opendev.org/q/project:openstack/ironic-python-agent+status:merged	16:52
dtantsur	and I don't see anything in the gate now	16:52
clarkb	dtantsur: ack give me a few and I'll reenqueue that run	16:53
clarkb	I think I might be able to do that if I login as admin in the web ui so I'll try that first	16:53
clarkb	dtantsur: https://zuul.opendev.org/t/openstack/buildset/f1db495797004110af2396d07bd18057 this is the correct set of failures right?	16:56
clarkb	dtantsur: I'm about to reenqueue them so want to double check with you that that is the correct state we want to rerun first	16:56
dtantsur	clarkb: yep, that's it	16:56
clarkb	dtantsur: ok things should be enqueued again	16:57
dtantsur	Thanks!!	16:57
clarkb	you're welcome. Sorry for the trouble	16:59
clarkb	infra-root my draft email to openmetal https://etherpad.opendev.org/p/Xd6JERl87Kr1zFnl3rHN	17:18
clarkb	dtantsur: looks like the jobs succeeded this time around	17:20
dtantsur	Yep, success, thanks again!	17:21
dtantsur	(I've been waiting for new images to show up on tarballs.o.o, which I guess takes a bit)	17:21
clarkb	dtantsur: afs publishes every 5 minutes	17:21
clarkb	using cron so its next 5 minute block + time to publish	17:22
clarkb	should be done within 10 minutes of job completion typically	17:22
dtantsur	Hmm, I think it's 20 minutes already	17:22
clarkb	https://grafana.opendev.org/d/9871b26303/afs?orgId=1&from=now-6h&to=now&timezone=utc the 'project vos release timers' graph captures this	17:23
clarkb	looks like your larger image data caused a longer vos release (it took 13 minutes) but appears to be done now according to that graph	17:24
dtantsur	I see some of the files (e.g. ipa-debian-master.kernel) from yesterday still, while most are updated	17:24
dtantsur	strange	17:24
dtantsur	what do you see in https://tarballs.opendev.org/openstack/ironic-python-agent/dib/files/?	17:24
dtantsur	maybe something caches them on the way to me?	17:24
clarkb	https://tarballs.opendev.org/openstack/ironic-python-agent/dib/?C=M;O=D they look updated to me there	17:24
clarkb	two new files froim 17:02UTC today	17:25
dtantsur	These two, but not the rest	17:25
dtantsur	checking https://ddbcd145bedf294c8288-f0a55fc4957fe55450e72f1f6d277d79.ssl.cf1.rackcdn.com/openstack/4fdd4e83f2754616826a7bc82adbbba4/job-output.txt, it looks like all files were uploaded to AFS	17:26
clarkb	ya the subdir contains updated manifests	17:26
clarkb	https://tarballs.opendev.org/openstack/ironic-python-agent/dib/files/ipa-centos9-master.d/ and https://tarballs.opendev.org/openstack/ironic-python-agent/dib/files/ipa-debian-master.d/ are up todate	17:26
dtantsur	ah, now everything is updated, sorry for the noise	17:26
clarkb	ack I'm guessing your early checks cached something and it took a few for it to renew the cached data	17:27
dtantsur	could be. will wait half an hour next time just to be sure	17:27
fungi	clarkb: minor note on the draft e-mail but lgtm overall, thanks!	18:04
clarkb	fungi: yup just made the edit that I think you're suggestiing. Does that look right? If so I'll go ahead and send this out and cc infra rooters on it	18:04
fungi	yep, perfect. thanks ahain!	18:05
fungi	again	18:05
clarkb	email is sent	18:12
clarkb	oh also I meant to mention that reenqueing the buildset was super easy via the zuul web ui	18:15
clarkb	definitely preferable to figuring ou the cli command and getting all the input data correct	18:15
clarkb	looks like the static.o.o connection limit bump change has two +2's I'm going to approve it now	18:19
clarkb	I'll double check it deploys properly and the server doesn't look sad afterawrds	18:19
fungi	yeah, i've done reenqueue via webui a few times and it's super easy	18:20
opendevreview	Merged opendev/system-config master: Increase static webserver limits to 4096 connections https://review.opendev.org/c/opendev/system-config/+/967711	18:52
clarkb	if apache isn't restarted by the deployment of ^ I will do that as I think increasing this limit requires a proper restart	18:53
clarkb	the restart was automatic	18:56
clarkb	and I'm able to reach releases.openstack.org as well as tarballs.openstack.org. I will check back in a bit to see if all of the processes are being used	18:57
clarkb	so far I've seen us running up to 27 child processes and now we are down to 24. This means we aren't exercising the new limit (which is good as maybe that means we were right on the edge? that would explain why this happens infrequently too)	19:20
clarkb	ok we're over the old limit now and system load seems fine	19:40
fungi	yeah, checking in and load average on the server is under 1.0	20:15
clarkb	tonyb: looking at a calendar has does December 7 for me and December 8 for you look if we start thinking about the gerrit 3.11 upgrade? I don't want to do November 30 because its the tail end of a holiday weekend for me	20:16
clarkb	I'm thinking that should be a decent amount of time to work through testing and all that and we can announce it soonish if we think that is a workable date. Probably ~2100 UTC December 7?	20:16
fungi	i'm available for that window, and agree it gives us time to prep	20:37
clarkb	I pulled up the gerrit 3.11 release notes right before lunch and I'm remembering why this is taking so long. There is a lot of information to digest and prepare for. That said I suspect that the actual amount of change is not that bad and probably no worse than any other upgrade	20:56
clarkb	I also thought about upgrading straight to 3.12 but the reason not to do that is the java 21 transition. It will be easier to do that in place on 3.11 I think	20:56
dmsimard[m]	hi, just following up, I was firefighting most of the day and didn't get a chance to sit down with Arnaud to talk about the flavors but we plan to talk about it soon, I'll let you know when I have an update	22:10
clarkb	dmsimard[m]: thanks!	22:11
clarkb	infra-root I'm slowly getting https://etherpad.opendev.org/p/gerrit-upgrade-3.11 into shape. So far I've gone through the existing known issues and breaking changes and added a couple of newer items from the release notes and have also started evaluating some of them with notes. For others where work needs to be done I'm trying to capture that with explicit TODOs	22:14
clarkb	but there are still a number of release note entries I need to read through and decide if they need to go on the etherpad or not	22:15
fungi	clarkb: looks like openmetal replied, but also in testing now i get 85.3 MB/s average where before i was getting orders of magnitude slower transfer rates. i wonder if it could have sped up after we stopped running jobs there, and maybe something about our workload was competing for limited bandwidth?	22:44
clarkb	fungi: that is an interesting theory. I tested all of this after the change disabling the region landed, but existing running jobs would keep going until complete so we'd have a time period of overlap?	22:45
fungi	perhaps	22:46
clarkb	fungi: do you want to respond to them with what you've just found and we can suggest that as a potential cause? Probably reenable it and then monitor from there?	22:46
fungi	or it could be that we were competing for network infrastructure with another customer there	22:46
clarkb	ya could also be a different noisy neighbor that is now queit. Considering the web crawling activity on the internet that wouldn't surprise me at all	22:46
fungi	yeah, can reply in a few minutes	22:46
clarkb	thanks. I'm happy to as well, but figured if you've arelady rerun tests then you're far ahead of me	22:47
opendevreview	Merged opendev/zuul-providers master: Revert "Temporarily turn down OpenMetal" https://review.opendev.org/c/opendev/zuul-providers/+/967718	23:02
fungi	i've replied to ramon	23:04
fungi	(cc'ing everyone still)	23:04
clarkb	thanks again	23:04
fungi	oh, though they won't receive it because openmetal.io is using gmail, so it bounced back to me for not using a corporatey enough mailserver	23:05
clarkb	I think I've gotten through the entire 3.11 release notes document and have done my best to call out things that need further attention in the etherpad	23:05
clarkb	fungi: I can respond with your content	23:05
fungi	thanks	23:05
clarkb	(if that is the best way to handle this)	23:05
fungi	wfm, sure	23:05
clarkb	done	23:06
clarkb	so now I have a good number of TODO items to look into on the held nodes	23:06
clarkb	please feel free to look over the release notes and call out anything I missed, or identify issues with the evaluations I've already done on the etherpad, or volunteer to dig into any of these items. THat said I'll be doing my best to look into things myself over the next few days	23:07
clarkb	3.11 has a new feature where you can tell it to not automatically run the online reindex upon upgrade. The documentation says this is alrgely geared at ha sites and managing zero downtime upgrades, but I think we can take advantage of this to make downgrades less painful if we need to do them. Basically upgrade to version N+1 but don't reindex yet. Make sure everything seems to be	23:11
clarkb	working then manually trigger online reindexing when ready. That way if you find a problem you can downgrade without taking the time to do an offline reindex to reindex back to version N's index versions	23:11
clarkb	3.11 has no new index versions so this isn't a good version to test that assumption on as it should noop anyway	23:11
clarkb	but somethign to keep in mind for the future and I've written a note about it	23:11
corvus	clarkb: fungi er, i tested it a while ago (some time after clarkb sent email) and got something very slow (slower than fungi). sorry i don't know the time; i didn't think it was interesting enough to mention.	23:20
clarkb	corvus: thanks, we'll just have to monitor it and see if it comes back	23:20
clarkb	it is entirely possible something on the internet fell over and got fixed too. Internet connectivity is such fun to debug	23:21
*** diablo_rojo_phone is now known as Guest31647		23:58

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!