| opendevreview | Michal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role https://review.opendev.org/c/zuul/zuul-jobs/+/966187 | 07:16 |
|---|---|---|
| noonedeadpunk | this indeed was like you attempted and I never managed to work on implementation | 07:34 |
| opendevreview | Michal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role https://review.opendev.org/c/zuul/zuul-jobs/+/966187 | 07:35 |
| opendevreview | Michal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role https://review.opendev.org/c/zuul/zuul-jobs/+/966187 | 07:37 |
| opendevreview | Michal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role https://review.opendev.org/c/zuul/zuul-jobs/+/966187 | 07:37 |
| mnasiadka | clarkb: looking at trixie arm64 patch - I think there are some storage iops issues again in the OSUOSL provider… | 12:48 |
| opendevreview | Michal Nasiadka proposed zuul/zuul-jobs master: Use mirror_info in configure-mirrors role https://review.opendev.org/c/zuul/zuul-jobs/+/966187 | 13:03 |
| dtantsur | hey folks, how do I know why these builds failed? I cannot find any hints: https://zuul.opendev.org/t/openstack/build/b95e332cef70428f8aca06889906e394 https://zuul.opendev.org/t/openstack/build/9c4bd30268a9432c912d808ae6f2efc3 | 15:34 |
| dtantsur | (to make matters worse, both contain critical fixes for the images we build, so the images are still broken) | 15:35 |
| dtantsur | Ah, so one of the playbooks timed out.. which does not explain me anything, unfortunately | 15:36 |
| dtantsur | But it looks like uploading images to tarballs.o.o may be broken | 15:37 |
| opendevreview | Jeremy Stanley proposed opendev/system-config master: DNM: Trigger some channel log collisions https://review.opendev.org/c/opendev/system-config/+/967708 | 15:41 |
| fungi | dtantsur: looking | 15:42 |
| clarkb | mnasiadka: osuosl shares a timezone with me. I suspect if there were problems in the middle of the night they may not be immediately handled. That said its now morning here and we can either try again to see if the issue was temporary (though it appears tohaev occurred ~4 times already) or just directly ask osuosl to take a look | 15:47 |
| clarkb | Ramereth[m]: ^ fyi we're seeing slow image builds in the osuosl cloud which may imply some sort of iops issue. Not sure if you're aware of anything going on (realizing its early still) | 15:47 |
| fungi | dtantsur: TASK [Copy files from /home/zuul/src/opendev.org/openstack/ironic-python-agent/UPLOAD_RAW on node] took 18 minutes, i think that's where the bulk of the time was spent on the post play that timed out | 15:48 |
| fungi | TASK [Copy files from /home/zuul/src/opendev.org/openstack/ironic-python-agent/UPLOAD_TAR on node] took a further 12 minutes before it reached the timeout | 15:49 |
| fungi | dtantsur: so either the files are a lot larger or bandwidth between the executor and job node was more constrained than usual or there were i/o problems reading on the job node or writing to the local disk on the executor | 15:52 |
| dtantsur | I don't think the files have increased in size recently (and the patch does not change the size) | 15:53 |
| fungi | that analysis was specific to the first example. for the second example just the copy from UPLOAD_RAW task alone ran almost 28 minutes, leaving about 2 minutes for the UPLOAD_TAR copy before it got killed | 15:55 |
| clarkb | fstrim was able to trim almost 1GB of data according to the dib log. Unfortunately, that doesn't tell us how large the result is | 15:56 |
| fungi | i'll check for commonalities between these (same executor, same cloud provider/region, et cetera), maybe there's a correlation | 15:56 |
| mnasiadka | clarkb: well, I thought that if there’s some slowness in the middle of the night - it might be worse in daytime :) | 15:58 |
| fungi | dtantsur: both builds ran in our openmetal provider, so that's one possible thread to pull on. checking executors next | 15:59 |
| fungi | one ran on ze06 and one on ze10 so i don't suspect it's an executor-specific issue | 15:59 |
| fungi | i wonder if i/o is slow in openmetal right now, or if it's impacted by some network issue | 16:00 |
| clarkb | pulling packages was quick Fetched 231 MB in 2s (112 MB/s) | 16:00 |
| fungi | yeah, so maybe inbound network connectivity is fine but outbound is constrained? | 16:01 |
| clarkb | or specific to the path between these clouds | 16:01 |
| fungi | in this case the slow transfers were from openmetal to the executors | 16:01 |
| fungi | maybe worth trying to pull a large file from the openmetal mirror to an executor | 16:02 |
| fungi | i need to step away for a moment, but can try that in a few minutes | 16:02 |
| clarkb | https://zuul.opendev.org/t/openstack/build/b95e332cef70428f8aca06889906e394/log/job-output.txt#6106 this says the initramfs file is 302MB so not massive | 16:03 |
| clarkb | (there aer a few other things copied too so we haven't ruled out total file size being huge yet but its looking less and less likely that is the issue) | 16:04 |
| clarkb | I'm getting ~400-500KBps from the mirror to my local machine | 16:09 |
| clarkb | (it is much easier to test that then constraining the test to the executor(s) | 16:09 |
| clarkb | which doesn't explain 20 minutes for a 300MB ish transfer but does probably point at a problem | 16:10 |
| priteau | fungi: Following up on the {tarballs,releases}.openstack.org issues I mentioned yesterday, I am not sure they are related to Cloudflare actually. What we see in Kayobe CI is an occasional "The handshake operation timed out" which I think would happen once DNS resolution is complete anyway. Examples from just earlier today: | 16:10 |
| priteau | https://9646a7fb82b47fbe6288-a22e2178400a1d74c0dfc0d0570ba9cf.ssl.cf2.rackcdn.com/openstack/e30a4c91bd504fb38196e56dfc18b9de/primary/ansible/tenks-deploy | 16:10 |
| priteau | https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_97a/openstack/97a23553e11845d08dd4ae3fd923f181/primary/ansible/overcloud-deploy-pre-upgrade | 16:10 |
| priteau | Also just for regular browsing of https://releases.openstack.org/ it feels very slow | 16:11 |
| clarkb | priteau: if you link to the logs within the zuul web ui it makes things a lot easier because you can link directly to the lines with the issues and we can more easily navigate to other information like where the job ran etc | 16:11 |
| priteau | Sorry, I don't use this feature often, checking | 16:13 |
| priteau | https://zuul.opendev.org/t/openstack/build/e30a4c91bd504fb38196e56dfc18b9de/log/primary/ansible/tenks-deploy#2463-2468 | 16:13 |
| priteau | https://zuul.opendev.org/t/openstack/build/97a23553e11845d08dd4ae3fd923f181/log/primary/ansible/overcloud-deploy-pre-upgrade#30143-30144 | 16:14 |
| clarkb | that job did not run in openmetal so unlikely to be directly related to the potential network issues there | 16:14 |
| priteau | It only happens occasionally, but often enough to require regular rechecks | 16:15 |
| priteau | I know, we should add retries on our http fetches | 16:15 |
| clarkb | looks like you have retries on the second one | 16:16 |
| priteau | You're right, it failed multiple times then | 16:17 |
| priteau | I am just wondering if the server is overloaded | 16:17 |
| clarkb | the first request is to https://releases.openstack.org/constraints/upper/master which redirects you to https://opendev.org/openstack/requirements/raw/branch/master/upper-constraints.txt The second is to https://tarballs.openstack.org/ironic-python-agent/tinyipa/files/tinyipa-stable-2025.1.vmlinuz which redirects you to | 16:19 |
| clarkb | https://tarballs.opendev.org/openstack/ironic-python-agent/tinyipa/files/tinyipa-stable-2025.1.vmlinuz. releases.openstack.org, tarballs.openstack.org and tarballs.opendev.org are all hosted on the same afs backed service (opendev.org is not) | 16:19 |
| clarkb | so it does seem likely that the common issue is in the static file hosting backed by afs. Considering the error reported is an inability to establish an ssl connection probably on the frontend (so not an afs problem) | 16:19 |
| clarkb | on static.opendev.org we have a server limit of 32 and we have 32 child pids for apache2 so yes seems likely we're getting all the server slots filled up probably by ai crawlers | 16:21 |
| clarkb | we recentlyish bumped the total number of connections on that server and that helped quite a bit. But maybe we should incrase the limits further | 16:22 |
| clarkb | ultimately the problem now is there is an arms race on the internet to collect as much data as quickly as possible. consequences be damned that is someone elses problem. We're the collateral damage | 16:22 |
| clarkb | in any case server load seems reasonable so I think we can bump up those limits | 16:23 |
| clarkb | then we wait another month until we can't keep up anymore and decide if we can bump limits further. Eventually can decide if we round robin or load balance across more servers | 16:23 |
| fungi | should we multiply all the tuning values we set in https://review.opendev.org/c/opendev/system-config/+/962973 or just some of them? | 16:26 |
| clarkb | fungi: I'm thinking we multiply just the process and connection limit. I have a change almost ready | 16:28 |
| fungi | ah cool, standing by to review | 16:28 |
| fungi | though i'm going to need to disappear to run errands shortly | 16:28 |
| clarkb | fungi: on the openmetal side of things if you are able to confirm slow data transfer off of the mirror then we can consider simply disabling that cloud for now and sending them an email | 16:28 |
| fungi | yeah, checking | 16:29 |
| opendevreview | Clark Boylan proposed opendev/system-config master: Increase static webserver limits to 4096 connections https://review.opendev.org/c/opendev/system-config/+/967711 | 16:30 |
| priteau | Thanks a lot clarkb, I will report if it helps | 16:32 |
| clarkb | and my thought for only bumping the process and connection limit is that it gives us room for emergency increases via the max thread bump later if we need to | 16:32 |
| fungi | i temporarily created http://mirror.iad3.openmetal.opendev.org/1gb-test.dat as a test file and am retrieving it with wget on ze12 | 16:33 |
| fungi | transfer rate is around 200KB/s but goes up and down quite a bit | 16:34 |
| fungi | just made a smaller 1mb version and average transfer rate fetching it was 277 KB/s | 16:35 |
| clarkb | at 200KB/s we can expect a 302MB file to take almost 1600 seconds to transfer | 16:35 |
| clarkb | which is inline with the timing on one of the two test cases so ya I suspect this is our culprit | 16:36 |
| fungi | i'll push a change to temporarily turn down that provider | 16:36 |
| opendevreview | Jeremy Stanley proposed opendev/zuul-providers master: Temporarily turn down OpenMetal https://review.opendev.org/c/opendev/zuul-providers/+/967717 | 16:41 |
| opendevreview | Jeremy Stanley proposed opendev/zuul-providers master: Revert "Temporarily turn down OpenMetal" https://review.opendev.org/c/opendev/zuul-providers/+/967718 | 16:41 |
| fungi | okay, heading out to run errands, but will try to make it quick. back soon | 16:42 |
| opendevreview | Merged opendev/zuul-providers master: Temporarily turn down OpenMetal https://review.opendev.org/c/opendev/zuul-providers/+/967717 | 16:44 |
| clarkb | dtantsur: ^ that is the workaround for now. I'm currently drafting an email to that cloud to see if we can correct it properly | 16:45 |
| dtantsur | Thank you! | 16:47 |
| dtantsur | Is it possible to somehow restart the jobs? I'd really hate to hurry a dummy change through the gate to get the images fixed. | 16:47 |
| clarkb | I think the answer is it depends on the jobs and the buildset. We can reenqueue the entire buildset associated with those jobs but that will run all of the jobs in the buildset and if there are isses with idempotency then we shouldn't do that | 16:49 |
| clarkb | I think publish-openstack-python-branch-tarball is only safe if nothing else has merged since | 16:50 |
| clarkb | since we don't want to rollback that data | 16:50 |
| clarkb | I want to say the github replication is safe as it takes the state of the world and pushes it but I'm not certain of that | 16:51 |
| clarkb | https://zuul.opendev.org/t/openstack/buildset/f1db495797004110af2396d07bd18057 is what I'm looking at | 16:51 |
| clarkb | I think in both cases we're safe if nothing else has merged. It becomes trickier if other changes have merged (including to other branches) | 16:51 |
| dtantsur | This was the last thing that merged, yes https://review.opendev.org/q/project:openstack/ironic-python-agent+status:merged | 16:52 |
| dtantsur | and I don't see anything in the gate now | 16:52 |
| clarkb | dtantsur: ack give me a few and I'll reenqueue that run | 16:53 |
| clarkb | I think I might be able to do that if I login as admin in the web ui so I'll try that first | 16:53 |
| clarkb | dtantsur: https://zuul.opendev.org/t/openstack/buildset/f1db495797004110af2396d07bd18057 this is the correct set of failures right? | 16:56 |
| clarkb | dtantsur: I'm about to reenqueue them so want to double check with you that that is the correct state we want to rerun first | 16:56 |
| dtantsur | clarkb: yep, that's it | 16:56 |
| clarkb | dtantsur: ok things should be enqueued again | 16:57 |
| dtantsur | Thanks!! | 16:57 |
| clarkb | you're welcome. Sorry for the trouble | 16:59 |
| clarkb | infra-root my draft email to openmetal https://etherpad.opendev.org/p/Xd6JERl87Kr1zFnl3rHN | 17:18 |
| clarkb | dtantsur: looks like the jobs succeeded this time around | 17:20 |
| dtantsur | Yep, success, thanks again! | 17:21 |
| dtantsur | (I've been waiting for new images to show up on tarballs.o.o, which I guess takes a bit) | 17:21 |
| clarkb | dtantsur: afs publishes every 5 minutes | 17:21 |
| clarkb | using cron so its next 5 minute block + time to publish | 17:22 |
| clarkb | should be done within 10 minutes of job completion typically | 17:22 |
| dtantsur | Hmm, I think it's 20 minutes already | 17:22 |
| clarkb | https://grafana.opendev.org/d/9871b26303/afs?orgId=1&from=now-6h&to=now&timezone=utc the 'project vos release timers' graph captures this | 17:23 |
| clarkb | looks like your larger image data caused a longer vos release (it took 13 minutes) but appears to be done now according to that graph | 17:24 |
| dtantsur | I see some of the files (e.g. ipa-debian-master.kernel) from yesterday still, while most are updated | 17:24 |
| dtantsur | strange | 17:24 |
| dtantsur | what do you see in https://tarballs.opendev.org/openstack/ironic-python-agent/dib/files/? | 17:24 |
| dtantsur | maybe something caches them on the way to me? | 17:24 |
| clarkb | https://tarballs.opendev.org/openstack/ironic-python-agent/dib/?C=M;O=D they look updated to me there | 17:24 |
| clarkb | two new files froim 17:02UTC today | 17:25 |
| dtantsur | These two, but not the rest | 17:25 |
| dtantsur | checking https://ddbcd145bedf294c8288-f0a55fc4957fe55450e72f1f6d277d79.ssl.cf1.rackcdn.com/openstack/4fdd4e83f2754616826a7bc82adbbba4/job-output.txt, it looks like all files were uploaded to AFS | 17:26 |
| clarkb | ya the subdir contains updated manifests | 17:26 |
| clarkb | https://tarballs.opendev.org/openstack/ironic-python-agent/dib/files/ipa-centos9-master.d/ and https://tarballs.opendev.org/openstack/ironic-python-agent/dib/files/ipa-debian-master.d/ are up todate | 17:26 |
| dtantsur | ah, now everything is updated, sorry for the noise | 17:26 |
| clarkb | ack I'm guessing your early checks cached something and it took a few for it to renew the cached data | 17:27 |
| dtantsur | could be. will wait half an hour next time just to be sure | 17:27 |
| fungi | clarkb: minor note on the draft e-mail but lgtm overall, thanks! | 18:04 |
| clarkb | fungi: yup just made the edit that I think you're suggestiing. Does that look right? If so I'll go ahead and send this out and cc infra rooters on it | 18:04 |
| fungi | yep, perfect. thanks ahain! | 18:05 |
| fungi | again | 18:05 |
| clarkb | email is sent | 18:12 |
| clarkb | oh also I meant to mention that reenqueing the buildset was super easy via the zuul web ui | 18:15 |
| clarkb | definitely preferable to figuring ou the cli command and getting all the input data correct | 18:15 |
| clarkb | looks like the static.o.o connection limit bump change has two +2's I'm going to approve it now | 18:19 |
| clarkb | I'll double check it deploys properly and the server doesn't look sad afterawrds | 18:19 |
| fungi | yeah, i've done reenqueue via webui a few times and it's super easy | 18:20 |
| opendevreview | Merged opendev/system-config master: Increase static webserver limits to 4096 connections https://review.opendev.org/c/opendev/system-config/+/967711 | 18:52 |
| clarkb | if apache isn't restarted by the deployment of ^ I will do that as I think increasing this limit requires a proper restart | 18:53 |
| clarkb | the restart was automatic | 18:56 |
| clarkb | and I'm able to reach releases.openstack.org as well as tarballs.openstack.org. I will check back in a bit to see if all of the processes are being used | 18:57 |
| clarkb | so far I've seen us running up to 27 child processes and now we are down to 24. This means we aren't exercising the new limit (which is good as maybe that means we were right on the edge? that would explain why this happens infrequently too) | 19:20 |
| clarkb | ok we're over the old limit now and system load seems fine | 19:40 |
| fungi | yeah, checking in and load average on the server is under 1.0 | 20:15 |
| clarkb | tonyb: looking at a calendar has does December 7 for me and December 8 for you look if we start thinking about the gerrit 3.11 upgrade? I don't want to do November 30 because its the tail end of a holiday weekend for me | 20:16 |
| clarkb | I'm thinking that should be a decent amount of time to work through testing and all that and we can announce it soonish if we think that is a workable date. Probably ~2100 UTC December 7? | 20:16 |
| fungi | i'm available for that window, and agree it gives us time to prep | 20:37 |
| clarkb | I pulled up the gerrit 3.11 release notes right before lunch and I'm remembering why this is taking so long. There is a lot of information to digest and prepare for. That said I suspect that the actual amount of change is not that bad and probably no worse than any other upgrade | 20:56 |
| clarkb | I also thought about upgrading straight to 3.12 but the reason not to do that is the java 21 transition. It will be easier to do that in place on 3.11 I think | 20:56 |
| dmsimard[m] | hi, just following up, I was firefighting most of the day and didn't get a chance to sit down with Arnaud to talk about the flavors but we plan to talk about it soon, I'll let you know when I have an update | 22:10 |
| clarkb | dmsimard[m]: thanks! | 22:11 |
| clarkb | infra-root I'm slowly getting https://etherpad.opendev.org/p/gerrit-upgrade-3.11 into shape. So far I've gone through the existing known issues and breaking changes and added a couple of newer items from the release notes and have also started evaluating some of them with notes. For others where work needs to be done I'm trying to capture that with explicit TODOs | 22:14 |
| clarkb | but there are still a number of release note entries I need to read through and decide if they need to go on the etherpad or not | 22:15 |
| fungi | clarkb: looks like openmetal replied, but also in testing now i get 85.3 MB/s average where before i was getting orders of magnitude slower transfer rates. i wonder if it could have sped up after we stopped running jobs there, and maybe something about our workload was competing for limited bandwidth? | 22:44 |
| clarkb | fungi: that is an interesting theory. I tested all of this after the change disabling the region landed, but existing running jobs would keep going until complete so we'd have a time period of overlap? | 22:45 |
| fungi | perhaps | 22:46 |
| clarkb | fungi: do you want to respond to them with what you've just found and we can suggest that as a potential cause? Probably reenable it and then monitor from there? | 22:46 |
| fungi | or it could be that we were competing for network infrastructure with another customer there | 22:46 |
| clarkb | ya could also be a different noisy neighbor that is now queit. Considering the web crawling activity on the internet that wouldn't surprise me at all | 22:46 |
| fungi | yeah, can reply in a few minutes | 22:46 |
| clarkb | thanks. I'm happy to as well, but figured if you've arelady rerun tests then you're far ahead of me | 22:47 |
| opendevreview | Merged opendev/zuul-providers master: Revert "Temporarily turn down OpenMetal" https://review.opendev.org/c/opendev/zuul-providers/+/967718 | 23:02 |
| fungi | i've replied to ramon | 23:04 |
| fungi | (cc'ing everyone still) | 23:04 |
| clarkb | thanks again | 23:04 |
| fungi | oh, though they won't receive it because openmetal.io is using gmail, so it bounced back to me for not using a corporatey enough mailserver | 23:05 |
| clarkb | I think I've gotten through the entire 3.11 release notes document and have done my best to call out things that need further attention in the etherpad | 23:05 |
| clarkb | fungi: I can respond with your content | 23:05 |
| fungi | thanks | 23:05 |
| clarkb | (if that is the best way to handle this) | 23:05 |
| fungi | wfm, sure | 23:05 |
| clarkb | done | 23:06 |
| clarkb | so now I have a good number of TODO items to look into on the held nodes | 23:06 |
| clarkb | please feel free to look over the release notes and call out anything I missed, or identify issues with the evaluations I've already done on the etherpad, or volunteer to dig into any of these items. THat said I'll be doing my best to look into things myself over the next few days | 23:07 |
| clarkb | 3.11 has a new feature where you can tell it to not automatically run the online reindex upon upgrade. The documentation says this is alrgely geared at ha sites and managing zero downtime upgrades, but I think we can take advantage of this to make downgrades less painful if we need to do them. Basically upgrade to version N+1 but don't reindex yet. Make sure everything seems to be | 23:11 |
| clarkb | working then manually trigger online reindexing when ready. That way if you find a problem you can downgrade without taking the time to do an offline reindex to reindex back to version N's index versions | 23:11 |
| clarkb | 3.11 has no new index versions so this isn't a good version to test that assumption on as it should noop anyway | 23:11 |
| clarkb | but somethign to keep in mind for the future and I've written a note about it | 23:11 |
| corvus | clarkb: fungi er, i tested it a while ago (some time after clarkb sent email) and got something very slow (slower than fungi). sorry i don't know the time; i didn't think it was interesting enough to mention. | 23:20 |
| clarkb | corvus: thanks, we'll just have to monitor it and see if it comes back | 23:20 |
| clarkb | it is entirely possible something on the internet fell over and got fixed too. Internet connectivity is such fun to debug | 23:21 |
| *** diablo_rojo_phone is now known as Guest31647 | 23:58 | |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!