frickler | fwiw I've seen that apache restart issue before, but didn't look into a solution yet | 08:12 |
---|---|---|
opendevreview | Merged zuul/zuul-jobs master: Add upload-image-swift role https://review.opendev.org/c/zuul/zuul-jobs/+/944812 | 10:22 |
opendevreview | Merged zuul/zuul-jobs master: Add upload-image-s3 role https://review.opendev.org/c/zuul/zuul-jobs/+/944813 | 10:22 |
opendevreview | Fabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs https://review.opendev.org/c/zuul/zuul-jobs/+/943586 | 11:06 |
opendevreview | Fabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs https://review.opendev.org/c/zuul/zuul-jobs/+/943586 | 11:10 |
opendevreview | Fabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs https://review.opendev.org/c/zuul/zuul-jobs/+/943586 | 11:13 |
mnasiadka | clarkb, fungi : I'd like to thank whoever fixed the aarch64 nodes to be fast again :) | 11:54 |
Clark[m] | mnasiadka we redeployed the mirror and nodepool builder in osuosl and noticed they were quick. We didn't change anything on our end so must've been Ramereth[m] | 13:37 |
fungi | clarkb: the python3.13.2t is a pyenv bug where it's selecting a free-threading build of python because that looks like the highest "version number" | 13:45 |
Clark[m] | Ya I remembered you pushed a workaround to zuul-jobs that I thought had merged (it hasn't). I need to depends on that and revert my python version selector | 13:47 |
corvus | loooks like we have zuul-launcher image builds for all the clouds: https://zuul.opendev.org/t/zuul/image/debian-bullseye | 14:40 |
corvus | i think that page could probably use a column showing the format for each of the build artifacts. but reading between the lines, it looks like ovh and raxflex use qcow2 and vexxhost and openmetal use raw? does that sound right? | 14:41 |
corvus | then we have two builds for each of those (current and previous) | 14:41 |
clarkb | corvus: looking at our nodepool builder config our base default is qcow2 so then only clouds that override that would get raw or vhd | 14:42 |
clarkb | so I think that is about right. vexxhostand openmetal use ceph to back the disks and ceph doesn't play as well with qcow2 as it does with raw | 14:43 |
clarkb | doubling up on the copy on right stuff doesn't work | 14:43 |
clarkb | so ya that lgtm | 14:43 |
corvus | ah that's why. i wondered. :) | 14:43 |
clarkb | *copy on write | 14:46 |
clarkb | infra-root how do we feel about upgrading gitea today to the latest bugfix release: https://review.opendev.org/c/opendev/system-config/+/945414 fungi has +2'd the change and screenshots lgtm : https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_ed2/openstack/ed2d3a106d304d57ace4a9c901c12bac/bridge99.opendev.org/screenshots/ | 14:59 |
clarkb | the main consideration is probably the openstack release next week | 14:59 |
clarkb | I'm getting my morning started but I expect to be around until thunderstorms knock out my power (and I hope that doesn't happen). But that isn't expected until this afternoon so I think I'm good with it | 15:04 |
fungi | this seems like a fine time for a gitea bugfix point release upgrade | 15:05 |
fungi | i'm around to help test | 15:05 |
clarkb | ok I'll give it a few more minutes for feedback then can hit +A | 15:07 |
clarkb | I approved it | 15:15 |
frickler | a week until the openstack release should be fine to find any possible issues and fix them if needed, but I agree the risk seems low, too | 15:28 |
clarkb | ya if this was an upgrade to 1.24 I would wait | 15:30 |
opendevreview | Dr. Jens Harbott proposed openstack/project-config master: Cap max-servers for rax clouds https://review.opendev.org/c/openstack/project-config/+/945623 | 15:30 |
clarkb | but 1.23.5 to 1.23.6 should be much safer | 15:30 |
frickler | infra-root: ^^ that's my proposal to see if it helps with the timeouts I mentioned yesterday, let me know what you think | 15:30 |
clarkb | +2 from me | 15:31 |
clarkb | seems like an easy safe experiment | 15:31 |
clarkb | frickler: my zuul python 3.13 test job is very slowly running pip install for requirements. I jumped on that test node 104.130.253.199 and it is almost idle. The problem seems to be with io to/from the mirror? | 15:38 |
clarkb | frickler: might be worth checking if the mirror is being overwhelmed or has some other issues. In the past we noticed similar slowness which was mitigated by switching to the internal rax network interfaces. But maybe now those internal interfaces are having problems. I don't know just noting what I observe | 15:39 |
clarkb | of course reducing total test nodes would reduce demand on the mirrors if they are overwhelmed that would still help | 15:40 |
fungi | could be ai crawlers | 15:40 |
clarkb | ha yes why didn't I consider that | 15:42 |
Ramereth[m] | <Clark[m]> "mnasiadka we redeployed the..." <- I did make some changes in the backend storage, glad you noticed an improvement! Can you say how much of an improvement it was? | 15:47 |
mnasiadka | Ramereth[m]: previously kolla build jobs timed out after 2.5/3 hours - now they finish in like 30-40 minutes :) | 15:47 |
Ramereth[m] | sweet, are they using ephemeral disks for the VMs? | 15:48 |
clarkb | Ramereth[m]: we're usingthe opendev.large flavor however that boots | 15:50 |
clarkb | for our image builder node we use a persistent volume and buidls went from several hours to around an hour | 15:50 |
Ramereth[m] | clarkb: yeah, so I added additional SSD nodes to Ceph and migrated that pool over to SSD. I also reduced the replication from 3 to 2 to save space needed to use SSD's | 15:51 |
clarkb | it seems to have helped | 15:52 |
clarkb | re mirror.dfw system load is low and tailing some access logs I see lots of what I expect are normal pip requests | 15:53 |
clarkb | dnf is requesting centos 9 packages over port 80 and not 443 | 15:53 |
Ramereth[m] | clarkb: btw did something happen to nl01.opendev.org? I had a nagios check to see when we had any images we needed to manually cleanup but it's no longer working | 15:54 |
clarkb | Ramereth[m]: oh sorry I deleted it and nl05.opendev.org took its place | 15:55 |
clarkb | it didn't occur to me that external users may be using those in automated checks. But if you just update the hostname in the fqdn it should go back to working | 15:55 |
clarkb | (I've been working through server replacements to update their base OSes) | 15:55 |
clarkb | doing a bigger tail -f dragnet on mirror.dfw I don't see anything that looks like obvious ai web crawler traffic | 15:56 |
Ramereth[m] | <clarkb> "Ramereth: oh sorry I deleted..." <- Confirmed this is working again, thanks! | 16:02 |
clarkb | and good luck with the storms today | 16:06 |
Ramereth[m] | Yeah, I hope it won't be bad | 16:07 |
clarkb | I managed to get half the garage cleared out so the car can move inside today. | 16:07 |
frickler | clarkb: yes, I suspected either network or disk io. for kolla the timeouts mostly happen while pulling container images via the mirror node | 16:23 |
opendevreview | Thierry Carrez proposed opendev/irc-meetings master: Move Large Scale SIG meeting one hour earlier https://review.opendev.org/c/opendev/irc-meetings/+/945633 | 16:36 |
clarkb | one thought I had is maybe afs is slow but load average on the afs server looks ok and we see slowness pulling from pypi too which doesn't involve afs | 17:25 |
clarkb | so ya this is kinda pointing at some sort of io (network and/or disk) slowness in rax I think | 17:26 |
fungi | container images aren't coming from afs either right? | 17:26 |
clarkb | no. Only distro packages | 17:26 |
*** iurygregory_ is now known as iurygregory | 17:26 | |
clarkb | I think it is possible that afs is impacted by the same io issues but I suspect that it isn't the cause | 17:26 |
clarkb | I think we should consider frickler's change to dial back use of rax and see if lower demand helps on the whole | 17:27 |
fungi | yeah, but as i said yesterday this might also be an opportunity to ask about dialling up use of flex at the same time, essentially starting to shift capacity in that direction | 17:31 |
clarkb | fwiw I did ask jamesdenton a little while ago and he said he wuold check. But haven't heard back on that yet. So I don'tw ant to push the issue | 17:32 |
fungi | sure, that's totally fair | 17:33 |
opendevreview | Merged opendev/system-config master: Update to Gitea 1.23.6 https://review.opendev.org/c/opendev/system-config/+/945414 | 17:33 |
fungi | that ^ should start deploying straight away | 17:33 |
clarkb | gitea09 is in the process of restarting now | 17:37 |
fungi | yeah, i just got the "service unavailable" page for it | 17:37 |
fungi | and it's back online now | 17:38 |
clarkb | https://gitea09.opendev.org:3081/opendev/system-config/ confirmed | 17:38 |
fungi | Powered by Gitea Version: v1.23.6 | 17:38 |
clarkb | git clone works for me too | 17:38 |
clarkb | 10 is upgraded now too | 17:40 |
clarkb | jamesdenton: re the conversation above we've seen jobs in rax classic (particularly dfw but maybe in iad and ord) become very slow. Initial investigation appears to show most of the time "lost" is performing network IO to grab things like python packages through our local caching proxy to pypi.org. But also through the cache proxy to quay (I think kolla is using quay now and not | 17:43 |
clarkb | docker hub) | 17:43 |
clarkb | jamesdenton: not sure if old rax is something you have insight into, but figured we'd pass that along. | 17:43 |
fungi | deploy succeeded | 17:48 |
fungi | all backends should be on 1.23.6 now | 17:48 |
clarkb | they appear to be from my checks | 17:49 |
clarkb | I guess the last thing to check is replication. I'm looking for a new patchset that has been pushed since 17:48 to check wtih | 17:56 |
jamesdenton | thanks clarkb. We are also seeing a few reports of slowness in another application, so it's on my list. In terms of bumping quota for Flex, i will start a thread for that | 17:56 |
clarkb | jamesdenton: thanks. Let us know if we can help with the debugging. | 17:57 |
clarkb | no new patchsets yet /me looks to see if any changes need a new patchset | 17:58 |
jamesdenton | clarkb are your resources primary in DFW? | 18:00 |
clarkb | jamesdenton: we notice most of the slowness in CI jobs which are in dfw, iad and ord. But many of our control plane services do run in dfw primarily including zuul and nodepool that coordinate all of the test resources | 18:01 |
jamesdenton | gotcha | 18:01 |
jamesdenton | clarkb if you come across one again, let me know if it's possible to login an tshoot | 18:03 |
clarkb | jamesdenton: sure let me see if I can find one | 18:03 |
clarkb | jamesdenton_: I asked nodepool to hold (not delete) the nodes for https://zuul.opendev.org/t/openstack/stream/60e83b5589fb47efa47e6d819debbe76?logfile=console.log but it will only do so if the job fails. If you open up that link you'll see it taking almost 2 minutes to download a 41MB package | 18:12 |
clarkb | of course the problem may be elsewhere still but starting with what we know and trying to work from there | 18:13 |
clarkb | the server is named np0040280911 with ip address 23.253.108.101 | 18:13 |
clarkb | I guess I can hop on the server and run a speedtest when the job is done assuming it fails | 18:13 |
clarkb | and try to see if we can isolate the slowness | 18:14 |
fungi | i suppose you could sabotage the running build carefully to trigger the autohold while causing zuul to retry it | 18:16 |
fungi | maybe surgically kill an ssh connection from the executor or something | 18:17 |
clarkb | I thought about that but the job us running at the top of the gate | 18:17 |
clarkb | another option would be to just boot a test node manually | 18:17 |
jamesdenton_ | i can boot a node, no problem | 18:18 |
clarkb | anyway I don't want to force that one to fail if it will otherwise succeed | 18:18 |
jamesdenton_ | i will add this to the list :) | 18:18 |
fungi | got it | 18:18 |
fungi | though if you can cause a retry that shouldn't reset the gate queue, just delay it | 18:19 |
clarkb | true. I guess the trick is in being confident you'll cause a retry and not a failure | 18:19 |
jamesdenton_alt | 2025-03-26 17:59:11.511852 | compute1 | Fetched 166 MB in 8min 18s (333 kB/s) | 18:19 |
jamesdenton_alt | i'm guessing that's not reasonable? | 18:19 |
jamesdenton_alt | my irc bouncer is boucing, sorry. | 18:20 |
clarkb | ya I think the slowness is those fetches. At first I suspected that openafs may be the problem (those packages are on an openafs filesystem also hosted in dfw) but it appears that fetches for pypi packages through our mirror is also slow and that is a pass through caching proxy to pypi.org | 18:20 |
clarkb | it could be that the mirror itself is a problem but system load there aws reasonable and my checks for ai web crawlers didn't show that being a problem | 18:21 |
clarkb | but I suppose if the mirror specifically had networking io trouble that would be felt across many/most/all of the tests | 18:21 |
clarkb | the mirror is mirror02.dfw.rax.opendev.org | 18:22 |
fungi | dmesg also doesn't indicate any recent incidents on it | 18:22 |
jamesdenton_alt | and that's a mirror you manage? 10.208.224.195? | 18:23 |
fungi | correct, job nodes are communicating with it across the 10.x network for better performance | 18:23 |
jamesdenton_alt | the 10.x "servicenet" network? | 18:24 |
fungi | but it's also globally reachable at 104.130.140.186 | 18:24 |
clarkb | though mirror-int.dfw.rax.opendev.org isn't resolving for me right now | 18:24 |
clarkb | I thought we put that in public dns | 18:24 |
fungi | resolves for me | 18:24 |
jamesdenton_alt | i resolved to a cname - mirror02-int.dfw.rax.opendev.org. | 18:24 |
fungi | $ host mirror-int.dfw.rax.opendev.org | 18:24 |
fungi | mirror-int.dfw.rax.opendev.org is an alias for mirror02-int.dfw.rax.opendev.org. | 18:24 |
fungi | mirror02-int.dfw.rax.opendev.org has address 10.208.224.195 | 18:25 |
clarkb | ya this must be a local dns problem | 18:25 |
clarkb | if I dig against google it resolves | 18:25 |
fungi | hopefully we haven't broken dnssec signatures for that | 18:25 |
clarkb | but ya that is a node that lives in dfw to serve as a cache for resources that test jobs consume. It caches pypi packages, container image layers, and distro packages | 18:25 |
fungi | though if we had, i shouldn't be able to resolve it either | 18:25 |
clarkb | ya I think its more likely that my over complicated ad blocking rules are tripping on it for some reason | 18:26 |
jamesdenton_alt | i wonder if things improve over that public interface vs snet interface... hmm | 18:27 |
clarkb | jamesdenton_alt: the reason we use the internal network is we had similar problems a few yaers ago at this point over the public network | 18:27 |
clarkb | switching to the private network addressed it | 18:27 |
jamesdenton_alt | of course. lol | 18:27 |
clarkb | but its possible the other way around is true now | 18:27 |
fungi | i think it was qos forcing different limits on the global side | 18:28 |
jamesdenton_alt | possible | 18:28 |
clarkb | on that mirror if I wget https://cloud-images.ubuntu.com/noble/current/noble-server-cloudimg-amd64.img I get 28.9MB/s | 18:28 |
fungi | the flavor-based network rate limiting was a little janky, if memory serves | 18:28 |
clarkb | that would go over the public network | 18:28 |
clarkb | and seem to be much quicker | 18:29 |
clarkb | so ya maybe something to do with the private networking? | 18:29 |
fungi | we should be able to test across the 10.x address from one of our control-plane servers in that region | 18:29 |
jamesdenton_alt | yeah, it's possible there's an issue with the servicenet network, somewhere | 18:29 |
clarkb | reminds me of that time infracloud's switch became a hub | 18:30 |
jamesdenton_alt | that sounds like a bad time. | 18:31 |
clarkb | let me try something like scp'ing that image file over the service net between rax-dfw nodes | 18:31 |
clarkb | I'm downloading it again to get more datapoint first | 18:32 |
clarkb | second time was 29.0MB/s | 18:32 |
clarkb | so that is consistent | 18:32 |
clarkb | currently getting just above 400KBps | 18:33 |
fungi | ideally try to grab something via https, since traffic shaping might impact througput differently for each protocol | 18:34 |
clarkb | sure but we already have data that that is slow | 18:34 |
clarkb | and this way I can show that it isn't our http service | 18:34 |
fungi | oh, right | 18:34 |
clarkb | now down to 200KB/s or so | 18:34 |
clarkb | so ya this seems like an internal network is super slow for some reason problem. | 18:35 |
fungi | though we do have other webservers listening on 10.x in that region too | 18:35 |
fungi | but if you can reproduce it with scp too then good enough | 18:35 |
jamesdenton_alt | is this between the mirror and other hosts? | 18:35 |
clarkb | I'm running the scp from paste.opendev.org to mirror.dfw.rax.opendev.org over the internal network address | 18:35 |
fungi | so yes | 18:36 |
fungi | might also try between e.g. paste and codesearch or something, to see if it's impacting other hosts besides mirror.dfw.rax | 18:37 |
clarkb | ack let me cancel this download as it seems we have enough data here so far | 18:37 |
clarkb | codesarch got 33MB/s pulling from ubuntu | 18:38 |
clarkb | paste scp'ing from codesearch got 85.9MB/s | 18:39 |
clarkb | so seems more specific to the mirror node (or maybe its network etc) | 18:39 |
fungi | so one possibility (cacti is not easy to check at the moment) is we're simply maxing out the bandwidth quota for the internal interface of the mirror server | 18:42 |
fungi | another is maybe noisy neighbor on the same compute host choking the physical interface (not sure what the wired topology looks like there so hard to guess) | 18:43 |
clarkb | or something occuring at the switch level too | 18:43 |
clarkb | fetching things over the public interface is speedy so I don't think the server itself is sad. This seems isolated to that specific network | 18:43 |
fungi | but given that the prior observations were this only occurred at times of peak utilization for job resources, that would fit with interface rate limits | 18:44 |
clarkb | I went ahead and deleted the noble image from all three nodes so that its not consuming disk space unnecessarily | 18:44 |
clarkb | its also possible we only notice during those times because there are so many more jobs running during those periods | 18:45 |
fungi | maybe this has only become an issue lately because some jobs have started to pull more content through our mirrors | 18:45 |
fungi | and that's started to push us over some rate limit | 18:45 |
fungi | i'll see if i can get a port forward up to check cacti | 18:46 |
jamesdenton_alt | clarkb fungi i will dig around and see what qos policies might be applied to servicenet interfaces, and if i can find the hypervisor, i'll check the stats | 18:48 |
fungi | thanks jamesdenton_alt! | 18:48 |
clarkb | ++ | 18:48 |
fungi | argh, i think my ff may dislike the ssl options on the cacti server now | 18:53 |
clarkb | the unfortunate thing with network trouble like this it can be so many things. Maybe interfaces renegotiated at 10mbps for some reason. Or switch is acting like a hub because arp caches are full. Or qos rules are unexpectedly taking effect. Or we've got significant packet loss impacting window sizes so they never grow beyond a very minimal value | 18:53 |
fungi | oh, conveniently, the cacti server doesn't redirect http to https | 18:54 |
clarkb | fun story. In a previous life I was a network person and some team couldn't figure out why microsoft ftp server was terribly slow. Turns out that the implementation at the time used an incredibly small window size which meant and downloads beyond like half a ms of rtt would always be terribly slow | 18:54 |
clarkb | that was a really fun one to debug because testing it in the datacenter it was always fine | 18:55 |
clarkb | the workaround was you could edit the registry to increase the window size because of course | 18:56 |
fungi | eth1 traffic does look like it's topping out in weird ways around 200Mbps | 18:56 |
clarkb | and historically I'm guessing it didn't do that? | 18:56 |
clarkb | either beacuse we were always under that limit or because we were well over it? | 18:57 |
clarkb | I need to pop out for lunch now. Back in a bit | 18:57 |
fungi | trying to get to the longer-term graphs | 18:57 |
fungi | here we go: https://fungi.yuggoth.org/tmp/dfw-mirror-traffic.png | 19:11 |
fungi | so for the past ~2 months we've been seeing about double the bandwidth utilization as we did all last year | 19:11 |
Clark[m] | My tails of Apache logs didn't show anything that stood out to me as abnormal or wrong but there is a lot of it which may be worth checking. Another thought is I got 85MB/s between nodes on private networks which is much greater than 200mbps | 19:14 |
Clark[m] | It could be possible that whatever problem here has us keeping connections open longer so that snmp records it? Whereas before we'd have shorter much smaller blips? | 19:14 |
fungi | well, again, this is specifically on the internal 10.x interface (eth1) so is unlikely to be a scraper/bot | 19:15 |
Clark[m] | 85* 8/200 is about 3.5x more throughout | 19:15 |
Clark[m] | True | 19:15 |
fungi | my money's on jobs pulling more container images or other large files from the cache | 19:16 |
Clark[m] | Anyway 200mbps seems low to me as a cap given the speed I got between codesearch and paste | 19:16 |
fungi | maybe in response to dockerhub issues | 19:16 |
Clark[m] | So even if there is more data passing here th cap seems unusually low | 19:16 |
fungi | some rate-limiting mechanisms have a bit of a delay to kick in, too | 19:17 |
fungi | though the server is using performance1-8 which claims to have rxtx_factor=1600.0 | 19:17 |
fungi | which i think is supposed to be Mbps? | 19:18 |
fungi | but maybe it doesn't apply the same limits on the internal network either | 19:18 |
Clark[m] | We can drop the use of -int interfaces and see what happens | 19:19 |
jamesdenton_ | mind PM'ing me the UUID of the VM? | 19:19 |
fungi | even adding the eth0 and eth1 inbound+outbound traffic together it's nowhere near 1600Mbps. maybe 20% of it at most | 19:20 |
jamesdenton_ | thanks fungi | 19:21 |
Clark[m] | 200mbps is suspiciously similar to lacp over two 100mbps links. But I have no knowledge of the interface setup on the hypervisors | 19:23 |
fungi | well, the graphs do show peaking over 200 and then getting squeezed back down, which is typical behavior for some traffic shapers | 19:24 |
fungi | there's often a bit of a delay before the rate limiting kicks in | 19:25 |
fungi | oh, i already said that | 19:25 |
fungi | anyway, if it was a 2x100 bond or something i'd expect it to look more like a hard limit | 19:25 |
fungi | which this doesn't | 19:26 |
opendevreview | Clark Boylan proposed openstack/project-config master: Stop using the -int interface on rax mirrors https://review.opendev.org/c/openstack/project-config/+/945651 | 19:51 |
clarkb | thats maybe the nuclear option. We could do a setup where only dfw doens't use -int | 19:51 |
clarkb | but I figure consistency is a good thing if we want to make the change so went with the rip it out option instead | 19:51 |
corvus | clarkb: the traffic graphs should be based on interface byte counters, so tcp session duration shouldn't affect it (they should be counting everything). | 19:57 |
clarkb | ah so that value is the 5 minute average for the interface | 19:58 |
fungi | right, snmp checks the interface to see the bytes tx and rx totals, then 5 minutes later it checks them again and subtracts the previous values | 19:58 |
clarkb | and not a point in time | 19:58 |
fungi | so yes effectively a ~5-minute mean (there's a bit of jitter due to snmp request timing, but not enough to be more than statistical noise) | 19:59 |
fungi | trying to infer rate limiters from such readings is sort of a gut feel thing, because the mean represents the minimum possible for a spike within that timeframe, so you have to assume there were probably points between where it went higher, though you can't really know by how much | 20:02 |
fungi | in some cases rate limiters exhibit a sort of hysteresis, if the period is longer than the sample duration it can even be visible as a harmonic | 20:04 |
clarkb | I think our options are: see if jamesdenton_ finds anything wrong with the network, land https://review.opendev.org/c/openstack/project-config/+/945651 and use a different network, or reduce the total test nodes to reduce overall demand, or some combo of the three | 20:04 |
clarkb | I did briefly consider multiple mirrors using A/AAAA dns round robins too | 20:04 |
clarkb | but that seems liek a lot of effort for now. That may become an option in the future though | 20:04 |
fungi | that would probably work well enough too, yeah | 20:04 |
clarkb | I do think that letting things run in sad mode is not ideal so at the very least we should probably land https://review.opendev.org/c/openstack/project-config/+/945623 | 20:05 |
clarkb | and if that doesn't help proceed with the switch to public interfaces? | 20:06 |
clarkb | or just do both and send it | 20:06 |
fungi | for now, i've approved 945623 to temporarily lower our use of rackspace classic | 20:09 |
fungi | if this is bandwidth contention, it will likely have the desired effect | 20:10 |
corvus | heh i would have started with 651 | 20:16 |
corvus | because if 651 fixes it, then we're in an optimal state. if 623 fixes it, we're suboptimal and we still have work to do (ie, land 651) | 20:16 |
opendevreview | Merged openstack/project-config master: Cap max-servers for rax clouds https://review.opendev.org/c/openstack/project-config/+/945623 | 20:17 |
corvus | not a problem, just that we're guaranteed to need another patch after 623 | 20:17 |
fungi | my main concern is that 945651 could make things worse | 20:20 |
fungi | given we don't yet know the cause | 20:20 |
fungi | whereas with 945623 we can at least pretty confidently guess a minimum bound on how much worse that will be | 20:21 |
fungi | er, maximum bound | 20:22 |
clarkb | accessbot, matrix eavesdrop, and limnoria are all up next for python 3.12. Have a preference for which one goes first? maybe matrix eavesdrop? I think the impact there is pretty low since matrix has logs too | 20:23 |
fungi | and per my review comment, it could actually improve throughput if it results in fewer job timeouts and rechecks | 20:23 |
fungi | accessbot should have the least impact, it's not really continuously used | 20:24 |
fungi | it really only matters if someone adds/changes its configs | 20:25 |
fungi | maybe after it updates, we could solicit another chanop volunteer and test adding them | 20:26 |
clarkb | wfm I can +A that change now | 20:26 |
clarkb | done | 20:26 |
corvus | fungi: i don't think lowering overall capacity is acceptable long-term. it may improve throughput on one project but at the cost of others. | 20:28 |
corvus | so what's the next step after observing 623? if it makes things better, i don't think we can leave it there. we need to find a better fix, so i'd popose that if it does make things better, we revert it and see if 651 fixes things. | 20:30 |
corvus | (and then, if 651 fails to make things as good as 623, we put 623 back and regroup) | 20:30 |
clarkb | that sounds reasonable to me | 20:30 |
fungi | well, my hope was that we hear back as to what the actual cause of the problem is, and then either restore the missing capacity in those regions or get equivalent capacity in flex | 20:32 |
corvus | an external fix would be ideal, yes | 20:33 |
fungi | if it really is some sort of aggregate network capacity limit we're running up against, then either it can be fixed within the provider or it can't, and if it can't then we may need to look at other options for reducing our bandwidth utilization for that mirror (clarkb's suggestion of dns round-robin between two mirror servers in each region seems like a reasonable solution there) | 20:34 |
fungi | my current supposition as to why throughput on eth0 is good and throughput on eth1 is bad is that the graphs show comparatively little contention for eth0 at the moment, but if we shift all our traffic from eth1 to eth0 then i see a few possible outcomes: | 20:36 |
fungi | 1. the rate limit on eth0 is the same as it is on eth1... this will make matters worse because now we're adding the already existing public traffic talking to pypi, dockerhub, etc together with the traffic to test nodes | 20:37 |
fungi | 2. the rate limit is shared between eth1 and eth0... this will neither improve nor worsen performance (but it seems unlikely since under that model we should be seeing performance issues for eth0 already and aren't) | 20:38 |
clarkb | I suspect that we won't make matters much worse if measured from a job success rate standpoint | 20:40 |
fungi | 3. the rate limit on eth0 is higher than it is on eth1 (or whatever peformance issue impacting the host isn't effecting the physical path eth0 relies on)... this is the only scenario where i would expect 945651 to help | 20:40 |
clarkb | would we prefer I update the project-config change to only modify dfw to reduce the blast radius? | 20:41 |
corvus | (i'm assuming there's a 4: eth0 is lower than eth1, where 651 makes thing worse as well) | 20:41 |
clarkb | I guess another option is to boot a new mirror and switch the current record to it alone | 20:42 |
fungi | yeah, i mean there's a long tail of scenarios where things could get worse, i didn't want to enumerate them all ;) | 20:42 |
clarkb | if the problem is iwth that specific hypervisor we have a good chance of alnding on a different hypervisor and the new mirror would be happier | 20:42 |
corvus | i agree, i think 2 is unlikely and we can put it at the bottom of our list for when we've exhausted other hypotheses... | 20:42 |
fungi | i can reestablish my port forward to cacti and see if iad and ord are exhibiting a similar traffic pattern on eth1 | 20:42 |
clarkb | fungi: that would be helpful | 20:43 |
corvus | maybe i should port-forward and take a look too | 20:43 |
fungi | iad looks a lot healthier by comparison | 20:44 |
fungi | i see traffic spiking more naturally and going up as high as 800Mbps at one point today | 20:44 |
corvus | have we only noticed job problems in dfw? | 20:45 |
clarkb | dfw jobs are definitelywhere I'ev noticed it | 20:45 |
corvus | ord is spikey up to 500Mbps | 20:46 |
fungi | ord also looks more like iad, with traffic spiking up as high as 700Mbps earlier in the week and 600Mbps when daily periodic jobs kicked off | 20:46 |
corvus | i like the idea of booting a new mirror, in dfw only, seeing if it behaves better, and if so, switching to it, and if not, then round-robining with the existing one | 20:47 |
fungi | i've been going off anecdotes and frickler's observations so far, it sounded like he thought all of rax classic was impacted, but if the anomalous eth1 traffic limiting i'm seeing on the graphs is tied to the cause then i would only expect dfw to be impacted | 20:47 |
corvus | because unless someone says "oh ord and iad are failing jobs too" then this is looking like a dfw issue | 20:47 |
corvus | all right | 20:49 |
corvus | so what jobs are failing? | 20:49 |
corvus | i need to know what to look up in zuul so i can figure out if this is dfw only or hits other regions | 20:49 |
clarkb | kolla jobs, I think I had a zuul unittest job timeout. The job I put a hold on (not suer if it failed but I'll check) was a cinder grenade job | 20:49 |
corvus | https://zuul.opendev.org/t/openstack/builds?project=openstack%2Fkolla&result=TIMED_OUT&skip=0 | 20:50 |
corvus | something like that? | 20:50 |
corvus | oh but maybe only tox jobs? | 20:50 |
clarkb | ya I think this expresses itself as timeouts as pip install and apt-get install etc are what are particularly slow | 20:51 |
fungi | overall the evidence i've seen so far was very slow download times in dfw jobs pulling content from dfw, probably jobs with low timeouts are more likely to get nudged over the edge from that | 20:51 |
clarkb | the cinder job I put a hold on succeeded. It took 2 hours. | 20:51 |
clarkb | fungi: yup | 20:51 |
fungi | er, pulling content from the dfw mirror | 20:51 |
fungi | in jobs with high timeouts due to expectation of longer runtimes for their tests, slow package downloads are more likely to be noise in the total duration | 20:52 |
fungi | whereas for normally quick jobs they represent a more outsized proportion of the overall job runtime | 20:53 |
corvus | okay, if we look at all the kolla jobs, it's not helpful. the distribution of timeouts looks like our cloud distribution. so i'll try looking just at tox jobs that timed out | 20:54 |
fungi | though very quick jobs may show less incidence courtesy of us having a comparatively long default timeout too, so they can probably absorb slow package downloads | 20:54 |
corvus | okay i don't know what to look for | 20:55 |
fungi | i'd wager tox-based unit tests are in the sweet spot of being more likely to hover on the verge of their configured timeouts already | 20:55 |
fungi | whereas linter jobs probably have so much headroom in the default timeout that they don't exceed it even with this problem | 20:55 |
clarkb | I suspect the cacti data is enough info to focus on dfw | 20:56 |
clarkb | and not overthink the analysis of which locations are affected | 20:56 |
clarkb | we know dfw is affected. This seems reflected through job logs, direct testing, and cacti | 20:56 |
fungi | https://zuul.opendev.org/t/openstack/builds?result=TIMED_OUT&skip=0 does show quite a few timed_out unit test jobs | 20:57 |
corvus | there are no openstack-tox-py312 timeouts in openstack/kolla this year | 20:57 |
clarkb | using cacti we can rule out the other two but have not done so with the othr two criteria | 20:57 |
clarkb | given that I'm ahppy to update my change to select mirrors to only change things for dfw and then monitor. I'm also happy to spin up a new mirror in dfw under the assumption the old mirror is in a sad hardware state somehow and a new one is less likely to end up in the same situation | 20:59 |
fungi | an openstack-tox-py312 example i just pulled up for neutron ran in ord: https://zuul.opendev.org/t/openstack/build/3ae2224e82eb49a3b41584f719dfe1f3 | 20:59 |
fungi | random sample, maybe unrelated | 20:59 |
corvus | fungi: yeah, good idea -- though i see basically everything except dfw in that list. based on that, i'd say we need to look at bhs1 | 20:59 |
clarkb | fungi: https://zuul.opendev.org/t/openstack/build/3ae2224e82eb49a3b41584f719dfe1f3/log/job-output.txt#344-345 it only took a few seconds to install packages though | 20:59 |
corvus | so i'm leaning toward what clarkb said -- let's just presume this is a dfw problem based on the cacti graphs, and if someone can come up with a better way of identifying affected jobs so we can confirm the actual affected regions, that would be great. | 21:00 |
clarkb | https://zuul.opendev.org/t/openstack/build/3ae2224e82eb49a3b41584f719dfe1f3/log/job-output.txt#545-642 and there | 21:00 |
clarkb | oh wait the second link needs to have a longer selection | 21:01 |
clarkb | but its on the order of half a minute total not half a minute per package | 21:01 |
opendevreview | James E. Blair proposed openstack/project-config master: Restore max-servers in rax-ord and rax-iad https://review.opendev.org/c/openstack/project-config/+/945654 | 21:02 |
corvus | i think we should do that ^ and boot a new mirror server in dfw | 21:02 |
clarkb | +2 on that change from me. I can start spinning up the mirror unless someone else wants to | 21:02 |
fungi | already approved | 21:03 |
clarkb | I have lots of recent practice with launching nodes | 21:03 |
fungi | the second timed_out build i selected at random was in rax-dfw (functional testing for cinder): https://zuul.opendev.org/t/openstack/build/7cfc840183b5421a8db60a05c6dae8c1 | 21:04 |
clarkb | https://zuul.opendev.org/t/openstack/build/7cfc840183b5421a8db60a05c6dae8c1/log/job-output.txt#371-372 note the time difference to the first example | 21:04 |
clarkb | that took almost 12.5 minutes | 21:04 |
clarkb | anyway I'm going to focus on a new mirror for a bit | 21:04 |
fungi | thanks! | 21:05 |
clarkb | it will be noble because I may as well | 21:05 |
clarkb | maybe you can push up a change to reduce the ttl the two cname records for the mirror? | 21:05 |
fungi | on it | 21:05 |
clarkb | and get that deployed while I'm building the new server. Thanks | 21:05 |
fungi | oh, actually do we need that? | 21:08 |
fungi | yeah, i guess it'll still make the additional cnames take effect sooner | 21:08 |
fungi | or not, the systems we care about in that regard will have cold caches anyway | 21:09 |
fungi | clarkb: ^ we're not altering the existing cname records, only adding new ones, right? | 21:09 |
clarkb | no I was thinking about alterning the existing ones to have smaller ttls | 21:09 |
clarkb | you're right that nodes would start with cold caches so it would only be a problem for ongoing lookups after initial job startup | 21:10 |
clarkb | which maybe matters less | 21:10 |
fungi | how short would you want them for this purpose? something super short like 30 seconds so a node will go back and forth between different mirrors? | 21:10 |
fungi | we're not going to see even distribution this way regardless, i don't think fiddling with the ttls will change that | 21:11 |
clarkb | I usually reduce to 300 seconds | 21:12 |
clarkb | I'm fine with not changing it if you think it is unnecessary | 21:13 |
opendevreview | Merged opendev/system-config master: Update the IRC accessbot to python3.12 https://review.opendev.org/c/opendev/system-config/+/944405 | 21:13 |
fungi | i guess my question is would you want to keep both cnames at a low ttl indefinitely in this case? | 21:13 |
clarkb | the new server is booting and I'ev created its volume. Once booted I'll attach the volume and get it mounted properly then start pushing changes to enroll it | 21:13 |
clarkb | no I was thinking it would just be for the change over to the new server | 21:13 |
clarkb | and the ttl could be increased to our hour long default later | 21:14 |
fungi | oh, you're going to replace the existing server as well as add a second new server? | 21:14 |
clarkb | I'm only going to replace the server right now | 21:14 |
clarkb | per the plan corvus proposed | 21:15 |
fungi | so right now the cnames point to mirror02, but we're going to round-robin between new 03 and 04? | 21:15 |
* fungi re-reads the plan, had previously read the earlier round-robin suggestion into it | 21:15 | |
clarkb | my intention was to boot a new server and update mirror to point at mirror03. then see if that server does better | 21:15 |
clarkb | if it does then we're done. if it doesn't then we can round robin 02 and 03 | 21:15 |
clarkb | from 20:47:20 | 21:16 |
corvus | that's the plan in my brain too | 21:16 |
fungi | okay, that's definitely where i got confused. when corvus said "and boot a new mirror server in dfw" i thought he meant so we could round-robin requests between the existing one and the new one, not for replacing the existing one with a new one | 21:16 |
corvus | por que no los dos | 21:16 |
corvus | two plans for the price of one | 21:16 |
opendevreview | Jeremy Stanley proposed opendev/zone-opendev.org master: Preemptively lower TTL for mirror*.dfw.rax CNAMEs https://review.opendev.org/c/opendev/zone-opendev.org/+/945655 | 21:16 |
fungi | now we're on the same page ;) | 21:17 |
corvus | +3 | 21:18 |
fungi | looks like i stuck the ttl at the wrong tabstop in one of the two records, but bind won't care and we're removing them again shortly regardless | 21:19 |
clarkb | its between the name and the IN thats all that matters | 21:19 |
fungi | or no, one of them just has more tabs than the other | 21:19 |
fungi | yeah, it's all whitespace | 21:19 |
opendevreview | Merged openstack/project-config master: Restore max-servers in rax-ord and rax-iad https://review.opendev.org/c/openstack/project-config/+/945654 | 21:19 |
fungi | just my ocd getting the better of me again | 21:20 |
fungi | the accessbot change failed infra-prod-run-accessbot in deploy | 21:22 |
fungi | TASK [Run accessbot] non-zero return code | 21:23 |
fungi | https://paste.opendev.org/show/bVee2HZdsSTODhvEse6U/ | 21:24 |
clarkb | the server add volume command is terribly slow. Hasn't returned yet | 21:25 |
fungi | AttributeError: module 'ssl' has no attribute 'wrap_socket' | 21:25 |
clarkb | bah I guess we revert. I seem to remember fixing similar issues elsewhere. Jeeypb maybe | 21:25 |
clarkb | its a solveable problem but does need some effort so a revert is probably best | 21:25 |
fungi | i'm trying to recall | 21:25 |
clarkb | lsblk shows the volume on the server. Just waiting for volume list to agree then I'll proceed with formatting it and all that fun stuff | 21:26 |
fungi | in this case we're passing a wrapper to irc.connection.Factory() | 21:26 |
opendevreview | Merged opendev/zone-opendev.org master: Preemptively lower TTL for mirror*.dfw.rax CNAMEs https://review.opendev.org/c/opendev/zone-opendev.org/+/945655 | 21:28 |
fungi | looks like we fixed it here: https://opendev.org/openstack/project-config/src/branch/master/tools/check_irc_access.py#L145-L151 | 21:30 |
fungi | i'll propose a similar patch | 21:30 |
fungi | https://review.opendev.org/c/openstack/project-config/+/926825 was where you did it | 21:31 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Accessbot fix for running on Python 3.12 https://review.opendev.org/c/opendev/system-config/+/945657 | 21:38 |
opendevreview | Clark Boylan proposed opendev/zone-opendev.org master: Add new noble rax mirror to DNS https://review.opendev.org/c/opendev/zone-opendev.org/+/945658 | 21:38 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add a new noble mirror in rax https://review.opendev.org/c/opendev/system-config/+/945659 | 21:39 |
opendevreview | Clark Boylan proposed opendev/zone-opendev.org master: Point mirror.dfw.rax at the new noble mirror https://review.opendev.org/c/opendev/zone-opendev.org/+/945660 | 21:40 |
clarkb | I think those three changes should do it | 21:40 |
fungi | all three lgtm, approved the first | 21:41 |
fungi | and with that, i need to go cook dinner | 21:41 |
corvus | +2 on all 3 | 21:44 |
opendevreview | Merged opendev/zone-opendev.org master: Add new noble rax mirror to DNS https://review.opendev.org/c/opendev/zone-opendev.org/+/945658 | 21:45 |
clarkb | I'll approve the inventory update as soon as ^ deploys | 21:46 |
clarkb | and records resolve | 21:46 |
clarkb | fungi: re https://review.opendev.org/c/opendev/system-config/+/945657/1/docker/accessbot/accessbot.py the comment has thing about our testing before that was true but isn't rue in this context | 21:50 |
clarkb | do you want to update the comment? or try another approach or maybe we just revert for now? | 21:50 |
clarkb | records are resolving for me now I'm approving the inventory update | 21:52 |
clarkb | and I'll take the wait for that as an opportunity for a break | 21:53 |
fungi | https://github.com/jaraco/irc/pull/221/files seems to be the corresponding reference update | 22:02 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Accessbot fix for running on Python 3.12 https://review.opendev.org/c/opendev/system-config/+/945657 | 22:07 |
clarkb | fungi: +2 from me on ^ do you think we should just go ahead and approve it? | 22:18 |
opendevreview | Merged opendev/system-config master: Add a new noble mirror in rax https://review.opendev.org/c/opendev/system-config/+/945659 | 22:41 |
clarkb | its going to be a little bit until that fully deploys the server (and new mirrors do an afs build too). But once apache is serving content I expect to see I can approve the next change to flip dns over | 22:44 |
clarkb | I was just able to get 22.8MB/s copying from mirror02 to paste over the 10 network interface | 22:46 |
clarkb | I'm now going to compare with mirror03 | 22:47 |
clarkb | 87MB/s | 22:48 |
fungi | clarkb: yeah, if 945657 doesn't run successfully (it should retrigger that deploy job), then i can revisit it again tomorrow. if it's broken for a few days it's not the end of the world | 22:54 |
clarkb | ok I approved it | 22:55 |
fungi | it only ever runs if we change the script or the config for it, which happens maybe on a ~monthly cadence at most these days | 22:56 |
fungi | if we have an urgent config update to land for it, then we always have the revert option anyway | 22:57 |
clarkb | we probably run it daily too? | 22:57 |
clarkb | actually I think I see a bug | 22:57 |
fungi | oh | 22:57 |
clarkb | review updated | 22:58 |
fungi | d'oh! | 22:59 |
fungi | now that's almost as embarrasing as leaving a clearly commented test configuration when copying into the production script | 22:59 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Accessbot fix for running on Python 3.12 https://review.opendev.org/c/opendev/system-config/+/945657 | 23:00 |
fungi | thanks again! | 23:00 |
* fungi sighs | 23:00 | |
clarkb | if you like you can apply the same fix to the test you found the bad fix from | 23:01 |
clarkb | that would confirm this works before we approve it if you'd like to double check. Otherwise I can reapprove | 23:02 |
clarkb | +2'd for now anyway | 23:02 |
fungi | oh, good idea | 23:03 |
clarkb | almost to the mirror deployment job now | 23:03 |
opendevreview | Jeremy Stanley proposed openstack/project-config master: Update SSL use in our IRC access check script https://review.opendev.org/c/openstack/project-config/+/945662 | 23:07 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Accessbot fix for running on Python 3.12 https://review.opendev.org/c/opendev/system-config/+/945657 | 23:07 |
clarkb | I believe the project-config change is self testing (thats how the fix came up we wanted to default to noble for the job and it broke) | 23:09 |
clarkb | openafs is building on mirror03 | 23:09 |
fungi | yeah, that's why i added the depends-on | 23:10 |
fungi | - project-config-irc-access https://zuul.opendev.org/t/openstack/build/def8d5ab69964d4ebb5a5e9af718925b : SUCCESS in 6m 30s | 23:18 |
clarkb | I think you can approve that one. I +2'd | 23:18 |
fungi | i guess it worked | 23:18 |
clarkb | then approve the other | 23:19 |
clarkb | the mirror is no longer running gcc so should hopefully complete the deployment job soon | 23:19 |
fungi | i need to knock off for the evening, but will check in on the accessbot deploy job in the morning before my meeting | 23:20 |
clarkb | https://mirror03.dfw.rax.opendev.org/ that has stuff now | 23:20 |
clarkb | I'm going to approve the the dns swap and that way periodic jobs at 02:00 should generate data for us | 23:21 |
corvus | nice | 23:22 |
opendevreview | Merged opendev/zone-opendev.org master: Point mirror.dfw.rax at the new noble mirror https://review.opendev.org/c/opendev/zone-opendev.org/+/945660 | 23:25 |
opendevreview | Merged openstack/project-config master: Update SSL use in our IRC access check script https://review.opendev.org/c/openstack/project-config/+/945662 | 23:32 |
corvus | i guess the mirror hosts are in cacti by their cnames, so we'll just see the graphs cutover nowish | 23:35 |
clarkb | ya new mirror is what I get out of dns now | 23:37 |
corvus | new data apparent in cacti (can see the discontinuity) | 23:54 |
clarkb | corvus: are we anywhere near the total bw from before? or is it too early to say if this is different? | 23:57 |
corvus | it was pretty low anyway (due to time of day? plus 623?). also, i guess we would expect a slowly (2 hours?) rolling cutover anyway with tcp connections to the old server | 23:58 |
corvus | https://imgur.com/nutfRgA | 23:59 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!