Wednesday, 2025-03-26

fricklerfwiw I've seen that apache restart issue before, but didn't look into a solution yet08:12
opendevreviewMerged zuul/zuul-jobs master: Add upload-image-swift role  https://review.opendev.org/c/zuul/zuul-jobs/+/94481210:22
opendevreviewMerged zuul/zuul-jobs master: Add upload-image-s3 role  https://review.opendev.org/c/zuul/zuul-jobs/+/94481310:22
opendevreviewFabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs  https://review.opendev.org/c/zuul/zuul-jobs/+/94358611:06
opendevreviewFabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs  https://review.opendev.org/c/zuul/zuul-jobs/+/94358611:10
opendevreviewFabien Boucher proposed zuul/zuul-jobs master: zuul_log_path: allow override for emit-job-header and upload-logs  https://review.opendev.org/c/zuul/zuul-jobs/+/94358611:13
mnasiadkaclarkb, fungi : I'd like to thank whoever fixed the aarch64 nodes to be fast again :)11:54
Clark[m]mnasiadka we redeployed the mirror and nodepool builder in osuosl and noticed they were quick. We didn't change anything on our end so must've been Ramereth[m]13:37
fungiclarkb: the python3.13.2t is a pyenv bug where it's selecting a free-threading build of python because that looks like the highest "version number"13:45
Clark[m]Ya I remembered you pushed a workaround to zuul-jobs that I thought had merged (it hasn't). I need to depends on that and revert my python version selector13:47
corvusloooks like we have zuul-launcher image builds for all the clouds: https://zuul.opendev.org/t/zuul/image/debian-bullseye14:40
corvusi think that page could probably use a column showing the format for each of the build artifacts.  but reading between the lines, it looks like ovh and raxflex use qcow2 and vexxhost and openmetal use raw?  does that sound right?14:41
corvusthen we have two builds for each of those (current and previous)14:41
clarkbcorvus: looking at our nodepool builder config our base default is qcow2 so then only clouds that override that would get raw or vhd14:42
clarkbso I think that is about right. vexxhostand openmetal use ceph to back the disks and ceph doesn't play as well with qcow2 as it does with raw14:43
clarkbdoubling up on the copy on right stuff doesn't work14:43
clarkbso ya that lgtm14:43
corvusah that's why.  i wondered.  :)14:43
clarkb*copy on write14:46
clarkbinfra-root how do we feel about upgrading gitea today to the latest bugfix release: https://review.opendev.org/c/opendev/system-config/+/945414 fungi has +2'd the change and screenshots lgtm : https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_ed2/openstack/ed2d3a106d304d57ace4a9c901c12bac/bridge99.opendev.org/screenshots/14:59
clarkbthe main consideration is probably the openstack release next week14:59
clarkbI'm getting my morning started but I expect to be around until thunderstorms knock out my power (and I hope that doesn't happen). But that isn't expected until this afternoon so I think I'm good with it15:04
fungithis seems like a fine time for a gitea bugfix point release upgrade15:05
fungii'm around to help test15:05
clarkbok I'll give it a few more minutes for feedback then can hit +A15:07
clarkbI approved it15:15
fricklera week until the openstack release should be fine to find any possible issues and fix them if needed, but I agree the risk seems low, too15:28
clarkbya if this was an upgrade to 1.24 I would wait15:30
opendevreviewDr. Jens Harbott proposed openstack/project-config master: Cap max-servers for rax clouds  https://review.opendev.org/c/openstack/project-config/+/94562315:30
clarkbbut 1.23.5 to 1.23.6 should be much safer15:30
fricklerinfra-root: ^^ that's my proposal to see if it helps with the timeouts I mentioned yesterday, let me know what you think15:30
clarkb+2 from me15:31
clarkbseems like an easy safe experiment15:31
clarkbfrickler: my zuul python 3.13 test job is very slowly running pip install for requirements. I jumped on that test node 104.130.253.199 and it is almost idle. The problem seems to be with io to/from the mirror?15:38
clarkbfrickler: might be worth checking if the mirror is being overwhelmed or has some other issues. In the past we noticed similar slowness which was mitigated by switching to the internal rax network interfaces. But maybe now those internal interfaces are having problems. I don't know just noting what I observe15:39
clarkbof course reducing total test nodes would reduce demand on the mirrors if they are overwhelmed that would still help15:40
fungicould be ai crawlers15:40
clarkbha yes why didn't I consider that15:42
Ramereth[m]<Clark[m]> "mnasiadka we redeployed the..." <- I did make some changes in the backend storage, glad you noticed an improvement! Can you say how much of an improvement it was?15:47
mnasiadkaRamereth[m]: previously kolla build jobs timed out after 2.5/3 hours - now they finish in like 30-40 minutes :)15:47
Ramereth[m]sweet, are they using ephemeral disks for the VMs?15:48
clarkbRamereth[m]: we're usingthe opendev.large flavor however that boots15:50
clarkbfor our image builder node we use a persistent volume and buidls went from several hours to around an hour15:50
Ramereth[m]clarkb: yeah, so I added additional SSD nodes to Ceph and migrated that pool over to SSD. I also reduced the replication from 3 to 2 to save space needed to use SSD's15:51
clarkbit seems to have helped15:52
clarkbre mirror.dfw system load is low and tailing some access logs I see lots of what I expect are normal pip requests15:53
clarkbdnf is requesting centos 9 packages over port 80 and not 44315:53
Ramereth[m]clarkb: btw did something happen to nl01.opendev.org? I had a nagios check to see when we had any images we needed to manually cleanup but it's no longer working15:54
clarkbRamereth[m]: oh sorry I deleted it and nl05.opendev.org took its place15:55
clarkbit didn't occur to me that external users may be using those in automated checks. But if you just update the hostname in the fqdn it should go back to working15:55
clarkb(I've been working through server replacements to update their base OSes)15:55
clarkbdoing a bigger tail -f dragnet on mirror.dfw I don't see anything that looks like obvious ai web crawler traffic15:56
Ramereth[m]<clarkb> "Ramereth: oh sorry I deleted..." <- Confirmed this is working again, thanks!16:02
clarkband good luck with the storms today16:06
Ramereth[m]Yeah, I hope it won't be bad16:07
clarkbI managed to get half the garage cleared out so the car can move inside today.16:07
fricklerclarkb: yes, I suspected either network or disk io. for kolla the timeouts mostly happen while pulling container images via the mirror node16:23
opendevreviewThierry Carrez proposed opendev/irc-meetings master: Move Large Scale SIG meeting one hour earlier  https://review.opendev.org/c/opendev/irc-meetings/+/94563316:36
clarkbone thought I had is maybe afs is slow but load average on the afs server looks ok and we see slowness pulling from pypi too which doesn't involve afs17:25
clarkbso ya this is kinda pointing at some sort of io (network and/or disk) slowness in rax I think17:26
fungicontainer images aren't coming from afs either right?17:26
clarkbno. Only distro packages17:26
*** iurygregory_ is now known as iurygregory17:26
clarkbI think it is possible that afs is impacted by the same io issues but I suspect that it isn't the cause17:26
clarkbI think we should consider frickler's change to dial back use of rax and see if lower demand helps on the whole17:27
fungiyeah, but as i said yesterday this might also be an opportunity to ask about dialling up use of flex at the same time, essentially starting to shift capacity in that direction17:31
clarkbfwiw I did ask jamesdenton a little while ago and he said he wuold check. But haven't heard back on that yet. So I don'tw ant to push the issue17:32
fungisure, that's totally fair17:33
opendevreviewMerged opendev/system-config master: Update to Gitea 1.23.6  https://review.opendev.org/c/opendev/system-config/+/94541417:33
fungithat ^ should start deploying straight away17:33
clarkbgitea09 is in the process of restarting now17:37
fungiyeah, i just got the "service unavailable" page for it17:37
fungiand it's back online now17:38
clarkbhttps://gitea09.opendev.org:3081/opendev/system-config/ confirmed17:38
fungiPowered by Gitea Version: v1.23.617:38
clarkbgit clone works for me too17:38
clarkb10 is upgraded now too17:40
clarkbjamesdenton: re the conversation above we've seen jobs in rax classic (particularly dfw but maybe in iad and ord) become very slow. Initial investigation appears to show most of the time "lost" is performing network IO to grab things like python packages through our local caching proxy to pypi.org. But also through the cache proxy to quay (I think kolla is using quay now and not17:43
clarkbdocker hub)17:43
clarkbjamesdenton: not sure if old rax is something you have insight into, but figured we'd pass that along.17:43
fungideploy succeeded17:48
fungiall backends should be on 1.23.6 now17:48
clarkbthey appear to be from my checks17:49
clarkbI guess the last thing to check is replication. I'm looking for a new patchset that has been pushed since 17:48 to check wtih17:56
jamesdentonthanks clarkb. We are also seeing a few reports of slowness in another application, so it's on my list. In terms of bumping quota for Flex, i will start a thread for that17:56
clarkbjamesdenton: thanks. Let us know if we can help with the debugging.17:57
clarkbno new patchsets yet /me looks to see if any changes need a new patchset17:58
jamesdentonclarkb are your resources primary in DFW?18:00
clarkbjamesdenton: we notice most of the slowness in CI jobs which are in dfw, iad and ord. But many of our control plane services do run in dfw primarily including zuul and nodepool that coordinate all of the test resources18:01
jamesdentongotcha18:01
jamesdentonclarkb if you come across one again, let me know if it's possible to login an tshoot18:03
clarkbjamesdenton: sure let me see if I can find one18:03
clarkbjamesdenton_: I asked nodepool to hold (not delete) the nodes for https://zuul.opendev.org/t/openstack/stream/60e83b5589fb47efa47e6d819debbe76?logfile=console.log but it will only do so if the job fails. If you open up that link you'll see it taking almost 2 minutes to download a 41MB package18:12
clarkbof course the problem may be elsewhere still but starting with what we know and trying to work from there18:13
clarkbthe server is named np0040280911 with ip address 23.253.108.10118:13
clarkbI guess I can hop on the server and run a speedtest when the job is done assuming it fails18:13
clarkband try to see if we can isolate the slowness18:14
fungii suppose you could sabotage the running build carefully to trigger the autohold while causing zuul to retry it18:16
fungimaybe surgically kill an ssh connection from the executor or something18:17
clarkbI thought about that but the job us running at the top of the gate 18:17
clarkbanother option would be to just boot a test node manually18:17
jamesdenton_i can boot a node, no problem18:18
clarkbanyway I don't want to force that one to fail if it will otherwise succeed18:18
jamesdenton_i will add this to the list :)18:18
fungigot it18:18
fungithough if you can cause a retry that shouldn't reset the gate queue, just delay it18:19
clarkbtrue. I guess the trick is in being confident you'll cause a retry and not a failure18:19
jamesdenton_alt2025-03-26 17:59:11.511852 | compute1 | Fetched 166 MB in 8min 18s (333 kB/s)18:19
jamesdenton_alti'm guessing that's not reasonable?18:19
jamesdenton_altmy irc bouncer is boucing, sorry.18:20
clarkbya I think the slowness is those fetches. At first I suspected that openafs may be the problem (those packages are on an openafs filesystem also hosted in dfw) but it appears that fetches for pypi packages through our mirror is also slow and that is a pass through caching proxy to pypi.org18:20
clarkbit could be that the mirror itself is a problem but system load there aws reasonable and my checks for ai web crawlers didn't show that being a problem18:21
clarkbbut I suppose if the mirror specifically had networking io trouble that would be felt across many/most/all of the tests18:21
clarkbthe mirror is mirror02.dfw.rax.opendev.org 18:22
fungidmesg also doesn't indicate any recent incidents on it18:22
jamesdenton_altand that's a mirror you manage? 10.208.224.195?18:23
fungicorrect, job nodes are communicating with it across the 10.x network for better performance18:23
jamesdenton_altthe 10.x "servicenet" network?18:24
fungibut it's also globally reachable at 104.130.140.18618:24
clarkbthough mirror-int.dfw.rax.opendev.org isn't resolving for me right now18:24
clarkbI thought we put that in public dns18:24
fungiresolves for me18:24
jamesdenton_alti resolved to a cname - mirror02-int.dfw.rax.opendev.org.18:24
fungi$ host mirror-int.dfw.rax.opendev.org18:24
fungimirror-int.dfw.rax.opendev.org is an alias for mirror02-int.dfw.rax.opendev.org.18:24
fungimirror02-int.dfw.rax.opendev.org has address 10.208.224.19518:25
clarkbya this must be a local dns problem18:25
clarkbif I dig against google it resolves18:25
fungihopefully we haven't broken dnssec signatures for that18:25
clarkbbut ya that is a node that lives in dfw to serve as a cache for resources that test jobs consume. It caches pypi packages, container image layers, and distro packages18:25
fungithough if we had, i shouldn't be able to resolve it either18:25
clarkbya I think its more likely that my over complicated ad blocking rules are tripping on it for some reason18:26
jamesdenton_alti wonder if things improve over that public interface vs snet interface... hmm18:27
clarkbjamesdenton_alt: the reason we use the internal network is we had similar problems a few yaers ago at this point over the public network18:27
clarkbswitching to the private network addressed it18:27
jamesdenton_altof course. lol18:27
clarkbbut its possible the other way around is true now18:27
fungii think it was qos forcing different limits on the global side18:28
jamesdenton_altpossible18:28
clarkbon that mirror if I wget https://cloud-images.ubuntu.com/noble/current/noble-server-cloudimg-amd64.img I get 28.9MB/s18:28
fungithe flavor-based network rate limiting was a little janky, if memory serves18:28
clarkbthat would go over the public network18:28
clarkband seem to be much quicker18:29
clarkbso ya maybe something to do with the private networking?18:29
fungiwe should be able to test across the 10.x address from one of our control-plane servers in that region18:29
jamesdenton_altyeah, it's possible there's an issue with the servicenet network, somewhere18:29
clarkbreminds me of that time infracloud's switch became a hub18:30
jamesdenton_altthat sounds like a bad time.18:31
clarkblet me try something like scp'ing that image file over the service net between rax-dfw nodes18:31
clarkbI'm downloading it again to get more datapoint first18:32
clarkbsecond time was 29.0MB/s18:32
clarkbso that is consistent18:32
clarkbcurrently getting just above 400KBps18:33
fungiideally try to grab something via https, since traffic shaping might impact througput differently for each protocol18:34
clarkbsure but we already have data that that is slow18:34
clarkband this way I can show that it isn't our http service18:34
fungioh, right18:34
clarkbnow down to 200KB/s or so18:34
clarkbso ya this seems like an internal network is super slow for some reason problem.18:35
fungithough we do have other webservers listening on 10.x in that region too18:35
fungibut if you can reproduce it with scp too then good enough18:35
jamesdenton_altis this between the mirror and other hosts?18:35
clarkbI'm running the scp from paste.opendev.org to mirror.dfw.rax.opendev.org over the internal network address18:35
fungiso yes18:36
fungimight also try between e.g. paste and codesearch or something, to see if it's impacting other hosts besides mirror.dfw.rax18:37
clarkback let me cancel this download as it seems we have enough data here so far18:37
clarkbcodesarch got 33MB/s pulling from ubuntu18:38
clarkbpaste scp'ing from codesearch got 85.9MB/s18:39
clarkbso seems more specific to the mirror node (or maybe its network etc)18:39
fungiso one possibility (cacti is not easy to check at the moment) is we're simply maxing out the bandwidth quota for the internal interface of the mirror server18:42
fungianother is maybe noisy neighbor on the same compute host choking the physical interface (not sure what the wired topology looks like there so hard to guess)18:43
clarkbor something occuring at the switch level too18:43
clarkbfetching things over the public interface is speedy so I don't think the server itself is sad. This seems isolated to that specific network18:43
fungibut given that the prior observations were this only occurred at times of peak utilization for job resources, that would fit with interface rate limits18:44
clarkbI went ahead and deleted the noble image from all three nodes so that its not consuming disk space unnecessarily18:44
clarkbits also possible we only notice during those times because there are so many more jobs running during those periods18:45
fungimaybe this has only become an issue lately because some jobs have started to pull more content through our mirrors18:45
fungiand that's started to push us over some rate limit18:45
fungii'll see if i can get a port forward up to check cacti18:46
jamesdenton_altclarkb fungi i will dig around and see what qos policies might be applied to servicenet interfaces, and if i can find the hypervisor, i'll check the stats18:48
fungithanks jamesdenton_alt!18:48
clarkb++18:48
fungiargh, i think my ff may dislike the ssl options on the cacti server now18:53
clarkbthe unfortunate thing with network trouble like this it can be so many things. Maybe interfaces renegotiated at 10mbps for some reason. Or switch is acting like a hub because arp caches are full. Or qos rules are unexpectedly taking effect. Or we've got significant packet loss impacting window sizes so they never grow beyond a very minimal value18:53
fungioh, conveniently, the cacti server doesn't redirect http to https18:54
clarkbfun story. In a previous life I was a network person and some team couldn't figure out why microsoft ftp server was terribly slow. Turns out that the implementation at the time used an incredibly small window size which meant and downloads beyond like half a ms of rtt would always be terribly slow18:54
clarkbthat was a really fun one to debug because testing it in the datacenter it was always fine18:55
clarkbthe workaround was you could edit the registry to increase the window size because of course18:56
fungieth1 traffic does look like it's topping out in weird ways around 200Mbps18:56
clarkband historically I'm guessing it didn't do that?18:56
clarkbeither beacuse we were always under that limit or because we were well over it?18:57
clarkbI need to pop out for lunch now. Back in a bit18:57
fungitrying to get to the longer-term graphs18:57
fungihere we go: https://fungi.yuggoth.org/tmp/dfw-mirror-traffic.png19:11
fungiso for the past ~2 months we've been seeing about double the bandwidth utilization as we did all last year19:11
Clark[m]My tails of Apache logs didn't show anything that stood out to me as abnormal or wrong but there is a lot of it which may be worth checking. Another thought is I got 85MB/s between nodes on private networks which is much greater than 200mbps19:14
Clark[m]It could be possible that whatever problem here has us keeping connections open longer so that snmp records it? Whereas before we'd have shorter much smaller blips?19:14
fungiwell, again, this is specifically on the internal 10.x interface (eth1) so is unlikely to be a scraper/bot19:15
Clark[m]85* 8/200 is about 3.5x more throughout 19:15
Clark[m]True19:15
fungimy money's on jobs pulling more container images or other large files from the cache19:16
Clark[m]Anyway 200mbps seems low to me as a cap given the speed I got between codesearch and paste19:16
fungimaybe in response to dockerhub issues19:16
Clark[m]So even if there is more data passing here th cap seems unusually low 19:16
fungisome rate-limiting mechanisms have a bit of a delay to kick in, too19:17
fungithough the server is using performance1-8 which claims to have rxtx_factor=1600.019:17
fungiwhich i think is supposed to be Mbps?19:18
fungibut maybe it doesn't apply the same limits on the internal network either19:18
Clark[m]We can drop the use of -int interfaces and see what happens19:19
jamesdenton_mind PM'ing me the UUID of the VM?19:19
fungieven adding the eth0 and eth1 inbound+outbound traffic together it's nowhere near 1600Mbps. maybe 20% of it at most19:20
jamesdenton_thanks fungi19:21
Clark[m]200mbps is suspiciously similar to lacp over two 100mbps links. But I have no knowledge of the interface setup on the hypervisors19:23
fungiwell, the graphs do show peaking over 200 and then getting squeezed back down, which is typical behavior for some traffic shapers19:24
fungithere's often a bit of a delay before the rate limiting kicks in19:25
fungioh, i already said that19:25
fungianyway, if it was a 2x100 bond or something i'd expect it to look more like a hard limit19:25
fungiwhich this doesn't19:26
opendevreviewClark Boylan proposed openstack/project-config master: Stop using the -int interface on rax mirrors  https://review.opendev.org/c/openstack/project-config/+/94565119:51
clarkbthats maybe the nuclear option. We could do a setup where only dfw doens't use -int19:51
clarkbbut I figure consistency is a good thing if we want to make the change so went with the rip it out option instead19:51
corvusclarkb: the traffic graphs should be based on interface byte counters, so tcp session duration shouldn't affect it (they should be counting everything).19:57
clarkbah so that value is the 5 minute average for the interface19:58
fungiright, snmp checks the interface to see the bytes tx and rx totals, then 5 minutes later it checks them again and subtracts the previous values19:58
clarkband not a point in time19:58
fungiso yes effectively a ~5-minute mean (there's a bit of jitter due to snmp request timing, but not enough to be more than statistical noise)19:59
fungitrying to infer rate limiters from such readings is sort of a gut feel thing, because the mean represents the minimum possible for a spike within that timeframe, so you have to assume there were probably points between where it went higher, though you can't really know by how much20:02
fungiin some cases rate limiters exhibit a sort of hysteresis, if the period is longer than the sample duration it can even be visible as a harmonic20:04
clarkbI think our options are: see if jamesdenton_ finds anything wrong with the network, land https://review.opendev.org/c/openstack/project-config/+/945651 and use a different network, or reduce the total test nodes to reduce overall demand, or some combo of the three20:04
clarkbI did briefly consider multiple mirrors using A/AAAA dns round robins too20:04
clarkbbut that seems liek a lot of effort for now. That may become an option in the future though20:04
fungithat would probably work well enough too, yeah20:04
clarkbI do think that letting things run in sad mode is not ideal so at the very least we should probably land https://review.opendev.org/c/openstack/project-config/+/94562320:05
clarkband if that doesn't help proceed with the switch to public interfaces?20:06
clarkbor just do both and send it20:06
fungifor now, i've approved 945623 to temporarily lower our use of rackspace classic20:09
fungiif this is bandwidth contention, it will likely have the desired effect20:10
corvusheh i would have started with 65120:16
corvusbecause if 651 fixes it, then we're in an optimal state.  if 623 fixes it, we're suboptimal and we still have work to do (ie, land 651)20:16
opendevreviewMerged openstack/project-config master: Cap max-servers for rax clouds  https://review.opendev.org/c/openstack/project-config/+/94562320:17
corvusnot a problem, just that we're guaranteed to need another patch after 62320:17
fungimy main concern is that 945651 could make things worse20:20
fungigiven we don't yet know the cause20:20
fungiwhereas with 945623 we can at least pretty confidently guess a minimum bound on how much worse that will be20:21
fungier, maximum bound20:22
clarkbaccessbot, matrix eavesdrop, and limnoria are all up next for python 3.12. Have a preference for which one goes first? maybe matrix eavesdrop? I think the impact there is pretty low since matrix has logs too20:23
fungiand per my review comment, it could actually improve throughput if it results in fewer job timeouts and rechecks20:23
fungiaccessbot should have the least impact, it's not really continuously used20:24
fungiit really only matters if someone adds/changes its configs20:25
fungimaybe after it updates, we could solicit another chanop volunteer and test adding them20:26
clarkbwfm I can +A that change now20:26
clarkbdone20:26
corvusfungi: i don't think lowering overall capacity is acceptable long-term.  it may improve throughput on one project but at the cost of others.20:28
corvusso what's the next step after observing 623?  if it makes things better, i don't think we can leave it there.  we need to find a better fix, so i'd popose that if it does make things better, we revert it and see if 651 fixes things.20:30
corvus(and then, if 651 fails to make things as good as 623, we put 623 back and regroup)20:30
clarkbthat sounds reasonable to me20:30
fungiwell, my hope was that we hear back as to what the actual cause of the problem is, and then either restore the missing capacity in those regions or get equivalent capacity in flex20:32
corvusan external fix would be ideal, yes20:33
fungiif it really is some sort of aggregate network capacity limit we're running up against, then either it can be fixed within the provider or it can't, and if it can't then we may need to look at other options for reducing our bandwidth utilization for that mirror (clarkb's suggestion of dns round-robin between two mirror servers in each region seems like a reasonable solution there)20:34
fungimy current supposition as to why throughput on eth0 is good and throughput on eth1 is bad is that the graphs show comparatively little contention for eth0 at the moment, but if we shift all our traffic from eth1 to eth0 then i see a few possible outcomes:20:36
fungi1. the rate limit on eth0 is the same as it is on eth1... this will make matters worse because now we're adding the already existing public traffic talking to pypi, dockerhub, etc together with the traffic to test nodes20:37
fungi2. the rate limit is shared between eth1 and eth0... this will neither improve nor worsen performance (but it seems unlikely since under that model we should be seeing performance issues for eth0 already and aren't)20:38
clarkbI suspect that we won't make matters much worse if measured from a job success rate standpoint20:40
fungi3. the rate limit on eth0 is higher than it is on eth1 (or whatever peformance issue impacting the host isn't effecting the physical path eth0 relies on)... this is the only scenario where i would expect 945651 to help20:40
clarkbwould we prefer I update the project-config change to only modify dfw to reduce the blast radius?20:41
corvus(i'm assuming there's a 4: eth0 is lower than eth1, where 651 makes thing worse as well)20:41
clarkbI guess another option is to boot a new mirror and switch the current record to it alone20:42
fungiyeah, i mean there's a long tail of scenarios where things could get worse, i didn't want to enumerate them all ;)20:42
clarkbif the problem is iwth that specific hypervisor we have a good chance of alnding on a different hypervisor and the new mirror would be happier20:42
corvusi agree, i think 2 is unlikely and we can put it at the bottom of our list for when we've exhausted other hypotheses... 20:42
fungii can reestablish my port forward to cacti and see if iad and ord are exhibiting a similar traffic pattern on eth120:42
clarkbfungi: that would be helpful20:43
corvusmaybe i should port-forward and take a look too20:43
fungiiad looks a lot healthier by comparison20:44
fungii see traffic spiking more naturally and going up as high as 800Mbps at one point today20:44
corvushave we only noticed job problems in dfw?20:45
clarkbdfw jobs are definitelywhere I'ev noticed it20:45
corvusord is spikey up to 500Mbps20:46
fungiord also looks more like iad, with traffic spiking up as high as 700Mbps earlier in the week and 600Mbps when daily periodic jobs kicked off20:46
corvusi like the idea of booting a new mirror, in dfw only, seeing if it behaves better, and if so, switching to it, and if not, then round-robining with the existing one20:47
fungii've been going off anecdotes and frickler's observations so far, it sounded like he thought all of rax classic was impacted, but if the anomalous eth1 traffic limiting i'm seeing on the graphs is tied to the cause then i would only expect dfw to be impacted20:47
corvusbecause unless someone says "oh ord and iad are failing jobs too" then this is looking like a dfw issue20:47
corvusall right20:49
corvusso what jobs are failing?20:49
corvusi need to know what to look up in zuul so i can figure out if this is dfw only or hits other regions20:49
clarkbkolla jobs, I think I had a zuul unittest job timeout. The job I put a hold on (not suer if it failed but I'll check) was a cinder grenade job20:49
corvushttps://zuul.opendev.org/t/openstack/builds?project=openstack%2Fkolla&result=TIMED_OUT&skip=020:50
corvussomething like that?20:50
corvusoh but maybe only tox jobs?20:50
clarkbya I think this expresses itself as timeouts as pip install and apt-get install etc are what are particularly slow20:51
fungioverall the evidence i've seen so far was very slow download times in dfw jobs pulling content from dfw, probably jobs with low timeouts are more likely to get nudged over the edge from that20:51
clarkbthe cinder job I put a hold on succeeded. It took 2 hours.20:51
clarkbfungi: yup20:51
fungier, pulling content from the dfw mirror20:51
fungiin jobs with high timeouts due to expectation of longer runtimes for their tests, slow package downloads are more likely to be noise in the total duration20:52
fungiwhereas for normally quick jobs they represent a more outsized proportion of the overall job runtime20:53
corvusokay, if we look at all the kolla jobs, it's not helpful.  the distribution of timeouts looks like our cloud distribution.  so i'll try looking just at tox jobs that timed out20:54
fungithough very quick jobs may show less incidence courtesy of us having a comparatively long default timeout too, so they can probably absorb slow package downloads20:54
corvusokay i don't know what to look for20:55
fungii'd wager tox-based unit tests are in the sweet spot of being more likely to hover on the verge of their configured timeouts already20:55
fungiwhereas linter jobs probably have so much headroom in the default timeout that they don't exceed it even with this problem20:55
clarkbI suspect the cacti data is enough info to focus on dfw20:56
clarkband not overthink the analysis of which locations are affected20:56
clarkbwe know dfw is affected. This seems reflected through job logs, direct testing, and cacti20:56
fungihttps://zuul.opendev.org/t/openstack/builds?result=TIMED_OUT&skip=0 does show quite a few timed_out unit test jobs20:57
corvusthere are no openstack-tox-py312 timeouts in openstack/kolla this year20:57
clarkbusing cacti we can rule out the other two but have not done so with the othr two criteria20:57
clarkbgiven that I'm ahppy to update my change to select mirrors to only change things for dfw and then monitor. I'm also happy to spin up a new mirror in dfw under the assumption the old mirror is in a sad hardware state somehow and a new one is less likely to end up in the same situation20:59
fungian openstack-tox-py312 example i just pulled up for neutron ran in ord: https://zuul.opendev.org/t/openstack/build/3ae2224e82eb49a3b41584f719dfe1f320:59
fungirandom sample, maybe unrelated20:59
corvusfungi: yeah, good idea -- though i see basically everything except dfw in that list.  based on that, i'd say we need to look at bhs120:59
clarkbfungi: https://zuul.opendev.org/t/openstack/build/3ae2224e82eb49a3b41584f719dfe1f3/log/job-output.txt#344-345 it only took a few seconds to install packages though20:59
corvusso i'm leaning toward what clarkb said -- let's just presume this is a dfw problem based on the cacti graphs, and if someone can come up with a better way of identifying affected jobs so we can confirm the actual affected regions, that would be great.21:00
clarkbhttps://zuul.opendev.org/t/openstack/build/3ae2224e82eb49a3b41584f719dfe1f3/log/job-output.txt#545-642 and there21:00
clarkboh wait the second link needs to have a longer selection21:01
clarkbbut its on the order of half a minute total not half a minute per package21:01
opendevreviewJames E. Blair proposed openstack/project-config master: Restore max-servers in rax-ord and rax-iad  https://review.opendev.org/c/openstack/project-config/+/94565421:02
corvusi think we should do that ^ and boot a new mirror server in dfw21:02
clarkb+2 on that change from me. I can start spinning up the mirror unless someone else wants to21:02
fungialready approved21:03
clarkbI have lots of recent practice with launching nodes21:03
fungithe second timed_out build i selected at random was in rax-dfw (functional testing for cinder): https://zuul.opendev.org/t/openstack/build/7cfc840183b5421a8db60a05c6dae8c121:04
clarkbhttps://zuul.opendev.org/t/openstack/build/7cfc840183b5421a8db60a05c6dae8c1/log/job-output.txt#371-372 note the time difference to the first example21:04
clarkbthat took almost 12.5 minutes21:04
clarkbanyway I'm going to focus on a new mirror for a bit21:04
fungithanks!21:05
clarkbit will be noble because I may as well21:05
clarkbmaybe you can push up a change to reduce the ttl the two cname records for the mirror?21:05
fungion it21:05
clarkband get that deployed while I'm building the new server. Thanks21:05
fungioh, actually do we need that?21:08
fungiyeah, i guess it'll still make the additional cnames take effect sooner21:08
fungior not, the systems we care about in that regard will have cold caches anyway21:09
fungiclarkb: ^ we're not altering the existing cname records, only adding new ones, right?21:09
clarkbno I was thinking about alterning the existing ones to have smaller ttls21:09
clarkbyou're right that nodes would start with cold caches so it would only be a problem for ongoing lookups after initial job startup21:10
clarkbwhich maybe matters less21:10
fungihow short would you want them for this purpose? something super short like 30 seconds so a node will go back and forth between different mirrors?21:10
fungiwe're not going to see even distribution this way regardless, i don't think fiddling with the ttls will change that21:11
clarkbI usually reduce to 300 seconds21:12
clarkbI'm fine with not changing it if you think it is unnecessary21:13
opendevreviewMerged opendev/system-config master: Update the IRC accessbot to python3.12  https://review.opendev.org/c/opendev/system-config/+/94440521:13
fungii guess my question is would you want to keep both cnames at a low ttl indefinitely in this case?21:13
clarkbthe new server is booting and I'ev created its volume. Once booted I'll attach the volume and get it mounted properly then start pushing changes to enroll it21:13
clarkbno I was thinking it would just be for the change over to the new server21:13
clarkband the ttl could be increased to our hour long default later21:14
fungioh, you're going to replace the existing server as well as add a second new server?21:14
clarkbI'm only going to replace the server right now21:14
clarkbper the plan corvus proposed21:15
fungiso right now the cnames point to mirror02, but we're going to round-robin between new 03 and 04?21:15
* fungi re-reads the plan, had previously read the earlier round-robin suggestion into it21:15
clarkbmy intention was to boot a new server and update mirror to point at mirror03. then see if that server does better21:15
clarkbif it does then we're done. if it doesn't then we can round robin 02 and 0321:15
clarkbfrom 20:47:2021:16
corvusthat's the plan in my brain too21:16
fungiokay, that's definitely where i got confused. when corvus said "and boot a new mirror server in dfw" i thought he meant so we could round-robin requests between the existing one and the new one, not for replacing the existing one with a new one21:16
corvuspor que no los dos21:16
corvustwo plans for the price of one21:16
opendevreviewJeremy Stanley proposed opendev/zone-opendev.org master: Preemptively lower TTL for mirror*.dfw.rax CNAMEs  https://review.opendev.org/c/opendev/zone-opendev.org/+/94565521:16
funginow we're on the same page ;)21:17
corvus+321:18
fungilooks like i stuck the ttl at the wrong tabstop in one of the two records, but bind won't care and we're removing them again shortly regardless21:19
clarkbits between the name and the IN thats all that matters21:19
fungior no, one of them just has more tabs than the other21:19
fungiyeah, it's all whitespace21:19
opendevreviewMerged openstack/project-config master: Restore max-servers in rax-ord and rax-iad  https://review.opendev.org/c/openstack/project-config/+/94565421:19
fungijust my ocd getting the better of me again21:20
fungithe accessbot change failed infra-prod-run-accessbot in deploy21:22
fungiTASK [Run accessbot] non-zero return code21:23
fungihttps://paste.opendev.org/show/bVee2HZdsSTODhvEse6U/21:24
clarkbthe server add volume command is terribly slow. Hasn't returned yet21:25
fungiAttributeError: module 'ssl' has no attribute 'wrap_socket'21:25
clarkbbah I guess we revert. I seem to remember fixing similar issues elsewhere. Jeeypb maybe21:25
clarkbits a solveable problem but does need some effort so a revert is probably best21:25
fungii'm trying to recall21:25
clarkblsblk shows the volume on the server. Just waiting for volume list to agree then I'll proceed with formatting it and all that fun stuff21:26
fungiin this case we're passing a wrapper to irc.connection.Factory()21:26
opendevreviewMerged opendev/zone-opendev.org master: Preemptively lower TTL for mirror*.dfw.rax CNAMEs  https://review.opendev.org/c/opendev/zone-opendev.org/+/94565521:28
fungilooks like we fixed it here: https://opendev.org/openstack/project-config/src/branch/master/tools/check_irc_access.py#L145-L15121:30
fungii'll propose a similar patch21:30
fungihttps://review.opendev.org/c/openstack/project-config/+/926825 was where you did it21:31
opendevreviewJeremy Stanley proposed opendev/system-config master: Accessbot fix for running on Python 3.12  https://review.opendev.org/c/opendev/system-config/+/94565721:38
opendevreviewClark Boylan proposed opendev/zone-opendev.org master: Add new noble rax mirror to DNS  https://review.opendev.org/c/opendev/zone-opendev.org/+/94565821:38
opendevreviewClark Boylan proposed opendev/system-config master: Add a new noble mirror in rax  https://review.opendev.org/c/opendev/system-config/+/94565921:39
opendevreviewClark Boylan proposed opendev/zone-opendev.org master: Point mirror.dfw.rax at the new noble mirror  https://review.opendev.org/c/opendev/zone-opendev.org/+/94566021:40
clarkbI think those three changes should do it21:40
fungiall three lgtm, approved the first21:41
fungiand with that, i need to go cook dinner21:41
corvus+2 on all 321:44
opendevreviewMerged opendev/zone-opendev.org master: Add new noble rax mirror to DNS  https://review.opendev.org/c/opendev/zone-opendev.org/+/94565821:45
clarkbI'll approve the inventory update as soon as ^ deploys21:46
clarkband records resolve21:46
clarkbfungi: re https://review.opendev.org/c/opendev/system-config/+/945657/1/docker/accessbot/accessbot.py the comment has  thing about our testing before that was true but isn't rue in this context21:50
clarkbdo you want to update the comment? or try another approach or maybe we just revert for now?21:50
clarkbrecords are resolving for me now I'm approving the inventory update21:52
clarkband I'll take the wait for that as an opportunity for a break21:53
fungihttps://github.com/jaraco/irc/pull/221/files seems to be the corresponding reference update22:02
opendevreviewJeremy Stanley proposed opendev/system-config master: Accessbot fix for running on Python 3.12  https://review.opendev.org/c/opendev/system-config/+/94565722:07
clarkbfungi: +2 from me on ^ do you think we should just go ahead and approve it?22:18
opendevreviewMerged opendev/system-config master: Add a new noble mirror in rax  https://review.opendev.org/c/opendev/system-config/+/94565922:41
clarkbits going to be a little bit until that fully deploys the server (and new mirrors do an afs build too). But once apache is serving content I expect to see I can approve the next change to flip dns over22:44
clarkbI was just able to get 22.8MB/s copying from mirror02 to paste over the 10 network interface22:46
clarkbI'm now going to compare with mirror0322:47
clarkb87MB/s22:48
fungiclarkb: yeah, if 945657 doesn't run successfully (it should retrigger that deploy job), then i can revisit it again tomorrow. if it's broken for a few days it's not the end of the world22:54
clarkbok I approved it22:55
fungiit only ever runs if we change the script or the config for it, which happens maybe on a ~monthly cadence at most these days22:56
fungiif we have an urgent config update to land for it, then we always have the revert option anyway22:57
clarkbwe probably run it daily too?22:57
clarkbactually I think I see a bug22:57
fungioh22:57
clarkbreview updated22:58
fungid'oh!22:59
funginow that's almost as embarrasing as leaving a clearly commented test configuration when copying into the production script22:59
opendevreviewJeremy Stanley proposed opendev/system-config master: Accessbot fix for running on Python 3.12  https://review.opendev.org/c/opendev/system-config/+/94565723:00
fungithanks again!23:00
* fungi sighs23:00
clarkbif you like you can apply the same fix to the test you found the bad fix from23:01
clarkbthat would confirm this works before we approve it if you'd like to double check. Otherwise I can reapprove23:02
clarkb+2'd for now anyway23:02
fungioh, good idea23:03
clarkbalmost to the mirror deployment job now23:03
opendevreviewJeremy Stanley proposed openstack/project-config master: Update SSL use in our IRC access check script  https://review.opendev.org/c/openstack/project-config/+/94566223:07
opendevreviewJeremy Stanley proposed opendev/system-config master: Accessbot fix for running on Python 3.12  https://review.opendev.org/c/opendev/system-config/+/94565723:07
clarkbI believe the project-config change is self testing (thats how the fix came up we wanted to default to noble for the job and it broke)23:09
clarkbopenafs is building on mirror0323:09
fungiyeah, that's why i added the depends-on23:10
fungi- project-config-irc-access https://zuul.opendev.org/t/openstack/build/def8d5ab69964d4ebb5a5e9af718925b : SUCCESS in 6m 30s23:18
clarkbI think you can approve that one. I +2'd23:18
fungii guess it worked23:18
clarkbthen approve the other23:19
clarkbthe mirror is no longer running gcc so should hopefully complete the deployment job soon23:19
fungii need to knock off for the evening, but will check in on the accessbot deploy job in the morning before my meeting23:20
clarkbhttps://mirror03.dfw.rax.opendev.org/ that has stuff now23:20
clarkbI'm going to approve the the dns swap and that way periodic jobs at 02:00 should generate data for us23:21
corvusnice23:22
opendevreviewMerged opendev/zone-opendev.org master: Point mirror.dfw.rax at the new noble mirror  https://review.opendev.org/c/opendev/zone-opendev.org/+/94566023:25
opendevreviewMerged openstack/project-config master: Update SSL use in our IRC access check script  https://review.opendev.org/c/openstack/project-config/+/94566223:32
corvusi guess the mirror hosts are in cacti by their cnames, so we'll just see the graphs cutover nowish23:35
clarkbya new mirror is what I get out of dns now23:37
corvusnew data apparent in cacti (can see the discontinuity)23:54
clarkbcorvus: are we anywhere near the total bw from before? or is it too early to say if this is different?23:57
corvusit was pretty low anyway (due to time of day?  plus 623?).  also, i guess we would expect a slowly (2 hours?) rolling cutover anyway with tcp connections to the old server23:58
corvushttps://imgur.com/nutfRgA23:59

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!