Friday, 2021-06-04

corvusianw: no i'm trying to say it's an incompatible option in zuul00:03
corvusianw: zuul either uses the app auth or the webhook auth00:03
corvusianw: a scheduler restart will bring in significant zuul changes; i was planning one tomorrow; i would not recommend it unless you're prepared to monitor for fallout.  also, i would not recommend it because i don't think it'll fix the problem.00:05
corvusianw: i read as meaning it's not going to use the api token as a fallback00:11
ianwcorvus: yeah, i agree also as i was reworking that became clearer00:15
ianwi think if we're not installed for the project, we should fall back to api authentication00:17
ianwi feel like i'm doing this very inefficiently compared to someone like tobiash[m] who might have thought about this a lot more.  i might just update the story and give others a chance to weigh in00:18
opendevreviewSandeep Yadav proposed openstack/diskimage-builder master: [DNM] Lock NetworkManager in DIB
opendevreviewSlawek Kaplonski proposed opendev/irc-meetings master: Move neutron meetings to the openstack-neutron channel
fricklerslaweq_: do you want to have some time for neutron ppl to review ^^ or do you think it's enough that it has been discussed in the meeting? then I'd just merge it06:54
slaweq_frickler: yes, I want at least Liu and Brian to check it06:56
fricklerslaweq_: o.k., waiting for that, then06:58
slaweq_frickler: thx07:04
opendevreviewPierre Riteau proposed opendev/irc-meetings master: Move Blazar meeting to #openstack-blazar
opendevreviewMerged opendev/irc-meetings master: Move Blazar meeting to #openstack-blazar
*** lucasagomes has quit IRC12:38
corvusstarting that now13:33
corvusas expected, zk data size is growing significantly (we're caching config data there now)13:39
corvuslooks like node count increased from 12k -> 35k, and data size increaste from 10mib -> 20mib13:42
corvusit didn't seem like startup took any more or less time, which is good.13:43
corvusjobs are running13:43
*** ysandeep|mtg is now known as ysandeep13:44
*** ysandeep is now known as ysandeep|ruck13:44
corvus#status log restarted zuul at commit 85e69c8eb04b2e059e4deaa4805978f6c0665c03 which caches unparsed config in zk. observed expected increase in zk usage after restart: 3x zk node count and 2x zk data size13:47
opendevstatuscorvus: finished logging13:47
corvuslooks like the final numbers may be a bit bigger as we're still adding in the operational baseline from before now that jobs are starting13:48
corvusso far none of the performance metrics look different13:49
corvuseverything still looks nominal14:53
corvusthere's like zero change in memory usage on them despite the 2x change in data size15:33
corvusa miniscule amount of additional cpu15:33
corvus(like, it's currently at 95% idle)15:34
corvusbut that could actually just be due to a change in connection distribution15:35
opendevreviewMerged opendev/base-jobs master: Set a fallback VERSION_ID in the mirror-info role
fungijrosser: ^ probably worth rechecking your bullseye addition change now16:00
fungiin theory it should work without your workaround at this point16:01
opendevreviewMonty Taylor proposed zuul/zuul-jobs master: Add a job for publishing a site to netlify
clarkbas a reminder I was planning on dropping out of freenode channels entirely next week. Any reason to not do that given the way the transition has gone?16:53
clarkbfungi: should we maybe roll the dice on topic updates first?16:53
fungii'm happy to stick around in there for months watching for stragglers, but sure maybe we update the topic for #opendev (as an example) to "Former channel for the OpenDev Collaboratory, see"16:59
clarkb++ I liek that. Clearly points to docs people need without tripping any of the known keyword that are a problem17:00
mordredI have also dropped out of freenode fwiw17:05
mnaseri am also planning to continue to be there to send people over18:51
mnaserbut it is getting incresingly quiet :)18:51
fungiand yeah, it has no members yet19:47
fungislittle1: done!19:47
fungiany time!20:05
noonedeadpunkfungi: any known activity to centos8 repos? Like dropping them?20:08
noonedeadpunkas they started filing wierdly today at ~10am utc20:09
fungifiling what?20:28
funginoonedeadpunk: id you meant failing, a link to an example build result would be helpful20:43
johnsomI wonder if we have another ansible upgrade going on. Cloning is slow at the moment.20:49
fungijohnsom: i've checked resource utilization for all the backends and the load balancer, everything is fairly quiet... can you check the ssl cert for the backend you're hitting to see what hostnames it lists? i'll dive deeper on the one you're hitting20:54
johnsomIt's a stack (devstack) so let me try in another window. Just odd to see a clone horizon taking minutes20:55
fungiand it's the cloning phase specifically, not installing, which is going slowly?20:57
johnsom~400 KiB/s20:57
johnsomWhat is the trick to grab the TLS info? Do I need to tcpdump it?20:58
johnsomYeah, even a direct git clone is super slow.20:58
fungii do `echo|openssl s_client -connect|openssl x509 -text|grep CN` but there are lots of ways20:59
johnsomYeah, ok, a separate s_client.21:00
johnsom CN = gitea01.opendev.org21:00
fungiyou can also get away with just a simple `openssl s_client -connect` and then scroll up to the beginning of the verification chain info where it mentions the CN21:01
fungibut you end up with a lot of output21:01
johnsomYeah, I know s_client all too well. lol21:01
fungicurrently testing cloning nova from gitea01 with another server on the same network just to get a baseline21:06
fungiReceiving objects: 100% (595113/595113), 155.44 MiB | 14.64 MiB/s, done.21:06
johnsomThis is the IP it's hitting: 2604:e100:3:0:f816:3eff:fe6b:ad6221:07
fungiyep, that's an haproxy load balancer21:07
fungiyou can test cloning directly from to bypass it21:07
fungisee if you get similar speeds21:07
johnsomThat looks faster, 2.6MiB/s21:08
fungiso that suggests one of two things, either the lb is slowing things down or (more likely) ipv6 performance for you is worse than ipv4 at the moment21:09
fungimaybe try `git clone -4 to rule out the latter21:09
fungier, `git clone -4`21:09
johnsomYeah, I'm still getting 1gbps to Portland. Let21:09
johnsomme try the v421:10
fungii can confirm cloning via the load balancer's ipv6 address is very slow for me as well21:10
fungieven though i'm hitting a different backend entirely21:10
johnsomAbout the same for ipv421:11
johnsomWell, maybe it's just Friday afternoon people streaming stuff. lol21:11
fungiyep, i'm seeing even worse performance to the lb over ipv4 than over ipv6, yikes21:12
funginetwork traffic graph for it seems reasonable though:
johnsomYeah, must be something upstream.21:13
fungicpu is virtually idle, so it's not like it's handling an interrupt storm or anything21:13
fungihowever, given that ipv4 to the load balancer is slow but ipv6 to the backends isn't, even though they're in the same provider, suggests there could be something else going on21:14
fungier, ipv4 to both i mean21:14
fungiipv4 to backends is fast (they're ipv4-only in fact)21:15
johnsomWell, that isn't enough traffic to wake haproxy up from an afternoon nap.21:15
clarkbmtr can often point out locations with problems21:16
fungiseems like it might be a local network issue impacting the segment the lb is on but not the segment the backends are on21:16
clarkbthough the local router appears to be shared and the ip addresses are on the same /2521:19
clarkbah and according to the network interfaces that range is part of a larger /24 segment21:20
fungimtr --tcp is showing some pretty substantial packet loss to the lb21:24
fungibut not to the backends21:24
fungianyone else observing the same?21:24
fungiseems to come and go in bursts21:25
funginow i'm not seeing it21:25
clarkbI've got a couple `mtr --tcp` running now but no loss so far21:25
clarkbmaybe sad lacp link or similar21:26
johnsomfungi Are you bouncing through seattle with your mtr?21:26
fungione frame for you, one frame for the bit bucket, one frame for you, ...21:26
johnsomI'm seeing some congestion in Seattle that comes and goes21:27
fungilooks like my cable provider peers with zayo and then i go through atlanta to dallas to los angeles to san jose21:27
fungithough i wouldn't be surprised if the return route is asymmertic. lemme see if i can check the other direction21:28
clarkbI'm going over cogent via pdx and sfo21:28
johnsomYeah, I bounce through Seattle, get on zayo straight to San Jose21:28
fungibut i would be surprised if my routes to and from the lb differ significantly vs those for a backend server in the same cloud21:29
clarkbI have not seen any loss over my path21:30
fungican't even make it through one pass with mtr before it crashes on "address in use"21:30
fungibut a traditional traceroute shows the return path to me is actually cogent not zayo21:31
fungiso my connections are arriving at vexxhost via zayo but responses go over cogent (san jose straight to atlanta)21:31
fungianyway, for me the routes are the same to/from as well, yet i can clone from it far faster21:35
fungieven the last hop for me is the same in both traceroutes21:37
fungilet's see if the first hops in the other direction line up21:37
fungifirst hops on the return path are also the same for both servers, so maybe it's a layer 2 issue, host level even?21:39
johnsomSorry to derail the end of Friday. I did get my clones finished, so I'm good to go at this point.21:39
fungino, i appreciate the heads up, it's looking like we might want to let mnaser in on the fun21:39
fungimnaser: are you aware of any internet network disruptions in sjc1?21:41
fungier, i mean internal21:41
mnasernothing that i'm aware of21:42
mnaseri haven't been able to digest the messages though21:42
fungiwe're seeing very slow network performance from multiple locations communicating over tcp with (both ipv4 and ipv6) addresses for gitea-lb01.opendev.org21:42
mnaseripv6 is fast but ipv4 is not, or?21:42
fungiboth slow, v4 is actually slower for me than v6 even21:42
fungihowever other servers on the same network, like are fairly snappy21:43
fungiresource graphs for all look basically idle21:43
mnaseroh so going direct to gitea backends is ok, but the load balancer is not?21:43
fungimtr --tcp is showing me a lot of packet loss for as well and not for other hosts in that network21:43
fungier. meant to say resource graphs for all look basically idle21:44
fungiwondering if there could be something happening at layer 2 but only impacting, maybe at the host level?21:44
fungitraceroutes to/from both the lb and backends look identical21:45
fungithe server instance is showing no obvious signs of distress, not even breaking a sweat21:46
mnaserfungi: are you inside gitea's vm?21:46
fungithe haproxy lb vm is the one we're seeing weird network performance for, not the backend gitea servers21:47
fungi"" (a.k.a.
mnaserright, sorry, i meant gitea-lb01 :p21:48
mnasermy 'shortcutting' failed21:48
mnaser`curl -s | python3 -mjson.tool | grep uuid`21:48
fungimnaser: curl can't seem to reach that url from the server, but server show reports the instance uuid is e65dc9f4-b1d4-4e18-bf26-13af30dc3dd621:50
fungifor the record, the curl response is "curl: (7) Failed to connect to port 80: No route to host"21:51
fungiso we're probably missing a static route for that21:51
clarkbyou can get the instance uuid from the api `openstack server show`21:52
fungiyeah, that's where i got the one i pasted above21:52
* mnaser looks21:53
fungiperformance seems to at times rise as high as 1.5MiB/s and then fall as low as 400KiB/s according to git clone... same sort of cadence i see mtr --tcp report packet loss coming and going for it21:57
mnaserim seeing peaks of like21:57
mnaser400-500Mbps on the public interface21:57
mnaserbut i guess that's because it's using the same interface for in/out traffic21:57
fungithat doesn't match at all what we're seeing with snmp polls though:
fungibut we're aggregating at 5-minute samples, so maybe it's far more bursty21:58
fungias far as the traffic graphs we have are concerned though, our network utilization on that interface is basically what we always have there, but the network performance we're seeing is not typical21:59
mnaserfungi: would it be possible to setup iftop on the lb node and see if you see anything odd22:01
*** tosky has quit IRC22:01
fungiinstalled, checking the manpage for it now22:01
clarkbfwiw I don't see the same issues that fungi sees22:02
fungiahh, reminds me of nftop and pftop on *bsd22:02
fungiclarkb: what speeds do you get cloning nova?22:02
mnaserfungi: p. much, nice little real time thing to see :)22:02
clarkbchecking now22:02
clarkbjust under 2MiB/s22:03
fungiand it's steady?22:03
clarkbwhich is typical for me iirc22:03
fungiwhat about to one of the backends?22:03
clarkbyup bounces between about 1.75 to 2.00 MiB/s but seems steady22:03
clarkbwill try a backend when this completes22:04
clarkb155.55 MiB | 1.74 MiB/s, done. <- was aggregate22:04
fungimnaser: the averaged rates are lower than i would expect but i do see the 2sec average occasionally around 150Mbps22:06
clarkb155.46 MiB | 1.70 MiB/s, done. <- to gitea01 all my data is via ipv4 as I don't have v6 here22:07
fungii just was 2sec average go a hair over 200Mbps22:07
fungier, just saw22:07
clarkbthat is pretty consistent with what i recall getting via gitea in the past22:07
fungiactually now i'm getting fairly poor performance directly to gitea04 so it's possible there is a backend issue22:11
fungimnaser: yeah this may not be as cut and dried as it seemed at first, and if clarkb's not seeing performance issues then it could be just impacting me and johnsom not everyone22:13
clarkb04 has plenty of memory available and cpu isn't spinning22:13
fungii'm going to do some clone tests to the other backends as well for comparison22:13
fungimy clone from the 04 backend averaged 643.00 KiB/s22:15
fungii'm getting much the same from the 01 backend now... i was seeing far better performance before. may need to test this from somewhere out on the 'net which doesn't share an uplink with lots of tourists watching netflix on a rainy friday evening22:22
fungiokay, so if there was a more general problem i'm not able to reproduce it now22:26
fungigetting 7.7 MiB/s from poland cloning via ipv6 at both the lb and directly from a backend22:27
fungi(in the ovh warsaw pop)22:27
fungithink i'm going to blame tourists and call it an afternoon, just as soon as i finish ironing out this nagging negative lookahead regex22:33
