Tuesday, 2021-09-21

Clark[m]	fungi: TheJulia: we're over subscribed in that cloud and have that ability to control that to an extent	00:45
Clark[m]	I think the idea is we don't know what the right balance is so we have to fiddle with it	00:46
fungi	yep	00:47
yuriys	We did scale down a bit as of 10 hours ago. But yeah, performance over availability there.	00:47
yuriys	How can I get runtime data for the 809895 run?	00:48
Clark[m]	yuriys I can get links in a bit. Finishing dinner first	00:50
ianw	yuriys: i'd suspect it was https://zuul.opendev.org/t/openstack/build/626c1caaf4e34e91b4a1b961e3a2a21d/	00:51
fungi	yeah, i think that's likely it	00:53
fungi	it's the only voting job to be reported on change 809895 with a failure state in the most recent buildset	00:53
clarkb	ok just sent out tomorrow's meeting agenda. Sorry that was late	01:24
yuriys	We can probably scale down a bit more, maybe 32-36 limit. I still saw some ooms after we went down to 40. We had 1 launch error in the last 10ish hours, but I'm assuming that test runtime is outside of that.	01:36
yuriys	If test duration is a good show of performance, I'm interested in that as well, longterm.	01:37
clarkb	there are likely tests that are a decent indicator of that, but I'm notsure what those might be.	01:38
clarkb	and ya the test runtime should be orthogonal to launch failures. This would b after we have a VM how quickly can it run the job content	01:38
fungi	but also, as we've observed, performance on one vm in isolation is often far better than performance on a vm when the same hosts are also running a bunch of other test instances flat out all at the same time	01:40
fungi	the "noisy neighbor" effect	01:41
clarkb	ya and we see that across clouds	01:42
yuriys	I want to say this 'expansion' created a lot of internal tickets for us, from how we approach overcommitting resources, to memory optimizations, to even ceph optimizations. Like one of the OSDs are caching at an insane rate, we're at 38GB for the 3 OSDs, which is hilarious since we set osd_memory_target at 4G...	01:42
yuriys	Also in all brutal honestly for your use case, ceph is actually bad. We'd want to provision LVM on these NVMes, which we're testing inhouse now.	01:43
fungi	yeah, donnyd had observed that local storage on the compute nodes performed far better	01:46
fungi	that's what he ended up doing in the fortnebula cloud	01:46
clarkb	he might have input on over subscription ratios too	01:46
yuriys	I'm calling it a night, if you guys see performance issues, feel free to scale down (Maybe to 32). We'll eventually find the right tenant maximum that doesn't impact any testing negatively, and that will help us later making correct calculations when scaling hardware. I'm not concerned over getting big numbers here, just finding the right numbers, and getting some experience what bad numbers end up doing to infra.	02:00
yuriys	Interestingly enough ended up watching Deploy Friday: E50 during this chat, which is heavily Zuul focused, many shoutouts to what you guys do there.	02:01
clarkb	https://www.youtube.com/watch?v=2c3qJ851QVI neat	02:04
fungi	woah	02:06
TheJulia	Is it bad I didn't remember which video that was until I saw myself and the background?	03:46
*** ysandeep\|away is now known as ysandeep		05:38
*** rpittau\|afk is now known as rpittau		07:24
*** jpena\|off is now known as jpena		07:28
*** ykarel is now known as ykarel\|away		07:34
opendevreview	daniel.pawlik proposed opendev/puppet-log_processor master: Add capability with python3; add log request cert verify https://review.opendev.org/c/opendev/puppet-log_processor/+/809424	07:38
*** ysandeep is now known as ysandeep\|lunch		07:48
opendevreview	Balazs Gibizer proposed opendev/irc-meetings master: Add Sylvain as nova meeting chair https://review.opendev.org/c/opendev/irc-meetings/+/810165	08:23
opendevreview	Balazs Gibizer proposed opendev/irc-meetings master: Add Sylvain as nova meeting chair https://review.opendev.org/c/opendev/irc-meetings/+/810165	08:26
*** ysandeep\|lunch is now known as ysandeep		09:09
*** ykarel\|away is now known as ykarel		10:26
*** frenzy_friday is now known as anbanerj\|ruck		10:36
*** jpena is now known as jpena\|lunch		11:24
*** dviroel\|out is now known as dviroel		11:31
*** ysandeep is now known as ysandeep\|afk		11:47
opendevreview	Yuriy Shyyan proposed openstack/project-config master: Scaling down InMotion nodepool resource. https://review.opendev.org/c/openstack/project-config/+/810213	12:04
*** jpena\|lunch is now known as jpena		12:22
*** ysandeep\|afk is now known as ysandeep		12:39
opendevreview	Merged opendev/irc-meetings master: Add Sylvain as nova meeting chair https://review.opendev.org/c/opendev/irc-meetings/+/810165	12:53
*** tristanC_ is now known as tristanC		13:16
fungi	TheJulia: you did a good job in it, i watched it all the way through last night	13:21
TheJulia	\o/	13:22
yuriys	Nothing like waking up to a fire. fungi can you approve\|c/r the scale down please.	13:22
fungi	yuriys: just did, sorry didn't spot it until moments before you asked	13:23
yuriys	Awesome ty. I already adjusted overcommit stuff on the cloud itself.	13:23
fungi	yuriys: also nodepool will pay attention to quotas, so if the cloud side scales down the ram, cpu or disk quota it will adjust its expectations accordingly	13:24
yuriys	Very cool, did not know. Will probably tackle via quotas as well then.	13:26
fungi	yeah, we have some public cloud providers who burst our activity by simply adjusting the ram quota on their side, so when they know they have extra capacity they ramp it up temporarily and then when they expect to be under additional load for other reasons they scale it way down, maybe even to 0	13:31
fungi	a more dynamic way to make adjustments, and faster than going through configuration management	13:32
opendevreview	Merged openstack/project-config master: Scaling down InMotion nodepool resource. https://review.opendev.org/c/openstack/project-config/+/810213	13:36
*** artom_ is now known as artom		13:41
opendevreview	Jeremy Stanley proposed opendev/system-config master: Use Apache to serve a local OpenDev logo on paste https://review.opendev.org/c/opendev/system-config/+/810253	14:21
opendevreview	Danni Shi proposed openstack/diskimage-builder master: Update keylime-agent and tpm-emulator elements Story: #2002713 Task: #41304 https://review.opendev.org/c/openstack/diskimage-builder/+/810254	14:23
opendevreview	Danni Shi proposed openstack/diskimage-builder master: Update keylime-agent and tpm-emulator elements https://review.opendev.org/c/openstack/diskimage-builder/+/810254	14:25
opendevreview	Merged opendev/system-config master: lodgeit: use logo from system-config assets https://review.opendev.org/c/opendev/system-config/+/809510	14:28
opendevreview	Merged opendev/system-config master: gerrit: copy theme plugin from plugins/ https://review.opendev.org/c/opendev/system-config/+/809511	15:13
clarkb	digging into the replication leaks a bit more: there is a current task to replicate tobiko to gitea02 from just over an hour ago. On gitea02 there is no receive pack process but there are processes for a couple of other replications that are happening	15:14
clarkb	Looking at netstat -np \| grep 222 I see three ssh connections that correspond to the three receive packs that are present	15:15
clarkb	All that to say it really does seem like we aren't properly connecting to the remote end when this happens	15:16
yuriys	I noticed a couple instances are in Shut Down state. Is that normal? Is that the 'Available' state?	15:17
clarkb	yuriys: it is possible for test jobs to request a reboot. But typically I'm not sure that is normal	15:19
fungi	reboots also don't generally enter shutdown state, as they're just performed soft from within the guest and not via the nova api	15:21
clarkb	I think libvirt may detect that through the acpi stuff though	15:22
fungi	ahh, any any case they shouldn't remain in that state for more than a few seconds if so	15:22
clarkb	infra-root does anyone else want to look at gitea02 before I kill and restart the tobiko replication task?	15:22
*** dviroel is now known as dviroel\|lunch		15:23
fungi	when you say "Looking at netstat..." do you mean on gitea02 or review?	15:23
*** ysandeep is now known as ysandeep\|dinner		15:23
clarkb	gitea02	15:24
fungi	i guess the gitea side since netstat isn't installed on review	15:24
fungi	i went ahead and installed net-tools on review	15:26
clarkb	fungi: I think you areexpected to use ss on newer systems liek focal	15:26
clarkb	but net-tools shouldn't hurt either	15:27
fungi	never heard of ss, thanks	15:27
clarkb	fungi: ss is to netstat what ip is to ifconfig aiui	15:27
fungi	interestingly, there is an ssh socket to gitea02 which exists only on the review side and has no corresponding socket tracked on gitea02	15:29
fungi	199.204.45.33:32798 -> 38.108.68.23:222	15:29
opendevreview	Clark Boylan proposed opendev/system-config master: GC/pack gitea repos every other day https://review.opendev.org/c/opendev/system-config/+/810284	15:38
clarkb	I'm less confident ^ will help but it also shouldn't hurt	15:38
fungi	i have a feeling it's something network related, like a pmtud blackhole impacting one random router which only gets a small subset of the flow distribution or some stateful middlebox dropping a small percentage of tracked states at random	15:39
clarkb	fungi: ya the network timeout that we need to restart gerrit for is likely the best bet	15:40
clarkb	fungi: any objection to me stopping and restargin that tobiko replication task now?	15:42
fungi	clarkb: nah, go for it. i'm curious to see whether these connections clear up, and on which ends	15:45
*** ykarel is now known as ykarel\|away		15:49
fungi	huh, a `git remote update` in zuul/zuul-jobs is taking forever on my workstation at "Fetching origin" (which should be one of the gitea servers)	15:49
fungi	fatal: unable to access 'https://opendev.org/zuul/zuul-jobs/': GnuTLS recv error (-110): The TLS connection was non-properly terminated.	15:49
clarkb	that would be to the haproxy not the backends ?	15:49
clarkb	but maybe something is busy (gitea02 was fairly quiet with a system load of 10	15:50
clarkb	er 1	15:50
fungi	maybe? we terminate ssl on the backends	15:50
fungi	haproxy is just a plani layer 4 proxy	15:50
fungi	plain	15:50
clarkb	http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=66611&rra_id=all	15:50
fungi	at least in our config	15:50
clarkb	someone decided to update their openstack ansible install?	15:50
clarkb	:P	15:50
fungi	looks like it	15:50
fungi	supposedly osad now sets a unique user agent string when it pulls from git servers	15:51
clarkb	looks like 02 did suffer some of that	15:51
clarkb	small bumps on some other servers buit was largely 02	15:52
fungi	i'm currently being sent to CN = gitea06.opendev.org	15:52
clarkb	seems like it may be recovering now? possibly because haproxy told things to go away	15:52
clarkb	looks like it did OOM though	15:53
*** ysandeep\|dinner is now known as ysandeep		15:56
fungi	looks like if it's openstack-ansible it'll have a ua of something along the lines of "git/unknown (osa/X.Y.Z/component)	15:57
*** marios is now known as marios\|out		15:59
clarkb	I don't see 'osa' in any of the UA strings during the period of time it appears to have been busy	16:00
fungi	[21/Sep/2021:15:58:03 +0000] "POST /openstack/zun-tempest-plugin/git-upload-pack HTTP/1.1" 200 224778 "-" "git/2.27.0 (osa/23.1.0.dev43/aio)"	16:00
fungi	as a sample	16:00
jrosser	thats a test / CI run becasue it has 'aio' in the string	16:01
fungi	thanks, i haven't tried to put any numbers together yet, just confirming whether i can find those	16:02
clarkb	jrosser: is there a reason the CI runs don't use the local git repos on the node?	16:02
jrosser	they try to as far as possible	16:02
clarkb	I think gitea06 may have become collateral damage in whateverthis is. I can't reach it	16:03
fungi	so at least looking for osa ua strings, there's not a substantial spike in those on gitea02's access log around the time things started to go sideways	16:05
clarkb	06 does ping, but maybe OOMkiller hit it in an unrecoverable manner? I Guess we can wait and see for a bit	16:06
fungi	filtering out any with /aio as the component, i can see some definite bursts but no clear correlation to spikes on the haproxy established connections graph	16:07
fungi	clarkb: unfortunately 06 is not so broken that haproxy has taken it out of the pool	16:08
fungi	i'm still getting my v6 connections balanced to 06	16:08
clarkb	fungi: are you sure? the haproxy log shows it as down	16:09
fungi	`echo \| openssl s_client -connect opendev.org:https \| openssl x509 -text \| grep CN`	16:09
clarkb	oh then it flipped it back UP again weird	16:09
fungi	Subject: CN = gitea06.opendev.org	16:09
clarkb	it flip flopped 05 and 06	16:10
opendevreview	Elod Illes proposed openstack/project-config master: Add stable-only tag to generated stable patches https://review.opendev.org/c/openstack/project-config/+/810287	16:10
clarkb	yesterday it properly detected the update to the images which do a rolling restart of the services (kind of cool to see that in the log as happening properly)	16:10
fungi	it's definitely hosed to the point where cacti can't poll snmpd though	16:10
clarkb	yup and no ssh either	16:10
fungi	but i guess apache is still semi-responsive, enough to complete tls handshakes	16:11
fungi	looks like gitea05 has also been knocked offline, yeah	16:12
fungi	i was able to ssh into 05 but it took a while to do the login	16:13
clarkb	same here	16:14
fungi	syslem load average is around 90	16:14
fungi	system	16:14
fungi	it's heavy into swap thrash	16:14
fungi	out of swap altogether in fact	16:14
clarkb	everytime this happens I seriously wonder if we shouldn't cgit again	16:14
clarkb	(it was far more resilient to connetion count based load balancing rather than source)	16:15
fungi	or use apache to handle the git interactions and just rely on gitea as a browser	16:15
fungi	yeah, there's a gitea process using almost 12gb of virtual memory	16:16
clarkb	for 05 and 06 should we ask the cloud to reboot them? and/or manually remove them from the haproxy pool?	16:16
fungi	well, at the moment i expect they're acting as a tarpit for whatever's generating all thos load	16:16
fungi	if we take them out of the pool, those connections will get balanced to another backend and knock it offline quickly as well	16:17
fungi	there was a sizeable spike in non-aio osa ua strings in requests to gitea05 at 14:33 and again at 14:44	16:19
clarkb	yup, but without access to the web server logs on those hosts it is hard to figure out what is causing this, but I'm looking at haproxy to see if there is any clues	16:19
fungi	i'll see if i can isolate those and possibly map them back to a client	16:19
clarkb	fungi: those happen well before the spike we see in cacti fwiw	16:20
clarkb	we are lookingat a start of ~15:30 according to cacti	16:20
fungi	oh, yep, you're right. i'm looking at the wrong hour	16:20
fungi	/openstack/nova/info/refs?service=git-upload-pack is a clone, right?	16:22
fungi	or could be a fetch too	16:22
fungi	POST /openstack/nova/git-upload-pack	16:23
fungi	i think that's what i'm looking for actually	16:23
clarkb	ya that sounds right	16:24
fungi	yeah, a spike of 42 of those within the 15:35 minute on gitea05	16:24
clarkb	fwiw on the load balancer: `grep 'Sep 21 15:[345].* ' syslog \| grep gitea06 \| cut -d' ' -f 6 \| sed -e 's/$.*$:[0-9]\+/\1/' \| sort \| uniq -c \| sort` shows a couple of interesting things	16:24
fungi	normally we see around 1-5 of them in a minute during the surrounding timeframe	16:25
clarkb	if I add 05 to that list then there is strong correlation between two IPs	16:25
fungi	i'm going to try to identify the client addresses associated with those nova clones at 15:35	16:25
*** jpena is now known as jpena\|off		16:25
*** rpittau is now known as rpittau\|afk		16:25
clarkb	fungi: I just PM'd you the IP I expect it to be based on the haproxy data	16:25
fungi	yep, thanks, i'll see if they like nova a lot	16:26
*** dviroel\|lunch is now known as dviroel		16:28
fungi	these were the source ports on the haproxy side of the connection for those nova clones at 15:35:	16:29
fungi	60478 60452 60702 60652 60588 60692 60660 60442 60466 60610 60464 60566 60460 60650 60616 60958 60916 60400 60730 32862 60992 60802 60838 32808 60910 32818 60882 60834 60954 60852 60932 60972 60746 60694 60978 32848 60918 60962 60876 32776 60844 33314	16:29
fungi	to gitea05	16:29
clarkb	60478 maps to the IP I shared with you	16:30
clarkb	60852 does as well	16:31
clarkb	the correlation is starting to get stronger :)	16:31
clarkb	fungi: though it looks like that IP did end up stopping about half an hour ago	16:33
clarkb	fungi: maybe if we restart things we'll be ok	16:33
clarkb	?	16:33
clarkb	based on that correlation and the lack of that IP showing up for the last bit that is my suggestion	16:34
clarkb	I suspect what happened according to the log is that 6 went down then they were setnt to 5. Then 6 came up and 5 went down and htat happened back and forth	16:35
clarkb	and UP here isn't a very strong metric apparently :)	16:36
fungi	so 100% of those 42 nova clone operations logged during the 15:35 minute on gitea05 came from the same ip address you noted as having a surge in connections through the proxy	16:38
fungi	though they showed up as being during the 15:37 minute on the haproxy log	16:38
clarkb	fungi: I think that is beacuse it takes a few minutes for haproxy to close disconnect the connections	16:39
clarkb	fungi: note they all have status of cD or sD in the log lines which is an exceptional state from haproxy aiui	16:39
clarkb	-- is normal	16:39
fungi	cacti is starting to be able to get through to gitea05 again	16:39
fungi	and gitea06 seems like it wants to finish logging me in... it did print the motd just no shell prompt yet	16:40
clarkb	progress!	16:40
fungi	spike in nova clones on gitea06 was logged at 15:32-15:33 but it was hit much harder too, and stopped really doing anything according to its logs after 15:34	16:43
clarkb	fungi: ya then I think it flipped over to 05 when 06 was noted as down	16:44
fungi	115 nova clones ni that 120 second timeframe	16:44
clarkb	Sep 21 15:35:27 gitea-lb01 docker-haproxy[786]: [WARNING] (9) : Server balance_git_https/gitea06.opendev.org is DOWN, reason: Layer4 timeout	16:44
clarkb	Sep 21 15:40:14 gitea-lb01 docker-haproxy[786]: [WARNING] (9) : Server balance_git_https/gitea05.opendev.org is DOWN, reason: Layer4 timeout	16:44
clarkb	basically it went from 06 to 05	16:44
fungi	running the cross-log analysis with haproxy now	16:45
clarkb	and for some reason didn't continue on to 01 02 03 04 etc	16:45
clarkb	possibly because 06 went back up Sep 21 15:37:47 gitea-lb01 docker-haproxy[786]: [WARNING] (9) : Server balance_git_https/gitea06.opendev.org is UP, reason: Layer4 check passed and so it stuck to only 05 and 06	16:45
clarkb	load average on 05 doesn't seem to be getting better	16:49
clarkb	fungi: are you good with attempting to reboot 05 and 06 now?	16:50
fungi	yeah, let's	16:50
clarkb	I can't get to 06, if you are still on it did you want to try sudo rebooting both of them?	16:50
clarkb	then if taht doesn't work we can ask the cloud to do it for us	16:50
fungi	i'm logged into both so yep, can do	16:54
fungi	and done	16:54
fungi	we'll see if they manage to shut themselves down cleanly and reboot	16:54
fungi	05 is back up again	16:56
clarkb	gitea05 closed my connection at least	16:56
fungi	06 is still booting i think	16:56
fungi	or might still be shutting down, but it closed my connection at least	16:56
clarkb	load is a bit high on 05	16:56
clarkb	but not as high as before	16:56
clarkb	is there a secondary dos?	16:56
fungi	jrosser: user agent on these nova clones we observed was just "git/1.8.3.1" so i have a feeling it still could be an osa site	16:57
jrosser	it could easily be	16:58
fungi	in a 3-minute span we saw >150 clones of nova from a single ip address, so likely behind a nat	16:58
jrosser	i think we backported that user agent stuff all the way back to T	16:58
jrosser	but it does require them to have moved to a new tag	16:58
clarkb	looks like load is falling back down again on 05 I guess it just had to catch up	16:59
clarkb	also gerrit replication shows retries enqueued for pushing to 05 (we want to see that so good to confirm it happens)	16:59
fungi	git 1.8.3.1 is fairly old... is that the default git on centos 7 maybe?	17:00
clarkb	fungi: I think it is	17:00
clarkb	05 looks normal now	17:01
clarkb	plenty of memory, reasonable system load etc	17:01
fungi	https://centos.pkgs.org/7/centos-x86_64/git-svn-1.8.3.1-23.el7_8.x86_64.rpm.html	17:01
fungi	yeah, centos 7 seems likely	17:01
fungi	i'm still getting a "no route to host" for gitea06	17:02
clarkb	I'm going to go find some breakfast since we are just waiting on 06 to restart and 05 showed a restart seems to make things happier and replication handles it properly	17:03
clarkb	I was hoping to do more zuul reviews this morning :) maybe I can do those this afternoon and the gerrit account emails can happen tomorrow	17:03
fungi	i was trying to start on another zuul-jobs change which is what caused me to notice things weren't working right	17:03
clarkb	ya I noticed beacuse I was trying to look at gitea06 like I had looked at gitae02 to debug the replication slowness	17:04
clarkb	oh and at some point we really should restart gerrit to pickup the timeout change	17:05
*** ysandeep is now known as ysandeep\|away		17:13
clarkb	fungi: looks like 06 has been up for about 5 minutes	17:13
fungi	yeah, i'm able to ssh into it now	17:13
fungi	git sidetracked trying to write docs	17:14
fungi	system load average is nice and low	17:14
clarkb	there are three older replication tasks to 06 one each for cinder ironic and neutron. If they don't complete soon they may need to be restarted as well	17:14
clarkb	There are many retry tasks for replication to 06 though so generally seems to have detected it needs to try again	17:15
clarkb	I only see a push for keystone and not the other three (implying they are in a similar state to the previous tobiko replication task. gitea knows nothing about them)	17:16
clarkb	I'll give them 5 more minutes then manually intervene	17:16
*** sshnaidm is now known as sshnaidm\|off		17:22
clarkb	and done. Will check queue statuses in a bit but I expect we'll be recovering to more normal situation soon	17:23
clarkb	fungi: thinking out loud here maybe we can to run iperf3 tests between rax dfw and gitea0X and compare to similar run between review02 and gitea0X	17:26
clarkb	the giteas were never in the same cloud region but it seems that replication might be a fair bit slower now?	17:26
clarkb	mnaser: ^ fyi if that is something you might already know about	17:27
clarkb	don't have hard data on it but possibly also having started since we migrated the review server to the new dc?	17:27
fungi	also can't rule out that these hangs are related to activity spikes and oom events on the gitea side	17:28
mnaser	clarkb: that's strange, the hardware should be way quicker and should be faster ceph systems. i wonder if it has something to do with the kernel version in your vm vs host (as that is signifcantly newer)	17:28
clarkb	fungi: well I have checked some of the hosts and some dont' have recent OOMs	17:29
clarkb	fungi: gitea03 for example is quite clean but also has had issues	17:29
clarkb	the other odd thing is it seems to happen between 14:00UTC and 18:00UTC	17:31
fungi	okay, so yes that does seem like occasional problems corssing the internet	17:31
clarkb	our cacti data shows the last few days during this period of time has been clean except for today	17:31
clarkb	note it is possible that is observer bias as I tend to check in the morning. It could be happening at other times but some timeout is finally occuring and cleaning them up so we don't seem them queue the next morning	17:33
opendevreview	Jeremy Stanley proposed zuul/zuul-jobs master: Deprecate EOL Python releases and OS versions https://review.opendev.org/c/zuul/zuul-jobs/+/810299	17:33
clarkb	236edacc 17:05:46.304 [43642e93] push ssh://git@gitea03.opendev.org:222/openstack/releases.git	17:34
clarkb	That one appears to have "leaked"	17:34
clarkb	gitea03 has not OOM'd	17:34
fungi	could they be getting cleaned up after 2 hours? 3? what's the oldest you observed?	17:34
clarkb	fungi: I've observed ~14:00ish at ~18:00	17:34
fungi	okay, so at least 4 hours i guess	17:35
clarkb	do we want to see what happens with 236edacc ?	17:35
fungi	sure, if someone complains about an old ref there we can always abort the experiment and kill that task so it catches up	17:35
clarkb	ok	17:36
clarkb	f54021d3 and 1c9c079d appear to have leaked on 05	17:42
clarkb	those are both post reboot tasks so no OOM there either	17:42
*** ysandeep\|away is now known as ysandeep		17:44
clarkb	two things I'll note. the giteas don't appear to have AAAA records but do have configured ipv6 addresses. This means gerrit is going to talk to them over ipv4 only	17:44
clarkb	pinging from gitea05 to review02 over ipv6 results in no route to host	17:44
fungi	yeah, i want to say the original kubernetes deployment design limited us to only using ipv4 addresses, but since the lb is proxying to them anyway it was irrelevant for end users	17:48
fungi	since we ended up not sticking with kubernetes there, we've got ipv6 addresses, just never added any aaaa records	17:49
clarkb	in this case it is a good thing beacuse it seems the ipv6 cannot route	17:57
clarkb	I did a ping -c 100 from gitea05 to review02 and vice versa and both had a 2% loss	17:57
opendevreview	Danni Shi proposed openstack/diskimage-builder master: Update keylime-agent and tpm-emulator elements https://review.opendev.org/c/openstack/diskimage-builder/+/810254	18:06
clarkb	rerunning the ping -c 100 test to see how consistent that is	18:07
opendevreview	Merged openstack/project-config master: Update neutron-lib grafana dasboard https://review.opendev.org/c/openstack/project-config/+/806138	18:07
fungi	clarkb: ianw: not urgent, but related to recently approved changes and i'm looking for a suggestion as to the best way to tackle it: https://review.opendev.org/810253	18:15
clarkb	yuriys: we're noticing some connectivity issues to https://registry.yarnpkg.com/@patternfly/react-tokens/-/react-tokens-4.12.15.tgz from 173.231.255.74 and 173.231.255.246 in the inmotion cloud. Currently I can fetch that url with wget from the hosts that have those IPs assigned to them.	18:16
clarkb	yuriys: I guess I'm wondering if there is potentail routing issues with those IPs or maybe the neutron routers/NAT mgiht be struggling?	18:16
clarkb	oh there wouldn't be NAT	18:16
clarkb	just the neutron router I think	18:16
clarkb	no packet loss on second pass of ping -c between gitea05 and review02	18:17
clarkb	fungi: left a response to your question on that change. I'm not 100% sure of that but maybe 90% sure	18:19
opendevreview	Jeremy Stanley proposed opendev/system-config master: Switch IPv4 rejects from host-prohibit to admin https://review.opendev.org/c/opendev/system-config/+/810013	18:19
fungi	thanks!	18:19
opendevreview	Jeremy Stanley proposed opendev/system-config master: Use Apache to serve a local OpenDev logo on paste https://review.opendev.org/c/opendev/system-config/+/810253	18:24
fungi	also the screenshot for that in the run job was a huge help, made it quite obvious my naive first attempt was worthless	18:25
opendevreview	Clark Boylan proposed opendev/system-config master: Upgrade gitea to 1.15.3 https://review.opendev.org/c/opendev/system-config/+/803231	18:29
opendevreview	Clark Boylan proposed opendev/system-config master: DNM force gitea failure for interaction https://review.opendev.org/c/opendev/system-config/+/800516	18:29
opendevreview	Clark Boylan proposed opendev/system-config master: Upgrade gitea to 1.14.7 https://review.opendev.org/c/opendev/system-config/+/810303	18:29
clarkb	infra-root ^ I put a hold on the last change in that stack to verify 1.15.3. I think we should consider going ahead and landing the 1.14.7 upgrade to keep up to date there. Then for the 1.15.3 update I'd like to do that after the gerrit theme logo stuff that ianw has pushed is done	18:29
clarkb	and I'm beginning to think maybe we do a combo restart of gerrit for the theme update and the replication timout config change	18:30
fungi	a reasonable choice	18:30
clarkb	Then after all that we can do the buster -> bullseye updates for those images (I have changes up for those as well)	18:31
*** ysandeep is now known as ysandeep\|out		18:32
clarkb	I have +2'd the two gerrit changes at the end of the logo stack but didn't approve them as they are a bit more involved than the previous changes. I figure we can double check the above plan with ianw then proceed from there with approving those updates?	18:39
fungi	wfm	18:40
fungi	and hopefully 810253 will work as intended now	18:42
opendevreview	Jeremy Stanley proposed opendev/system-config master: Use Apache to serve a local OpenDev logo on paste https://review.opendev.org/c/opendev/system-config/+/810253	19:05
clarkb	infra-root https://review.opendev.org/c/opendev/system-config/+/810303 has been +1'd by zuul. I'm around all afternoon if we want to proceed with that. I do have to pick up kids from school though for a shortish gap in keyboard time	19:50
clarkb	that is the gitea 1.14.7 update	19:50
ianw	fungi: the paste not showing the logo in the screenshot is weird	19:52
ianw	especially when it seems like the wget returned it correctly	19:52
fungi	ianw: well, my test also failed and so there should be a held node for it now	19:53
fungi	the get returned a 5xx error	19:53
clarkb	fungi: have you ever seen anything like the error in https://zuul.opendev.org/t/openstack/build/e665dbc7368e44caa398e8c130c4151a ? seems apt had problems?	19:54
clarkb	maybe we fetched and incomplete file? but hash verification sould catch that first?	19:54
ianw	fungi: oh indeed https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_d26/810253/3/check/system-config-run-paste/d266360/bridge.openstack.org/test-results.html	19:55
fungi	clarkb: cannot copy extracted data for './usr/bin/dockerd' to '/usr/bin/dockerd.dpkg-new': unexpected end of file or stream	19:56
fungi	my guess is it was truncated, yeah	19:56
clarkb	I'll recheck that change once it reports I guess	19:57
ianw	"[pid: 14\|app: -1\|req: -1/2] 127.0.0.1 () {32 vars in 435 bytes} [Tue Sep 21 19:44:07 2021] GET /assets/opendev.svg" <- so the request made it to lodgeit, which it shouldn't have you'd think	19:57
fungi	clarkb: i'm betting the working run will show a larger file size for docker-ce than 21.2 MB	19:58
ianw	but also, it looks like mysql wasn't ready -> https://zuul.opendev.org/t/openstack/build/d266360944434e288db1880729d809dc/log/paste01.opendev.org/containers/docker-lodgeit.log#144	19:58
fungi	ianw: possible my location section in the vhost config isn't right. i can fiddle with it on the held node when i get a moment	19:58
fungi	according to the apache 2.4 docs, location /assets/ should cover /assets/opendev.svg	19:59
fungi	and therefore be excluded from the proxy	20:00
ianw	https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/lodgeit/templates/docker-compose.yaml.j2#L28 -> we sleep for 30 seconds for mariadb to be up	20:00
ianw	Sep 21 19:42:03 -> Sep 21 19:42:39 paste01 docker-mariadb[10998]: 2021-09-21 19:42:39 0 [Note] mysqld: ready for connections.	20:01
ianw	that's ... 36 seconds from start to ready?	20:01
fungi	that provider may still not be running at consistent performance levels to our others	20:07
ianw	yeah it was inmotion	20:09
ianw	it should use a proper polling wait; the sleep was just expedience but we could include the wait-for-it script	20:10
fungi	huh, fun, my ssh to that held node seems to have just hung	20:15
fungi	nevermind, it resumed	20:16
fungi	bizarre, i tried treating opendev.svg exactly like robots.txt in the apache config on the held node, and it's still getting proxied to lodgeit	20:20
fungi	oh, duh	20:25
opendevreview	Jeremy Stanley proposed opendev/system-config master: Use Apache to serve a local OpenDev logo on paste https://review.opendev.org/c/opendev/system-config/+/810253	20:28
fungi	ianw: ^ turns out the mistake was a surprisingly simple one	20:28
fungi	i only added the logo to the http vhost, not the https one	20:29
* fungi sighs audibly		20:29
ianw	oh, doh. that's right we allowed the http for the old config file	20:32
clarkb	its been just over 3 hours on those leaked replication tasks and they are still present	20:38
fungi	well, we guessed the timeout is at least 4 hours there	20:42
clarkb	ya or at least 4 hours	20:43
clarkb	just calling out the data doesn't contradict this yet	20:43
clarkb	following up on the nodepool zk data backups that appears to be working as expected	20:46
clarkb	gitea 1.15.3 continues to look good https://158.69.73.109:3081/opendev/system-config	20:56
mnaser	clarkb: i think you caught network in an odd time where we were flipping some bits, let me know if you continue to see some instability-ish	21:00
clarkb	mnaser: will do	21:00
clarkb	mnaser: fwiw the ipv6 ping from gitea05 to review02 still says Destination unreachable: No route	21:04
mnaser	clarkb: yes, that's still a 'working on fixing it' :(	21:04
clarkb	got it	21:04
*** dviroel is now known as dviroel\|out		21:07
yuriys	Just caught up on chat. Okay, looks like we need to scale down once more to improve the CI performance, 36 sec to start MySQL is awful lol.	21:35
yuriys	Quick question on workload distribution, when a test is queued does it pull a 'worker' at random from a list of available instances, or does it pull an instance out of a serial list of available instances for testing?	21:38
yuriys	The reason I ask is that one of the nodes in the inmotion cloud was heavily under used while the other exploded, which is where I'm guessing ianw's test ran, but it looks like there may have been multiple instances used at the same time, which is fine, but looking to optimize for better/more responsive load distribution.	21:43
ianw	yuriys: umm, the nodes are up and in a "ready" state before they are assigned to run tests	21:47
yuriys	Yeah, what I've seen so far, is instance is created "Launch Attempt", then they are shut off and go to an "Available" state, then if they are selected they go to "In Use" state. This is the stuff that gets pushed to grafana so I'm going from that.	21:49
ianw	if you were just looking at the cloud side, you see vm's come up that are lightly used that will sit for an indeterminate amount of time, when under load not very long, before being assigned as workers when they start doing stuff	21:49
ianw	i need to look at why https://grafana.opendev.org/d/4sdNjeXGk/nodepool-inmotion?orgId=1 is not showing openstackapi stats	21:50
yuriys	No other providers have that info.	21:51
yuriys	I thought it was just 'permabroken' lol.	21:51
Guest490	ianw: i haven't looked into it but https://zuul-ci.org/docs/nodepool/releasenotes.html#relnotes-3-6-0-upgrade-notes may be relevant to missing stats	21:54
*** Guest490 is now known as corvus		21:55
*** corvus is now known as _corvus		21:56
*** _corvus is now known as corvus		21:56
ianw	i think it might be related to https://review.opendev.org/c/zuul/nodepool/+/786862	21:57
ianw	we're graphing : stats.timers.nodepool.task.$region.ComputePostServers.mean	22:07
ianw	i think that should be compute.POST.servers now	22:09
opendevreview	Yuriy Shyyan proposed openstack/project-config master: Improve CI performance and reduce infra load. https://review.opendev.org/c/openstack/project-config/+/810326	22:11
opendevreview	Ian Wienand proposed openstack/project-config master: grafana: fix openstack API stats for providers https://review.opendev.org/c/openstack/project-config/+/810329	22:25
fungi	okay, back now. dinner ended up slightly more involved than i anticipated	22:27
corvus	i think everything we were interested in getting into zuul has landed, so i'd like to start working on a restart now	22:32
fungi	i'm happy to help	22:32
corvus	fungi: you want to establish if now is okay time wrt openstack?	22:33
corvus	i'll run pull meanwhile	22:33
fungi	yeah, i'm checking in with the release team	22:33
fungi	our gerrit logo changes haven't been approved yet, so we can just do gerrit restart separately later	22:33
corvus	most recent promote succeeded, and i've pulled images, so we're up to date now	22:34
clarkb	fungi: I think that is fine. The two restarts a sufficiently quick compared to each other that we don't need to try and squash them together I don't think	22:34
opendevreview	Merged openstack/project-config master: Improve CI performance and reduce infra load. https://review.opendev.org/c/openstack/project-config/+/810326	22:34
fungi	i've let the openstack release team know we're restarting zuul, and there are no changes in any of their release-oriented zuul pipelines right now so should be non-impacting there	22:35
fungi	should be all clear to start	22:36
corvus	thanks, i'll save qs and run the restart playbook	22:36
corvus	starting up now	22:37
TheJulia	I was just about ask....	22:37
fungi	see, no need to ask! ;)	22:38
TheJulia	lol	22:38
fungi	we promise to try to avoid unnecessary restarts next week when we expect things to get more frantic for openstack ;)	22:39
TheJulia	Well, I actually had ironic's last change before releasing in the check queue.... :)	22:39
fungi	(not that this restart is unnecessary, it fixes at least one somewhat nasty bug, and we don't like to release new versions of zuul without making sure opendev's happy on them)	22:40
TheJulia	I also found a find against /opt in our devstack plugin which I'm very promptly ripping out because that makes us bad opendev citizens	22:40
fungi	oof	22:40
TheJulia	fungi: that was my reaction when I spotted it	22:40
fungi	thank you for helping take out the trash ;)	22:41
corvus	our current zuul effort is in making it so no one notices downtime again... so every restart now is an infinite number of restarts avoided in the future :)	22:42
corvus	you can't argue with that math. ;)	22:42
fungi	that too!	22:42
clarkb	ya we've been doing a lot of incremental improvements to get closer to removing the spof	22:42
fungi	restartless zuul	22:42
fungi	it's nearing everpresence	22:42
clarkb	This is one of the things that has motivated me to do all this code review :)	22:43
corvus	much appreciated :)	22:43
clarkb	looks like it is done reloading configs?	22:49
corvus	i think it's reloading something (still? or again?)	22:50
corvus	re-enqueing now	22:51
clarkb	changs and jobs are showing up as queued	22:51
fungi	lgtm	22:52
yuriys	hmmm how do you guys idenitfy which provider gets selected for a particular task?	22:53
clarkb	yuriys: every job records an inventory and in that inventory are hostnames that indicate the cloud provider	22:54
clarkb	yuriys: the beginningof the job-output.txt also records a summary of that info (so you can see it in the live stream console)	22:54
yuriys	thank you, found localhost \| Provider: xxxx in one of the logs	22:55
clarkb	yuriys: a single job will always have all of its nodes provided by the same provider too	22:56
yuriys	when change is successfully merged by zuul, what triggers a build?	22:56
clarkb	*a single build of a job	22:56
clarkb	yuriys: zuul's gerrit driver will see the merge event sent by gerrit then the pipeline configs in zuul can match that and then trigger their jobs	22:57
yuriys	> a single job will always have all of its nodes provided by the same provider	22:57
yuriys	This explains the explosions!!!	22:57
fungi	unless you're talking about speculative merges, rather than merging changes which have passed gating	22:57
clarkb	yuriys: basically zuul has an event stream open to gerrit and for every event that gerrit emits it evaluates against its pipelines	22:57
yuriys	so if you guys just restarted things	22:58
fungi	"merge" is used in multiple contexts, so it's good to be clear which scenario you're asking about	22:58
yuriys	what are the odds that stream got cut	22:58
yuriys	https://review.opendev.org/c/openstack/project-config/+/810326	22:58
yuriys	no build	22:58
clarkb	yuriys: the deploy job for that is currently running (zuul got restarted so we had to restore queues)	22:59
fungi	check zuul's status page for the openstack tenant, all the builds for that change got re-added to pipelines	22:59
clarkb	https://zuul.opendev.org/t/openstack/status and look for 810326	22:59
yuriys	Ah I saw stuff under [check] but deploy was empty for a bit	22:59
yuriys	i see it now	22:59
corvus	re-enqueue complete	23:00
clarkb	ya it isn't instantaneous as each one of those enqueue actions after a restart has to requery git repos	23:00
fungi	yeah, the re-enqueuing was scripted so it doesn't all show back up at once	23:00
corvus	#status log restarted all of zuul on commit 0c26b1570bdd3a4d4479fb8c88a8dca0e9e38b7f	23:00
opendevstatus	corvus: finished logging	23:00
fungi	thanks corvus!	23:00
clarkb	fungi: its been alomst 6 hours on those leaked replications. I guess maybe we wait ~8 hours and then manually clean them up or do we want to leave them until tomorrow?	23:01
clarkb	the mass of failures on some check changes seem to be legit (pip dep resolution problems)	23:03
fungi	clarkb: i think it's safe to assume the queue times you were observing for those replication tasks weren't particularly biased by the time you were checking them, so i'd be fine just cleaning them up at this poimt	23:04
clarkb	ya I'm beginning to suspect there is something interesting about the time period they show up in. Network instability during thos eperiods of time for example	23:05
clarkb	rather than it being a side effect of some sort of long timeout	23:05
clarkb	I'll give them a bit longer. I don't have to make dinner for a bit	23:06
yuriys	build failed : (	23:06
clarkb	neat let me go see why	23:06
yuriys	is it waiting on logger?	23:07
yuriys	https://zuul.opendev.org/t/openstack/build/a9c7f49c293f4659befe7ae1e3353ca5/log/job-output.txt	23:07
clarkb	no that is the bit I was telling you about where we don't let zuul stream those logs out. We keep the logs on the bastion to avoid unexpected leakages of sensitive info	23:07
clarkb	looking at the log on the bastion it failed because nb01 and nb02 had some issue. I think your change is only needed on nl02 and so we should be good from the scale down perspective	23:08
clarkb	ya https://grafana.opendev.org/d/4sdNjeXGk/nodepool-inmotion?orgId=1 reflects the change	23:08
fungi	that bit of log redaction is specific to our continuous deployment jobs, not typical of test jobs	23:08
yuriys	kk	23:08
fungi	we just want to make sure that the ansible which pushes production configs doesn't inadvertently log things like credentials if it breaks	23:09
clarkb	nb01 and nb02 failed to update project-config which goes in /opt because /opt is full	23:09
yuriys	yeah i got that part, hard to track what failed though lol	23:09
clarkb	I'll stop buildres on them now and then work on cleaning them up	23:09
yuriys	Cool, well, hopefully this is the last one, might have to fiddle with placement distribution limits, our weakness here is just the quantity of nodes.	23:11
ianw	... we just had an earthquake!	23:17
yuriys	woah	23:18
fungi	everything okay there?	23:19
ianw	yep, well the internet is still working! :) but wow, that got the heartrate up	23:20
yuriys	easy calorie burn	23:20
ianw	i felt a few in bay area when i lived there, but this was bigger bumps than them	23:21
clarkb	wow	23:21
artom	"easy calorie burn" is it though? Feels like a lot of trouble for some cardio ;)	23:22
ianw	it wasn't knock things of shelves level. still, why not add something else to worry about in 2021 :)	23:25
yuriys	don't worry, 2021 not over yet	23:26
yuriys	sorry, correction, worry, 2021 not over yet	23:26
clarkb	any idea why we seem to have a ton of fedora-34 image?	23:27
clarkb	that seems to be at least part of the reason that nb01 and nb02 have filled their disks	23:27
clarkb	I have cleaned out their /opt/dib_tmp as well as stale intermediate vhd images and that helped a bit	23:27
opendevreview	Merged opendev/system-config master: Use Apache to serve a local OpenDev logo on paste https://review.opendev.org/c/opendev/system-config/+/810253	23:28
ianw	we should just have the normal amount (2). but it is the only thing using containerfile to build so might be a bug in there	23:28
clarkb	ianw: hrm I cross checked against focal as a sanity check and it has 2 for x86 and 2 ready + 1 deleting for arm64	23:28
clarkb	but fedora-34 has many many	23:28
clarkb	oh you know what	23:28
clarkb	one sec	23:28
clarkb	2021-09-21 23:29:03,556 ERROR nodepool.zk.ZooKeeper: Error loading json data from image build /nodepool/images/fedora-34/builds/0000007388	23:29
clarkb	I suspect that issue in the zk db is preventing nodepool from cleaning up the older images	23:29
clarkb	corvus: ^ is that something you think you'd like to look at or should we just rm the node or?	23:29
clarkb	fwiw I htink I cleaned enough disk that we can look at this tomorrow	23:29
clarkb	but probably won't want to wait much longer than that	23:30
opendevreview	Merged openstack/project-config master: grafana: fix openstack API stats for providers https://review.opendev.org/c/openstack/project-config/+/810329	23:31
clarkb	/nodepool/images/fedora-34/builds/0000007388 is empty and is the oldest build	23:35
clarkb	I don't see the string 7388 in /opt/nodepool_dib on either builder	23:36
clarkb	I suspect some sort of half completed cleaning of the zk db and we should go ahead and rm that znode	23:36
clarkb	However, I'll let corvus confirm there isn't further debugging that wants to happen first	23:37
clarkb	I cleaned up the replication queue as it is getting close to dinner	23:39
clarkb	the replication queue is now empty even after I reenqueued the tasks	23:39
fungi	nice, thanks	23:39
corvus	clarkb: i don't feel a compelling need to debug that right now, so if you want to manually clean up that's great	23:41
fungi	if it's an actual persistent or intermittent issue, i'm sure we'll have more samples soon enough	23:41
clarkb	ok I will rm that single entry then I expect nodepool will clean up after itself from there	23:42
clarkb	oh wait it won't let me rm it because it has subnodes and shows ovh-gra1 has the image?	23:43
clarkb	let me check on what ovh-gra1 sees	23:43
clarkb	nodepool image list didn't show it and its subnode for images/ whereI think it records that was empty so I went ahead and cleaned up everything below 7388 as well as 7388	23:46
clarkb	the exception listing dib images is gone	23:47
ianw	https://earthquakes.ga.gov.au/event/ga2021sqogij	23:47
ianw	clarkb: i'll let you poke at it, take me longer to context switch in i imagine	23:48
clarkb	ianw: ya, I think this may be all that was necessary then the next time nodepool's cleanup routines run it will clean up the 460something extra records	23:48
clarkb	heh now /nodepool/images/fedora-34/builds/0000007392 is sad but we went from 467 to 460 :)	23:50
clarkb	that one is in the same situation so I'll give it the same treatment	23:51
corvus	wow M6 is not nothing	23:52
clarkb	/nodepool/images/fedora-34/builds/0000007405 now	23:56

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!