Tuesday, 2022-08-16

*** ysandeep\|holiday is now known as ysandeep		05:28
*** ysandeep is now known as ysandeep\|ruck		05:29
*** gibi_pto is now known as gibi		07:26
*** jpena\|off is now known as jpena		07:34
opendevreview	Martin Kopec proposed opendev/system-config master: refstack: trigger image upload https://review.opendev.org/c/opendev/system-config/+/853251	07:48
*** ysandeep\|ruck is now known as ysandeep\|ruck\|lunch		08:05
*** ysandeep\|ruck\|lunch is now known as ysandeep\|ruck		08:28
*** ysandeep\|ruck is now known as ysandeep\|ruck\|afk		10:46
*** ysandeep\|ruck\|afk is now known as ysandeep\|ruck		11:17
*** dviroel_ is now known as dviroel		11:38
*** dasm\|off is now known as dasm		13:55
*** enick_727 is now known as diablo_rojo		14:09
*** ysandeep\|ruck is now known as ysandeep\|dinner		15:11
*** marios is now known as marios\|out		15:29
frickler	fungi: could you add https://review.opendev.org/c/opendev/system-config/+/853189 to your review list pls? would help me make progress on the kolla side	15:36
fungi	frickler: lgtm. i haven't checked afs graphs to make sure there's sufficient space, but i doubt it will be a problem	15:39
clarkb	UCA tends to be small	15:45
clarkb	like less than 1GB	15:45
frickler	thx. 6.2G used for all existing releases, with 50G quota, seems fine	15:48
frickler	but looking at that page shows wheel releases are 7 days old, any known issue?	15:48
fungi	not known to me, but sounds like something has probably broken the update job for those	15:51
frickler	wheel-cache-centos-7-python3 \| changed: Non-fatal POSTIN scriptlet failure in rpm package dkms-openafs-1.8.8.1-1.el7.x86_64	15:54
frickler	https://zuul.opendev.org/t/openstack/build/1f4e9a0f03064ef68d63035151fd5b6d	15:54
*** ysandeep\|dinner is now known as ysandeep		15:54
frickler	might be a one off, though, earlier failures were the f-string issue it seems	15:56
clarkb	I think we might generate that package from the upstream packages? It is possible we need to update our package to accomodate some centos 7 change	15:56
clarkb	but ya maybe we let it run and see if the f string fix was the more persistent issue?	15:57
*** ysandeep is now known as ysandeep\|out		16:00
*** jpena is now known as jpena\|off		16:33
fungi	system-config-run-base failed in gate on the uca addition	16:50
fungi	https://zuul.opendev.org/t/openstack/build/36ede125be004d58bd5ee4dfd9455b6e	16:50
clarkb	I think that is a recheck. It was testinfra bootstrapping that failed	17:02
fungi	ahh	17:21
dviroel	fungi: i see more jobs in tripleo queue failing with "ssh_exchange_identification: Connection closed by remote host"	18:18
clarkb	dviroel: fungi isn't around today. Can you link to specific failures?	18:20
dviroel	clarkb: yes, i was talking about the build log that fungi posted above	18:22
dviroel	clarkb: same issue on tripleo gates: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_45f/852513/1/gate/tripleo-ci-centos-9-scenario000-multinode-oooq-container-updates/45f9024/job-output.txt	18:22
dviroel	"kex_exchange_identification: Connection closed by remote"	18:22
dviroel	i was checking on journal logs	18:22
dviroel	around same timing	18:23
clarkb	if you use the zuul links then you can link to specific lines	18:23
fungi	often that implies there are rogue virtual machines in one of our providers which nova has lost track of, but end up in arp fights with new server instances assigned to the same ip addresses	18:23
dviroel	clarkb: sure, one sec	18:23
fungi	or sshd is ending up hung on the test nodes for some reason	18:23
dviroel	clarkb: https://zuul.opendev.org/t/openstack/build/45f90248b2f84ce7aad9fd86fdd00130/log/job-output.txt#8814	18:24
fungi	i guess it's more often we see openssh host key mismatches for the rogue virtual machine problem	18:24
dviroel	clarkb: fungi: around same timing: https://zuul.opendev.org/t/openstack/build/45f90248b2f84ce7aad9fd86fdd00130/log/logs/undercloud/var/log/extra/journal.txt#17174	18:25
clarkb	I guessin the earlier example testinfra is running ansible from the fake bridge to the fake servers. So are susceptible to the same sort of arp poisoning	18:26
clarkb	however in dviroel's example the connection is to localhost	18:26
clarkb	I don't think the key exchange problem with 127.0.0.2 can be caused by the hosting cloud	18:26
fungi	Connection closed by remote host\r\nConnection closed by 127.0.0.2 port 22	18:27
fungi	yeah, that sounds more like something is killing the sshd	18:27
dviroel	fungi: clarkb: see https://zuul.opendev.org/t/openstack/build/45f90248b2f84ce7aad9fd86fdd00130/log/logs/undercloud/var/log/extra/journal.txt#17174	18:27
dviroel	multiple tentatives to ssh from another op	18:28
dviroel	ip	18:28
fungi	sometimes that can happen if login can't spawn a pty, e.g. if /var is full and writes to wtmp fail	18:28
clarkb	dviroel: sure but your log line clearly says that localhost closed the connection	18:28
clarkb	I think what you are seeing in the journal is people port scanning and poking at sshd	18:28
fungi	which will happen, since the sshd on our job nodes is generally reachable from the entire internet, and that sort of background scanning happens continually	18:29
dviroel	yes, but is this a side effect? or coincidence?	18:30
clarkb	my hunch is coincidence	18:30
clarkb	ansible will return ssh errors when the disk is full too	18:30
dviroel	hum, because the time matches	18:30
clarkb	random users attempting to ssh to you from the internet shouldn't affect valid connections succeeding unless you use fail2ban or they successful DoS you	18:31
clarkb	I would focus instead on what could cause ssh to localhost (127.0.0.2) to fail	18:32
fungi	i agree the timing is suspicious, but the sample size is small enough to still chalk the connection between the system-config and tripleo build results up to coincidence	18:33
clarkb	sshd crashed, disk is full, invalid user credentials, host key verification failure, etc. I don't think the "connection closed" message can be fully relied on due to how ansible handles ssh failures. It is a bit more generic than that	18:33
clarkb	https://zuul.opendev.org/t/openstack/build/45f90248b2f84ce7aad9fd86fdd00130/log/logs/undercloud/var/log/extra/journal.txt#17160-17169 is where you can see localhost connections happening	18:37
clarkb	I suspect that the ssh layer is actually succeeding but then something in early remote ansible bootstrapping is failing and causing ansible to return that failure	18:37
clarkb	you can see that the session just prior copies some l3 thing whcih seems to align with the task just prior in the job output log	18:39
clarkb	I'm reasonably confident the session I'ev highlighted in my last link is the one that failed	18:39
dviroel	this job also has the same symptoms: https://zuul.opendev.org/t/openstack/build/2419cb17169348c697f62773d098bc93anoth	18:40
dviroel	correct link https://zuul.opendev.org/t/openstack/build/2419cb17169348c697f62773d098bc93	18:41
clarkb	that last one is one a different provider too	18:43
clarkb	I just checked on ze04 for your original example to see if our cleanup task managed to catch disk usage. But the host was unreachable at that point. I think something has broken ssh for ansible.	18:45
clarkb	Note "ssh for ansible" is important there as according to the journal it seems that ssh is generally working. But ansible + ssh is not	18:45
clarkb	but log copying and other tasks continued to function	18:46
clarkb	which implies this isn't a catastrophic failure. Just something that happens enough to be noticed	18:47
clarkb	but also it affects localhost which implies something about the host itself is causing the problem not networking to the world.	18:48
clarkb	centos stream didn't update ssh recently did it?	18:49
clarkb	or maybe ya'll are changing host keys and host key verification is failing? It would have to be something isolated to the host and I don't know your jobs well ebough to say if it could be a side ffect from that	18:49
clarkb	fungi: you'll like this: my browser has cached the redired from / to /postorius/lists for lists.opendev.org so now when I reset /etc/hosts I can't get mm2 lists :/	18:51
dviroel	clarkb: thanks for looking, I don't remember any recent changes related to this and doesn't seems to be any centos compose update too	18:52
clarkb	I do see `[WARNING]: sftp transfer mechanism failed on [127.0.0.2]. Use ANSIBLE_DEBUG=1` in the job output log. Maybe increasing ansible verbosity will give us the info we need	18:53
dviroel	clarkb: ack, we will proceed with tripleo bug report and debug, if the issue remains to occur	18:56
dviroel	just in case you start to see this happening in non-tripleo jobs	18:57
clarkb	ansible updated 2 weeks ago.	18:57
clarkb	paramiko hasn't updated in months but ansible uses openssh by default anyway	18:58
rcastillo\|rover	we're seeing the error on c8s, latest openssh update there is from last month	18:59
clarkb	ya I'm just trying to see if there are any obvious things. It is ssh from 127.0.0.1 to 127.0.0.2 as the zuul user if I am reading logs correctly	18:59
clarkb	that means cloud networking doesn't matter	18:59
clarkb	but also journald records that ssh itself seems to have worked properly so likely some issue in how ansible uses ssh	19:00
clarkb	rcastillo\|rover: looks like 45f90248b2f84ce7aad9fd86fdd00130 was centos stream 9 not 8	19:04
clarkb	and 2419cb17169348c697f62773d098bc93 was both (not sure which had the ssh issues though)	19:04
dviroel	there is another centos9 here: https://zuul.opendev.org/t/openstack/build/f3aa91645999406d8e9221a61fdfa6ca/log/job-output.txt#4842	19:05
rcastillo\|rover	here's a c8 https://zuul.opendev.org/t/openstack/build/6a96f799c4c7459392d0cba3544d0817	19:05
clarkb	rcastillo\|rover: that one seems to have the same behavior as the cs9 example: https://zuul.opendev.org/t/openstack/build/6a96f799c4c7459392d0cba3544d0817/log/logs/undercloud/var/log/extra/journal.txt#4316-4325	19:20
clarkb	that looks like on the ssh side of things everything is working. But ansible is unhappy for some reason	19:20
clarkb	oh ya'll don't install ansible via pip. Is is possible that ansible updated recently via wherever it comes from?	19:33
dviroel	list of installed packages: https://zuul.opendev.org/t/openstack/build/2419cb17169348c697f62773d098bc93/log/logs/undercloud/var/log/extra/package-list-installed.txt#25	19:48
rcastillo\|rover	ansible-core-2.13.2-1.el9.x86_64.rpm 2022-07-18 19:45	19:50
*** dviroel is now known as dviroel\|brb		20:02
clarkb	"You must sign in to search for specific terms." <- wow gitlan	20:03
clarkb	*gitlab	20:03
clarkb	https://github.com/maxking/docker-mailman/issues/548 and https://github.com/maxking/docker-mailman/issues/549 filed for ALLOWED_HOSTS problems. I'll do the uid:gid bug after lunch. I also remembered that I think I found a bug with mailman3 as well using boolean types in json to set boolean values. I'll try to file that against mailman3 as well	20:05
clarkb	https://github.com/maxking/docker-mailman/issues/550 filed for the uid:gid thing	20:47
clarkb	now to reproduce that mailman3 rest api bug and file that against mailman3 proper	20:47
clarkb	ok I think the other issue may have been pebkac. Specifically False and True in ansible yaml must not evaluate through to json boolean values with ansible's uri module's body var	21:03
clarkb	trying to reproduce the issue I had with curl I can't do it. And I'm using raw json with json booleans there	21:04
ianw	maybe it has to be "false" and "true" in the yaml? (no caps)	21:05
clarkb	ya that might be. I'm sending it the string version of "false" and "true" now and that works through ansible	21:06
clarkb	which should be fine for us. I just thought there may have been a bug in mm3's api accepting boolean types but now realize it must be something ansible does to mangle the json data	21:07
clarkb	as raw json doc and curl can't reproduce	21:07
*** dviroel\|brb is now known as dviroel		21:15
opendevreview	Merged opendev/system-config master: reprepro: mirror Ubuntu UCA Zed for Jammy https://review.opendev.org/c/opendev/system-config/+/853189	21:15
*** dviroel is now known as dviroel\|afk		21:40
clarkb	the retry_limit on https://review.opendev.org/850676 is due to tripleo having some dependency on a github repo that deleted a sha. We've seen this when people make a branch to act as a pull request and then delete it when the pr is merged. It tends to go away on its own as its a tight timing window for that	22:19
clarkb	https://github.com/ansible-collections/community.general/pull/5121 is the PR and it is merged so I suspect this will resolve itself with a recheck	22:21
*** dasm is now known as dasm\|off		22:26

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!