*** ysandeep|holiday is now known as ysandeep | 05:28 | |
*** ysandeep is now known as ysandeep|ruck | 05:29 | |
*** gibi_pto is now known as gibi | 07:26 | |
*** jpena|off is now known as jpena | 07:34 | |
opendevreview | Martin Kopec proposed opendev/system-config master: refstack: trigger image upload https://review.opendev.org/c/opendev/system-config/+/853251 | 07:48 |
---|---|---|
*** ysandeep|ruck is now known as ysandeep|ruck|lunch | 08:05 | |
*** ysandeep|ruck|lunch is now known as ysandeep|ruck | 08:28 | |
*** ysandeep|ruck is now known as ysandeep|ruck|afk | 10:46 | |
*** ysandeep|ruck|afk is now known as ysandeep|ruck | 11:17 | |
*** dviroel_ is now known as dviroel | 11:38 | |
*** dasm|off is now known as dasm | 13:55 | |
*** enick_727 is now known as diablo_rojo | 14:09 | |
*** ysandeep|ruck is now known as ysandeep|dinner | 15:11 | |
*** marios is now known as marios|out | 15:29 | |
frickler | fungi: could you add https://review.opendev.org/c/opendev/system-config/+/853189 to your review list pls? would help me make progress on the kolla side | 15:36 |
fungi | frickler: lgtm. i haven't checked afs graphs to make sure there's sufficient space, but i doubt it will be a problem | 15:39 |
clarkb | UCA tends to be small | 15:45 |
clarkb | like less than 1GB | 15:45 |
frickler | thx. 6.2G used for all existing releases, with 50G quota, seems fine | 15:48 |
frickler | but looking at that page shows wheel releases are 7 days old, any known issue? | 15:48 |
fungi | not known to me, but sounds like something has probably broken the update job for those | 15:51 |
frickler | wheel-cache-centos-7-python3 | changed: Non-fatal POSTIN scriptlet failure in rpm package dkms-openafs-1.8.8.1-1.el7.x86_64 | 15:54 |
frickler | https://zuul.opendev.org/t/openstack/build/1f4e9a0f03064ef68d63035151fd5b6d | 15:54 |
*** ysandeep|dinner is now known as ysandeep | 15:54 | |
frickler | might be a one off, though, earlier failures were the f-string issue it seems | 15:56 |
clarkb | I think we might generate that package from the upstream packages? It is possible we need to update our package to accomodate some centos 7 change | 15:56 |
clarkb | but ya maybe we let it run and see if the f string fix was the more persistent issue? | 15:57 |
*** ysandeep is now known as ysandeep|out | 16:00 | |
*** jpena is now known as jpena|off | 16:33 | |
fungi | system-config-run-base failed in gate on the uca addition | 16:50 |
fungi | https://zuul.opendev.org/t/openstack/build/36ede125be004d58bd5ee4dfd9455b6e | 16:50 |
clarkb | I think that is a recheck. It was testinfra bootstrapping that failed | 17:02 |
fungi | ahh | 17:21 |
dviroel | fungi: i see more jobs in tripleo queue failing with "ssh_exchange_identification: Connection closed by remote host" | 18:18 |
clarkb | dviroel: fungi isn't around today. Can you link to specific failures? | 18:20 |
dviroel | clarkb: yes, i was talking about the build log that fungi posted above | 18:22 |
dviroel | clarkb: same issue on tripleo gates: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_45f/852513/1/gate/tripleo-ci-centos-9-scenario000-multinode-oooq-container-updates/45f9024/job-output.txt | 18:22 |
dviroel | "kex_exchange_identification: Connection closed by remote" | 18:22 |
dviroel | i was checking on journal logs | 18:22 |
dviroel | around same timing | 18:23 |
clarkb | if you use the zuul links then you can link to specific lines | 18:23 |
fungi | often that implies there are rogue virtual machines in one of our providers which nova has lost track of, but end up in arp fights with new server instances assigned to the same ip addresses | 18:23 |
dviroel | clarkb: sure, one sec | 18:23 |
fungi | or sshd is ending up hung on the test nodes for some reason | 18:23 |
dviroel | clarkb: https://zuul.opendev.org/t/openstack/build/45f90248b2f84ce7aad9fd86fdd00130/log/job-output.txt#8814 | 18:24 |
fungi | i guess it's more often we see openssh host key mismatches for the rogue virtual machine problem | 18:24 |
dviroel | clarkb: fungi: around same timing: https://zuul.opendev.org/t/openstack/build/45f90248b2f84ce7aad9fd86fdd00130/log/logs/undercloud/var/log/extra/journal.txt#17174 | 18:25 |
clarkb | I guessin the earlier example testinfra is running ansible from the fake bridge to the fake servers. So are susceptible to the same sort of arp poisoning | 18:26 |
clarkb | however in dviroel's example the connection is to localhost | 18:26 |
clarkb | I don't think the key exchange problem with 127.0.0.2 can be caused by the hosting cloud | 18:26 |
fungi | Connection closed by remote host\r\nConnection closed by 127.0.0.2 port 22 | 18:27 |
fungi | yeah, that sounds more like something is killing the sshd | 18:27 |
dviroel | fungi: clarkb: see https://zuul.opendev.org/t/openstack/build/45f90248b2f84ce7aad9fd86fdd00130/log/logs/undercloud/var/log/extra/journal.txt#17174 | 18:27 |
dviroel | multiple tentatives to ssh from another op | 18:28 |
dviroel | ip | 18:28 |
fungi | sometimes that can happen if login can't spawn a pty, e.g. if /var is full and writes to wtmp fail | 18:28 |
clarkb | dviroel: sure but your log line clearly says that localhost closed the connection | 18:28 |
clarkb | I think what you are seeing in the journal is people port scanning and poking at sshd | 18:28 |
fungi | which will happen, since the sshd on our job nodes is generally reachable from the entire internet, and that sort of background scanning happens continually | 18:29 |
dviroel | yes, but is this a side effect? or coincidence? | 18:30 |
clarkb | my hunch is coincidence | 18:30 |
clarkb | ansible will return ssh errors when the disk is full too | 18:30 |
dviroel | hum, because the time matches | 18:30 |
clarkb | random users attempting to ssh to you from the internet shouldn't affect valid connections succeeding unless you use fail2ban or they successful DoS you | 18:31 |
clarkb | I would focus instead on what could cause ssh to localhost (127.0.0.2) to fail | 18:32 |
fungi | i agree the timing is suspicious, but the sample size is small enough to still chalk the connection between the system-config and tripleo build results up to coincidence | 18:33 |
clarkb | sshd crashed, disk is full, invalid user credentials, host key verification failure, etc. I don't think the "connection closed" message can be fully relied on due to how ansible handles ssh failures. It is a bit more generic than that | 18:33 |
clarkb | https://zuul.opendev.org/t/openstack/build/45f90248b2f84ce7aad9fd86fdd00130/log/logs/undercloud/var/log/extra/journal.txt#17160-17169 is where you can see localhost connections happening | 18:37 |
clarkb | I suspect that the ssh layer is actually succeeding but then something in early remote ansible bootstrapping is failing and causing ansible to return that failure | 18:37 |
clarkb | you can see that the session just prior copies some l3 thing whcih seems to align with the task just prior in the job output log | 18:39 |
clarkb | I'm reasonably confident the session I'ev highlighted in my last link is the one that failed | 18:39 |
dviroel | this job also has the same symptoms: https://zuul.opendev.org/t/openstack/build/2419cb17169348c697f62773d098bc93anoth | 18:40 |
dviroel | correct link https://zuul.opendev.org/t/openstack/build/2419cb17169348c697f62773d098bc93 | 18:41 |
clarkb | that last one is one a different provider too | 18:43 |
clarkb | I just checked on ze04 for your original example to see if our cleanup task managed to catch disk usage. But the host was unreachable at that point. I think something has broken ssh for ansible. | 18:45 |
clarkb | Note "ssh for ansible" is important there as according to the journal it seems that ssh is generally working. But ansible + ssh is not | 18:45 |
clarkb | but log copying and other tasks continued to function | 18:46 |
clarkb | which implies this isn't a catastrophic failure. Just something that happens enough to be noticed | 18:47 |
clarkb | but also it affects localhost which implies something about the host itself is causing the problem not networking to the world. | 18:48 |
clarkb | centos stream didn't update ssh recently did it? | 18:49 |
clarkb | or maybe ya'll are changing host keys and host key verification is failing? It would have to be something isolated to the host and I don't know your jobs well ebough to say if it could be a side ffect from that | 18:49 |
clarkb | fungi: you'll like this: my browser has cached the redired from / to /postorius/lists for lists.opendev.org so now when I reset /etc/hosts I can't get mm2 lists :/ | 18:51 |
dviroel | clarkb: thanks for looking, I don't remember any recent changes related to this and doesn't seems to be any centos compose update too | 18:52 |
clarkb | I do see `[WARNING]: sftp transfer mechanism failed on [127.0.0.2]. Use ANSIBLE_DEBUG=1` in the job output log. Maybe increasing ansible verbosity will give us the info we need | 18:53 |
dviroel | clarkb: ack, we will proceed with tripleo bug report and debug, if the issue remains to occur | 18:56 |
dviroel | just in case you start to see this happening in non-tripleo jobs | 18:57 |
clarkb | ansible updated 2 weeks ago. | 18:57 |
clarkb | paramiko hasn't updated in months but ansible uses openssh by default anyway | 18:58 |
rcastillo|rover | we're seeing the error on c8s, latest openssh update there is from last month | 18:59 |
clarkb | ya I'm just trying to see if there are any obvious things. It is ssh from 127.0.0.1 to 127.0.0.2 as the zuul user if I am reading logs correctly | 18:59 |
clarkb | that means cloud networking doesn't matter | 18:59 |
clarkb | but also journald records that ssh itself seems to have worked properly so likely some issue in how ansible uses ssh | 19:00 |
clarkb | rcastillo|rover: looks like 45f90248b2f84ce7aad9fd86fdd00130 was centos stream 9 not 8 | 19:04 |
clarkb | and 2419cb17169348c697f62773d098bc93 was both (not sure which had the ssh issues though) | 19:04 |
dviroel | there is another centos9 here: https://zuul.opendev.org/t/openstack/build/f3aa91645999406d8e9221a61fdfa6ca/log/job-output.txt#4842 | 19:05 |
rcastillo|rover | here's a c8 https://zuul.opendev.org/t/openstack/build/6a96f799c4c7459392d0cba3544d0817 | 19:05 |
clarkb | rcastillo|rover: that one seems to have the same behavior as the cs9 example: https://zuul.opendev.org/t/openstack/build/6a96f799c4c7459392d0cba3544d0817/log/logs/undercloud/var/log/extra/journal.txt#4316-4325 | 19:20 |
clarkb | that looks like on the ssh side of things everything is working. But ansible is unhappy for some reason | 19:20 |
clarkb | oh ya'll don't install ansible via pip. Is is possible that ansible updated recently via wherever it comes from? | 19:33 |
dviroel | list of installed packages: https://zuul.opendev.org/t/openstack/build/2419cb17169348c697f62773d098bc93/log/logs/undercloud/var/log/extra/package-list-installed.txt#25 | 19:48 |
rcastillo|rover | ansible-core-2.13.2-1.el9.x86_64.rpm 2022-07-18 19:45 | 19:50 |
*** dviroel is now known as dviroel|brb | 20:02 | |
clarkb | "You must sign in to search for specific terms." <- wow gitlan | 20:03 |
clarkb | *gitlab | 20:03 |
clarkb | https://github.com/maxking/docker-mailman/issues/548 and https://github.com/maxking/docker-mailman/issues/549 filed for ALLOWED_HOSTS problems. I'll do the uid:gid bug after lunch. I also remembered that I think I found a bug with mailman3 as well using boolean types in json to set boolean values. I'll try to file that against mailman3 as well | 20:05 |
clarkb | https://github.com/maxking/docker-mailman/issues/550 filed for the uid:gid thing | 20:47 |
clarkb | now to reproduce that mailman3 rest api bug and file that against mailman3 proper | 20:47 |
clarkb | ok I think the other issue may have been pebkac. Specifically False and True in ansible yaml must not evaluate through to json boolean values with ansible's uri module's body var | 21:03 |
clarkb | trying to reproduce the issue I had with curl I can't do it. And I'm using raw json with json booleans there | 21:04 |
ianw | maybe it has to be "false" and "true" in the yaml? (no caps) | 21:05 |
clarkb | ya that might be. I'm sending it the string version of "false" and "true" now and that works through ansible | 21:06 |
clarkb | which should be fine for us. I just thought there may have been a bug in mm3's api accepting boolean types but now realize it must be something ansible does to mangle the json data | 21:07 |
clarkb | as raw json doc and curl can't reproduce | 21:07 |
*** dviroel|brb is now known as dviroel | 21:15 | |
opendevreview | Merged opendev/system-config master: reprepro: mirror Ubuntu UCA Zed for Jammy https://review.opendev.org/c/opendev/system-config/+/853189 | 21:15 |
*** dviroel is now known as dviroel|afk | 21:40 | |
clarkb | the retry_limit on https://review.opendev.org/850676 is due to tripleo having some dependency on a github repo that deleted a sha. We've seen this when people make a branch to act as a pull request and then delete it when the pr is merged. It tends to go away on its own as its a tight timing window for that | 22:19 |
clarkb | https://github.com/ansible-collections/community.general/pull/5121 is the PR and it is merged so I suspect this will resolve itself with a recheck | 22:21 |
*** dasm is now known as dasm|off | 22:26 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!