Tuesday, 2022-08-16

*** ysandeep|holiday is now known as ysandeep05:28
*** ysandeep is now known as ysandeep|ruck05:29
*** gibi_pto is now known as gibi07:26
*** jpena|off is now known as jpena07:34
opendevreviewMartin Kopec proposed opendev/system-config master: refstack: trigger image upload  https://review.opendev.org/c/opendev/system-config/+/85325107:48
*** ysandeep|ruck is now known as ysandeep|ruck|lunch08:05
*** ysandeep|ruck|lunch is now known as ysandeep|ruck08:28
*** ysandeep|ruck is now known as ysandeep|ruck|afk10:46
*** ysandeep|ruck|afk is now known as ysandeep|ruck11:17
*** dviroel_ is now known as dviroel11:38
*** dasm|off is now known as dasm13:55
*** enick_727 is now known as diablo_rojo14:09
*** ysandeep|ruck is now known as ysandeep|dinner15:11
*** marios is now known as marios|out15:29
fricklerfungi: could you add https://review.opendev.org/c/opendev/system-config/+/853189 to your review list pls? would help me make progress on the kolla side15:36
fungifrickler: lgtm. i haven't checked afs graphs to make sure there's sufficient space, but i doubt it will be a problem15:39
clarkbUCA tends to be small15:45
clarkblike less than 1GB15:45
fricklerthx. 6.2G used for all existing releases, with 50G quota, seems fine15:48
fricklerbut looking at that page shows wheel releases are 7 days old, any known issue?15:48
funginot known to me, but sounds like something has probably broken the update job for those15:51
fricklerwheel-cache-centos-7-python3 | changed: Non-fatal POSTIN scriptlet failure in rpm package dkms-openafs-
*** ysandeep|dinner is now known as ysandeep15:54
fricklermight be a one off, though, earlier failures were the f-string issue it seems15:56
clarkbI think we might generate that package from the upstream packages? It is possible we need to update our package to accomodate some centos 7 change15:56
clarkbbut ya maybe we let it run and see if the f string fix was the more persistent issue?15:57
*** ysandeep is now known as ysandeep|out16:00
*** jpena is now known as jpena|off16:33
fungisystem-config-run-base failed in gate on the uca addition16:50
clarkbI think that is a recheck. It was testinfra bootstrapping that failed17:02
dviroelfungi: i see more jobs in tripleo queue failing with "ssh_exchange_identification: Connection closed by remote host"18:18
clarkbdviroel: fungi isn't around today. Can you link to specific failures?18:20
dviroelclarkb: yes, i was talking about the build log that fungi posted above18:22
dviroelclarkb: same issue on tripleo gates: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_45f/852513/1/gate/tripleo-ci-centos-9-scenario000-multinode-oooq-container-updates/45f9024/job-output.txt18:22
dviroel"kex_exchange_identification: Connection closed by remote"18:22
dviroeli was checking on journal logs18:22
dviroelaround same timing18:23
clarkbif you use the zuul links then you can link to specific lines18:23
fungioften that implies there are rogue virtual machines in one of our providers which nova has lost track of, but end up in arp fights with new server instances assigned to the same ip addresses18:23
dviroelclarkb: sure, one sec18:23
fungior sshd is ending up hung on the test nodes for some reason18:23
dviroelclarkb: https://zuul.opendev.org/t/openstack/build/45f90248b2f84ce7aad9fd86fdd00130/log/job-output.txt#881418:24
fungii guess it's more often we see openssh host key mismatches for the rogue virtual machine problem18:24
dviroelclarkb: fungi: around same timing: https://zuul.opendev.org/t/openstack/build/45f90248b2f84ce7aad9fd86fdd00130/log/logs/undercloud/var/log/extra/journal.txt#1717418:25
clarkbI guessin the earlier example testinfra is running ansible from the fake bridge to the fake servers. So are susceptible to the same sort of arp poisoning18:26
clarkbhowever in dviroel's example the connection is to localhost18:26
clarkbI don't think the key exchange problem with can be caused by the hosting cloud18:26
fungiConnection closed by remote host\r\nConnection closed by port 2218:27
fungiyeah, that sounds more like something is killing the sshd18:27
dviroelfungi: clarkb: see https://zuul.opendev.org/t/openstack/build/45f90248b2f84ce7aad9fd86fdd00130/log/logs/undercloud/var/log/extra/journal.txt#1717418:27
dviroelmultiple tentatives to ssh from another op18:28
fungisometimes that can happen if login can't spawn a pty, e.g. if /var is full and writes to wtmp fail18:28
clarkbdviroel: sure but your log line clearly says that localhost closed the connection18:28
clarkbI think what you are seeing in the journal is people port scanning and poking at sshd18:28
fungiwhich will happen, since the sshd on our job nodes is generally reachable from the entire internet, and that sort of background scanning happens continually18:29
dviroelyes, but is this a side effect? or coincidence?18:30
clarkbmy hunch is coincidence18:30
clarkbansible will return ssh errors when the disk is full too18:30
dviroelhum, because the time matches18:30
clarkbrandom users attempting to ssh to you from the internet shouldn't affect valid connections succeeding unless you use fail2ban or they successful DoS you18:31
clarkbI would focus instead on what could cause ssh to localhost ( to fail18:32
fungii agree the timing is suspicious, but the sample size is small enough to still chalk the connection between the system-config and tripleo build results up to coincidence18:33
clarkbsshd crashed, disk is full, invalid user credentials, host key verification failure, etc. I don't think the "connection closed" message can be fully relied on due to how ansible handles ssh failures. It is a bit more generic than that18:33
clarkbhttps://zuul.opendev.org/t/openstack/build/45f90248b2f84ce7aad9fd86fdd00130/log/logs/undercloud/var/log/extra/journal.txt#17160-17169 is where you can see localhost connections happening18:37
clarkbI suspect that the ssh layer is actually succeeding but then something in early remote ansible bootstrapping is failing and causing ansible to return that failure18:37
clarkbyou can see that the session just prior copies some l3 thing whcih seems to align with the task just prior in the job output log18:39
clarkbI'm reasonably confident the session I'ev highlighted in my last link is the one that failed18:39
dviroelthis job also has the same symptoms: https://zuul.opendev.org/t/openstack/build/2419cb17169348c697f62773d098bc93anoth18:40
dviroelcorrect link https://zuul.opendev.org/t/openstack/build/2419cb17169348c697f62773d098bc9318:41
clarkbthat last one is one a different provider too18:43
clarkbI just checked on ze04 for your original example to see if our cleanup task managed to catch disk usage. But the host was unreachable at that point. I think something has broken ssh for ansible.18:45
clarkbNote "ssh for ansible" is important there as according to the journal it seems that ssh is generally working. But ansible + ssh is not18:45
clarkbbut log copying and other tasks continued to function18:46
clarkbwhich implies this isn't a catastrophic failure. Just something that happens enough to be noticed18:47
clarkbbut also it affects localhost which implies something about the host itself is causing the problem not networking to the world.18:48
clarkbcentos stream didn't update ssh recently did it?18:49
clarkbor maybe ya'll are changing host keys and host key verification is failing? It would have to be something isolated to the host and I don't know your jobs well ebough to say if it could be a side ffect from that18:49
clarkbfungi: you'll like this: my browser has cached the redired from / to /postorius/lists for lists.opendev.org so now when I reset /etc/hosts I can't get mm2 lists :/18:51
dviroelclarkb: thanks for looking, I don't remember any recent changes related to this and doesn't seems to be any centos compose update too18:52
clarkbI do see `[WARNING]: sftp transfer mechanism failed on []. Use ANSIBLE_DEBUG=1` in the job output log. Maybe increasing ansible verbosity will give us the info we need18:53
dviroelclarkb: ack, we will proceed with tripleo bug report and debug, if the issue remains to occur18:56
dviroeljust in case you start to see this happening in non-tripleo jobs18:57
clarkbansible updated 2 weeks ago.18:57
clarkbparamiko hasn't updated in months but ansible uses openssh by default anyway18:58
rcastillo|roverwe're seeing the error on c8s, latest openssh update there is from last month18:59
clarkbya I'm just trying to see if there are any obvious things. It is ssh from to as the zuul user if I am reading logs correctly18:59
clarkbthat means cloud networking doesn't matter18:59
clarkbbut also journald records that ssh itself seems to have worked properly so likely some issue in how ansible uses ssh19:00
clarkbrcastillo|rover: looks like 45f90248b2f84ce7aad9fd86fdd00130 was centos stream 9 not 819:04
clarkband 2419cb17169348c697f62773d098bc93 was both (not sure which had the ssh issues though)19:04
dviroelthere is another centos9 here: https://zuul.opendev.org/t/openstack/build/f3aa91645999406d8e9221a61fdfa6ca/log/job-output.txt#484219:05
rcastillo|roverhere's a c8 https://zuul.opendev.org/t/openstack/build/6a96f799c4c7459392d0cba3544d081719:05
clarkbrcastillo|rover: that one seems to have the same behavior as the cs9 example: https://zuul.opendev.org/t/openstack/build/6a96f799c4c7459392d0cba3544d0817/log/logs/undercloud/var/log/extra/journal.txt#4316-432519:20
clarkbthat looks like on the ssh side of things everything is working. But ansible is unhappy for some reason19:20
clarkboh ya'll don't install ansible via pip. Is is possible that ansible updated recently via wherever it comes from?19:33
dviroellist of installed packages: https://zuul.opendev.org/t/openstack/build/2419cb17169348c697f62773d098bc93/log/logs/undercloud/var/log/extra/package-list-installed.txt#2519:48
rcastillo|roveransible-core-2.13.2-1.el9.x86_64.rpm 2022-07-18 19:45 19:50
*** dviroel is now known as dviroel|brb20:02
clarkb"You must sign in to search for specific terms." <- wow gitlan20:03
clarkbhttps://github.com/maxking/docker-mailman/issues/548 and https://github.com/maxking/docker-mailman/issues/549 filed for ALLOWED_HOSTS problems. I'll do the uid:gid bug after lunch. I also remembered that I think I found a bug with mailman3 as well using boolean types in json to set boolean values. I'll try to file that against mailman3 as well20:05
clarkbhttps://github.com/maxking/docker-mailman/issues/550 filed for the uid:gid thing20:47
clarkbnow to reproduce that mailman3 rest api bug and file that against mailman3 proper20:47
clarkbok I think the other issue may have been pebkac. Specifically False and True in ansible yaml must not evaluate through to json boolean values with ansible's uri module's body var21:03
clarkbtrying to reproduce the issue I had with curl I can't do it. And I'm using raw json with json booleans there21:04
ianwmaybe it has to be "false" and "true" in the yaml? (no caps)21:05
clarkbya that might be. I'm sending it the string version of "false" and "true" now and that works through ansible21:06
clarkbwhich should be fine for us. I just thought there may have been a bug in mm3's api accepting boolean types but now realize it must be something ansible does to mangle the json data21:07
clarkbas raw json doc and curl can't reproduce21:07
*** dviroel|brb is now known as dviroel21:15
opendevreviewMerged opendev/system-config master: reprepro: mirror Ubuntu UCA Zed for Jammy  https://review.opendev.org/c/opendev/system-config/+/85318921:15
*** dviroel is now known as dviroel|afk21:40
clarkbthe retry_limit on https://review.opendev.org/850676 is due to tripleo having some dependency on a github repo that deleted a sha. We've seen this when people make a branch to act as a pull request and then delete it when the pr is merged. It tends to go away on its own as its a tight timing window for that22:19
clarkbhttps://github.com/ansible-collections/community.general/pull/5121 is the PR and it is merged so I suspect this will resolve itself with a recheck22:21
*** dasm is now known as dasm|off22:26

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!