*** ysandeep is now known as ysandeep|afk | 07:49 | |
jrosser | morning | 08:46 |
---|---|---|
noonedeadpunk | \o/ | 09:02 |
*** ysandeep|afk is now known as ysandeep | 09:43 | |
damiandabrowski | hey! | 09:49 |
jrosser | do we know whats happening with these MODULE FAILURES yet? | 10:13 |
jrosser | i can reproduce it easily | 10:13 |
jrosser | it is to do with the process that owns the control master socket exiting | 10:19 |
jrosser | i messed with the controlpath a bit | 10:37 |
jrosser | ANSIBLE_SSH_CONTROL_PATH=/root/.ansible/cp/ansible-ssh-%%h-%%p-%%r | 10:37 |
jrosser | then if i run this in another terminal it makes the MODULE_ERRORS stop `watch -n 0.2 ssh -O check -o ControlPath="/root/.ansible/cp/ansible-ssh-%h-%p-%r" 172.29.236.100` | 10:38 |
jrosser | ^ that command is enough to keep the controlmaster alive, and the tasks keep running, shortly after ^C on that command the playbook fails | 10:38 |
jrosser | whilst this is going on you can strace the ssh controlmaster process, and this happens when the playbook fails https://paste.opendev.org/show/bLt55zAHbiRdDYa63cNY/ | 10:42 |
jrosser | it returns ansible_facts (line 50), receives $something back (line 52) then does what looks like a pretty clean tidy up and exit, finally deleting the controlmaster socket | 10:44 |
mgariepy | jrosser, so when the pipelining is set to True, ansible does something else and do not keep the ssh alive ? | 10:58 |
jrosser | doesnt pipelining just reduce the number of SSH that are required | 10:59 |
jrosser | ssh <foo> <gigantic python command> rather than scp <foo> <python script> / ssh <foo> <run python script> | 11:00 |
mgariepy | it clearly does something wrong on my tests. since when i set it to false the issue doesn't reproduce. | 11:01 |
noonedeadpunk | yes, pipelining is exactly does transfer python command instead of scp/ssh | 11:02 |
mgariepy | unless it's our ssh config that does mess the run ? like the session expiration ? | 11:05 |
jrosser | it would be nice to be able to get some verbose on the ssh command used at the ansible end | 11:05 |
jrosser | becasue this is a looong time ago but feels kind of familiar http://lists.mindrot.org/pipermail/openssh-bugs/2015-July/014957.html | 11:06 |
jrosser | from my strace we see that the last thing that the controlmaster does is unlink the controlpersist socket | 11:06 |
mgariepy | When set to True it does reproduce on my laptop quite often. | 11:08 |
jrosser | maybe there is a race there with the removal of the controlpersist socket when the timeout happens, but the client still thinks that the socket is good to use and doesnt handle retrying the ssh connection | 11:08 |
jrosser | what was the stuff with some retry decorator last week? i kind of missed that | 11:08 |
mgariepy | can it be the session logout ? | 11:09 |
mgariepy | https://github.com/ansible/ansible/blob/fbaea4c269b0a3c8112101754cee808d82bebbee/lib/ansible/plugins/connection/ssh.py#L1163 | 11:09 |
mgariepy | -13 is the exit code for ssh. | 11:10 |
mgariepy | i got to switch desk. i'll be back in an hour or so. | 11:13 |
jrosser | https://github.com/ansible/ansible/blob/fbaea4c269b0a3c8112101754cee808d82bebbee/lib/ansible/plugins/connection/ssh.py#L388 | 11:13 |
jrosser | like there is a whole class for handling this exact situation when the controlpersist socket closes | 11:13 |
mgariepy | bbl. | 11:15 |
anskiy | I do remember running into some issues with pipelining (with another basic playbook, not OSA) some time ago. AFAIR, I've found some confirmation that it could be buggy, and so I keep it disabled. | 11:16 |
jrosser | well - looks like we totally run our own ssh command directly with no retries https://github.com/openstack/openstack-ansible-plugins/blob/master/plugins/connection/ssh.py#L502 | 11:21 |
noonedeadpunk | well, we jsut overwrite original methods that should be covered with retries imo | 11:23 |
jrosser | right yes so that should take the method from the base class | 11:24 |
noonedeadpunk | and setting ansible.builtin.ssh issue would be the same | 11:24 |
noonedeadpunk | *with | 11:24 |
noonedeadpunk | so that likely doesn't matter much | 11:24 |
jrosser | is it the same without our plugin? | 11:24 |
noonedeadpunk | it was for me, yes | 11:24 |
jrosser | hmm ok | 11:25 |
noonedeadpunk | but you should try as well :) | 11:25 |
* jrosser goes to eat - bbl | 11:25 | |
*** dviroel_ is now known as dviroel | 11:43 | |
jrosser | ok so with ansible 2.12.7 it did this `fatal: [aio1]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: muxclient: master hello exchange failed\r\nkex_exchange_identification: read: Connection reset by peer", "unreachable": true}` | 12:05 |
mgariepy | i'm back ! | 12:10 |
*** ysandeep is now known as ysandeep|afk | 12:13 | |
noonedeadpunk | I have never seen that ^ | 12:14 |
jrosser | i am having much more difficulty making it fail with ansible-core installed in a venv | 12:24 |
jrosser | the other thing i can see is that with regular ansible the PID of the controlmaster process is rolling continuously, so the insides of ansible are handling the socket being deleted / recreated OK | 12:32 |
noonedeadpunk | ansible-core was failing as long as I sourced openstack-ansible.rc with commented out connection plugin override. And it had nothing to do with dynamic-inventory either at very least | 12:49 |
noonedeadpunk | Damn, I really fail to understand ansible motivation to hack apt module that weirdly | 12:50 |
noonedeadpunk | (wrt https://github.com/ansible/ansible/pull/78327) | 12:50 |
noonedeadpunk | as it makes pinning unpersistant anyway, that will be applied only during runtime. And next run for next package can just update recently installed one if it will have it mentioned as a dependency | 12:51 |
noonedeadpunk | not talking about unattended upgrades or anything like that | 12:52 |
noonedeadpunk | it's so weird.... | 12:52 |
mgariepy | they wanted to be able to force a version from withing the playbook. but they did fail to implement it correctly. | 12:52 |
mgariepy | unless you are only building an app container image in that case you don't really care | 12:53 |
mgariepy | also it looks like duplicate from apt directly imo. apt does manage all those case i think. | 12:55 |
noonedeadpunk | It really does | 12:55 |
noonedeadpunk | so what they could do is jsut add another module to manage preferences.d in a more convenient way | 12:56 |
noonedeadpunk | like our apt-package-pinning does | 12:56 |
noonedeadpunk | Huh, so does ppl build container images with ansible? :D | 12:56 |
mgariepy | yeah and use the apt python module and manage error instead of implementing the logic in this one. | 12:56 |
mgariepy | noonedeadpunk, maybe ? | 12:56 |
mgariepy | lol | 12:57 |
noonedeadpunk | well, we kind of do but it's quite different :D | 12:57 |
mgariepy | we do system-container. | 12:57 |
mgariepy | but you could also build app container if you really want i guess. | 12:57 |
noonedeadpunk | because even for VM image that won't really work | 12:58 |
noonedeadpunk | yeah, application maybe... I just thought that it's mostly smth like dockerfiles | 12:58 |
mgariepy | does tower make ansible more like puppet ? | 13:00 |
noonedeadpunk | meh, I tried to adopt AWX couple of times and it never had reall flight, as was abandoned quite fast | 13:01 |
mgariepy | also if you check the comments on the patch that totally break the apt module. it was not supposed to break preference.d config. | 13:01 |
noonedeadpunk | well, yes, and fix is basically to read that config and still provide version when it's not asked to do any of that | 13:02 |
noonedeadpunk | which is smth I really fail to understand | 13:02 |
mgariepy | isn't there a from apt import apt-just-like-you-would-in-cli ? | 13:03 |
noonedeadpunk | there is | 13:06 |
noonedeadpunk | but they do `pkg_list.append("'%s=%s'" % (name, version))` for some conditions I can hardl read | 13:07 |
cloudnull | yo! | 13:25 |
mgariepy | hey cloudnull it's been a while ! :D | 13:25 |
jrosser | noonedeadpunk: mgariepy i think the problem is here https://github.com/ansible/ansible/blob/devel/lib/ansible/plugins/connection/ssh.py#L1165 | 13:26 |
jrosser | becasue 255 != -13 | 13:26 |
jrosser | so it never retries | 13:26 |
* cloudnull just stood up an OSA cloud from master on Debian, using mostly baremetal, and it all worked perfectly. So I just wanted to say hi, thanks, and great work! | 13:26 | |
jrosser | \o/ awesome | 13:27 |
jrosser | noonedeadpunk: i hacked it to check for -13 instead of 255 https://paste.opendev.org/show/bOlt6eL38SZerFVuqSLV/ | 13:27 |
jrosser | and you see it retry - without that it crash/burn when it gets -13 | 13:27 |
cloudnull | how you been mgariepy jrosser? | 13:29 |
mgariepy | not too bad yourself? | 13:29 |
cloudnull | doing great. chillin :D | 13:30 |
jrosser | yeah doing ok | 13:30 |
noonedeadpunk | \o/ | 13:30 |
*** ysandeep|afk is now known as ysandeep | 13:30 | |
noonedeadpunk | well,that is intentional https://github.com/ansible/ansible/blob/devel/lib/ansible/plugins/connection/ssh.py#L179 | 13:32 |
jrosser | but wrong, no? | 13:33 |
jrosser | we kind of show that the controlpersist socket closing results in -13, not 255 | 13:33 |
noonedeadpunk | they don't consider negative though | 13:33 |
noonedeadpunk | yes, totally. I think it would be valid bug. | 13:33 |
noonedeadpunk | with super trivial fix... | 13:34 |
jrosser | i'm not really understanding why -13 though | 13:34 |
noonedeadpunk | or not considering tests | 13:34 |
noonedeadpunk | I actually tried to google for it but haven't found much | 13:35 |
jrosser | unless it should retry on any failure | 13:35 |
jrosser | it's not EPIPE | 13:35 |
jrosser | i'm still unhappy this doesnt reproduce so easily without our connection plugin | 13:35 |
noonedeadpunk | it feels as a race that depends on execution time as well, and our connection plugin slows things down from what I saw | 13:37 |
noonedeadpunk | and eventually our retry looks useless | 13:38 |
jrosser | yeah i think we should get rid of that | 13:38 |
jrosser | i've copied my debugging lines over to stock ansible venv and i see it retrying there as well now | 13:40 |
spatel | cloudnull \o/ | 13:42 |
mgariepy | if i add if p.returncode == 255 or p.returncode == -13: to the ssh.py file it does retry correctly | 13:44 |
mgariepy | https://paste.openstack.org/show/bwdhz61ViT1vT8imKkfO/ | 13:44 |
jrosser | i have a ton of debugging lines in and it seems our code takes quite a different path through _bare_run than if you run with stock ansible | 13:44 |
jrosser | and with stock ansible i see the -13 being returned but it is handled | 13:45 |
mgariepy | ssh exits with the exit status of the remote command or with 255 is an error occured. | 13:45 |
mgariepy | it catches the @_ssh_retry ? | 13:46 |
jrosser | i'm not sure right now | 13:46 |
jrosser | most obviously, ansible 2.12.7 is mostly using this https://github.com/ansible/ansible/blob/devel/lib/ansible/plugins/connection/ssh.py#L913 | 13:48 |
jrosser | but our plugin makes it go here instead https://github.com/ansible/ansible/blob/devel/lib/ansible/plugins/connection/ssh.py#L926 | 13:49 |
mgariepy | because in_data is None. | 13:52 |
mgariepy | https://opendev.org/openstack/openstack-ansible-plugins/src/branch/master/plugins/connection/ssh.py#L422 | 13:53 |
jrosser | yeah, i just need to make sure i'm running the code i think i am here | 13:53 |
noonedeadpunk | jrosser: I'm not sure, but isn't pty.openpty is more path without pipelining? | 13:54 |
jrosser | it could be yes - it looks like there are many more iterations happening | 13:54 |
jrosser | where is that controlled? | 13:54 |
noonedeadpunk | https://github.com/ansible/ansible/blob/15750aec5265866ae46319cbfbb318e9eec0e083/lib/ansible/plugins/connection/ssh.py#L891-L894 | 13:56 |
noonedeadpunk | it's not where controlled, but at least explaining path | 13:57 |
noonedeadpunk | but controlled with env var | 13:57 |
noonedeadpunk | https://opendev.org/openstack/openstack-ansible/src/branch/master/scripts/openstack-ansible.rc#L51 | 13:57 |
jrosser | argh i have a meeting | 14:01 |
jrosser | right i reproduced it with ansible-core venv with pipelineing | 14:15 |
mgariepy | if i don't set pipelining i can't reproduce it in osa. | 14:15 |
mgariepy | if i don't set pipelining to True i can't reproduce it in osa. | 14:16 |
jrosser | https://github.com/ansible/ansible/issues/78344 | 14:39 |
mgariepy | cool thanks jrosser | 14:53 |
mgariepy | it was broken on older release as well. | 14:53 |
mgariepy | we had the issue on xena | 14:54 |
mgariepy | 2.11.something irrc | 14:54 |
jrosser | make a comment :) | 14:56 |
*** ysandeep is now known as ysandeep|dinner | 15:04 | |
mgariepy | done | 15:05 |
jrosser | noonedeadpunk: that ansible apt stuff - do you think it can deal with this https://github.com/openstack/openstack-ansible-openstack_hosts/blob/master/defaults/main.yml#L186-L189 | 15:06 |
*** dviroel is now known as dviroel|lunch | 15:08 | |
*** ysandeep|dinner is now known as ysandeep|out | 15:45 | |
opendevreview | Merged openstack/openstack-ansible-ops master: Update MNAIO to use Ansible Collections https://review.opendev.org/c/openstack/openstack-ansible-ops/+/850817 | 16:06 |
opendevreview | Merged openstack/openstack-ansible-os_mistral master: Add release notes for mistral_api_use_uwsgi https://review.opendev.org/c/openstack/openstack-ansible-os_mistral/+/849804 | 16:13 |
*** dviroel|lunch is now known as dviroel | 16:26 | |
mgariepy | c9s again .. https://zuul.opendev.org/t/openstack/build/897cce3b5405487aa670b4dce05e3051/log/logs/host/rsyslog.service.journal-14-32-29.log.txt#15-17 | 18:06 |
mgariepy | wtf https://zuul.opendev.org/t/openstack/build/897cce3b5405487aa670b4dce05e3051/log/logs/host/keystone-wsgi-public.service.journal-14-32-29.log.txt#14960 | 18:15 |
jrosser | it’s not string vs actual bool is it? | 18:28 |
mgariepy | no idea | 18:29 |
jrosser | because maybe ‘False’ != False | 18:29 |
mgariepy | i'm just trying to find our why tempest fials on centos | 18:29 |
mgariepy | https://zuul.opendev.org/t/openstack/build/897cce3b5405487aa670b4dce05e3051/log/logs/host/haproxy.service.journal-14-32-29.log.txt#1784 | 18:30 |
mgariepy | this is also a bit troublesome. | 18:30 |
mgariepy | i wonder if we are not a bit too aggressive on the keystone threads uwsgi config | 18:30 |
opendevreview | Marc Gariépy proposed openstack/openstack-ansible master: Set the number of threads for processes to 2 https://review.opendev.org/c/openstack/openstack-ansible/+/850942 | 18:53 |
opendevreview | Marc Gariépy proposed openstack/openstack-ansible master: Set the number of threads for processes to 2 https://review.opendev.org/c/openstack/openstack-ansible/+/850942 | 18:54 |
opendevreview | Marc Gariépy proposed openstack/openstack-ansible master: Set the number of threads for processes to 2 https://review.opendev.org/c/openstack/openstack-ansible/+/850942 | 18:55 |
opendevreview | Merged openstack/openstack-ansible stable/victoria: Add mistra-extra repo https://review.opendev.org/c/openstack/openstack-ansible/+/849520 | 18:56 |
*** dviroel is now known as dviroel|afk | 20:17 | |
*** melwitt_ is now known as melwitt | 21:10 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!