Thursday, 2020-06-18

*** rfolco\|rover has joined #softwarefactory		00:05
*** apevec has quit IRC		00:46
*** rfolco\|rover has quit IRC		00:47
*** rfolco\|rover has joined #softwarefactory		01:23
*** harrymichal has joined #softwarefactory		07:31
*** harrymichal has quit IRC		07:43
*** jpena\|off is now known as jpena		07:57
*** jpena is now known as jpena\|lunch		11:38
*** jpena\|lunch is now known as jpena		12:38
*** sshnaidm is now known as sshnaidm\|mtg		13:23
*** rfolco\|rover is now known as rfolco		14:01
*** harrymichal has joined #softwarefactory		15:06
*** harrymichal has quit IRC		15:11
*** sshnaidm\|mtg is now known as sshnaidm\|ruck		15:41
*** harrymichal has joined #softwarefactory		15:53
*** harrymichal has quit IRC		16:03
*** harrymichal has joined #softwarefactory		16:03
*** sshnaidm\|ruck is now known as sshnaidm\|off		16:52
*** jpena is now known as jpena\|off		17:21
*** sduthil has joined #softwarefactory		18:29
sduthil	hi there! I'm running software factory for the wazo platform project http://wazo-platform.org. I'll describe my problem below:	18:30
sduthil	I have 1 VM running zuul and nodepool and another VM hosting the runc containers started by the zuul VM	18:30
sduthil	I sometimes have a job that gets stuck on the building state for the runc container	18:31
sduthil	after investigating, the runc container is correctly started, but the task to check that the container is running never finishes (ssh <container> echo okay)	18:32
sduthil	I have one such job right now that is stuck, and I'm digging to understand what's happening	18:32
sduthil	I do see the ansible-playbook ...create.yml running, but I don't see the ssh echo okay client process	18:33
sduthil	stracing the ansible-playbook ...create.yml gives me something like this, in a fast loop:	18:34
sduthil	clock_gettime(CLOCK_MONOTONIC, {tv_sec=769509, tv_nsec=56756942}) = 0	18:34
sduthil	select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=10000}) = 0 (Timeout)	18:34
sduthil	...	18:34
sduthil	do you have any idea what I could try to understand what is stuck?	18:35
sduthil	if I ssh container echo okay in another shell, it works perfectly (did that in a loop 1000 times)	18:35
sduthil	the jobs get stuck randomly about 1 time out of 100, it's rather rare	18:36
tristanC	sduthil: that's odd, perhaps it happens when the test is performed too early?	18:38
sduthil	tristanC, yes, that's what I thought too. In this scenario, either the ssh client command would fail and return and get re-run by ansible or the ssh client command would be stuck somewhere and I would be able to see it in the running processes	18:39
sduthil	but I don't see the ssh client command anywhere	18:39
tristanC	sduthil: we are actually in the process of removing the runc driver... perhaps we could help you setup the kubernetes based replacement?	18:39
sduthil	ah ok, I didn't know runc was being removed :)	18:40
nhicher	sduthil: it will be remove on sf-3.5. Are you on 3.4 ?	18:41
tristanC	sduthil: the ssh client command should be running on the nodepool-launcher host, it is trigger by: https://review.opendev.org/#/c/535556/28/nodepool/driver/runc/playbooks/create.yml@32	18:41
sduthil	I never played with kubernetes yet, so I'll need to read up a bit on kube before making that kind of transition	18:41
tristanC	sduthil: you don't have too, sf comes with a role to setup a small service that can be used in place of k8s	18:42
sduthil	nhicher, I don't know how to check the version of software-factory, what's the command?	18:43
tristanC	sduthil: cat /etc/sf-release	18:43
sduthil	I'm running SF 3.3	18:43
sduthil	tristanC, I read that playbook thoroughly, and I totally agree that I should see a ssh client command on the nodepool-launcher host, but I don't	18:45
sduthil	also, in an earlier investigation, I killed the ansible-playbook ...create.yml and I saw in the nodepool log "Wait for ssh access" then nothing. The current symptoms are exactly the same than the current situation so most likely it is stuck at the same place than earlier.	18:47
tristanC	sduthil: could it be the task before that is stuck?	18:47
tristanC	sduthil: hum, if it said "Wait for ssh access", then no :)	18:48
sduthil	tristanC, I doubt it, since the runc container is correctly running and usable, and the earlier investigation showed the log "Wait for ssh access"	18:48
sduthil	yes, I agree	18:48
* tristanC scratching head		18:50
sduthil	yep, same here :)	18:51
sduthil	I have those two processes:	18:51
sduthil	nodepool 9831 1.0 0.9 523032 38328 ? Sl 13:22 3:25 /opt/rh/rh-python35/root/usr/bin/python3 /opt/rh/rh-python35/root/usr/bin/ansible-playbook /opt/rh/rh-python35/root/usr/lib/python3.5/site-packages/nodepool/driver/runc/playbooks/create.yml -i /var/opt/rh/rh-python35/lib/nodepool/runc/containers.zuul.wazo.community.inventory -e use_rootfs=False -e zuul_console_dir=/tmp -e hypervisor_info_file=/var/opt/rh/rh-python35/li	18:51
sduthil	b/nodepool/runc/containers.zuul.wazo.community.json -e container_id=0000031672-runc-debian-buster-200-0000884177 -e container_spec=/var/opt/rh/rh-python35/lib/nodepool/runc/0000031672-runc-debian-buster-200-0000884177.config -e worker_username=zuul-worker -e worker_homedir=/home/zuul-worker -e host_addr=containers.zuul.wazo.community -e container_port=23604	18:51
sduthil	nodepool 10170 0.0 0.8 524936 34080 ? S 13:24 0:00 /opt/rh/rh-python35/root/usr/bin/python3 /opt/rh/rh-python35/root/usr/bin/ansible-playbook /opt/rh/rh-python35/root/usr/lib/python3.5/site-packages/nodepool/driver/runc/playbooks/create.yml -i /var/opt/rh/rh-python35/lib/nodepool/runc/containers.zuul.wazo.community.inventory -e use_rootfs=False -e zuul_console_dir=/tmp -e hypervisor_info_file=/var/opt/rh/rh-python35/li	18:51
sduthil	b/nodepool/runc/containers.zuul.wazo.community.json -e container_id=0000031672-runc-debian-buster-200-0000884177 -e container_spec=/var/opt/rh/rh-python35/lib/nodepool/runc/0000031672-runc-debian-buster-200-0000884177.config -e worker_username=zuul-worker -e worker_homedir=/home/zuul-worker -e host_addr=containers.zuul.wazo.community -e container_port=23604	18:51
sduthil	the first is the parent, the second is the child	18:51
sduthil	the parent show the strace mentioned earlier	18:51
sduthil	the child shows the following strace, when attaching:	18:52
sduthil	strace: Process 10170 attached	18:52
sduthil	futex(0x7f02facee930, FUTEX_WAIT_PRIVATE, 2, NULL	18:52
sduthil	...	18:52
sduthil	(no loop here, it's just waiting) so not very helpful	18:53
tristanC	sduthil: it looks like an ansible bug, the task should either have an ssh child process, either it should fail after 3 attempts (with a default 5 second delay)	18:53
sduthil	would upgrading ansible be a good idea? I'm running ansible 2.6.9	18:53
tristanC	sduthil: that can be attempted, though upgrading doesn't always solve the desired issue, and there is the risk it introduce more	18:54
tristanC	sduthil: since that driver is no longer supported, may i suggest another band aid solution?	18:55
sduthil	are there known compatibility issues for software-factory with higher versions of ansible ?	18:55
tristanC	sduthil: let me check	18:56
tristanC	sduthil: it should be fine, the main risk would be zuul executor, but your version already has the `multi-ansible-version` feature where zuul is using custom venv for ansible	18:59
tristanC	sduthil: however, i think a better solution would be to add here: https://review.opendev.org/#/c/535556/28/nodepool/driver/runc/provider.py@80	18:59
tristanC	sduthil: a ["timeout", "256s", "ansible-playbook"] in the begining of the argv list	19:00
tristanC	sduthil: such timeout would be a safeguard to prevent the ansible process from being stuck, and when it kicks in, it should properly propagate as a start failure and it will retry creating the node	19:01
sduthil	ah thank you :) I was looking for a timeout feature in ansible, but didn't find any	19:02
sduthil	would it behave differently than killing the ansible-playbook ...create.yml directly? Because I already tried this one, and the container was never removed, so when nodepool tried to recreate the container, it failed saying "the container with id ... is already started"	19:04
tristanC	hmm, adding such timeout should have the same effect as killing the ansible-playbook process	19:05
tristanC	e.g. it should propage through https://review.opendev.org/#/c/535556/28/nodepool/driver/runc/provider.py@199	19:06
tristanC	and then https://review.opendev.org/#/c/535556/28/nodepool/driver/runc/handler.py@45	19:07
tristanC	which should result in a new attempt using a new hostid	19:07
tristanC	however, you may have found another issue where the leftover runc process may result in a conflict...	19:08
sduthil	yeah, I see that the hostid does not include the retry number, so it will be identical to the previous try	19:09
tristanC	sduthil: thus, you may have to also add a 'try: self.handler.manager.cleanupNode(hostid) \n except: pass' in https://review.opendev.org/#/c/535556/28/nodepool/driver/runc/handler.py@46	19:10
tristanC	yeah you are right, the hostid is not updated	19:11
tristanC	sduthil: i think it's safe to attempt a delete when the create fails, just have to wrap in another try/except to avoid escaping the retry loop too soon	19:11
sduthil	ok, I'll apply this patch then	19:12
sduthil	thank you very much for your help	19:12
tristanC	sduthil: you're welcome, please let us know how it went	19:18
sduthil	sure, but it will take a few weeks to confirm that is doesn't happen again :)	19:20
sduthil	I'd like to clone the source code to make the patch, but I can't find the files to be patched when I clone with git clone https://review.opendev.org/zuul/nodepool	19:23
sduthil	I can't find the nodepool/driver/runc directory	19:23
sduthil	how do I git clone the runc driver?	19:24
tristanC	sduthil: that's because the driver is not merged. have a look at https://review.opendev.org/#/c/535556/28	19:24
tristanC	sduthil: on the top right, there is a `download` button that contains git command to pick the driver locally	19:24
sduthil	ah! thank you!	19:26
sduthil	seems like a well hidden bug, still unsolved: https://github.com/ansible/ansible/issues/30411	20:07
*** sduthil has quit IRC		20:36
*** rfolco has quit IRC		20:52
*** harrymichal has quit IRC		21:33
*** harrymichal has joined #softwarefactory		21:35
*** rfolco has joined #softwarefactory		23:58

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!