*** rfolco|rover has joined #softwarefactory | 00:05 | |
*** apevec has quit IRC | 00:46 | |
*** rfolco|rover has quit IRC | 00:47 | |
*** rfolco|rover has joined #softwarefactory | 01:23 | |
*** harrymichal has joined #softwarefactory | 07:31 | |
*** harrymichal has quit IRC | 07:43 | |
*** jpena|off is now known as jpena | 07:57 | |
*** jpena is now known as jpena|lunch | 11:38 | |
*** jpena|lunch is now known as jpena | 12:38 | |
*** sshnaidm is now known as sshnaidm|mtg | 13:23 | |
*** rfolco|rover is now known as rfolco | 14:01 | |
*** harrymichal has joined #softwarefactory | 15:06 | |
*** harrymichal has quit IRC | 15:11 | |
*** sshnaidm|mtg is now known as sshnaidm|ruck | 15:41 | |
*** harrymichal has joined #softwarefactory | 15:53 | |
*** harrymichal has quit IRC | 16:03 | |
*** harrymichal has joined #softwarefactory | 16:03 | |
*** sshnaidm|ruck is now known as sshnaidm|off | 16:52 | |
*** jpena is now known as jpena|off | 17:21 | |
*** sduthil has joined #softwarefactory | 18:29 | |
sduthil | hi there! I'm running software factory for the wazo platform project http://wazo-platform.org. I'll describe my problem below: | 18:30 |
---|---|---|
sduthil | I have 1 VM running zuul and nodepool and another VM hosting the runc containers started by the zuul VM | 18:30 |
sduthil | I sometimes have a job that gets stuck on the building state for the runc container | 18:31 |
sduthil | after investigating, the runc container is correctly started, but the task to check that the container is running never finishes (ssh <container> echo okay) | 18:32 |
sduthil | I have one such job right now that is stuck, and I'm digging to understand what's happening | 18:32 |
sduthil | I do see the ansible-playbook ...create.yml running, but I don't see the ssh echo okay client process | 18:33 |
sduthil | stracing the ansible-playbook ...create.yml gives me something like this, in a fast loop: | 18:34 |
sduthil | clock_gettime(CLOCK_MONOTONIC, {tv_sec=769509, tv_nsec=56756942}) = 0 | 18:34 |
sduthil | select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=10000}) = 0 (Timeout) | 18:34 |
sduthil | ... | 18:34 |
sduthil | do you have any idea what I could try to understand what is stuck? | 18:35 |
sduthil | if I ssh container echo okay in another shell, it works perfectly (did that in a loop 1000 times) | 18:35 |
sduthil | the jobs get stuck randomly about 1 time out of 100, it's rather rare | 18:36 |
tristanC | sduthil: that's odd, perhaps it happens when the test is performed too early? | 18:38 |
sduthil | tristanC, yes, that's what I thought too. In this scenario, either the ssh client command would fail and return and get re-run by ansible or the ssh client command would be stuck somewhere and I would be able to see it in the running processes | 18:39 |
sduthil | but I don't see the ssh client command anywhere | 18:39 |
tristanC | sduthil: we are actually in the process of removing the runc driver... perhaps we could help you setup the kubernetes based replacement? | 18:39 |
sduthil | ah ok, I didn't know runc was being removed :) | 18:40 |
nhicher | sduthil: it will be remove on sf-3.5. Are you on 3.4 ? | 18:41 |
tristanC | sduthil: the ssh client command should be running on the nodepool-launcher host, it is trigger by: https://review.opendev.org/#/c/535556/28/nodepool/driver/runc/playbooks/create.yml@32 | 18:41 |
sduthil | I never played with kubernetes yet, so I'll need to read up a bit on kube before making that kind of transition | 18:41 |
tristanC | sduthil: you don't have too, sf comes with a role to setup a small service that can be used in place of k8s | 18:42 |
sduthil | nhicher, I don't know how to check the version of software-factory, what's the command? | 18:43 |
tristanC | sduthil: cat /etc/sf-release | 18:43 |
sduthil | I'm running SF 3.3 | 18:43 |
sduthil | tristanC, I read that playbook thoroughly, and I totally agree that I should see a ssh client command on the nodepool-launcher host, but I don't | 18:45 |
sduthil | also, in an earlier investigation, I killed the ansible-playbook ...create.yml and I saw in the nodepool log "Wait for ssh access" then nothing. The current symptoms are exactly the same than the current situation so most likely it is stuck at the same place than earlier. | 18:47 |
tristanC | sduthil: could it be the task before that is stuck? | 18:47 |
tristanC | sduthil: hum, if it said "Wait for ssh access", then no :) | 18:48 |
sduthil | tristanC, I doubt it, since the runc container is correctly running and usable, and the earlier investigation showed the log "Wait for ssh access" | 18:48 |
sduthil | yes, I agree | 18:48 |
* tristanC scratching head | 18:50 | |
sduthil | yep, same here :) | 18:51 |
sduthil | I have those two processes: | 18:51 |
sduthil | nodepool 9831 1.0 0.9 523032 38328 ? Sl 13:22 3:25 /opt/rh/rh-python35/root/usr/bin/python3 /opt/rh/rh-python35/root/usr/bin/ansible-playbook /opt/rh/rh-python35/root/usr/lib/python3.5/site-packages/nodepool/driver/runc/playbooks/create.yml -i /var/opt/rh/rh-python35/lib/nodepool/runc/containers.zuul.wazo.community.inventory -e use_rootfs=False -e zuul_console_dir=/tmp -e hypervisor_info_file=/var/opt/rh/rh-python35/li | 18:51 |
sduthil | b/nodepool/runc/containers.zuul.wazo.community.json -e container_id=0000031672-runc-debian-buster-200-0000884177 -e container_spec=/var/opt/rh/rh-python35/lib/nodepool/runc/0000031672-runc-debian-buster-200-0000884177.config -e worker_username=zuul-worker -e worker_homedir=/home/zuul-worker -e host_addr=containers.zuul.wazo.community -e container_port=23604 | 18:51 |
sduthil | nodepool 10170 0.0 0.8 524936 34080 ? S 13:24 0:00 /opt/rh/rh-python35/root/usr/bin/python3 /opt/rh/rh-python35/root/usr/bin/ansible-playbook /opt/rh/rh-python35/root/usr/lib/python3.5/site-packages/nodepool/driver/runc/playbooks/create.yml -i /var/opt/rh/rh-python35/lib/nodepool/runc/containers.zuul.wazo.community.inventory -e use_rootfs=False -e zuul_console_dir=/tmp -e hypervisor_info_file=/var/opt/rh/rh-python35/li | 18:51 |
sduthil | b/nodepool/runc/containers.zuul.wazo.community.json -e container_id=0000031672-runc-debian-buster-200-0000884177 -e container_spec=/var/opt/rh/rh-python35/lib/nodepool/runc/0000031672-runc-debian-buster-200-0000884177.config -e worker_username=zuul-worker -e worker_homedir=/home/zuul-worker -e host_addr=containers.zuul.wazo.community -e container_port=23604 | 18:51 |
sduthil | the first is the parent, the second is the child | 18:51 |
sduthil | the parent show the strace mentioned earlier | 18:51 |
sduthil | the child shows the following strace, when attaching: | 18:52 |
sduthil | strace: Process 10170 attached | 18:52 |
sduthil | futex(0x7f02facee930, FUTEX_WAIT_PRIVATE, 2, NULL | 18:52 |
sduthil | ... | 18:52 |
sduthil | (no loop here, it's just waiting) so not very helpful | 18:53 |
tristanC | sduthil: it looks like an ansible bug, the task should either have an ssh child process, either it should fail after 3 attempts (with a default 5 second delay) | 18:53 |
sduthil | would upgrading ansible be a good idea? I'm running ansible 2.6.9 | 18:53 |
tristanC | sduthil: that can be attempted, though upgrading doesn't always solve the desired issue, and there is the risk it introduce more | 18:54 |
tristanC | sduthil: since that driver is no longer supported, may i suggest another band aid solution? | 18:55 |
sduthil | are there known compatibility issues for software-factory with higher versions of ansible ? | 18:55 |
tristanC | sduthil: let me check | 18:56 |
tristanC | sduthil: it should be fine, the main risk would be zuul executor, but your version already has the `multi-ansible-version` feature where zuul is using custom venv for ansible | 18:59 |
tristanC | sduthil: however, i think a better solution would be to add here: https://review.opendev.org/#/c/535556/28/nodepool/driver/runc/provider.py@80 | 18:59 |
tristanC | sduthil: a ["timeout", "256s", "ansible-playbook"] in the begining of the argv list | 19:00 |
tristanC | sduthil: such timeout would be a safeguard to prevent the ansible process from being stuck, and when it kicks in, it should properly propagate as a start failure and it will retry creating the node | 19:01 |
sduthil | ah thank you :) I was looking for a timeout feature in ansible, but didn't find any | 19:02 |
sduthil | would it behave differently than killing the ansible-playbook ...create.yml directly? Because I already tried this one, and the container was never removed, so when nodepool tried to recreate the container, it failed saying "the container with id ... is already started" | 19:04 |
tristanC | hmm, adding such timeout should have the same effect as killing the ansible-playbook process | 19:05 |
tristanC | e.g. it should propage through https://review.opendev.org/#/c/535556/28/nodepool/driver/runc/provider.py@199 | 19:06 |
tristanC | and then https://review.opendev.org/#/c/535556/28/nodepool/driver/runc/handler.py@45 | 19:07 |
tristanC | which should result in a new attempt using a new hostid | 19:07 |
tristanC | however, you may have found another issue where the leftover runc process may result in a conflict... | 19:08 |
sduthil | yeah, I see that the hostid does not include the retry number, so it will be identical to the previous try | 19:09 |
tristanC | sduthil: thus, you may have to also add a 'try: self.handler.manager.cleanupNode(hostid) \n except: pass' in https://review.opendev.org/#/c/535556/28/nodepool/driver/runc/handler.py@46 | 19:10 |
tristanC | yeah you are right, the hostid is not updated | 19:11 |
tristanC | sduthil: i think it's safe to attempt a delete when the create fails, just have to wrap in another try/except to avoid escaping the retry loop too soon | 19:11 |
sduthil | ok, I'll apply this patch then | 19:12 |
sduthil | thank you very much for your help | 19:12 |
tristanC | sduthil: you're welcome, please let us know how it went | 19:18 |
sduthil | sure, but it will take a few weeks to confirm that is doesn't happen again :) | 19:20 |
sduthil | I'd like to clone the source code to make the patch, but I can't find the files to be patched when I clone with git clone https://review.opendev.org/zuul/nodepool | 19:23 |
sduthil | I can't find the nodepool/driver/runc directory | 19:23 |
sduthil | how do I git clone the runc driver? | 19:24 |
tristanC | sduthil: that's because the driver is not merged. have a look at https://review.opendev.org/#/c/535556/28 | 19:24 |
tristanC | sduthil: on the top right, there is a `download` button that contains git command to pick the driver locally | 19:24 |
sduthil | ah! thank you! | 19:26 |
sduthil | seems like a well hidden bug, still unsolved: https://github.com/ansible/ansible/issues/30411 | 20:07 |
*** sduthil has quit IRC | 20:36 | |
*** rfolco has quit IRC | 20:52 | |
*** harrymichal has quit IRC | 21:33 | |
*** harrymichal has joined #softwarefactory | 21:35 | |
*** rfolco has joined #softwarefactory | 23:58 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!