Wednesday, 2021-02-24

*** fuentess has quit IRC02:46
*** sameo has joined #kata-dev06:18
kata-irc-bot<samuel.ortiz> @fidencio overlayfs means we need to use virtiofs to expose the container image, while devicemapper means virtio-blk.06:26
kata-irc-bot<samuel.ortiz> @fidencio06:26
*** dklyle has quit IRC07:16
*** sgarzare has joined #kata-dev08:01
kata-irc-bot<fidencio> I was missing that piece, makes sense. Thanks @samuel.ortiz.08:05
*** jodh has joined #kata-dev08:19
*** davidgiluk has joined #kata-dev09:01
*** pmores has joined #kata-dev09:06
pmoresHi! I'm looking into merging https://github.com/kata-containers/runtime/pull/3127 .  There are errors reported in CI logs (e.g. http://jenkins.katacontainers.io/job/kata-containers-runtime-ubuntu-18-04-virtiofs-PR/851/console). Some of them do look interesting (to me at least) however they either don't seem to cause the tests fail or don't seem to be related to the PR in question.09:13
pmoresHow to proceed in such a case?09:13
kata-irc-bot<fidencio> Let's take a look at the error: ```18:45:15 docker: Error response from daemon: OCI runtime create failed: virtiofs daemon /usr/bin/virtiofsd returned with error: fork/exec /usr/bin/virtiofsd: no such file or directory: unknown.```09:21
kata-irc-bot<fidencio> Looking way up in the logs it says: ```18:42:05 INFO: Installing cached QEMU 18:42:06                                                                             0.1%                                                                            1.1% ######                                                                     9.2% ################################                                          45.2%09:24
kata-irc-bot##########################################################                81.7% ######################################################################## 100.0% 18:42:06 kata-static-qemu.tar.gz: OK```09:24
kata-irc-bot<fidencio> Then, taking a look at the tar.gz, from http://jenkins.katacontainers.io/job/qemu-nightly-x86_64/, we can see: ```fidencio@quino ~ $ tar tf ~/Descargas/kata-static-qemu.tar.gz |  grep virtiofsd usr/share/kata-qemu/qemu/vhost-user/50-qemu-virtiofsd.json usr/libexec/kata-qemu/virtiofsd```09:25
kata-irc-bot<fidencio> So, at least now we know the reason why it fails.09:25
kata-irc-bot<fidencio> Now, going back to your original question, "How to proceed in such a case?"  I'd recommend opening an issue against the tests repo, explicitly mentioning which tests is failing, from which CI, and with the info provided above.09:26
kata-irc-bot<fidencio> Then wait till it's solved before merging #312709:27
kata-irc-bot<fidencio> Does this sound reasonable?09:27
kata-irc-bot<fidencio> Folks, if I can get a quick review on https://github.com/kata-containers/community/pull/199, I'd greatly appreciate!09:54
fidenciojodh: ^09:55
kata-irc-bot<fidencio> @fupan ^^ :slightly_smiling_face:09:55
kata-irc-bot<fupan> ?09:56
kata-irc-bot<fidencio> If you can give me a quick review on https://github.com/kata-containers/community/pull/19909:57
kata-irc-bot<fidencio> Just to get the process going :slightly_smiling_face:09:57
fidenciojodh: thanks!09:57
pmoresfidencio: okay, so the policy is don't merge even if errors aren't related, right?09:59
fidenciopmores: the policy is to have the issues reported, at least.10:00
fidenciopmores: I'd prefer not merging, some may prefer to merge10:01
fidenciopmores: but the issue being reported is a must10:01
pmoresI mean, besides the one you mentioned, there's a lot of other errors related to e.g. tracing, TestCleanupContainer cannot remove rootfs dir etc.10:02
kata-irc-bot<fupan> Done10:03
fidenciopmores: we have to understand what are the errors causing the abortion of the CI, and what are the logged "unharmful" content10:03
kata-irc-bot<fidencio> Lovely, thanks!10:03
pmores(Of course I'll report the error, no question about that.  Just trying to understand the overall process.)10:03
fidenciopmores: I guess it takes some time to navigate through the CI errors, and figure out what's causing the breakage10:04
fidenciopmores: it took for me, at least10:04
fidenciopmores: that's one of the proposals that Cameron raised, about improving that for contributors to easily see what went wrong with the CI10:05
fidenciobut people didn't have cycle to start working on that, at least not yet10:05
pmoresYeah, got it.10:05
pmoresBTW how about the last reported error: ERROR: kata-log-parser: missing pid: {Count:0 TimeDelta:0 Filename:/tmp/jenkins/workspace/kata-containers-runtime-ubuntu-18-04-virtiofs-PR/go/src/github.com/kata-containers/tests/kata.log Line:1 Time:2021-02-23 17:43:33.932314193 +0000 UTC Pid:0 Level:warning Msg:Feature to allow containers to share PID namespace with the agent has been enabled. Please understand this has security implications10:06
pmoresand should only be used for debug purposes Source: Name: Container: Sandbox: Data:map[]}10:06
pmoresIsn't this the one that actually caused failure?10:06
fidenciono, that's just the teardown failing10:11
kata-irc-bot<fidencio> For instance, when we look at: ``````10:11
kata-irc-bot<fidencio> ```18:45:15 docker: Error response from daemon: OCI runtime create failed: virtiofs daemon /usr/bin/virtiofsd returned with error: fork/exec /usr/bin/virtiofsd: no such file or directory: unknown. 18:45:15 Makefile:237: recipe for target 'conformance' failed 18:45:15 make: *** [conformance] Error 127 18:45:15 Failed at 121: sudo -E PATH="$PATH" bash -c "make conformance" 18:45:15 Build step 'Execute shell' marked build as failure```10:11
pmoresah okay, thanks10:11
kata-irc-bot<fidencio> We see this: ```18:45:15 Build step 'Execute shell' marked build as failure```10:11
kata-irc-bot<fidencio> This is important information, this is what marked the whole job as a failure10:12
kata-irc-bot<fidencio> Whatever happens after that, that's not that important (generally speaking)10:12
kata-irc-bot<fidencio> Makes sense?10:12
fidencio(and sorry for mixing IRC and Slack, some things are easier to type on IRC, some others on Slack)10:13
pmoresYeah, makes sense.10:13
pmoresfidencio: hm, I somehow can't find the failing test in kata-containers or tests repos ('main' branch in both cases)...10:27
fidenciopmores: yep, things are not so easy to find / navigate.  First thing we have to keep in mind, this test case comes from kata-containers/runtime, which means kata-containers/tests should be using `master` branch (yeah, confusing, I know)10:29
fidenciofrom the logs, seems that what's failing is `make conformance`10:30
fidencio`make conformance` is called from `kata-containers/tests/.ci/run.sh`10:31
fidenciofrom `master` branch10:31
pmoresalright, at least I can now see the Dockerfile that the test uses! :-)10:33
fidenciopmores: so, let's go a little bit deepr10:34
fidencio*deeper10:34
fidenciopmores: we know the failure comes from `make conformance`, if yoiu open the `.ci/run.sh` script and search for that, you'll see it in several places10:34
fidencioyou also pointed out that the failing job is "virtiofs-PR", which leads us to the line 118, the "VIRTIOFS" case in that switch.10:35
fidencioand, as you alread pointed out, from the Makefile you can get to the `conformance` folder10:36
fidenciowhere you can find the Dockerfile10:36
fidencioSo, my *guess* here, and it's just a *guess*10:36
pmoreshang on, I think I'm confused - is this the 2.0 kata-containers repo, or the 1.x tests repo?10:37
fidencio`master` branch, is 1.x10:38
fidenciopmores: what are you confused with?10:39
pmoresfidencio: you said I should see 'make conformance' several times in .ci/run.sh. I can see it only once in my run.sh. So I'm probably looking at a different one.10:41
fidenciopmores: https://github.com/kata-containers/tests/blob/master/.ci/run.sh10:42
pmoresI'm looking at 1.x tests/.ci/run.sh, branch 'master'.10:42
fidenciohow old is your clone?10:42
pmoresI've just pulled. I'm looking into it, give me a second...10:44
pmoresokay, I think I'm OK now (not quite sure what was wrong but apparently it was nothing that 'git reset --hard' and 'git pull' couldn't solve ;-))10:47
fidenciopmores: so, shall I proceed?10:48
pmoresyes please :-)10:48
fidenciopmores: basically, what I *think* that's going wrong here, and that's just a *guess*, is that when we're building Kata Containers, we're not passing the correct value for virtiofsd location10:50
fidenciothis is handled here: https://github.com/kata-containers/tests/blob/master/.ci/install_runtime.sh10:51
pmoresRight. That's a good hint, I can take it from here and investigate.10:51
fidencioSee that it uses runtime_config_path="${SYSCONFDIR}/kata-containers/configuration.toml"10:51
fidencioI don't know what's the content of that file, or where that file comes from (most like from `make `10:52
pmorescool, don't bother fixing it all for me, I'll have a look and only ask questions if I get stuck - thanks a lot!10:53
fidenciopmores: np. last tip: https://github.com/kata-containers/runtime/blob/59e227336903383fcb04e0075e0b55cbd98c42bb/cli/config/configuration-qemu-virtiofs.toml.in#L123-L12410:54
fidenciopmores: that's the part that contains the wrong information10:54
fidenciopmores: something sets DEFVIRTIOFSDAEMON as /usr/bin/virtiofsd, while we saw the binary is actually at /usr/libexec/kata-qemu/virtiofsd10:55
fidencioI'd recommend you check the whole chain, as it surprises me quite a lot that it's looking for the correct path for QEMU itself.10:55
fidenciopmores: have fun! :-)10:55
pmoresso you won't rest... ;-D10:56
fidenciopmores: by the way, I also think that the questions you are asking should have been asked by some sort of FAQ for the herders10:57
fidenciopmores: and I'd appreciate if you could take some time and prepare that10:57
*** devimc has joined #kata-dev13:04
davidgilukdevimc: Get anywhere with your seg?13:20
devimcdavidgiluk, nop :(, I can't reproduce it using `strace -vv -tt --ff virtiofsd` nor virtiofsd-debug (binary with debug symbols), maybe the problem is the optimization / and or a race condition .. ?13:31
devimcdavidgiluk, I'm still on that13:31
devimcdavidgiluk, https://paste.centos.org/view/453b003c13:54
*** fuentess has joined #kata-dev14:04
*** crobinso has joined #kata-dev14:21
davidgilukdevimc: Ah so you're running with the dax world, which is worth noting14:51
devimcdavidgiluk, no, dax is disabled14:52
davidgilukok14:52
devimcdavidgiluk, I'm still trying to get a coredump but centos is kidd(ll)ing me14:54
davidgilukdevimc: Explain the problem to centos in an assertive manner14:57
davidgilukdevimc: You could try using -o sandbox=chroot and attaching gdb to the daemon14:57
devimcdavidgiluk, really hard to do that, since this happen randomly14:58
davidgilukdevimc: Ah, I thought it was happening on every pod run14:59
devimcdavidgiluk, no :(15:00
davidgilukdevimc: Is your core problem where it lands or what?15:01
devimcdavidgiluk, yeah, I don't see it15:03
devimcdavidgiluk, I followed this guide https://pve.proxmox.com/wiki/Enable_Core_Dump_systemd15:03
devimcbut nothing - nada15:03
davidgilukdevimc: I'm an old school type, try setting /proc/sys/kernel/core_pattern to   /tmp/core_%p15:04
davidgilukdevimc: The problem is that may land you in the /tmp of the container or of the chroot or ...wth knows15:05
devimcdavidgiluk, # cat /proc/sys/kernel/core_pattern15:06
devimc/var/lib/systemd/coredump/core-%e-%s-%u-%g-%p-%t15:06
davidgilukdevimc: Are you sure you didn't miss a  '|' at the beginning of that?15:07
davidgilukoh hmm15:07
davidgilukthat's different from mine15:07
davidgilukdevimc: OK, sthe systemd way sseems to  be     |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h15:08
davidgilukdevimc: then I guess it pops out in coredumpctl15:08
devimcdavidgiluk, # coredumpctl15:09
devimcNo coredumps found.15:09
devimcdavidgiluk, # systemctl --failed15:14
devimc  UNIT          LOAD   ACTIVE SUB    DESCRIPTION15:14
devimc● kdump.service loaded failed failed Crash recovery kernel arming15:14
devimcis this the issue?15:14
devimcdo I need kdump ?15:14
davidgiluknot unless you manage to nuke your host15:15
davidgilukdevimc: Which centos version?15:16
devimcdavidgiluk, 715:16
devimcthe infamous 715:16
davidgilukhmm that's a bit crusty15:17
davidgilukdevimc: I've not tested on 7 at all; I normally test on fedora and colleagues depend on it on rhel 815:19
kata-irc-bot<eric.ernst> “Infamous 7” :D15:30
kata-irc-bot<eric.ernst> Updated kernel at least devimc?15:30
devimc@<eric.ernst I will try that15:31
*** dklyle has joined #kata-dev15:46
pmoresWhere can I find the latest kata-static nightly tarball for 2.0?16:18
*** devimc has quit IRC16:24
*** devimc has joined #kata-dev16:24
*** sgarzare has quit IRC17:08
*** dklyle has quit IRC17:53
*** jodh has quit IRC18:05
*** dklyle has joined #kata-dev18:13
*** sameo has quit IRC19:04
devimcdavidgiluk, around?19:15
devimcfinally got a core19:16
devimc1.6 G19:16
*** sameo has joined #kata-dev19:27
davidgilukdevimc: Not bad!19:34
davidgilukdevimc: Now, can you get a backtrace?19:35
devimcdavidgiluk, https://paste.centos.org/view/1b1cb8c519:35
devimcdavidgiluk, I can compress it and send you19:36
devimcdavidgiluk, 491K compressed19:36
davidgilukthat's a surprising thing to fail isn't it19:37
devimcdavidgiluk, yeah, a segfault is too bad19:37
devimcdavidgiluk, this is the frame 119:37
devimchttps://paste.centos.org/view/807724f319:37
davidgilukdevimc: can you get the output of    t a a bt full19:38
devimcdavidgiluk, https://paste.centos.org/view/1649c93a19:40
davidgilukdevimc: OK, that makes a lot more sense19:40
devimcdavidgiluk, https://paste.centos.org/view/e541129e19:41
devimcthis looks better19:41
devimcwithout pager19:41
davidgilukdevimc: Is this happening just as the 2nd container in the pod starts up?19:42
devimcdavidgiluk, yes19:43
devimcincepthi2-fpnhp-deployment-6f749d84b4-x8zzm   1/2     CrashLoopBackOff   8          32m19:43
devimcdavidgiluk, the first container (workload/midroservice) is ready19:43
davidgilukdevimc: OK, so I see what's happening - but I need to go and figure out whether we thought we had code in there already to stop it :-)19:43
davidgilukdevimc: What's happening is that in one thread you're procesisng a request to open a file, and then passing the result back on the virtqueue - boring normal19:43
davidgilukdevimc: On the other thread, the hypervisor has just told you that someone has hotplugged some memory and your memory map has changed19:44
davidgilukdevimc: that other thread is just in the process of remapping all the queues at the same time as the 1st thread is writing into it19:44
devimcouch!19:45
devimcdavidgiluk, it's good to know what's happening :)19:46
davidgilukdevimc: I suspect it's more likely to happen with CloudHypervisor because there was a new call (that I don't think CH got yet) that means 'here's an extra mapping' rather than 'hey here's all your mappings'19:46
devimcand that you know how to fix it :)19:46
devimcdavidgiluk, you are right, this is easier to reproduce with CLH19:47
davidgilukdevimc: Yeh, so I thought we had a lock somewhere that we took when we were nuking the queues; but perhaps I'm remembering thinking that we should have a lock....19:48
davidgilukdevimc: I'm going to create a gitlab issue for it for now and then I need to look at our locking19:51
devimcdavidgiluk, thanks19:52
davidgilukdevimc: https://gitlab.com/virtio-fs/qemu/-/issues/2519:52
devimcdavidgiluk, cool19:53
devimcthanks again19:53
davidgilukdevimc: the setmemtable call is a *horrible* interface; it tells you all of your mappings;  when you hotplug you get a new setmemtable call, but no indication of whether there's a correspondence between old/new19:54
*** davidgiluk has quit IRC20:23
kata-irc-bot<fidencio> @jose.carlos.venegas.m, @salvador.fuentes, have you ever seen those errors before? ```"error opening storage: /dev/sdb is already part of a volume group \"storage\": must remove this device from any volume group or provide a different device"```20:30
kata-irc-bot<fidencio> This is happening with CRI-O CI20:30
kata-irc-bot<fidencio> http://jenkins.katacontainers.io/job/kata-containers-crio-PR/16570/console20:31
kata-irc-bot<fidencio> Looks like, for some reason, we're not deleting the configuration of the storage20:33
*** th0din has quit IRC21:01
*** th0din has joined #kata-dev21:05
kata-irc-bot<jose.carlos.venegas.m> I think I have seen that in the past21:11
kata-irc-bot<fidencio> We noticed this now that we switched the CRI-O CI to use Ubuntu 18.0421:12
kata-irc-bot<jose.carlos.venegas.m> @salvador.fuentes isn't an attached disk in the VM ?21:12
kata-irc-bot<fidencio> It is!21:12
kata-irc-bot<jose.carlos.venegas.m> @fidencio oh before was fedora? or ... centos?21:13
kata-irc-bot<fidencio> ```export LVM_DEVICE=/dev/sdb```21:13
kata-irc-bot<jose.carlos.venegas.m> then it may be the VM config ?21:13
kata-irc-bot<fidencio> It was Ubuntu 16.0421:13
kata-irc-bot<fidencio> I just updated the image from 16.04 to 18.0421:14
kata-irc-bot<jose.carlos.venegas.m> @fidencio let me ping @salvador.fuentes or @gabriela.cervantes.te21:14
kata-irc-bot<gabriela.cervantes.te> no I have not see that error before21:18
kata-irc-bot<jose.carlos.venegas.m> @fidencio see in http://jenkins.katacontainers.io/configureClouds/21:21
kata-irc-bot<fidencio> But that was a good guess. :slightly_smiling_face:21:26
kata-irc-bot<eric.ernst> devimc and davidgulik - nice work!21:28
kata-irc-bot<fidencio> I will keep debugging this Tomorrw, thanks Carlos, Chava, and Gaby!21:40
kata-irc-bot<jose.carlos.venegas.m> @fidencio chat with you tomorrow!21:41
*** devimc has quit IRC22:02
*** sameo has quit IRC22:05
*** crobinso has quit IRC22:11
*** pmores has quit IRC22:44

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!