jeblair | Sep 10 23:51:03 ze01 kernel: [2079277.750440] ansible-playboo[14535]: segfault at a9 ip 000000000050eaa4 sp 00007ffcc0c3ffc8 error 4 in python3.5[400000+3a7000] | 00:00 |
---|---|---|
pabelanger | nice | 00:01 |
jeblair | 2017-09-10 23:51:03,890 DEBUG zuul.AnsibleJob: [build: 57e588b84fd145d99b4049d974d270fe] Ansible output: b'ERROR! A worker was found in a dead state' | 00:01 |
jeblair | -rw------- 1 zuul zuul 94801920 Sep 10 23:51 /var/lib/zuul/builds/57e588b84fd145d99b4049d974d270fe/work/core | 00:02 |
pabelanger | yay | 00:02 |
jeblair | http://paste.ubuntu.com/25511193/ | 00:05 |
jeblair | #0 0x000000000050eaa4 in visit_decref () at ../Modules/gcmodule.c:373 | 00:06 |
jeblair | that's the same location as before | 00:06 |
pabelanger | ya, I'm wondering if something isn't setup properly | 00:06 |
jeblair | the stack above it is different; this time it's the powershell module that triggered it, not the ssh connection plugin | 00:09 |
jeblair | pabelanger: like what? | 00:10 |
pabelanger | jeblair: on friday I did run apt-get upgrade on zuul-executors because it said a new version of PPA for python3.5 needed to be installed. This was on ze02.o.o, but I also ran it on ze01.o.o, since apt was also complaining. However, we should have already been running PPA | 00:12 |
ianw | i think my irc is messed up, it looks like jeblair just said "powershell" was causing zuul segfaults :) | 00:12 |
pabelanger | so I wonder it upgrading to PPA version caused an issue | 00:12 |
jeblair | we've had 31 segfaults on ze01 today, compared to 13 yesterday, 4 the day before, 10, 3,3, 8 before that | 00:12 |
pabelanger | but, cannot think why | 00:12 |
jeblair | ianw: ansible-playbook segfaults :) | 00:13 |
pabelanger | jeblair: how did we install the patched version of python a few weeks ago when you ran your series of rechecks | 00:13 |
jeblair | pabelanger: from the ppy that mordred created | 00:13 |
jeblair | cat /etc/apt/sources.list.d/openstack-ci-core-ubuntu-python-bpo-27945-backport-xenial.list | 00:14 |
jeblair | deb http://ppa.launchpad.net/openstack-ci-core/python-bpo-27945-backport/ubuntu xenial main | 00:14 |
* tobiash finally arrived at the hotel in denver | 00:15 | |
jeblair | pabelanger: i don't see anything in dpkg.log after sept 6 | 00:16 |
pabelanger | jeblair: ya, that would have been the day I did apt-get upgrade | 00:16 |
openstackgerrit | Merged openstack-infra/zuul-jobs master: Allow overriding the workspace directory in prepare-workspace https://review.openstack.org/500466 | 00:17 |
pabelanger | so wednesday | 00:17 |
jeblair | pabelanger: so you installed a new version of our patched python from the ppa. why was there a new version in our ppa? | 00:18 |
pabelanger | jeblair: right, I don't know. So, that makes me think, we didn't have ppa version installed. Because, no where is puppet-zuul to we have ensure => latest python3.5 | 00:19 |
pabelanger | So, I wonder if we some how uninstalled the fix | 00:19 |
pabelanger | but, that would mean, our PPA doesn't include the actually fix | 00:19 |
pabelanger | I'm also unsure why our PPA diff is largely different from xenial-proposed diff | 00:21 |
jeblair | pabelanger: the version on the system matches what's in the ppa 3.5.2-2ubuntu0~16.04.2~openstackinfra | 00:21 |
pabelanger | our ppa: https://launchpadlibrarian.net/333758013/python3.5_3.5.1-10_3.5.2-2ubuntu0~16.04.2~openstackinfra.diff.gz | 00:21 |
pabelanger | xenial-proposed: http://launchpadlibrarian.net/336180849/python3.5_3.5.2-2~16.04_3.5.2-2ubuntu0~16.04.2.diff.gz | 00:21 |
pabelanger | jeblair: have we discussed moving ansible-playbook back to python2 interrupter? Or we want to ensure everything is python3 for zuulv3 | 00:31 |
jeblair | pabelanger: that's not entirely straightforward. i don't think we can release a piece of software that runs partly under python2 and partly under python3 and expect anyone to take it seriously. so either we make python3 work, or *everything* goes back to python2, including zuul (which includes web streaming). | 00:37 |
jeblair | or rust | 00:37 |
jeblair | i could live with zuul in rust running ansible in python2 :) | 00:38 |
pabelanger | jeblair: sure, I just got the feeling from ansiblefest, python3 support and ansible was best effort right now. When I talked about the issue of dead worker, ansible testing hasn't seen it yet. So, we are likely in new territory with python3 | 00:39 |
jeblair | but even as a temporary measure, we'd have to find a way to invoke ansible under python2 from a zuul installation running in python3 | 00:39 |
pabelanger | ya | 00:40 |
jeblair | pabelanger: nope, ansible core is py3 ready. ansible modules are best effort. | 00:40 |
jeblair | pabelanger: this is not an ansible bug, this is a python bug. | 00:40 |
SpamapS | how's the hacking going folks? | 00:40 |
* SpamapS reading.. | 00:40 | |
SpamapS | more python problems? | 00:41 |
pabelanger | jeblair: ya, so I'm kinda surprised we able to segfault python3.5 at all | 00:41 |
jeblair | SpamapS: yeah, segfaults in the same location | 00:41 |
SpamapS | jeblair: even w/ fix? :( | 00:41 |
SpamapS | jeblair: so maybe we found a separate class of segfault there | 00:41 |
jeblair | apparently so | 00:41 |
SpamapS | thar be dragons | 00:41 |
SpamapS | jeblair: maybe we could use jython | 00:42 |
* SpamapS waits for beers to be cleaned off screens | 00:42 | |
jeblair | i pastebinned the gdb backtrace above | 00:42 |
jeblair | i need to afk for a bit | 00:42 |
pabelanger | same, I'm about to step away for social but will return once finished | 00:43 |
SpamapS | kk.. I'm not super great at python debugging but I'll peek and see if it compares to the supposedly fixed bug | 00:43 |
ianw | very suspicous that it's in this same frozen import magic area as before -> http://paste.openstack.org/show/620835/ | 00:56 |
ianw | i mean, of all the dictionary creation magic that happens, it hits it around this area ... | 00:58 |
*** jkilpatr has joined #zuul | 01:27 | |
*** jkilpatr has quit IRC | 01:34 | |
SpamapS | also why powershell? | 03:49 |
SpamapS | I'm sure it's the mechanism importing, and not powershell itself.. but... why? | 03:49 |
ianw | SpamapS: http://paste.openstack.org/show/620839/ | 03:55 |
ianw | note that does *not* replicate a segfault ... just my attempts at it | 03:56 |
ianw | it's definitely coming around from when builtins.__build_class__ builds builds a ShellModule class | 03:56 |
SpamapS | whoa | 03:57 |
SpamapS | is that a dictionary key'd by a subclassed multiprocessing.Process? | 03:57 |
ianw | i was just trying to pass something back and foward and keep a global dict there | 03:57 |
SpamapS | kk, just trying to wrap head around it | 03:57 |
SpamapS | So the updated Python definitely has "fixed" the original reproducer effects. But it's entirely possible there are off-by-one's left that are triggered by deeper usage. | 03:58 |
ianw | just reading those ... i'm not sure how similar they are | 03:58 |
ianw | they have to do with basically modifying a dict while inside it | 03:58 |
ianw | there is sooooooo much going on around all this though ... that may be happening | 03:59 |
ianw | __build_class__ is just deep magic ... all i can find about it is https://mail.python.org/pipermail/python-3000/2007-March/006338.html | 04:00 |
SpamapS | IIRC the ansible plugin interface makes use of imp() | 04:02 |
SpamapS | which means Python devs throw their hands up and say NOPE | 04:02 |
ianw | SpamapS: yep, i was walking "upwards" from http://paste.openstack.org/show/620835/ to see where we might call in | 04:03 |
ianw | basically it mutli-process forks() and then "connection = self._shared_loader_obj.connection_loader.get" | 04:04 |
ianw | but really, ti's the alloc from line #54 in http://paste.ubuntu.com/25511193/ that's triggered a gc, and then presumably found some bad object | 04:07 |
SpamapS | I'm still going back to my old twitch in the back of my head when better python devs than I told me never ever fork without exec in python. | 04:09 |
ianw | hmm, i can walk the garbage collection ... i wonder if we can leverage that to figure out what it was touching ... | 04:36 |
ianw | jeblair / SpamapS : I started a story on this -> https://storyboard.openstack.org/#!/story/2001186 | 05:58 |
ianw | in short, it seem extremely suspicious that the object seeming to cause this problems is a code-generation object for https://git.openstack.org/cgit/openstack-infra/zuul/tree/zuul/ansible/callback/zuul_stream.py?h=feature/zuulv3#n39 | 05:59 |
*** hashar has joined #zuul | 07:45 | |
*** xinliang has quit IRC | 08:25 | |
*** xinliang has joined #zuul | 08:25 | |
*** xinliang has quit IRC | 08:25 | |
*** xinliang has joined #zuul | 08:25 | |
*** electrofelix has joined #zuul | 09:04 | |
electrofelix | To reply to the email thread I started about the UI, is there a good place to host screenshots to help with the discussion? Would storyboard be an option? Just figured that attachments would get stripped | 09:20 |
*** jkilpatr has joined #zuul | 11:04 | |
*** jkilpatr has quit IRC | 11:11 | |
*** jkilpatr has joined #zuul | 11:23 | |
*** yolanda has joined #zuul | 11:50 | |
*** dkranz has joined #zuul | 12:27 | |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/zuul feature/zuulv3: WIP: web: add /{tenant}/status controller https://review.openstack.org/502453 | 12:30 |
*** fbo_ has joined #zuul | 12:49 | |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Decode stdout from ansible-playbook before logging https://review.openstack.org/502362 | 13:35 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Emit a message to the job log if ansible crashes https://review.openstack.org/502468 | 13:35 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Emit a message to the job log if ansible crashes https://review.openstack.org/502468 | 13:36 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Decode stdout from ansible-playbook before logging https://review.openstack.org/502362 | 13:36 |
fungi | electrofelix: closest thing we had was pholio.openstack.org but after the ux team dissolved and nobody had ever used the service we decommissioned it | 14:23 |
fungi | electrofelix: i guess gerrit can host images if you push them in throwaway changes, though that's likely suboptimal | 14:25 |
fungi | (it will display images in the diff view) | 14:25 |
clarkb | its also probably not great long term for a repo you care about | 14:26 |
clarkb | (could make a throwaway image repo though) | 14:26 |
electrofelix | fungi clarkb: would using https://snag.gy/ or https://screencloud.net/ be considered ok? 6 month lifespan on snag.gy | 14:54 |
tristanC | mordred: hum, should the public keys be exposed over gearman by the scheduler so that we could move the /keys endpoint to zuul-web too? | 15:05 |
jeblair | tristanC: sounds reasonable | 15:06 |
fungi | electrofelix: seems as good as anything. i usually just push stuff up to a temporary directory on a personal website | 15:09 |
tristanC | jeblair: would you mind checking this gearman worker implementation in the scheduler: https://review.openstack.org/#/c/502453/1/zuul/scheduler.py | 15:11 |
*** yolanda has quit IRC | 15:11 | |
* tristanC working on moving the webapp.py to zuul-web | 15:12 | |
*** yolanda has joined #zuul | 15:16 | |
*** yolanda has quit IRC | 15:16 | |
*** yolanda has joined #zuul | 15:16 | |
mordred | tristanC: ++ to keys | 15:18 |
Shrews | fyi, putting https://review.openstack.org/502137 through to fix websocket tests. survived several rechecks. someone ping me if they discover another ipv6 test failure. | 15:29 |
Shrews | there is still a race in that test. trying to hunt that down now | 15:30 |
jeblair | https://etherpad.openstack.org/p/zuulv3-ptg | 15:51 |
jeblair | that's the list of tasks we discussed | 15:51 |
tristanC | ianw: jeblair: what's the exact python version that segfault? | 15:52 |
*** dmellado has joined #zuul | 15:54 | |
*** yolanda has quit IRC | 15:55 | |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Support IPv6 in the finger log streamer https://review.openstack.org/502137 | 15:55 |
tristanC | jeblair: would it be http://ppa.launchpad.net/openstack-ci-core/python-bpo-27945-backport/ubuntu/pool/main/p/python3.5/ ? | 15:58 |
jeblair | tristanC: that's the one | 15:59 |
jeblair | tristanC: that was the current version of py35 in ubuntu with the patch in python bug 27945 applied | 15:59 |
openstack | bug 33002 in gnome-session (Ubuntu) "duplicate for #27945 logout dialog UI objections" [Wishlist,Fix released] https://launchpad.net/bugs/33002 | 15:59 |
jeblair | that's not the bug :) | 15:59 |
jeblair | https://bugs.python.org/issue27945 | 15:59 |
jeblair | that bug | 15:59 |
Shrews | pabelanger: another thing to note, even though a node may be READY, it could already be allocated to another request (the allocated_to field in ZK) | 16:03 |
Shrews | just something to check if you see it again | 16:03 |
Shrews | also tobiash ^^^ | 16:04 |
SpamapS | https://bugs.launchpad.net/ubuntu/zesty/+source/python3.5/+bug/1711724 | 16:04 |
openstack | Launchpad bug 1711724 in python3.6 (Ubuntu Zesty) "Segfaults with dict" [High,Fix committed] - Assigned to Clint Byrum (clint-fewbar) | 16:04 |
SpamapS | is the Ubuntu bug | 16:04 |
pabelanger | Shrews: ya, next time I see it, I'll try to get into zk-shell | 16:05 |
SpamapS | jeblair: ianw makes a fair point about zuul_stream. I wonder if we can somehow disable streaming and see if the problem goes away. it would at least give us an idea of where to go to work around the issue. | 16:05 |
pabelanger | Shrews: or maybe add info to --detail output? | 16:05 |
jeblair | SpamapS: mordred is working on a patch to make linesplit not a generater in an attempt to avoid triggering the bug | 16:06 |
jeblair | meanwhile tristanC is looking into the bug itself | 16:07 |
Shrews | pabelanger: yeah, if it's not already there | 16:07 |
jeblair | SpamapS: (the hope being that we can eventually squash the bug, but in the mean time, maybe avoid triggering it while we try to do v3 cutover) | 16:07 |
fungi | mordred: are your topic:zuul-v3-migration changes intended to be topic:zuulv3 instead? (i just stumbled across them via a depends-on) | 16:11 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Shift zuul_stream socket reading to use a for loop https://review.openstack.org/502492 | 16:11 |
mordred | fungi: yes they are - whoops | 16:12 |
fungi | mordred: no worries, i'll reset them | 16:12 |
jeblair | remote: https://review.openstack.org/502273 Add log processing roles | 16:12 |
*** yolanda has joined #zuul | 16:13 | |
jeblair | mordred, clarkb, dmsimard: ^ updated to address comments, plus i added a 'when' line for subunit (because we only want those in certain pipelines) | 16:13 |
*** haypo has joined #zuul | 16:13 | |
fungi | if this is a help to anyone, my zuulv3 review dashboard query string in gertty: "is:open AND (project:openstack-infra/openstack-zuul-jobs OR project:openstack-infra/openstack-zuul-roles OR project:openstack-infra/zuul-jobs OR (project:^openstack-infra/.* AND topic:zuulv3)) AND NOT label:Workflow=-1" | 16:14 |
haypo | hi. tristanC pointed me to https://storyboard.openstack.org/#!/story/2001186 -- you are using Python 3.5.2 which has https://bugs.python.org/issue26617 bug - i don't know at this point if your crash is related to this bug or not? | 16:14 |
jeblair | fungi: ++ that's pretty close to mine too :) | 16:14 |
fungi | 27 changes currently matching | 16:14 |
haypo | the best would be to have a simple reproducer and try python 3.5.4 | 16:14 |
tristanC | haypo: thanks for steping in! | 16:14 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Add node.allocated_to to node detail output https://review.openstack.org/502495 | 16:16 |
Shrews | pabelanger: ^^^ | 16:16 |
jeblair | haypo, tristanC: so we should check whether https://hg.python.org/cpython/rev/c9b7272e2553 is in our version? | 16:16 |
pabelanger | Shrews: ty! | 16:16 |
haypo | jeblair: it's not in 3.5.2 : https://docs.python.org/3.5/whatsnew/changelog.html#changelog | 16:21 |
haypo | jeblair: it was fixed in 3.5.3 | 16:21 |
haypo | jeblair: the question is if it's the same bug or not | 16:21 |
mordred | jeblair: +2 from me | 16:21 |
jeblair | haypo: yes, and it doesn't look like the fix was backported in ubuntu | 16:21 |
haypo | jeblair: but i'm almost 100% that your python3 (3.5.2) has the https://bugs.python.org/issue26617 bug | 16:22 |
haypo | jeblair: ubuntu, haha | 16:22 |
haypo | jeblair: i sent a ping every 3 months to backport a fix, two peoples validated a test package which contained the fix | 16:22 |
haypo | jeblair: 1 year 1/2 later, the bug was still not fixed | 16:22 |
jeblair | haypo: we can probably apply that fix to a local package | 16:22 |
jeblair | haypo: oy :( | 16:22 |
haypo | jeblair: it was a critical segfault in... the garbage collector, again ;) | 16:22 |
haypo | jeblair: https://bugs.launchpad.net/ubuntu/+source/python3.4/+bug/1367907 opened since 2014-09-10, still not fixed | 16:23 |
openstack | Launchpad bug 1367907 in python3.4 (Ubuntu Trusty) "Segfault in gc with cyclic trash" [Undecided,In progress] | 16:23 |
haypo | jeblair: straighforward fix, with a very short script to reproduce the crash, it's not possible to workaround the crash, and *easy* to get it using asyncio... | 16:23 |
haypo | jeblair: i never understood why ubuntu doesn't care of this bug | 16:24 |
mordred | jeblair, haypo that patch does not seem to be in our version - I'll prepare a version that has it | 16:24 |
jeblair | clarkb opened that bug 3 years and one day ago. | 16:24 |
openstack | bug 3 in Launchpad itself "Custom information for each translation team" [Low,Fix released] https://launchpad.net/bugs/3 | 16:24 |
haypo | i don't understand neither why Linux distro don't upgrade to the latest 3.5.x instead of cherry-pick | 16:24 |
jeblair | anything would be better than this | 16:24 |
mordred | haypo: because insanity | 16:24 |
haypo | mordred: while i'm not sure that https://bugs.python.org/issue26617 is your bug, it shouldn't hurt to get the fix ;) | 16:25 |
haypo | mordred: "because insanity" ? can you elaborate? | 16:25 |
haypo | i should ask my colleagues who maintain python for RHEL, Fedora, Centos and SCL | 16:25 |
mordred | haypo: the reason they don't just upgrade to 3.5.x ... I was just being snarky | 16:25 |
mordred | haypo: the reason is that the policy for stable releases of distros is to not upgrade versions of software they have | 16:26 |
haypo | mordred: for python, we (python core developers) try to make sure that we don't change the behaviour in stable versions | 16:27 |
haypo | sorry i have to go, please ping me back if you are still stuck! | 16:30 |
jeblair | haypo: thanks! | 16:31 |
jeblair | remote: https://review.openstack.org/502499 Add log processing jobs to base-test | 16:33 |
openstackgerrit | David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: DNM: tracking down test race https://review.openstack.org/502500 | 16:34 |
jeblair | fungi: https://review.openstack.org/502273 is good reading | 16:48 |
fungi | thanks | 16:48 |
fungi | was having trouble tracking that down for some reason | 16:48 |
fungi | aha, git grep and codesearch won't do string matches on file paths ;) | 16:49 |
jeblair | fungi: yeah... i wonder if we should put the names of our roles in our roles... :) | 16:52 |
*** harlowja has joined #zuul | 16:54 | |
openstackgerrit | Merged openstack-infra/nodepool feature/zuulv3: Add node.allocated_to to node detail output https://review.openstack.org/502495 | 16:55 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Shift zuul_stream socket reading to use a for loop https://review.openstack.org/502492 | 17:04 |
*** jkilpatr_ has joined #zuul | 17:12 | |
openstackgerrit | Merged openstack-infra/nodepool master: Add SSH Host Key Verifier Strategy https://review.openstack.org/483485 | 17:12 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Fix node list output https://review.openstack.org/502504 | 17:13 |
*** hashar is now known as hasharAway | 17:13 | |
*** jkilpatr has quit IRC | 17:14 | |
Shrews | pabelanger: care to +3 that? ^^^ | 17:14 |
fungi | Shrews: lgtm, +3 | 17:16 |
Shrews | fungi: thx | 17:16 |
tobiash | pabelanger, Shrews: now I found the (lengthly) discussion about the nodepool issues we discussed (node_failure when running into openstack quota and also the 'build instead of directly allocate' issue) | 17:16 |
tobiash | pabelanger, Shrews: http://eavesdrop.openstack.org/irclogs/%23zuul/%23zuul.2017-07-05.log.html#t2017-07-05T20:18:49 | 17:16 |
pabelanger | tobiash: thanks, I'll read shortly | 17:20 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Fix node list output https://review.openstack.org/502504 | 17:25 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul-jobs master: DNM: test base-tests logstash stuff https://review.openstack.org/502515 | 17:28 |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/zuul feature/zuulv3: web: add /{tenant}/status controller https://review.openstack.org/502453 | 17:37 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Shift zuul_stream socket reading to use a for loop https://review.openstack.org/502492 | 17:40 |
mordred | dmsimard: ^^ there is a change to the functional test in that patch that I do not understand why it's needed :( | 17:43 |
electrofelix | I've noticed some problems with trying to run the zuul tox tests via within Jenkins, because there is no tty, the tests are significantly slower. I can replicate this by running without a tty allocated | 17:44 |
electrofelix | has anyone seen that before and any suggestions on how to avoid the slowdown (keeps causing the tests to timeout) | 17:44 |
mordred | electrofelix: I have not seen that issue - but we haven't run anyhting in a jenkins in a while | 17:45 |
electrofelix | mordred: based on tests I can reproduce this whenever I disable a tty allocation to run the tests within a docker container and I get the same slowdown | 17:54 |
*** bhavik1 has joined #zuul | 17:55 | |
SpamapS | electrofelix: that's worth diagnosing. I'd guess it has something to do with forking and running anisible | 17:59 |
openstackgerrit | Merged openstack-infra/nodepool feature/zuulv3: Fix node list output https://review.openstack.org/502504 | 18:01 |
*** bhavik1 has quit IRC | 18:07 | |
*** yolanda has quit IRC | 18:07 | |
*** jkilpatr_ has quit IRC | 18:15 | |
*** jkilpatr has joined #zuul | 18:16 | |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/zuul feature/zuulv3: web: add /{tenant}/status route https://review.openstack.org/502453 | 18:38 |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/zuul feature/zuulv3: web: add /{source}/{project}.pem route https://review.openstack.org/502530 | 18:38 |
clarkb | mordred: your devstack things appera to be failing due to missing passwords set in localrc | 18:46 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul-jobs master: Override tox requirments with zuul git repos https://review.openstack.org/489719 | 18:55 |
*** xinliang has quit IRC | 18:57 | |
mordred | Shrews: so - it's about building on top of: https://review.openstack.org/#/c/491805/ | 19:01 |
SpamapS | anything I can do to help btw? | 19:02 |
* SpamapS may be able to review or code while y'all are socializing | 19:02 | |
clarkb | SpamapS: yes actually | 19:05 |
clarkb | SpamapS: there is possibility we need another SRU for python3 | 19:06 |
clarkb | SpamapS: tldr is the python3 gc problem in python3.4 on trusty that took forever to merge? well it affects python3.5 in xenial now too becase ya | 19:06 |
SpamapS | OHHHH | 19:06 |
SpamapS | seriously? | 19:06 |
clarkb | SpamapS: we are currently checking if mordred's current ppa package for python3.5 with circular gc fix fixes zuul | 19:06 |
clarkb | SpamapS: so shortly we should have a good idea if that is the culprit | 19:06 |
SpamapS | That's wild.. yeah we can get it moving faster this time. | 19:07 |
SpamapS | especially if it has a reproducer | 19:07 |
clarkb | SpamapS: ya Ithink that bug lived so long that ubuntu managed to grab xenial 3.5 python without noticing that it too needed patching | 19:07 |
clarkb | when the bug was filed we did not have a xenial r python3.4 to worry about | 19:07 |
clarkb | but then things happened and ya | 19:07 |
clarkb | I think upstream python may have a reproducer for that bug too | 19:07 |
clarkb | SpamapS: https://bugs.launchpad.net/ubuntu/+source/python3.4/+bug/1367907 has links to upstream bug too | 19:08 |
openstack | Launchpad bug 1367907 in python3.4 (Ubuntu Trusty) "Segfault in gc with cyclic trash" [Undecided,In progress] | 19:08 |
mordred | Shrews: skip-if:all-files-match-any -> irrelevant-files - if the pattern matches job.old and project matches | 19:09 |
*** xinliang has joined #zuul | 19:10 | |
*** xinliang has quit IRC | 19:10 | |
*** xinliang has joined #zuul | 19:10 | |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Shift zuul_stream socket reading to use a for loop https://review.openstack.org/502492 | 19:27 |
*** yolanda has joined #zuul | 19:28 | |
clarkb | SpamapS: looking like we still segfault though | 19:29 |
clarkb | so maybe more investigation needed | 19:29 |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul feature/zuulv3: Shift zuul_stream socket reading to use a for loop https://review.openstack.org/502492 | 19:40 |
*** yolanda has quit IRC | 19:43 | |
*** yolanda_ has joined #zuul | 19:49 | |
*** yolanda_ is now known as yolanda | 19:49 | |
*** jkilpatr has quit IRC | 19:56 | |
*** kmalloc_ has joined #zuul | 20:16 | |
*** ianw has quit IRC | 20:18 | |
*** mattclay_ has joined #zuul | 20:18 | |
*** bstinson_ has joined #zuul | 20:22 | |
*** kmalloc has quit IRC | 20:24 | |
*** mattclay has quit IRC | 20:24 | |
*** bstinson has quit IRC | 20:24 | |
*** kmalloc_ is now known as kmalloc | 20:24 | |
*** mattclay_ is now known as mattclay | 20:24 | |
*** _ari_ has quit IRC | 20:28 | |
*** bstinson_ is now known as bstinson | 20:34 | |
*** _ari_ has joined #zuul | 20:38 | |
*** openstackgerrit has quit IRC | 20:48 | |
*** ianw has joined #zuul | 21:01 | |
*** dkranz has quit IRC | 21:07 | |
Shrews | pabelanger: i have solved our delay issue | 21:12 |
Shrews | or, figured out a possible reason, at least | 21:12 |
Shrews | in the zuul repo, we define 3 ubuntu-xenial nodes for a functional test. we have min-ready set to 2 for that label, so it's likely that it will always need to build a new node for that | 21:13 |
Shrews | jeblair: mordred: ^^^ | 21:14 |
Shrews | oh, but i also see two xenial nodes that have been ready and unallocated for a couple of hours. seems something is likely wonky there | 21:17 |
*** hasharAway has quit IRC | 21:18 | |
Shrews | both rax-iad | 21:18 |
Shrews | oh, 3 in rax-iad | 21:20 |
*** jkilpatr has joined #zuul | 21:25 | |
Shrews | i think this is intentional, due to our node locality logic | 21:26 |
Shrews | an existing ready node will only be used for a request if it is from the same provider handling the request, launched by the same pool handler, and in the same AZ | 21:28 |
Shrews | i think that logic, and our min-ready definition, are slightly at odds | 21:31 |
clarkb | could we short circuit request handling by attempting to fullfill from any ready nodes first | 21:34 |
clarkb | then foreard the request to the launchers if unable to? | 21:34 |
Shrews | clarkb: we always attempt to fulfill from ready nodes first | 21:36 |
clarkb | but on a provider specific basis | 21:36 |
Shrews | the launchers are the ones handling the request (1 thread per pool), and we can't control which one handles the request | 21:36 |
clarkb | I'm saying byass providers entirely at first and scan the ready nodes and hand out what is available or fallback to the provider | 21:37 |
Shrews | we would need a thing in between | 21:37 |
clarkb | ya | 21:37 |
Shrews | yeah, that's a whole new architecture | 21:37 |
clarkb | basically two layers of requests | 21:37 |
clarkb | one that is what zuul talks to and the other an implementation detail to boot if necessary | 21:37 |
Shrews | zuul only talks to ZK | 21:38 |
clarkb | by submitting a request | 21:38 |
Shrews | correct | 21:38 |
clarkb | that request would submit some inner level boot request if ready cant fulfill for some reason | 21:38 |
Shrews | and the pool threads consume that request list | 21:38 |
clarkb | (I understand this is not how it works today suggesting it may br a good way to fix it) | 21:39 |
Shrews | well, that's not a simple change, i would think :) | 21:39 |
Shrews | but yeah, that's certainly a way | 21:40 |
dmsimard | you guys still over in vail ? | 21:44 |
clarkb | dmsimard: I left to go to tc session on onboarding new contributors | 21:55 |
*** yolanda has quit IRC | 21:57 | |
tristanC | clarkb: is there a way to test the crm114 filters manually? I'm trying to understand if it's possible to use it as a post job... | 21:58 |
clarkb | tristanC: yes, you can run crm114 locally if oyu install it | 21:58 |
clarkb | its been a long time since I did it but its basically an interpreter for programs and you feed it stdin | 21:58 |
tristanC | what does it except in the argument directory? | 21:58 |
clarkb | tristanC: I believe that is the location to store its learned state content | 22:00 |
clarkb | tristanC: so you can give it a path in /tmp and it writes to it | 22:00 |
tristanC | oh I see, then how it differeciates success from failure logs? | 22:00 |
clarkb | https://review.openstack.org/#/c/502492/5 could use reviews as a segfault workaround | 22:04 |
tobiash | tristanC: you tell it as an argument | 22:05 |
clarkb | tobiash: tristanC correct youtell it based on the result of the job | 22:06 |
clarkb | then it uses that information to learn | 22:06 |
SpamapS | dmsimard: FYI, running 'sfconfig' for the first time right now. | 22:06 |
SpamapS | dmsimard: it looks like I'm getting a zuulv2 ... want 3 | 22:06 |
dmsimard | SpamapS: oi, exciting | 22:06 |
clarkb | so if job fails you tell it the logs belong to failed job and if the job passed you tell it it is from a successful job | 22:06 |
dmsimard | SpamapS: I'm sure tristanC can help | 22:07 |
dmsimard | he's a pro | 22:07 |
SpamapS | tristanC: I assume I need to change arch.yaml a bit | 22:07 |
SpamapS | maybe zuul3-* ? | 22:07 |
dmsimard | SpamapS: maybe v3 isn't installed by default | 22:07 |
dmsimard | I dunno | 22:07 |
dmsimard | the v3 that sf2.6 runs is effectively a snapshot, it doesn't follow the feature/zuulv3 branch closely so just be aware of that | 22:08 |
SpamapS | oh I'm also getting a gerrit htat I don't want, oops | 22:08 |
SpamapS | dmsimard: I was figuring I'd learn how to circumvent that with git ASAP :) | 22:08 |
tristanC | SpamapS: gerrit is still needed because it's the default place to store the config repo, we may make it optional in further release | 22:08 |
dmsimard | tristanC: ah I guess with v3 we could host it in github | 22:09 |
tristanC | SpamapS: yes, you need to add zuul3-scheduler, zuul3-executor and nodepool3-launcher in the arch | 22:09 |
SpamapS | tristanC: Can't I just wrestle control of zuul config to put it somewhere else? | 22:09 |
tristanC | SpamapS: the trick is that the config-update job except to clone that repository from the local gerrit | 22:10 |
SpamapS | I dunno what that means | 22:10 |
tristanC | SpamapS: when a change is merged in that config repo, e.g. adding a new project to the zuulv3 main.yaml, then the config-update will apply that change to the zuul configuration and reload the service | 22:11 |
dmsimard | SpamapS: software factory has a 'config' repo which is akin to upstream project-config where the job, zuul and nodepool config lives | 22:12 |
dmsimard | it self hosts that repository inside gerrit | 22:12 |
dmsimard | but could be made to self host it on github with v3 later | 22:12 |
dmsimard | or something else | 22:12 |
SpamapS | k | 22:13 |
SpamapS | so now I tried editting arch.yaml and it's telling me there's a conflict because I asked for zuul and zuul3 but I only have zuul3 stuff now | 22:14 |
SpamapS | :_P | 22:14 |
tristanC | clarkb: tobiash: alright it makes more sense, so basically "cat success.log | ./classify-log.crm data/ SUCCESS"; and then use the FAILED argument to process failed log | 22:14 |
*** harlowja has quit IRC | 22:14 | |
tristanC | SpamapS: you should use the master version, in 2.6 zuul3 needed a specific node | 22:14 |
tristanC | SpamapS: install softwarefactory-project.io/repos/sf-release-master.rpm and yum update -y | 22:15 |
SpamapS | wee | 22:15 |
clarkb | tristanC: yup | 22:15 |
pabelanger | SpamapS: Oh, awesome | 22:16 |
tristanC | then to run crm114 as a post-job it will need a way to retrieve previous success logs | 22:17 |
pabelanger | SpamapS: oops, should have been Shrews | 22:17 |
SpamapS | tristanC: thanks, it's happening now | 22:20 |
SpamapS | I have dueling Zuul installs here | 22:20 |
SpamapS | SF on right, BonnyCI on left | 22:20 |
SpamapS | but I think SF is going to win out | 22:20 |
SpamapS | since Bonny still doesn't know how to find python3 on CentOS | 22:20 |
SpamapS | tristanC: looks like I still need jenkins? | 22:21 |
* SpamapS removed it thinking it could be eliminated | 22:21 | |
dmsimard | SpamapS: we don't know how to find py3 on centos either, tell us when you find it | 22:21 |
* dmsimard hides | 22:21 | |
SpamapS | :) | 22:22 |
*** yolanda has joined #zuul | 22:22 | |
tristanC | SpamapS: hum, after installing the master version, you should probably use "sfconfig --upgrade" to make sure any configuration changes are updated | 22:22 |
* SpamapS checks under the rock marked "Disable SELinux" where he usually finds hard centos problem solutions. | 22:23 | |
SpamapS | tristanC: k, it says managesf needs the jenkins user | 22:23 |
SpamapS | "TASK [sf-gerrit : Add jenkins user (for zuul-launcher and managesf)] ********************************************************************************************************************************************************************************************************* | 22:23 |
SpamapS | so I assume since i still have 'managesf' that it's still needed. | 22:24 |
tristanC | SpamapS: yes you could remove jenkins, but when doing so you have to "zuul_jenkins_credencials: False" in /etc/software-factory/custom-vars.yaml | 22:24 |
clarkb | mordred: jeblair rereading things haypo seemed pretty concinved the bug is the one that I filed with ubuntu 3 years ago | 22:24 |
clarkb | mordred: jeblair so looking at ze01 3.5.2-2ubuntu0~16.04.3 is what we have installed. Does that contain mordred's fix? | 22:24 |
tristanC | SpamapS: that's because zuul-launcher uses jenkins_rsa key and "Jenkins CI" user in gerrit, but the service can be removed | 22:25 |
clarkb | I'm testing with the simple reproducer | 22:25 |
tristanC | SpamapS: well there is work in progress to run sf with selinux enforcing + new policies for zuul, this is type enforcement file I'm working on for zuul's domain: paste.openstack.org/show/620869/ | 22:26 |
SpamapS | tristanC: ok ... I'll deal with having a useless jenkins for now :) | 22:27 |
clarkb | 3.5.2-2ubuntu0~16.04.2~openstackinfra is what is installed on ze03 fwiw note that the ~openstackinfra is gone on ze01. But neither fial on the crash.py script in the original bug | 22:27 |
SpamapS | hrm | 22:30 |
SpamapS | "+ ssh gerrit gerrit create-account jenkins -g '\"Non-Interactive' 'Users\"' --email jenkins@factory0.cloud.phx3.gdg --full-name '\"Jenkins' 'CI\"' --ssh-key -", "fatal: invalid email address"], | 22:30 |
haypo | clarkb: which bug? (url?) | 22:30 |
clarkb | haypo: the one you looked at earlier on storyboard and pointed to https://bugs.python.org/issue26617 | 22:31 |
tristanC | SpamapS: arg, that may be a gerrit thing where it's a bit sensitive with valid email address | 22:31 |
clarkb | haypo: that has a crash.py script unsure if expected to crash 100% though | 22:31 |
SpamapS | tristanC: doesn't like my pretend tld? ;-) | 22:31 |
SpamapS | tristanC: I'm trying your "zuul_jenkins_credentials: false" suggestion | 22:32 |
clarkb | haypo: mordred applied the patch from that to ubuntu python3.5 and we are still getting segfaults though I guess no one has looked a core dump from current segfaults and could be unrelated? | 22:32 |
clarkb | jeblair: mordred ^ I've got to do docs thing now on docs retention | 22:32 |
haypo | clarkb: ah https://bugs.python.org/issue26617 -- well, use Python 3.5.4 :-) | 22:32 |
dmsimard | SpamapS: you're not at the ptg, are you ? | 22:32 |
clarkb | haypo: well I'm saying the reproducer doesn't reproduce | 22:32 |
haypo | clarkb: 3.5.4 fixes also a security issue ;-) | 22:32 |
clarkb | haypo: oh nice | 22:32 |
clarkb | mordred: jeblair ^ maybe we just need to build a 3.5.4? | 22:33 |
clarkb | but I doubt that gets SRU'd | 22:33 |
tristanC | SpamapS: it seems like your fqdn needs to match the "TLD by Apache Commons Validator v1.4.1", so probably uses a more common tld indeed | 22:33 |
SpamapS | tristanC: fun | 22:33 |
haypo | clarkb: the test is a little bit different than the attached reproducer, https://github.com/python/cpython/pull/2695/files | 22:33 |
tristanC | SpamapS: that's a good reason to work on making it optional :-) | 22:34 |
SpamapS | my fqdn == local 'hostname -f' output .. so I'm wondering what it think is invalid | 22:34 |
haypo | clarkb: maybe you are testing a python3 which contains the fix? i didn't follow the discussion. it's getting late here, have to go ;-) | 22:34 |
clarkb | haypo: ya I tried to test the one with and one without. But I'll look at the one in the test and see if that behaves differently when I get a chance | 22:34 |
haypo | clarkb: someone said that he/she will include the fix in your python3 package | 22:35 |
clarkb | ya mordred did that, but its only on a subset of our nodes so I should be able to a/b test it | 22:35 |
haypo | clarkb: ok | 22:36 |
SpamapS | tristanC: looks like zuul_jenkins_credentials is not used to decide whether or not to do that part. | 22:36 |
clarkb | actually 26617 looks newer than the one I filed /me looks more closely | 22:36 |
clarkb | ya 21435 was my thing ok | 22:37 |
clarkb | (now I am slightly less confused on the situation) | 22:37 |
SpamapS | tristanC: also, credentials is the english spelling. Not sure what the standard is in software factory. ;) | 22:37 |
haypo | clarkb: https://bugs.python.org/issue21435 is https://github.com/python/cpython/commit/5fbc7b12f776109678dc34fdb49b420750a3e5ff and was fixed in Python 3.4.1 and 3.5.0 | 22:39 |
haypo | clarkb: we are talking about python 3.5.2 | 22:39 |
clarkb | haypo: thank you for clarifying that. Saw the link to my 3 year old launchpad bug and got confused | 22:39 |
clarkb | so that one should definitely be fixed in python3.5 on ubuntu | 22:40 |
clarkb | question is whether or not that other, 26617, is | 22:40 |
clarkb | haypo: was your reproducer 100% determinstic? | 22:40 |
haypo | clarkb: i don't recall, sorry. i don't know if you have the same bug. it can also be a bug in any third party C extensions | 22:40 |
haypo | clarkb: PYTHONMALLOC=debug of Python 3.6 may help here, if you get access to python 3.6 | 22:41 |
SpamapS | that's an option btw | 22:44 |
SpamapS | we could just try python3.6 | 22:44 |
SpamapS | backported from artful | 22:44 |
clarkb | if only to confirm bug is gone | 22:45 |
SpamapS | right | 22:45 |
clarkb | may not be terrible idea | 22:45 |
SpamapS | if you have a decently small reproducer you could even bisect | 22:45 |
SpamapS | python will build relatively quickly without a 'make clean' in between | 22:46 |
dmsimard | SpamapS: credencials is a typo and should be fixed ;) | 22:46 |
*** harlowja has joined #zuul | 22:50 | |
SpamapS | this is pretty annoying now I got it to stop making jenkins gerrit users | 22:52 |
SpamapS | but it still needs a zuul gerrit user | 22:52 |
*** harlowja has quit IRC | 23:06 | |
clarkb | I have successfully run haypos test case on ze01 without a segfault and run it on ze03 with a segfault. 01 has mordreds new python and 03 doesn't (I think) | 23:08 |
clarkb | so I think any current segfaults are either not that bug or we are not running the python version we expect in bubblewrap or somewhere | 23:09 |
clarkb | we don't fork python/ansible in such a way that it wouldn't load up the newer python do we? | 23:09 |
SpamapS | bubblewrap mounts /usr from the host | 23:10 |
SpamapS | so would be hard to get another version of python in the way | 23:11 |
clarkb | well I know we didn't restart the zuul executor | 23:11 |
clarkb | and maybe we misunderstand where the segfault is happening? I dunno | 23:11 |
SpamapS | that's entireyl possible | 23:11 |
SpamapS | as is spehlinge | 23:11 |
* clarkb makes a paste | 23:14 | |
SpamapS | tristanC: failing on SSO stuff :-P | 23:14 |
clarkb | http://paste.openstack.org/show/620893/ is derived from haypos test case and that does fail on 03 but not 01 when run under python3.4 | 23:14 |
clarkb | er 3.5 | 23:14 |
*** toabctl has quit IRC | 23:15 | |
*** jkilpatr has quit IRC | 23:16 | |
*** harlowja has joined #zuul | 23:17 | |
clarkb | I think it would be good if someone could grab a current core and if we restarted services just to be sure new python is in use everywhere | 23:25 |
ianw | haypo: i had a little more of a look in https://storyboard.openstack.org/#!/story/2001186 ; i'm not seeing obvious connections to weak references | 23:26 |
clarkb | ianw: ya I'm beginning to suspect its a different segfault bsaed on my testing | 23:26 |
clarkb | ianw: since haypos problem is fixed on ze01 with haypos thing patched in | 23:26 |
ianw | haypo: in summary, seems related to an invalid gi_frame in a generator object ... if that rings any bells | 23:27 |
*** jkilpatr has joined #zuul | 23:29 | |
clarkb | ianw: https://review.openstack.org/#/c/502492/5 avoids using a generator | 23:30 |
clarkb | which is another thing to try to see if problem goes away | 23:30 |
clarkb | I expect that generator runs quite a bit because its reading 4096 bytes at a time or so and our streamed logs can be many megabytes | 23:30 |
mordred | jeblair: https://review.openstack.org/#/c/502492/ | 23:32 |
mordred | clarkb: where are you? | 23:32 |
clarkb | mordred: in the docs retention session | 23:32 |
clarkb | dhellmann asked I be here for this months ago | 23:32 |
mordred | clarkb: kk | 23:33 |
mordred | clarkb: mostly just curious | 23:33 |
clarkb | atrium in steamboat is the room iirc | 23:33 |
*** openstackgerrit has joined #zuul | 23:39 | |
openstackgerrit | Paul Belanger proposed openstack-infra/zuul feature/zuulv3: Rename success ansible variable to zuul_success https://review.openstack.org/502863 | 23:39 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Shift zuul_stream socket reading to use a for loop https://review.openstack.org/502492 | 23:40 |
clarkb | mordred: ^ that failed due to merge conflict? | 23:41 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Shift zuul_stream socket reading to use a for loop https://review.openstack.org/502492 | 23:42 |
mordred | clarkb: yah | 23:47 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!