16:00:30 #startmeeting neutron_ci 16:00:32 hi 16:00:33 Meeting started Tue Jan 15 16:00:30 2019 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:34 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:36 The meeting name has been set to 'neutron_ci' 16:00:40 hi 16:00:53 o/ 16:01:02 o/ 16:02:08 o/ 16:02:20 o/ 16:03:15 sorry, I had phone call 16:03:18 ok, lets start 16:03:27 #topic Actions from previous meetings 16:03:44 first one is: 16:03:46 mlavalle will continue debugging trunk tests failures in multinode dvr env 16:03:59 I've been working on that 16:04:11 I took a little detour after you pinged this past friday 16:04:31 I think you read my comments in the ssh bug 16:05:02 mlavalle: You are talking about this one https://bugs.launchpad.net/neutron/+bug/1811515 right? 16:05:03 Launchpad bug 1811515 in neutron "SSH to FIP fails in CI jobs" [Critical,Confirmed] 16:05:09 yes 16:05:48 yes, I read it 16:06:06 so I decided to contnue debugging the trunk test case further 16:06:24 but this bug is not related to trunk test only 16:06:31 it happens in many tests currently 16:06:35 I know 16:07:36 I just assumed that you were going to contnue debugging with a focus on the puroute2 failure 16:07:47 yes, I'm trying that 16:07:53 and ralonsoh is helping me too 16:07:55 or you want me to stop working on the trunk case and focus on this one? 16:08:49 I think we should first fix this issue with many tests failing 16:09:02 and then we can focus on trunk port issue again 16:09:34 because currently I think that this issue with (probably) pyroute2 is blocking everything 16:09:42 ok, how can I help? 16:11:06 I don't know 16:11:38 if You have any ideas how to debug what exactly happens in pyroute lib, that would be great 16:11:59 I couldn't reproduce this issue on my local devstack 16:12:02 slaweq: ok, I'll try to take a stab 16:12:46 and debugging in gate what's going on in lib installed from pypi is impossible 16:12:59 at least I don't know about any way how to do it 16:13:49 so if anyone have any ideas about how to fix/debug this issue, that would be great 16:14:29 mriedem asked for e-r query for that bug as it happens a lot in CI 16:14:39 so I will do it today or tomorrow 16:14:49 unless there is anyone else who want to do it 16:14:51 :) 16:15:28 #action slaweq to make e-r query for bug 1811515 16:15:31 bug 1811515 in neutron "SSH to FIP fails in CI jobs" [Critical,Confirmed] https://launchpad.net/bugs/1811515 16:15:54 ok, so I think we can move to the next action then 16:16:03 slaweq to debug problems with ssh to vm in tempest tests 16:16:22 and this is basically related to the same bug which we already discussed 16:17:26 any questions/something to add? 16:17:31 or can we move on? 16:17:54 i'm trying to get a good e-r query on that bug, 16:18:07 but i can't find anything that gets 100% failed jobs 16:18:41 so those errors in the l3 agent log must also show up quite a bit in successful jobs 16:18:49 mriedem: problem is that this error in logs is not causing failures always 16:19:15 yes, in some cases this error appears e.g. during remove router or something like that and tests are fine 16:19:52 but sometimes it happens in adding interface or configuring some ip address in namespace and then test or some tests fails 16:20:22 could any post-failure logging could be added to tempest to help identify the fault? 16:20:37 like, is there something about the router or floating ip resource that would indicate it's broken? 16:21:41 I don't know, basically if it's broken connectivity to FIP is not working 16:22:42 perhaps sometimes the floating ip status is ERROR? guess it depends on when the failure happens 16:23:03 haleyb: maybe but I didn't check that 16:25:39 i also wonder if router status is ACTIVE? 16:26:37 haleyb: IMHO it will depend on place where failure happens 16:26:56 there is no one strict pattern for that IMHO 16:28:15 slaweq: right. i know we made the l3-agent better at recovering and retrying things on failure, but status might not reflect that exactly 16:28:48 mriedem: isn't query like http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22OSError%3A%20%5BErrno%209%5D%20Bad%20file%20descriptor%5C%22%20AND%20build_status%3A%5C%22FAILURE%5C%22 enough? 16:30:02 we normally avoid explicitly adding build_status in the queries 16:30:16 ahh, ok 16:30:20 so it might be hard 16:30:40 because if that error message shows up a lot in non-failure jobs, we could be saying failed jobs are a result of this bug, which they might not be 16:30:58 i.e. it's not good to have errors in the logs if there aren't real errors :) 16:31:18 mriedem: I know but in this case it is real error 16:31:25 which is why i was suggesting maybe adding some post-failure logging if possible to try and identify this case 16:31:31 it looks that it is some race condition 16:33:18 mriedem: I can take a look at some of such failures if there will be some same pattern for all of them but I'm affraid that it will be hard 16:35:04 anything else mriedem or can we move on? 16:35:15 has anyone identified when this started failing? 16:35:31 to see if new dependencies are being used which might cause a regression? or change to neutron? 16:35:48 seems like it started around jan 9? 16:36:07 anyway, you can move on 16:36:28 https://github.com/openstack/neutron/commit/c6d358d4c6926638fe9d5194e3da112c2750c6a4 16:36:32 ^ seems like a good suspect 16:37:05 I don't think so as it's failing in completly different module 16:37:28 privsep is related in the failures i'm seeing though 16:37:54 yes, it is in module which uses privsep, but different one 16:39:04 https://github.com/openstack/requirements/commit/6b45c47e53b8820b68ff78eaec8062c4fdf05a56#diff-0bdd949ed8a7fdd4f95240bd951779c8 16:39:06 was jan 9 16:39:09 new privsep release 16:39:40 anyway, i'll dump notes in the bug report 16:39:53 mriedem: ok, thx 16:40:09 we could revert those suspitous commits and recheck a few times , which might be able to locate the error 16:40:12 mriedem: it might be that this new privsep version raised one more issue in our code 16:40:29 hongbin: yes, good point 16:40:36 I will do it just after the meeting 16:41:34 #action slaweq to check if oslo.privsep < 1.31.0 will help to workaround issue with SSH to FIP 16:41:47 ok, lets move on to the next topics then 16:41:50 do You agree? 16:42:13 mriedem: thanks for the inouts. We'll look at it 16:42:21 slaweq: yes, let's move on 16:42:36 yes, thx mriedem for all help with this one 16:42:41 slaweq: ok, thx 16:42:47 #topic Python 3 16:42:59 Etherpad: https://etherpad.openstack.org/p/neutron_ci_python3 16:43:07 I don't have anythin new in fact here 16:43:30 we have few patches to convert jobs to python3 (and zuulv3) but all of them are stuck in gate for last week 16:43:52 do You have anything to add here? 16:44:32 From a python3 perspective, since tempest has switched to python3 by default do the remaining tempest jobs derive that by default? 16:45:09 njohnston: tempest switched to py3? Do You have link to patch? 16:46:33 Hmm, I thought I remembered seeing that on the ML while I was out on vacation, but perhaps I dreamed it because I can't find it in the ML archives now 16:46:40 so never mind :-) 16:46:45 I remember something similar too 16:46:47 LOL 16:46:47 I remember only about devstack switch 16:47:02 perhaps that is what I am thinking of 16:47:11 http://lists.openstack.org/pipermail/openstack-discuss/2019-January/001356.html 16:47:24 ^^ is it what You were thinking of? 16:47:41 aaah yes I think it is 16:47:46 (well at least for me) 16:48:39 yes that was it 16:48:48 ok, so it's about change in devstack 16:49:02 we still need to configure our jobs to be run on python3 :) 16:50:18 ok, lets move on then 16:50:32 #topic Grafana 16:50:41 #link: http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:51:10 another quick py3 item for functional tests, I saw one green func/py3 result recently, but the string exception thingy still happens most of the time :/ so we are close but not there yet 16:51:47 bcafarel: yes, I wanted to take a look at it again but I didn't have time yet 16:52:14 I think we are at the point where it passes depending on the generated log and moon phase :) 16:52:38 as in a few additional lines trigger the string exception 16:52:39 bcafarel: yeah, moon phase might be key here ;) 16:53:14 speaking about grafana, it doesn't look good, especially in tempest and scenario jobs but this is related to our main issue which we have currently and which was already discussed 16:53:39 so let's not spend time on this 16:53:45 other than that I think it's not bad with fullstack/functional/other jobs 16:53:51 unless you want to highlight something else 16:53:55 no 16:54:07 in fact I don't have anything else for today's meeting 16:54:21 ok so before we leave 16:54:34 let's agree on a set of marching orders 16:54:34 we have 1 main issue which we need to solve and than we will be able to continue work on other things 16:54:59 1) slaweq to propose reverts 16:55:12 but it will be DNM patch at least for now 16:55:22 just to test if that helps or not 16:55:23 I know 16:55:26 :) 16:56:05 2) recheck reverts to see if we get evidence of fixing the problem 16:56:25 yep 16:56:49 do you agree with Matts perception that it started showing up around Jan 9? 16:57:10 more or less 16:57:18 I see some failures on 6th also 16:57:24 but then there wasn't any hits 16:57:41 and I know that we reverted change which bumped privsep lib 16:57:50 are there jobs where you see the problem more frequently? 16:57:53 so it might be related and fit to this timeline :) 16:58:39 I think that it's most often in neutron-tempest-dvr and neutron-tempest-linuxbridge jobs 16:58:58 this neutron-tempest-dvr for sure 16:59:07 and IIRC it's running python 3 already 16:59:12 slaweq: would you post a few examples in the bug? 16:59:16 but I don't think it's the reason 16:59:30 few examples of failures? yes I can 16:59:34 yes 16:59:37 sure 16:59:40 Thanks 17:00:03 #action slaweq to post more examples of failiures in bug 1811515 17:00:04 bug 1811515 in neutron "SSH to FIP fails in CI jobs" [Critical,Confirmed] https://launchpad.net/bugs/1811515 17:00:04 slaweq: before you leave for good tonight, please ping me so we synch up 17:00:13 mlavalle: ok 17:00:14 I will 17:00:19 we need to finish now 17:00:22 thx for attending 17:00:23 o/ 17:00:24 #endmeeting