#openstack-meeting log

16:00:30 <slaweq> #startmeeting neutron_ci
16:00:32 <slaweq> hi
16:00:33 <openstack> Meeting started Tue Jan 15 16:00:30 2019 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:34 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:36 <openstack> The meeting name has been set to 'neutron_ci'
16:00:40 <haleyb> hi
16:00:53 <mlavalle> o/
16:01:02 <bcafarel> o/
16:02:08 <njohnston> o/
16:02:20 <hongbin> o/
16:03:15 <slaweq> sorry, I had phone call
16:03:18 <slaweq> ok, lets start
16:03:27 <slaweq> #topic Actions from previous meetings
16:03:44 <slaweq> first one is:
16:03:46 <slaweq> mlavalle will continue debugging trunk tests failures in multinode dvr env
16:03:59 <mlavalle> I've been working on that
16:04:11 <mlavalle> I took a little detour after you pinged this past friday
16:04:31 <mlavalle> I think you read my comments in the ssh bug
16:05:02 <slaweq> mlavalle: You are talking about this one https://bugs.launchpad.net/neutron/+bug/1811515 right?
16:05:03 <openstack> Launchpad bug 1811515 in neutron "SSH to FIP fails in CI jobs" [Critical,Confirmed]
16:05:09 <mlavalle> yes
16:05:48 <slaweq> yes, I read it
16:06:06 <mlavalle> so I decided to contnue debugging the trunk test case further
16:06:24 <slaweq> but this bug is not related to trunk test only
16:06:31 <slaweq> it happens in many tests currently
16:06:35 <mlavalle> I know
16:07:36 <mlavalle> I just assumed that you were going to contnue debugging with a focus on the puroute2 failure
16:07:47 <slaweq> yes, I'm trying that
16:07:53 <slaweq> and ralonsoh is helping me too
16:07:55 <mlavalle> or you want me to stop working on the trunk case and focus on this one?
16:08:49 <slaweq> I think we should first fix this issue with many tests failing
16:09:02 <slaweq> and then we can focus on trunk port issue again
16:09:34 <slaweq> because currently I think that this issue with (probably) pyroute2 is blocking everything
16:09:42 <mlavalle> ok, how can I help?
16:11:06 <slaweq> I don't know
16:11:38 <slaweq> if You have any ideas how to debug what exactly happens in pyroute lib, that would be great
16:11:59 <slaweq> I couldn't reproduce this issue on my local devstack
16:12:02 <mlavalle> slaweq: ok, I'll try to take a stab
16:12:46 <slaweq> and debugging in gate what's going on in lib installed from pypi is impossible
16:12:59 <slaweq> at least I don't know about any way how to do it
16:13:49 <slaweq> so if anyone have any ideas about how to fix/debug this issue, that would be great
16:14:29 <slaweq> mriedem asked for e-r query for that bug as it happens a lot in CI
16:14:39 <slaweq> so I will do it today or tomorrow
16:14:49 <slaweq> unless there is anyone else who want to do it
16:14:51 <slaweq> :)
16:15:28 <slaweq> #action slaweq to make e-r query for bug 1811515
16:15:31 <openstack> bug 1811515 in neutron "SSH to FIP fails in CI jobs" [Critical,Confirmed] https://launchpad.net/bugs/1811515
16:15:54 <slaweq> ok, so I think we can move to the next action then
16:16:03 <slaweq> slaweq to debug problems with ssh to vm in tempest tests
16:16:22 <slaweq> and this is basically related to the same bug which we already discussed
16:17:26 <slaweq> any questions/something to add?
16:17:31 <slaweq> or can we move on?
16:17:54 <mriedem> i'm trying to get a good e-r query on that bug,
16:18:07 <mriedem> but i can't find anything that gets 100% failed jobs
16:18:41 <mriedem> so those errors in the l3 agent log must also show up quite a bit in successful jobs
16:18:49 <slaweq> mriedem: problem is that this error in logs is not causing failures always
16:19:15 <slaweq> yes, in some cases this error appears e.g. during remove router or something like that and tests are fine
16:19:52 <slaweq> but sometimes it happens in adding interface or configuring some ip address in namespace and then test or some tests fails
16:20:22 <mriedem> could any post-failure logging could be added to tempest to help identify the fault?
16:20:37 <mriedem> like, is there something about the router or floating ip resource that would indicate it's broken?
16:21:41 <slaweq> I don't know, basically if it's broken connectivity to FIP is not working
16:22:42 <haleyb> perhaps sometimes the floating ip status is ERROR?  guess it depends on when the failure happens
16:23:03 <slaweq> haleyb: maybe but I didn't check that
16:25:39 <haleyb> i also wonder if router status is ACTIVE?
16:26:37 <slaweq> haleyb: IMHO it will depend on place where failure happens
16:26:56 <slaweq> there is no one strict pattern for that IMHO
16:28:15 <haleyb> slaweq: right.  i know we made the l3-agent better at recovering and retrying things on failure, but status might not reflect that exactly
16:28:48 <slaweq> mriedem: isn't query like http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22OSError%3A%20%5BErrno%209%5D%20Bad%20file%20descriptor%5C%22%20AND%20build_status%3A%5C%22FAILURE%5C%22 enough?
16:30:02 <mriedem> we normally avoid explicitly adding build_status in the queries
16:30:16 <slaweq> ahh, ok
16:30:20 <slaweq> so it might be hard
16:30:40 <mriedem> because if that error message shows up a lot in non-failure jobs, we could be saying failed jobs are a result of this bug, which they might not be
16:30:58 <mriedem> i.e. it's not good to have errors in the logs if there aren't real errors :)
16:31:18 <slaweq> mriedem: I know but in this case it is real error
16:31:25 <mriedem> which is why i was suggesting maybe adding some post-failure logging if possible to try and identify this case
16:31:31 <slaweq> it looks that it is some race condition
16:33:18 <slaweq> mriedem: I can take a look at some of such failures if there will be some same pattern for all of them but I'm affraid that it will be hard
16:35:04 <slaweq> anything else mriedem or can we move on?
16:35:15 <mriedem> has anyone identified when this started failing?
16:35:31 <mriedem> to see if new dependencies are being used which might cause a regression? or change to neutron?
16:35:48 <mriedem> seems like it started around jan 9?
16:36:07 <mriedem> anyway, you can move on
16:36:28 <mriedem> https://github.com/openstack/neutron/commit/c6d358d4c6926638fe9d5194e3da112c2750c6a4
16:36:32 <mriedem> ^ seems like a good suspect
16:37:05 <slaweq> I don't think so as it's failing in completly different module
16:37:28 <mriedem> privsep is related in the failures i'm seeing though
16:37:54 <slaweq> yes, it is in module which uses privsep, but different one
16:39:04 <mriedem> https://github.com/openstack/requirements/commit/6b45c47e53b8820b68ff78eaec8062c4fdf05a56#diff-0bdd949ed8a7fdd4f95240bd951779c8
16:39:06 <mriedem> was jan 9
16:39:09 <mriedem> new privsep release
16:39:40 <mriedem> anyway, i'll dump notes in the bug report
16:39:53 <slaweq> mriedem: ok, thx
16:40:09 <hongbin> we could revert those suspitous commits and recheck a few times , which might be able to locate the error
16:40:12 <slaweq> mriedem: it might be that this new privsep version raised one more issue in our code
16:40:29 <slaweq> hongbin: yes, good point
16:40:36 <slaweq> I will do it just after the meeting
16:41:34 <slaweq> #action slaweq to check if oslo.privsep < 1.31.0 will help to workaround issue with SSH to FIP
16:41:47 <slaweq> ok, lets move on to the next topics then
16:41:50 <slaweq> do You agree?
16:42:13 <mlavalle> mriedem: thanks for the inouts. We'll look at it
16:42:21 <mlavalle> slaweq: yes, let's move on
16:42:36 <slaweq> yes, thx mriedem for all help with this one
16:42:41 <slaweq> slaweq: ok, thx
16:42:47 <slaweq> #topic Python 3
16:42:59 <slaweq> Etherpad: https://etherpad.openstack.org/p/neutron_ci_python3
16:43:07 <slaweq> I don't have anythin new in fact here
16:43:30 <slaweq> we have few patches to convert jobs to python3 (and zuulv3) but all of them are stuck in gate for last week
16:43:52 <slaweq> do You have anything to add here?
16:44:32 <njohnston> From a python3 perspective, since tempest has switched to python3 by default do the remaining tempest jobs derive that by default?
16:45:09 <slaweq> njohnston: tempest switched to py3? Do You have link to patch?
16:46:33 <njohnston> Hmm, I thought I remembered seeing that on the ML while I was out on vacation, but perhaps I dreamed it because I can't find it in the ML archives now
16:46:40 <njohnston> so never mind :-)
16:46:45 <bcafarel> I remember something similar too
16:46:47 <mlavalle> LOL
16:46:47 <slaweq> I remember only about devstack switch
16:47:02 <njohnston> perhaps that is what I am thinking of
16:47:11 <slaweq> http://lists.openstack.org/pipermail/openstack-discuss/2019-January/001356.html
16:47:24 <slaweq> ^^ is it what You were thinking of?
16:47:41 <bcafarel> aaah yes I think it is
16:47:46 <bcafarel> (well at least for me)
16:48:39 <njohnston> yes that was it
16:48:48 <slaweq> ok, so it's about change in devstack
16:49:02 <slaweq> we still need to configure our jobs to be run on python3 :)
16:50:18 <slaweq> ok, lets move on then
16:50:32 <slaweq> #topic Grafana
16:50:41 <slaweq> #link: http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:51:10 <bcafarel> another quick py3 item for functional tests, I saw one green func/py3 result recently, but the string exception thingy still happens most of the time :/ so we are close but not there yet
16:51:47 <slaweq> bcafarel: yes, I wanted to take a look at it again but I didn't have time yet
16:52:14 <bcafarel> I think we are at the point where it passes depending on the generated log and moon phase :)
16:52:38 <bcafarel> as in a few additional lines trigger the string exception
16:52:39 <slaweq> bcafarel: yeah, moon phase might be key here ;)
16:53:14 <slaweq> speaking about grafana, it doesn't look good, especially in tempest and scenario jobs but this is related to our main issue which we have currently and which was already discussed
16:53:39 <mlavalle> so let's not spend time on this
16:53:45 <slaweq> other than that I think it's not bad with fullstack/functional/other jobs
16:53:51 <mlavalle> unless you want to highlight something else
16:53:55 <slaweq> no
16:54:07 <slaweq> in fact I don't have anything else for today's meeting
16:54:21 <mlavalle> ok so before we leave
16:54:34 <mlavalle> let's agree on a set of marching orders
16:54:34 <slaweq> we have 1 main issue which we need to solve and than we will be able to continue work on other things
16:54:59 <mlavalle> 1) slaweq to propose reverts
16:55:12 <slaweq> but it will be DNM patch at least for now
16:55:22 <slaweq> just to test if that helps or not
16:55:23 <mlavalle> I know
16:55:26 <slaweq> :)
16:56:05 <mlavalle> 2) recheck reverts to see if we get evidence of fixing the problem
16:56:25 <slaweq> yep
16:56:49 <mlavalle> do you agree with Matts perception that it started showing up around Jan 9?
16:57:10 <slaweq> more or less
16:57:18 <slaweq> I see some failures on 6th also
16:57:24 <slaweq> but then there wasn't any hits
16:57:41 <slaweq> and I know that we reverted change which bumped privsep lib
16:57:50 <mlavalle> are there jobs where you see the problem more frequently?
16:57:53 <slaweq> so it might be related and fit to this timeline :)
16:58:39 <slaweq> I think that it's most often in neutron-tempest-dvr and neutron-tempest-linuxbridge jobs
16:58:58 <slaweq> this neutron-tempest-dvr for sure
16:59:07 <slaweq> and IIRC it's running python 3 already
16:59:12 <mlavalle> slaweq: would you post a few examples in the bug?
16:59:16 <slaweq> but I don't think it's the reason
16:59:30 <slaweq> few examples of failures? yes I can
16:59:34 <mlavalle> yes
16:59:37 <slaweq> sure
16:59:40 <mlavalle> Thanks
17:00:03 <slaweq> #action slaweq to post more examples of failiures in bug 1811515
17:00:04 <openstack> bug 1811515 in neutron "SSH to FIP fails in CI jobs" [Critical,Confirmed] https://launchpad.net/bugs/1811515
17:00:04 <mlavalle> slaweq: before you leave for good tonight, please ping me so we synch up
17:00:13 <slaweq> mlavalle: ok
17:00:14 <slaweq> I will
17:00:19 <slaweq> we need to finish now
17:00:22 <slaweq> thx for attending
17:00:23 <mlavalle> o/
17:00:24 <slaweq> #endmeeting