#openstack-meeting log

16:01:19 <ihrachys> #startmeeting neutron_ci
16:01:20 <openstack> Meeting started Tue Nov 28 16:01:19 2017 UTC and is due to finish in 60 minutes.  The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:01:21 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:01:23 <openstack> The meeting name has been set to 'neutron_ci'
16:01:26 <mlavalle> o/
16:01:27 <jlibosva> o/
16:01:29 <haleyb> hi
16:01:46 <ihrachys> hello folks
16:01:48 <ihrachys> #topic Actions from prev meeting
16:02:04 <ihrachys> as usual, we start with actions from last meeting
16:02:06 <ihrachys> first is "jlibosva to figure out why unstable_test didn't work for fullstack scenario case"
16:02:22 <ihrachys> though afaiu we figured it out in the meeting
16:02:24 <jlibosva> we figured that out in the last meeting itself
16:02:33 <jlibosva> right, it was because it failed in setup phase
16:02:42 <ihrachys> and we decided to leave it as is
16:02:56 <mlavalle> yes
16:03:06 <ihrachys> + your patch that bumped the timeout to wait agents in fullstack
16:03:17 <jlibosva> right - https://review.openstack.org/#/c/522872/
16:03:18 <ihrachys> I mean this one: https://review.openstack.org/#/c/522872/
16:03:20 <ihrachys> ok
16:03:41 <ihrachys> I guess we could backport it to reduce spurious failures from stable
16:03:49 <ihrachys> next was "jlibosva to investigate / report a bug for env deployment failure in fullstack because of port down"
16:03:57 <ihrachys> and that's basically the fix above
16:04:08 <jlibosva> correct - bug here https://bugs.launchpad.net/neutron/+bug/1734357
16:04:08 <openstack> Launchpad bug 1734357 in neutron "fullstack: Test runner doesn't wait enough time for env to come up" [High,Fix released] - Assigned to Jakub Libosvar (libosvar)
16:04:24 <jlibosva> and I requested the backport to pike right now - https://review.openstack.org/#/c/523453/
16:04:29 <ihrachys> jlibosva, so you observed up to 3 mins waiting on agents?
16:04:48 <jlibosva> no, I observed 67 seconds but I wanted to be safe so I tripled the value
16:05:04 <jlibosva> it's active polling so once env is ready, the waiting loop stopps
16:05:08 <jlibosva> stops*
16:05:16 <ihrachys> I see. why is it so slow though? is it because of high parallelism?
16:05:19 <jlibosva> so the worst case, we wait three minutes only when there is real issue
16:05:25 <ihrachys> or would be same in single thread
16:05:41 <jlibosva> I didn't investigate that, I saw it took 30 seconds for the ovs-agent to start logging
16:05:55 <jlibosva> i.e. between test runner spawning process and process actually doing something was a 30 seconds gap
16:06:39 <jlibosva> so it sounds like busy machine. but I haven't checked load and cpu usage stats
16:07:14 <ihrachys> ok, I guess if it's something serious it will resurface eventually
16:07:35 <ihrachys> the job already takes a lot of time. we won't be able to push the boundary indefinitely while adding new cases
16:07:49 <ihrachys> ok, moving on
16:07:52 <ihrachys> "ihrachys to investigate latest https://bugs.launchpad.net/neutron/+bug/1673531 failures"
16:07:52 <openstack> Launchpad bug 1673531 in neutron "fullstack test_controller_timeout_does_not_break_connectivity_sigkill(GRE and l2pop,openflow-native_ovsdb-cli) failure" [High,Confirmed] - Assigned to Ihar Hrachyshka (ihar-hrachyshka)
16:07:57 <ihrachys> I had a look at logs
16:08:22 <ihrachys> so the test case fails on polling port of second fake machine for ACTIVE
16:08:30 <ihrachys> here is the first attempt: http://logs.openstack.org/71/520371/7/check/legacy-neutron-dsvm-fullstack/ad585a2/logs/dsvm-fullstack-logs/TestOvsConnectivitySameNetworkOnOvsBridgeControllerStop.test_controller_timeout_does_not_break_connectivity_sigkill_GRE-and-l2pop,openflow-native_.txt.gz#_2017-11-20_21_59_56_469
16:08:45 <ihrachys> actually, the attempt is the only one, despite us using wait_until_true
16:09:02 <ihrachys> and this is because for some reason neutron-server hanged in the middle processing request
16:09:14 <ihrachys> here is where the server eventually gives up: http://logs.openstack.org/71/520371/7/check/legacy-neutron-dsvm-fullstack/ad585a2/logs/dsvm-fullstack-logs/TestOvsConnectivitySameNetworkOnOvsBridgeControllerStop.test_controller_timeout_does_not_break_connectivity_sigkill_GRE-and-l2pop,openflow-native_/neutron-server--2017-11-20--21-57-14-411163.txt.gz#_2017-11-20_22_01_03_475
16:09:23 <ihrachys> probably triggered by DELETE sent during cleanup of the test case
16:09:43 <ihrachys> note that those are two only messages with the req-id in the server log
16:10:09 <ihrachys> not sure what to make of it yet
16:10:31 <ihrachys> usually we have some messages from when the request arrives
16:10:42 <ihrachys> it could be though that it didn't for some reason
16:11:26 <ihrachys> and then we see it e.g. repeated in the background / tcp connectivity to server finally recovered but it's too late
16:12:16 <jlibosva> so you're saying that wait_until_true doesn't actually poll?
16:13:15 <ihrachys> jlibosva, maybe it's locked because it waits for reply to its request
16:13:22 <ihrachys> we wait for 60 seconds there
16:13:28 <ihrachys> and after that it bails out
16:13:46 <ihrachys> and the client apparently also waits 60s+ for reply
16:13:58 <ihrachys> so we don't ever have a chance to actually retry with new http request
16:14:15 <ihrachys> if f.e. the previous one is lost somehow by the server
16:14:35 <jlibosva> I see
16:14:39 <ihrachys> also note that the message in server log is 6s+ after 60s
16:15:57 <ihrachys> oh and one more thing there
16:16:07 <ihrachys> there are two ports there (two fake machines)
16:16:20 <ihrachys> first one is ACTIVE, and here is the message in test log that gives the req-id: http://logs.openstack.org/71/520371/7/check/legacy-neutron-dsvm-fullstack/ad585a2/logs/dsvm-fullstack-logs/TestOvsConnectivitySameNetworkOnOvsBridgeControllerStop.test_controller_timeout_does_not_break_connectivity_sigkill_GRE-and-l2pop,openflow-native_.txt.gz#_2017-11-20_21_59_56_468
16:16:34 <ihrachys> but when you search for the id in server log, you don't have it there at all
16:18:06 <ihrachys> but if you try to relate messages, the request is probably this: http://logs.openstack.org/71/520371/7/check/legacy-neutron-dsvm-fullstack/ad585a2/logs/dsvm-fullstack-logs/TestOvsConnectivitySameNetworkOnOvsBridgeControllerStop.test_controller_timeout_does_not_break_connectivity_sigkill_GRE-and-l2pop,openflow-native_/neutron-server--2017-11-20--21-57-14-411163.txt.gz#_2017-11-20_21_59_56_456
16:18:10 <ihrachys> note a different id
16:18:22 <jlibosva> I also noted http://logs.openstack.org/71/520371/7/check/legacy-neutron-dsvm-fullstack/ad585a2/logs/dsvm-fullstack-logs/TestOvsConnectivitySameNetworkOnOvsBridgeControllerStop.test_controller_timeout_does_not_break_connectivity_sigkill_GRE-and-l2pop,openflow-native_/neutron-server--2017-11-20--21-57-14-411163.txt.gz#_2017-11-20_22_00_00_838
16:18:34 <jlibosva> which is 4 seconds after querying the API
16:18:43 <jlibosva> so maybe ovs agent reported something in the meantime
16:19:51 <ihrachys> how is it possible that we have different req-id in server and client?
16:20:15 <ihrachys> is there something in between proxying / overriding headers?
16:20:45 <ihrachys> anyway... I will dig more
16:20:57 <ihrachys> I probably should capture what we have alraedy there
16:21:05 <ihrachys> next item was "slaweq to investigate / report a bug for test_dscp_marking_packets fullstack failure"
16:21:13 <ihrachys> I don't see slaweq around
16:21:40 <ihrachys> but I see this reported: https://bugs.launchpad.net/neutron/+bug/1733649
16:21:40 <openstack> Launchpad bug 1733649 in neutron "fullstack neutron.tests.fullstack.test_qos.TestDscpMarkingQoSOvs.test_dscp_marking_packets(openflow-native) failure" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq)
16:22:07 <ihrachys> slaweq seems to work on it
16:22:08 <jlibosva> to the previous - maybe when you don't provide req-id it generates one?
16:22:19 <jlibosva> and what we see on server comes from previous call
16:23:03 <ihrachys> jlibosva, what do you mean? I believe the req-id logged in test case log is generated, yes, but then it is still sent to server
16:23:19 <jlibosva> I see curl used:
16:23:30 <jlibosva> http://logs.openstack.org/71/520371/7/check/legacy-neutron-dsvm-fullstack/ad585a2/logs/dsvm-fullstack-logs/TestOvsConnectivitySameNetworkOnOvsBridgeControllerStop.test_controller_timeout_does_not_break_connectivity_sigkill_GRE-and-l2pop,openflow-native_.txt.gz#_2017-11-20_21_59_53_605
16:23:49 <jlibosva> 3 seconds before the neutronclient call with req-d
16:24:24 <ihrachys> it still says "neutronclient.client" in the message
16:24:34 <ihrachys> so probably neutronclient uses curl under the hood?
16:24:50 <jlibosva> no, I think it uses python library
16:25:03 <jlibosva> urllib or httplib or something
16:25:25 <ihrachys> ok, I will have a closer look
16:25:32 <ihrachys> I don't want to take all the time for this issue
16:25:35 <ihrachys> so let's move on
16:25:39 <jlibosva> sure
16:25:41 <jlibosva> sorry
16:25:46 <ihrachys> these were all items from previous meeting
16:25:55 <ihrachys> #topic Grafana
16:25:58 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:26:32 <ihrachys> I was actually hopeful that with agent timeout fix the failure rate for fullstack will drop, but seems like it didn't
16:26:54 <haleyb> no rainbows and unicorns :(
16:27:20 <ihrachys> it's at least not as flat as it was but we have long way to go
16:27:30 <ihrachys> we'll look at latest failures later
16:28:16 <ihrachys> in the meantime, I will only note that nothing material changed for fullstack or scenarios.
16:28:24 <ihrachys> periodics seem ok too
16:28:31 <ihrachys> so let's dive into specifics
16:28:42 <ihrachys> #topic Fullstack
16:29:28 <ihrachys> for starter, we have the Kuba's fix to backport; I work on the req-id / port not active issue / slaweq looking at qos dscp failure
16:29:40 <ihrachys> let's see if there is anything else that we don't know about in latest logs
16:30:14 <ihrachys> I am looking at http://logs.openstack.org/72/522872/1/check/legacy-neutron-dsvm-fullstack/50dfd44/logs/testr_results.html.gz
16:30:48 <ihrachys> actually, failure rebound for fullstack back to ~90% could be because of dscp qos failure that slaweq suggested in LP is new
16:31:29 <jlibosva> is that with the timeout bump patch in?
16:31:42 <ihrachys> it is results from your patch so yes
16:31:47 <jlibosva> :(
16:31:55 <jlibosva> the env build fails still
16:31:57 <ihrachys> as we can see in logs, a log of failures are due to a timeout issue that is similar to what I am looking at
16:32:08 <ihrachys> where it fails on waiting a port to become ACTIVE
16:32:10 <jlibosva> oh, wait, no
16:32:38 <ihrachys> it could be either port genuinely is down, or an issue like mine where server is not responsive
16:33:21 <ihrachys> also in this run, test_controller_timeout_does_not_break_connectivity_sigkill failed but with slightly different error
16:33:24 <ihrachys> neutron.tests.common.machine_fixtures.FakeMachineException: No ICMP reply obtained from IP address 20.0.0.10
16:33:52 <ihrachys> which happens AFTER ports are validated to be ACTIVE
16:34:01 <jlibosva> shall we mark the test_connectivity tests as unstable with the bug that you're looking at?
16:34:10 <jlibosva> and qos with slaweq's bug
16:34:12 <ihrachys> so it must be a different issue
16:34:25 <ihrachys> jlibosva, probably. I will post a patch.
16:34:49 <ihrachys> #action ihrachys to disable connectivity fullstack tests while we look for culprit
16:35:07 <ihrachys> #action ihrachys to disable dscp qos fullstack test while we look for culprit
16:35:43 <ihrachys> there are two more failures that are not falling into the set
16:35:56 <ihrachys> test_l2_agent_restart with AssertionError: False is not true in     self.assertTrue(all([r.done() for r in restarts]))
16:36:05 <ihrachys> and test_securitygroup(linuxbridge-iptables) with RuntimeError: Process ['ncat', u'20.0.0.11', '3333', '-w', '20'] hasn't been spawned in 20 seconds
16:36:38 <jlibosva> I can have a look at the netcat issue, I hope it won't be related to linuxbridge agent :)
16:37:16 <ihrachys> #action jlibosva to look at test_securitygroup(linuxbridge-iptables) failure in fullstack
16:37:49 <ihrachys> any candidates to look at test_l2_agent_restart ?
16:39:36 <jlibosva> I can also pick that one if nobody wants
16:39:39 <ihrachys> :)
16:39:43 <ihrachys> sory jlibosva
16:39:46 <ihrachys> we love you :)
16:39:51 <jlibosva> I have a feeling it will be related to slow agent starts
16:40:04 <ihrachys> #action jlibosva to look at test_l2_agent_restart fullstack failure
16:40:09 <jlibosva> as I observed it in the env build-up issue
16:40:29 * jlibosva loves being loved
16:40:38 <ihrachys> that restart test is afair brutally restarting agents. if it's slow, could hit it in one of restart cycles
16:41:13 <ihrachys> though having timeout bumped to 60 sec for each iteration there is maybe not the best path
16:41:29 <ihrachys> ok seems like we have work to do for fullstack
16:41:32 <ihrachys> #topic Scenarios
16:42:00 <ihrachys> we have old bugs
16:42:01 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1717302
16:42:01 <openstack> Launchpad bug 1717302 in neutron "Tempest floatingip scenario tests failing on DVR Multinode setup with HA" [High,Confirmed]
16:42:05 <ihrachys> and https://bugs.launchpad.net/neutron/+bug/1719711
16:42:05 <openstack> Launchpad bug 1719711 in neutron "iptables failed to apply when binding a port with AGENT.debug_iptables_rules enabled" [High,In progress] - Assigned to Dr. Jens Harbott (j-harbott)
16:42:28 <ihrachys> for the latter, there seems to be a fix!
16:42:32 <ihrachys> here https://review.openstack.org/#/c/523319/
16:43:02 <ihrachys> everyone please review after the meeting :)
16:43:14 <ihrachys> as for dvr fip issue, mlavalle haleyb no news I believe?
16:43:42 <haleyb> ihrachys: i will need to look, nothing from swami yet
16:43:44 <mlavalle> ihrachys: we haven't heard from swami
16:44:03 <mlavalle> let's make sure we talk to him this week
16:44:14 <mlavalle> I think we will have to reassign that issue
16:44:21 <ihrachys> yeah... maybe he is swamped and we need someone else to have a look instead
16:44:27 <mlavalle> I'll talk to him
16:44:31 <ihrachys> right, it's not moving forward. thanks!
16:45:03 <ihrachys> #topic Tempest plugin
16:45:15 <ihrachys> so the etherpad is https://etherpad.openstack.org/p/neutron-tempest-plugin-job-move
16:45:31 <ihrachys> and we have some items there still in progress
16:45:36 <ihrachys> actually quite a lot :)
16:45:48 <ihrachys> one thing that blocks us is that we still have legacy jobs in master
16:45:58 <ihrachys> so we can't e.g. remove tempest test classes from neutron tree
16:46:04 <ihrachys> I believe mlavalle was looking at it
16:46:06 <ihrachys> mlavalle, any news?
16:46:22 <mlavalle> well, I am looking at it for stable branches
16:46:48 <mlavalle> I pushed this https://review.openstack.org/#/c/522931 over the weekend
16:46:51 <jlibosva> are we going to migrate stable branches too?
16:46:54 <jlibosva> I thought no
16:46:56 <jlibosva> not*
16:46:56 <ihrachys> mlavalle, so your order would be move jobs to stable, then remove from infra?
16:47:16 <jlibosva> ah, sorry. it's about legacy jobs. ignore me :)
16:47:19 <ihrachys> jlibosva, projects move their jobs including to stable. I saw others doing it, there is infra request.
16:47:21 <mlavalle> yes
16:47:32 <ihrachys> jlibosva, after we move to stable, they are able to clean them up
16:47:35 <ihrachys> in their repos
16:47:40 <mlavalle> I can also move master
16:47:52 <mlavalle> it's only that I thought someone else was doing ti
16:47:57 <jlibosva> I apologize, I thought you meant adopting stable branches for neutron plugin :)
16:47:59 <ihrachys> mlavalle, I was thinking, we could also have a small patch that makes the legacy jobs executed on stable only while we move jobs?
16:48:09 <ihrachys> mlavalle, that would be a simple fix and unblock cleanup in neutron tree
16:48:32 <ihrachys> mlavalle, master doesn't need move, it needs removal since we already have new jobs there
16:48:46 <mlavalle> yeah, that's what I thought
16:48:55 <ihrachys> mlavalle, I would imagine we could make it with a regex against stable.* for branch for legacy?
16:49:22 <mlavalle> yeah that sounds right
16:50:35 <ihrachys> mlavalle, do you want to cover the regex yourself or I should do it?
16:50:48 <mlavalle> please take care of that
16:51:09 <ihrachys> ok
16:51:19 <ihrachys> #action ihrachys to disable legacy jobs for neutron master
16:51:26 <mlavalle> I will continue with the patch for Pike ^^^^
16:51:35 <mlavalle> and also another one for Ocata
16:51:39 <ihrachys> great
16:51:49 <ihrachys> for ocata you may just backport once done with pike
16:51:52 <ihrachys> it should work
16:52:00 <mlavalle> yeap, that's my plan
16:52:21 <ihrachys> oh so I look at the patch for pike, and I see you move all tempest jobs not just those for tempest plugin?
16:52:37 <mlavalle> I moved all legacy
16:52:44 <ihrachys> I was under impression you were covering plugin jobs only
16:52:55 <ihrachys> if not, I don't think anyone is looking at master patch to do the same
16:53:15 <mlavalle> ok, I can do a similar thing for master
16:53:26 <ihrachys> and for those, it could make sense to start in master
16:53:27 <ihrachys> yeah
16:53:30 <ihrachys> thanks!
16:54:18 <ihrachys> I also had a small fix to skip new jobs for doc/* changes: https://review.openstack.org/#/c/523244/
16:54:25 <ihrachys> spotted them executed in a doc-only change in neutron repo
16:55:19 <mlavalle> pushed it in
16:55:34 * ihrachys bows
16:55:59 <ihrachys> and once legacy jobs are gone in master, we can push https://review.openstack.org/#/c/506672/
16:56:25 <mlavalle> yeah
16:56:39 <ihrachys> ok I think that mostly covers next steps for plugin repo
16:56:46 <ihrachys> #topic Open discussion
16:56:46 <mlavalle> ++
16:56:50 <ihrachys> anything to bring up?
16:57:26 <ihrachys> seems like no! well then you have 3 mins back!
16:57:30 <ihrachys> enjoy!
16:57:33 <ihrachys> #endmeeting