16:01:19 <ihrachys> #startmeeting neutron_ci 16:01:20 <openstack> Meeting started Tue Nov 28 16:01:19 2017 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:01:21 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:01:23 <openstack> The meeting name has been set to 'neutron_ci' 16:01:26 <mlavalle> o/ 16:01:27 <jlibosva> o/ 16:01:29 <haleyb> hi 16:01:46 <ihrachys> hello folks 16:01:48 <ihrachys> #topic Actions from prev meeting 16:02:04 <ihrachys> as usual, we start with actions from last meeting 16:02:06 <ihrachys> first is "jlibosva to figure out why unstable_test didn't work for fullstack scenario case" 16:02:22 <ihrachys> though afaiu we figured it out in the meeting 16:02:24 <jlibosva> we figured that out in the last meeting itself 16:02:33 <jlibosva> right, it was because it failed in setup phase 16:02:42 <ihrachys> and we decided to leave it as is 16:02:56 <mlavalle> yes 16:03:06 <ihrachys> + your patch that bumped the timeout to wait agents in fullstack 16:03:17 <jlibosva> right - https://review.openstack.org/#/c/522872/ 16:03:18 <ihrachys> I mean this one: https://review.openstack.org/#/c/522872/ 16:03:20 <ihrachys> ok 16:03:41 <ihrachys> I guess we could backport it to reduce spurious failures from stable 16:03:49 <ihrachys> next was "jlibosva to investigate / report a bug for env deployment failure in fullstack because of port down" 16:03:57 <ihrachys> and that's basically the fix above 16:04:08 <jlibosva> correct - bug here https://bugs.launchpad.net/neutron/+bug/1734357 16:04:08 <openstack> Launchpad bug 1734357 in neutron "fullstack: Test runner doesn't wait enough time for env to come up" [High,Fix released] - Assigned to Jakub Libosvar (libosvar) 16:04:24 <jlibosva> and I requested the backport to pike right now - https://review.openstack.org/#/c/523453/ 16:04:29 <ihrachys> jlibosva, so you observed up to 3 mins waiting on agents? 16:04:48 <jlibosva> no, I observed 67 seconds but I wanted to be safe so I tripled the value 16:05:04 <jlibosva> it's active polling so once env is ready, the waiting loop stopps 16:05:08 <jlibosva> stops* 16:05:16 <ihrachys> I see. why is it so slow though? is it because of high parallelism? 16:05:19 <jlibosva> so the worst case, we wait three minutes only when there is real issue 16:05:25 <ihrachys> or would be same in single thread 16:05:41 <jlibosva> I didn't investigate that, I saw it took 30 seconds for the ovs-agent to start logging 16:05:55 <jlibosva> i.e. between test runner spawning process and process actually doing something was a 30 seconds gap 16:06:39 <jlibosva> so it sounds like busy machine. but I haven't checked load and cpu usage stats 16:07:14 <ihrachys> ok, I guess if it's something serious it will resurface eventually 16:07:35 <ihrachys> the job already takes a lot of time. we won't be able to push the boundary indefinitely while adding new cases 16:07:49 <ihrachys> ok, moving on 16:07:52 <ihrachys> "ihrachys to investigate latest https://bugs.launchpad.net/neutron/+bug/1673531 failures" 16:07:52 <openstack> Launchpad bug 1673531 in neutron "fullstack test_controller_timeout_does_not_break_connectivity_sigkill(GRE and l2pop,openflow-native_ovsdb-cli) failure" [High,Confirmed] - Assigned to Ihar Hrachyshka (ihar-hrachyshka) 16:07:57 <ihrachys> I had a look at logs 16:08:22 <ihrachys> so the test case fails on polling port of second fake machine for ACTIVE 16:08:30 <ihrachys> here is the first attempt: http://logs.openstack.org/71/520371/7/check/legacy-neutron-dsvm-fullstack/ad585a2/logs/dsvm-fullstack-logs/TestOvsConnectivitySameNetworkOnOvsBridgeControllerStop.test_controller_timeout_does_not_break_connectivity_sigkill_GRE-and-l2pop,openflow-native_.txt.gz#_2017-11-20_21_59_56_469 16:08:45 <ihrachys> actually, the attempt is the only one, despite us using wait_until_true 16:09:02 <ihrachys> and this is because for some reason neutron-server hanged in the middle processing request 16:09:14 <ihrachys> here is where the server eventually gives up: http://logs.openstack.org/71/520371/7/check/legacy-neutron-dsvm-fullstack/ad585a2/logs/dsvm-fullstack-logs/TestOvsConnectivitySameNetworkOnOvsBridgeControllerStop.test_controller_timeout_does_not_break_connectivity_sigkill_GRE-and-l2pop,openflow-native_/neutron-server--2017-11-20--21-57-14-411163.txt.gz#_2017-11-20_22_01_03_475 16:09:23 <ihrachys> probably triggered by DELETE sent during cleanup of the test case 16:09:43 <ihrachys> note that those are two only messages with the req-id in the server log 16:10:09 <ihrachys> not sure what to make of it yet 16:10:31 <ihrachys> usually we have some messages from when the request arrives 16:10:42 <ihrachys> it could be though that it didn't for some reason 16:11:26 <ihrachys> and then we see it e.g. repeated in the background / tcp connectivity to server finally recovered but it's too late 16:12:16 <jlibosva> so you're saying that wait_until_true doesn't actually poll? 16:13:15 <ihrachys> jlibosva, maybe it's locked because it waits for reply to its request 16:13:22 <ihrachys> we wait for 60 seconds there 16:13:28 <ihrachys> and after that it bails out 16:13:46 <ihrachys> and the client apparently also waits 60s+ for reply 16:13:58 <ihrachys> so we don't ever have a chance to actually retry with new http request 16:14:15 <ihrachys> if f.e. the previous one is lost somehow by the server 16:14:35 <jlibosva> I see 16:14:39 <ihrachys> also note that the message in server log is 6s+ after 60s 16:15:57 <ihrachys> oh and one more thing there 16:16:07 <ihrachys> there are two ports there (two fake machines) 16:16:20 <ihrachys> first one is ACTIVE, and here is the message in test log that gives the req-id: http://logs.openstack.org/71/520371/7/check/legacy-neutron-dsvm-fullstack/ad585a2/logs/dsvm-fullstack-logs/TestOvsConnectivitySameNetworkOnOvsBridgeControllerStop.test_controller_timeout_does_not_break_connectivity_sigkill_GRE-and-l2pop,openflow-native_.txt.gz#_2017-11-20_21_59_56_468 16:16:34 <ihrachys> but when you search for the id in server log, you don't have it there at all 16:18:06 <ihrachys> but if you try to relate messages, the request is probably this: http://logs.openstack.org/71/520371/7/check/legacy-neutron-dsvm-fullstack/ad585a2/logs/dsvm-fullstack-logs/TestOvsConnectivitySameNetworkOnOvsBridgeControllerStop.test_controller_timeout_does_not_break_connectivity_sigkill_GRE-and-l2pop,openflow-native_/neutron-server--2017-11-20--21-57-14-411163.txt.gz#_2017-11-20_21_59_56_456 16:18:10 <ihrachys> note a different id 16:18:22 <jlibosva> I also noted http://logs.openstack.org/71/520371/7/check/legacy-neutron-dsvm-fullstack/ad585a2/logs/dsvm-fullstack-logs/TestOvsConnectivitySameNetworkOnOvsBridgeControllerStop.test_controller_timeout_does_not_break_connectivity_sigkill_GRE-and-l2pop,openflow-native_/neutron-server--2017-11-20--21-57-14-411163.txt.gz#_2017-11-20_22_00_00_838 16:18:34 <jlibosva> which is 4 seconds after querying the API 16:18:43 <jlibosva> so maybe ovs agent reported something in the meantime 16:19:51 <ihrachys> how is it possible that we have different req-id in server and client? 16:20:15 <ihrachys> is there something in between proxying / overriding headers? 16:20:45 <ihrachys> anyway... I will dig more 16:20:57 <ihrachys> I probably should capture what we have alraedy there 16:21:05 <ihrachys> next item was "slaweq to investigate / report a bug for test_dscp_marking_packets fullstack failure" 16:21:13 <ihrachys> I don't see slaweq around 16:21:40 <ihrachys> but I see this reported: https://bugs.launchpad.net/neutron/+bug/1733649 16:21:40 <openstack> Launchpad bug 1733649 in neutron "fullstack neutron.tests.fullstack.test_qos.TestDscpMarkingQoSOvs.test_dscp_marking_packets(openflow-native) failure" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq) 16:22:07 <ihrachys> slaweq seems to work on it 16:22:08 <jlibosva> to the previous - maybe when you don't provide req-id it generates one? 16:22:19 <jlibosva> and what we see on server comes from previous call 16:23:03 <ihrachys> jlibosva, what do you mean? I believe the req-id logged in test case log is generated, yes, but then it is still sent to server 16:23:19 <jlibosva> I see curl used: 16:23:30 <jlibosva> http://logs.openstack.org/71/520371/7/check/legacy-neutron-dsvm-fullstack/ad585a2/logs/dsvm-fullstack-logs/TestOvsConnectivitySameNetworkOnOvsBridgeControllerStop.test_controller_timeout_does_not_break_connectivity_sigkill_GRE-and-l2pop,openflow-native_.txt.gz#_2017-11-20_21_59_53_605 16:23:49 <jlibosva> 3 seconds before the neutronclient call with req-d 16:24:24 <ihrachys> it still says "neutronclient.client" in the message 16:24:34 <ihrachys> so probably neutronclient uses curl under the hood? 16:24:50 <jlibosva> no, I think it uses python library 16:25:03 <jlibosva> urllib or httplib or something 16:25:25 <ihrachys> ok, I will have a closer look 16:25:32 <ihrachys> I don't want to take all the time for this issue 16:25:35 <ihrachys> so let's move on 16:25:39 <jlibosva> sure 16:25:41 <jlibosva> sorry 16:25:46 <ihrachys> these were all items from previous meeting 16:25:55 <ihrachys> #topic Grafana 16:25:58 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:26:32 <ihrachys> I was actually hopeful that with agent timeout fix the failure rate for fullstack will drop, but seems like it didn't 16:26:54 <haleyb> no rainbows and unicorns :( 16:27:20 <ihrachys> it's at least not as flat as it was but we have long way to go 16:27:30 <ihrachys> we'll look at latest failures later 16:28:16 <ihrachys> in the meantime, I will only note that nothing material changed for fullstack or scenarios. 16:28:24 <ihrachys> periodics seem ok too 16:28:31 <ihrachys> so let's dive into specifics 16:28:42 <ihrachys> #topic Fullstack 16:29:28 <ihrachys> for starter, we have the Kuba's fix to backport; I work on the req-id / port not active issue / slaweq looking at qos dscp failure 16:29:40 <ihrachys> let's see if there is anything else that we don't know about in latest logs 16:30:14 <ihrachys> I am looking at http://logs.openstack.org/72/522872/1/check/legacy-neutron-dsvm-fullstack/50dfd44/logs/testr_results.html.gz 16:30:48 <ihrachys> actually, failure rebound for fullstack back to ~90% could be because of dscp qos failure that slaweq suggested in LP is new 16:31:29 <jlibosva> is that with the timeout bump patch in? 16:31:42 <ihrachys> it is results from your patch so yes 16:31:47 <jlibosva> :( 16:31:55 <jlibosva> the env build fails still 16:31:57 <ihrachys> as we can see in logs, a log of failures are due to a timeout issue that is similar to what I am looking at 16:32:08 <ihrachys> where it fails on waiting a port to become ACTIVE 16:32:10 <jlibosva> oh, wait, no 16:32:38 <ihrachys> it could be either port genuinely is down, or an issue like mine where server is not responsive 16:33:21 <ihrachys> also in this run, test_controller_timeout_does_not_break_connectivity_sigkill failed but with slightly different error 16:33:24 <ihrachys> neutron.tests.common.machine_fixtures.FakeMachineException: No ICMP reply obtained from IP address 20.0.0.10 16:33:52 <ihrachys> which happens AFTER ports are validated to be ACTIVE 16:34:01 <jlibosva> shall we mark the test_connectivity tests as unstable with the bug that you're looking at? 16:34:10 <jlibosva> and qos with slaweq's bug 16:34:12 <ihrachys> so it must be a different issue 16:34:25 <ihrachys> jlibosva, probably. I will post a patch. 16:34:49 <ihrachys> #action ihrachys to disable connectivity fullstack tests while we look for culprit 16:35:07 <ihrachys> #action ihrachys to disable dscp qos fullstack test while we look for culprit 16:35:43 <ihrachys> there are two more failures that are not falling into the set 16:35:56 <ihrachys> test_l2_agent_restart with AssertionError: False is not true in self.assertTrue(all([r.done() for r in restarts])) 16:36:05 <ihrachys> and test_securitygroup(linuxbridge-iptables) with RuntimeError: Process ['ncat', u'20.0.0.11', '3333', '-w', '20'] hasn't been spawned in 20 seconds 16:36:38 <jlibosva> I can have a look at the netcat issue, I hope it won't be related to linuxbridge agent :) 16:37:16 <ihrachys> #action jlibosva to look at test_securitygroup(linuxbridge-iptables) failure in fullstack 16:37:49 <ihrachys> any candidates to look at test_l2_agent_restart ? 16:39:36 <jlibosva> I can also pick that one if nobody wants 16:39:39 <ihrachys> :) 16:39:43 <ihrachys> sory jlibosva 16:39:46 <ihrachys> we love you :) 16:39:51 <jlibosva> I have a feeling it will be related to slow agent starts 16:40:04 <ihrachys> #action jlibosva to look at test_l2_agent_restart fullstack failure 16:40:09 <jlibosva> as I observed it in the env build-up issue 16:40:29 * jlibosva loves being loved 16:40:38 <ihrachys> that restart test is afair brutally restarting agents. if it's slow, could hit it in one of restart cycles 16:41:13 <ihrachys> though having timeout bumped to 60 sec for each iteration there is maybe not the best path 16:41:29 <ihrachys> ok seems like we have work to do for fullstack 16:41:32 <ihrachys> #topic Scenarios 16:42:00 <ihrachys> we have old bugs 16:42:01 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1717302 16:42:01 <openstack> Launchpad bug 1717302 in neutron "Tempest floatingip scenario tests failing on DVR Multinode setup with HA" [High,Confirmed] 16:42:05 <ihrachys> and https://bugs.launchpad.net/neutron/+bug/1719711 16:42:05 <openstack> Launchpad bug 1719711 in neutron "iptables failed to apply when binding a port with AGENT.debug_iptables_rules enabled" [High,In progress] - Assigned to Dr. Jens Harbott (j-harbott) 16:42:28 <ihrachys> for the latter, there seems to be a fix! 16:42:32 <ihrachys> here https://review.openstack.org/#/c/523319/ 16:43:02 <ihrachys> everyone please review after the meeting :) 16:43:14 <ihrachys> as for dvr fip issue, mlavalle haleyb no news I believe? 16:43:42 <haleyb> ihrachys: i will need to look, nothing from swami yet 16:43:44 <mlavalle> ihrachys: we haven't heard from swami 16:44:03 <mlavalle> let's make sure we talk to him this week 16:44:14 <mlavalle> I think we will have to reassign that issue 16:44:21 <ihrachys> yeah... maybe he is swamped and we need someone else to have a look instead 16:44:27 <mlavalle> I'll talk to him 16:44:31 <ihrachys> right, it's not moving forward. thanks! 16:45:03 <ihrachys> #topic Tempest plugin 16:45:15 <ihrachys> so the etherpad is https://etherpad.openstack.org/p/neutron-tempest-plugin-job-move 16:45:31 <ihrachys> and we have some items there still in progress 16:45:36 <ihrachys> actually quite a lot :) 16:45:48 <ihrachys> one thing that blocks us is that we still have legacy jobs in master 16:45:58 <ihrachys> so we can't e.g. remove tempest test classes from neutron tree 16:46:04 <ihrachys> I believe mlavalle was looking at it 16:46:06 <ihrachys> mlavalle, any news? 16:46:22 <mlavalle> well, I am looking at it for stable branches 16:46:48 <mlavalle> I pushed this https://review.openstack.org/#/c/522931 over the weekend 16:46:51 <jlibosva> are we going to migrate stable branches too? 16:46:54 <jlibosva> I thought no 16:46:56 <jlibosva> not* 16:46:56 <ihrachys> mlavalle, so your order would be move jobs to stable, then remove from infra? 16:47:16 <jlibosva> ah, sorry. it's about legacy jobs. ignore me :) 16:47:19 <ihrachys> jlibosva, projects move their jobs including to stable. I saw others doing it, there is infra request. 16:47:21 <mlavalle> yes 16:47:32 <ihrachys> jlibosva, after we move to stable, they are able to clean them up 16:47:35 <ihrachys> in their repos 16:47:40 <mlavalle> I can also move master 16:47:52 <mlavalle> it's only that I thought someone else was doing ti 16:47:57 <jlibosva> I apologize, I thought you meant adopting stable branches for neutron plugin :) 16:47:59 <ihrachys> mlavalle, I was thinking, we could also have a small patch that makes the legacy jobs executed on stable only while we move jobs? 16:48:09 <ihrachys> mlavalle, that would be a simple fix and unblock cleanup in neutron tree 16:48:32 <ihrachys> mlavalle, master doesn't need move, it needs removal since we already have new jobs there 16:48:46 <mlavalle> yeah, that's what I thought 16:48:55 <ihrachys> mlavalle, I would imagine we could make it with a regex against stable.* for branch for legacy? 16:49:22 <mlavalle> yeah that sounds right 16:50:35 <ihrachys> mlavalle, do you want to cover the regex yourself or I should do it? 16:50:48 <mlavalle> please take care of that 16:51:09 <ihrachys> ok 16:51:19 <ihrachys> #action ihrachys to disable legacy jobs for neutron master 16:51:26 <mlavalle> I will continue with the patch for Pike ^^^^ 16:51:35 <mlavalle> and also another one for Ocata 16:51:39 <ihrachys> great 16:51:49 <ihrachys> for ocata you may just backport once done with pike 16:51:52 <ihrachys> it should work 16:52:00 <mlavalle> yeap, that's my plan 16:52:21 <ihrachys> oh so I look at the patch for pike, and I see you move all tempest jobs not just those for tempest plugin? 16:52:37 <mlavalle> I moved all legacy 16:52:44 <ihrachys> I was under impression you were covering plugin jobs only 16:52:55 <ihrachys> if not, I don't think anyone is looking at master patch to do the same 16:53:15 <mlavalle> ok, I can do a similar thing for master 16:53:26 <ihrachys> and for those, it could make sense to start in master 16:53:27 <ihrachys> yeah 16:53:30 <ihrachys> thanks! 16:54:18 <ihrachys> I also had a small fix to skip new jobs for doc/* changes: https://review.openstack.org/#/c/523244/ 16:54:25 <ihrachys> spotted them executed in a doc-only change in neutron repo 16:55:19 <mlavalle> pushed it in 16:55:34 * ihrachys bows 16:55:59 <ihrachys> and once legacy jobs are gone in master, we can push https://review.openstack.org/#/c/506672/ 16:56:25 <mlavalle> yeah 16:56:39 <ihrachys> ok I think that mostly covers next steps for plugin repo 16:56:46 <ihrachys> #topic Open discussion 16:56:46 <mlavalle> ++ 16:56:50 <ihrachys> anything to bring up? 16:57:26 <ihrachys> seems like no! well then you have 3 mins back! 16:57:30 <ihrachys> enjoy! 16:57:33 <ihrachys> #endmeeting