16:01:19 #startmeeting neutron_ci 16:01:20 Meeting started Tue Nov 28 16:01:19 2017 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:01:21 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:01:23 The meeting name has been set to 'neutron_ci' 16:01:26 o/ 16:01:27 o/ 16:01:29 hi 16:01:46 hello folks 16:01:48 #topic Actions from prev meeting 16:02:04 as usual, we start with actions from last meeting 16:02:06 first is "jlibosva to figure out why unstable_test didn't work for fullstack scenario case" 16:02:22 though afaiu we figured it out in the meeting 16:02:24 we figured that out in the last meeting itself 16:02:33 right, it was because it failed in setup phase 16:02:42 and we decided to leave it as is 16:02:56 yes 16:03:06 + your patch that bumped the timeout to wait agents in fullstack 16:03:17 right - https://review.openstack.org/#/c/522872/ 16:03:18 I mean this one: https://review.openstack.org/#/c/522872/ 16:03:20 ok 16:03:41 I guess we could backport it to reduce spurious failures from stable 16:03:49 next was "jlibosva to investigate / report a bug for env deployment failure in fullstack because of port down" 16:03:57 and that's basically the fix above 16:04:08 correct - bug here https://bugs.launchpad.net/neutron/+bug/1734357 16:04:08 Launchpad bug 1734357 in neutron "fullstack: Test runner doesn't wait enough time for env to come up" [High,Fix released] - Assigned to Jakub Libosvar (libosvar) 16:04:24 and I requested the backport to pike right now - https://review.openstack.org/#/c/523453/ 16:04:29 jlibosva, so you observed up to 3 mins waiting on agents? 16:04:48 no, I observed 67 seconds but I wanted to be safe so I tripled the value 16:05:04 it's active polling so once env is ready, the waiting loop stopps 16:05:08 stops* 16:05:16 I see. why is it so slow though? is it because of high parallelism? 16:05:19 so the worst case, we wait three minutes only when there is real issue 16:05:25 or would be same in single thread 16:05:41 I didn't investigate that, I saw it took 30 seconds for the ovs-agent to start logging 16:05:55 i.e. between test runner spawning process and process actually doing something was a 30 seconds gap 16:06:39 so it sounds like busy machine. but I haven't checked load and cpu usage stats 16:07:14 ok, I guess if it's something serious it will resurface eventually 16:07:35 the job already takes a lot of time. we won't be able to push the boundary indefinitely while adding new cases 16:07:49 ok, moving on 16:07:52 "ihrachys to investigate latest https://bugs.launchpad.net/neutron/+bug/1673531 failures" 16:07:52 Launchpad bug 1673531 in neutron "fullstack test_controller_timeout_does_not_break_connectivity_sigkill(GRE and l2pop,openflow-native_ovsdb-cli) failure" [High,Confirmed] - Assigned to Ihar Hrachyshka (ihar-hrachyshka) 16:07:57 I had a look at logs 16:08:22 so the test case fails on polling port of second fake machine for ACTIVE 16:08:30 here is the first attempt: http://logs.openstack.org/71/520371/7/check/legacy-neutron-dsvm-fullstack/ad585a2/logs/dsvm-fullstack-logs/TestOvsConnectivitySameNetworkOnOvsBridgeControllerStop.test_controller_timeout_does_not_break_connectivity_sigkill_GRE-and-l2pop,openflow-native_.txt.gz#_2017-11-20_21_59_56_469 16:08:45 actually, the attempt is the only one, despite us using wait_until_true 16:09:02 and this is because for some reason neutron-server hanged in the middle processing request 16:09:14 here is where the server eventually gives up: http://logs.openstack.org/71/520371/7/check/legacy-neutron-dsvm-fullstack/ad585a2/logs/dsvm-fullstack-logs/TestOvsConnectivitySameNetworkOnOvsBridgeControllerStop.test_controller_timeout_does_not_break_connectivity_sigkill_GRE-and-l2pop,openflow-native_/neutron-server--2017-11-20--21-57-14-411163.txt.gz#_2017-11-20_22_01_03_475 16:09:23 probably triggered by DELETE sent during cleanup of the test case 16:09:43 note that those are two only messages with the req-id in the server log 16:10:09 not sure what to make of it yet 16:10:31 usually we have some messages from when the request arrives 16:10:42 it could be though that it didn't for some reason 16:11:26 and then we see it e.g. repeated in the background / tcp connectivity to server finally recovered but it's too late 16:12:16 so you're saying that wait_until_true doesn't actually poll? 16:13:15 jlibosva, maybe it's locked because it waits for reply to its request 16:13:22 we wait for 60 seconds there 16:13:28 and after that it bails out 16:13:46 and the client apparently also waits 60s+ for reply 16:13:58 so we don't ever have a chance to actually retry with new http request 16:14:15 if f.e. the previous one is lost somehow by the server 16:14:35 I see 16:14:39 also note that the message in server log is 6s+ after 60s 16:15:57 oh and one more thing there 16:16:07 there are two ports there (two fake machines) 16:16:20 first one is ACTIVE, and here is the message in test log that gives the req-id: http://logs.openstack.org/71/520371/7/check/legacy-neutron-dsvm-fullstack/ad585a2/logs/dsvm-fullstack-logs/TestOvsConnectivitySameNetworkOnOvsBridgeControllerStop.test_controller_timeout_does_not_break_connectivity_sigkill_GRE-and-l2pop,openflow-native_.txt.gz#_2017-11-20_21_59_56_468 16:16:34 but when you search for the id in server log, you don't have it there at all 16:18:06 but if you try to relate messages, the request is probably this: http://logs.openstack.org/71/520371/7/check/legacy-neutron-dsvm-fullstack/ad585a2/logs/dsvm-fullstack-logs/TestOvsConnectivitySameNetworkOnOvsBridgeControllerStop.test_controller_timeout_does_not_break_connectivity_sigkill_GRE-and-l2pop,openflow-native_/neutron-server--2017-11-20--21-57-14-411163.txt.gz#_2017-11-20_21_59_56_456 16:18:10 note a different id 16:18:22 I also noted http://logs.openstack.org/71/520371/7/check/legacy-neutron-dsvm-fullstack/ad585a2/logs/dsvm-fullstack-logs/TestOvsConnectivitySameNetworkOnOvsBridgeControllerStop.test_controller_timeout_does_not_break_connectivity_sigkill_GRE-and-l2pop,openflow-native_/neutron-server--2017-11-20--21-57-14-411163.txt.gz#_2017-11-20_22_00_00_838 16:18:34 which is 4 seconds after querying the API 16:18:43 so maybe ovs agent reported something in the meantime 16:19:51 how is it possible that we have different req-id in server and client? 16:20:15 is there something in between proxying / overriding headers? 16:20:45 anyway... I will dig more 16:20:57 I probably should capture what we have alraedy there 16:21:05 next item was "slaweq to investigate / report a bug for test_dscp_marking_packets fullstack failure" 16:21:13 I don't see slaweq around 16:21:40 but I see this reported: https://bugs.launchpad.net/neutron/+bug/1733649 16:21:40 Launchpad bug 1733649 in neutron "fullstack neutron.tests.fullstack.test_qos.TestDscpMarkingQoSOvs.test_dscp_marking_packets(openflow-native) failure" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq) 16:22:07 slaweq seems to work on it 16:22:08 to the previous - maybe when you don't provide req-id it generates one? 16:22:19 and what we see on server comes from previous call 16:23:03 jlibosva, what do you mean? I believe the req-id logged in test case log is generated, yes, but then it is still sent to server 16:23:19 I see curl used: 16:23:30 http://logs.openstack.org/71/520371/7/check/legacy-neutron-dsvm-fullstack/ad585a2/logs/dsvm-fullstack-logs/TestOvsConnectivitySameNetworkOnOvsBridgeControllerStop.test_controller_timeout_does_not_break_connectivity_sigkill_GRE-and-l2pop,openflow-native_.txt.gz#_2017-11-20_21_59_53_605 16:23:49 3 seconds before the neutronclient call with req-d 16:24:24 it still says "neutronclient.client" in the message 16:24:34 so probably neutronclient uses curl under the hood? 16:24:50 no, I think it uses python library 16:25:03 urllib or httplib or something 16:25:25 ok, I will have a closer look 16:25:32 I don't want to take all the time for this issue 16:25:35 so let's move on 16:25:39 sure 16:25:41 sorry 16:25:46 these were all items from previous meeting 16:25:55 #topic Grafana 16:25:58 http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:26:32 I was actually hopeful that with agent timeout fix the failure rate for fullstack will drop, but seems like it didn't 16:26:54 no rainbows and unicorns :( 16:27:20 it's at least not as flat as it was but we have long way to go 16:27:30 we'll look at latest failures later 16:28:16 in the meantime, I will only note that nothing material changed for fullstack or scenarios. 16:28:24 periodics seem ok too 16:28:31 so let's dive into specifics 16:28:42 #topic Fullstack 16:29:28 for starter, we have the Kuba's fix to backport; I work on the req-id / port not active issue / slaweq looking at qos dscp failure 16:29:40 let's see if there is anything else that we don't know about in latest logs 16:30:14 I am looking at http://logs.openstack.org/72/522872/1/check/legacy-neutron-dsvm-fullstack/50dfd44/logs/testr_results.html.gz 16:30:48 actually, failure rebound for fullstack back to ~90% could be because of dscp qos failure that slaweq suggested in LP is new 16:31:29 is that with the timeout bump patch in? 16:31:42 it is results from your patch so yes 16:31:47 :( 16:31:55 the env build fails still 16:31:57 as we can see in logs, a log of failures are due to a timeout issue that is similar to what I am looking at 16:32:08 where it fails on waiting a port to become ACTIVE 16:32:10 oh, wait, no 16:32:38 it could be either port genuinely is down, or an issue like mine where server is not responsive 16:33:21 also in this run, test_controller_timeout_does_not_break_connectivity_sigkill failed but with slightly different error 16:33:24 neutron.tests.common.machine_fixtures.FakeMachineException: No ICMP reply obtained from IP address 20.0.0.10 16:33:52 which happens AFTER ports are validated to be ACTIVE 16:34:01 shall we mark the test_connectivity tests as unstable with the bug that you're looking at? 16:34:10 and qos with slaweq's bug 16:34:12 so it must be a different issue 16:34:25 jlibosva, probably. I will post a patch. 16:34:49 #action ihrachys to disable connectivity fullstack tests while we look for culprit 16:35:07 #action ihrachys to disable dscp qos fullstack test while we look for culprit 16:35:43 there are two more failures that are not falling into the set 16:35:56 test_l2_agent_restart with AssertionError: False is not true in self.assertTrue(all([r.done() for r in restarts])) 16:36:05 and test_securitygroup(linuxbridge-iptables) with RuntimeError: Process ['ncat', u'20.0.0.11', '3333', '-w', '20'] hasn't been spawned in 20 seconds 16:36:38 I can have a look at the netcat issue, I hope it won't be related to linuxbridge agent :) 16:37:16 #action jlibosva to look at test_securitygroup(linuxbridge-iptables) failure in fullstack 16:37:49 any candidates to look at test_l2_agent_restart ? 16:39:36 I can also pick that one if nobody wants 16:39:39 :) 16:39:43 sory jlibosva 16:39:46 we love you :) 16:39:51 I have a feeling it will be related to slow agent starts 16:40:04 #action jlibosva to look at test_l2_agent_restart fullstack failure 16:40:09 as I observed it in the env build-up issue 16:40:29 * jlibosva loves being loved 16:40:38 that restart test is afair brutally restarting agents. if it's slow, could hit it in one of restart cycles 16:41:13 though having timeout bumped to 60 sec for each iteration there is maybe not the best path 16:41:29 ok seems like we have work to do for fullstack 16:41:32 #topic Scenarios 16:42:00 we have old bugs 16:42:01 https://bugs.launchpad.net/neutron/+bug/1717302 16:42:01 Launchpad bug 1717302 in neutron "Tempest floatingip scenario tests failing on DVR Multinode setup with HA" [High,Confirmed] 16:42:05 and https://bugs.launchpad.net/neutron/+bug/1719711 16:42:05 Launchpad bug 1719711 in neutron "iptables failed to apply when binding a port with AGENT.debug_iptables_rules enabled" [High,In progress] - Assigned to Dr. Jens Harbott (j-harbott) 16:42:28 for the latter, there seems to be a fix! 16:42:32 here https://review.openstack.org/#/c/523319/ 16:43:02 everyone please review after the meeting :) 16:43:14 as for dvr fip issue, mlavalle haleyb no news I believe? 16:43:42 ihrachys: i will need to look, nothing from swami yet 16:43:44 ihrachys: we haven't heard from swami 16:44:03 let's make sure we talk to him this week 16:44:14 I think we will have to reassign that issue 16:44:21 yeah... maybe he is swamped and we need someone else to have a look instead 16:44:27 I'll talk to him 16:44:31 right, it's not moving forward. thanks! 16:45:03 #topic Tempest plugin 16:45:15 so the etherpad is https://etherpad.openstack.org/p/neutron-tempest-plugin-job-move 16:45:31 and we have some items there still in progress 16:45:36 actually quite a lot :) 16:45:48 one thing that blocks us is that we still have legacy jobs in master 16:45:58 so we can't e.g. remove tempest test classes from neutron tree 16:46:04 I believe mlavalle was looking at it 16:46:06 mlavalle, any news? 16:46:22 well, I am looking at it for stable branches 16:46:48 I pushed this https://review.openstack.org/#/c/522931 over the weekend 16:46:51 are we going to migrate stable branches too? 16:46:54 I thought no 16:46:56 not* 16:46:56 mlavalle, so your order would be move jobs to stable, then remove from infra? 16:47:16 ah, sorry. it's about legacy jobs. ignore me :) 16:47:19 jlibosva, projects move their jobs including to stable. I saw others doing it, there is infra request. 16:47:21 yes 16:47:32 jlibosva, after we move to stable, they are able to clean them up 16:47:35 in their repos 16:47:40 I can also move master 16:47:52 it's only that I thought someone else was doing ti 16:47:57 I apologize, I thought you meant adopting stable branches for neutron plugin :) 16:47:59 mlavalle, I was thinking, we could also have a small patch that makes the legacy jobs executed on stable only while we move jobs? 16:48:09 mlavalle, that would be a simple fix and unblock cleanup in neutron tree 16:48:32 mlavalle, master doesn't need move, it needs removal since we already have new jobs there 16:48:46 yeah, that's what I thought 16:48:55 mlavalle, I would imagine we could make it with a regex against stable.* for branch for legacy? 16:49:22 yeah that sounds right 16:50:35 mlavalle, do you want to cover the regex yourself or I should do it? 16:50:48 please take care of that 16:51:09 ok 16:51:19 #action ihrachys to disable legacy jobs for neutron master 16:51:26 I will continue with the patch for Pike ^^^^ 16:51:35 and also another one for Ocata 16:51:39 great 16:51:49 for ocata you may just backport once done with pike 16:51:52 it should work 16:52:00 yeap, that's my plan 16:52:21 oh so I look at the patch for pike, and I see you move all tempest jobs not just those for tempest plugin? 16:52:37 I moved all legacy 16:52:44 I was under impression you were covering plugin jobs only 16:52:55 if not, I don't think anyone is looking at master patch to do the same 16:53:15 ok, I can do a similar thing for master 16:53:26 and for those, it could make sense to start in master 16:53:27 yeah 16:53:30 thanks! 16:54:18 I also had a small fix to skip new jobs for doc/* changes: https://review.openstack.org/#/c/523244/ 16:54:25 spotted them executed in a doc-only change in neutron repo 16:55:19 pushed it in 16:55:34 * ihrachys bows 16:55:59 and once legacy jobs are gone in master, we can push https://review.openstack.org/#/c/506672/ 16:56:25 yeah 16:56:39 ok I think that mostly covers next steps for plugin repo 16:56:46 #topic Open discussion 16:56:46 ++ 16:56:50 anything to bring up? 16:57:26 seems like no! well then you have 3 mins back! 16:57:30 enjoy! 16:57:33 #endmeeting