16:01:18 #startmeeting neutron_ci 16:01:18 Meeting started Tue Jan 16 16:01:18 2018 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:01:19 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:01:21 The meeting name has been set to 'neutron_ci' 16:01:24 o/ 16:01:28 hi 16:01:58 heya 16:02:02 #topic Actions from prev meeting 16:02:03 hi 16:02:30 (I think the first one that was on me since it's overridden by another AI) 16:02:32 next is "mlavalle to follow up with stadium projects on switching to new tempest repo" 16:02:50 I did follow up with vpnaas and midonet 16:03:03 they are on board with the move 16:03:30 eh wait :) 16:03:36 are we talking the same thing? 16:03:38 I think the patches from the gentleman leading this have to be updated 16:03:52 oh ok we DO talk the same thing 16:04:07 I figured the AI was not worded unambiguously :) 16:04:17 LOL 16:04:18 o/ 16:04:36 I made sure yamamoto and hoangcx keep and eye on the respective patches 16:04:38 for the record, we are talking about switching imports in other stadium repos to neutron-tempest-plugin, then killing neutron in-tree remaining tempest bits from neutron.tests.tempest 16:04:48 correct 16:04:53 mlavalle, keep an eye or follow up / update? 16:05:06 I am not sure chandankumar is going to follow up on those patches 16:05:39 ok, do they have to continue with the respective patches 16:05:41 ? 16:05:55 I was thinking that yes, they would take over / drive to merge 16:06:08 ok, I will talk to them again tonight 16:06:18 for what I understand, chandankumar spearheads several initiatives and projects and may not have equal time for all patches 16:06:30 pl great 16:06:33 *ok great 16:07:04 cool, since I saw the chandankumar was pushing the patches I thought he was going to continue with them 16:07:11 #action mlavalle to follow up with stadium projects on taking over / merge patches to switch imports to new tempest repo 16:07:35 oh if he was active recently that's another story. I haven't seen it, probably happened the latest week. 16:07:36 will do 16:07:52 ok next was "ihrachys to report bug for dvr scenario job timeouts and try concurrency increase" 16:08:08 nevermind. I'll have the stadium guys take over 16:08:24 we merged https://review.openstack.org/#/c/532261/ the bug report was https://launchpad.net/bugs/1742200 16:08:25 Launchpad bug 1742200 in neutron "dvr scenario job fails with timeouts" [High,Fix released] 16:08:42 we will revisit whether it helped in grafana section 16:08:50 "ihrachys to report bug for mtu scenario not executed for linuxbridge job" 16:09:15 I reported https://bugs.launchpad.net/neutron/+bug/1742197 16:09:15 Launchpad bug 1742197 in neutron "linuxbridge scenario job doesn't execute NetworkMtuBaseTest" [Medium,Confirmed] - Assigned to Dongcan Ye (hellochosen) 16:09:38 and there is already a fix in review for that: https://review.openstack.org/#/c/532406/ though it relies on plugin to support mtu-writable extension 16:10:06 I also tried https://review.openstack.org/#/c/532259/ but figured that vlan type driver is not working in gate 16:10:19 so for linuxbridge we are left with flat and vxlan to pick from 16:10:33 (well, I *assume* flat works, I haven't checked) 16:11:03 in light of this, I guess mtu-writable approach is the best 16:11:34 ok, the last item was "slaweq to take over sec group failure in fullstack (report bug / triage / fix)" 16:11:46 that was in regards to failure as in http://logs.openstack.org/43/529143/3/check/neutron-fullstack/d031a6b/logs/testr_results.html.gz 16:11:55 I only reported it on launchpad: https://bugs.launchpad.net/neutron/+bug/1742401 16:11:56 Launchpad bug 1742401 in neutron "Fullstack tests neutron.tests.fullstack.test_securitygroup.TestSecurityGroupsSameNetwork fails often" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq) 16:12:02 I didn't had time to work on it 16:12:25 because I was fixing https://bugs.launchpad.net/neutron/+bug/1737892 16:12:27 Launchpad bug 1737892 in neutron "Fullstack test test_qos.TestBwLimitQoSOvs.test_bw_limit_qos_port_removed failing many times" [High,In progress] - Assigned to Slawek Kaplonski (slaweq) 16:12:36 I hope I will work on this issue with SG this week 16:12:45 great! thanks for taking it. 16:13:04 we'll have a look at fullstack patches after briefly checking grafana 16:13:07 #topic Grafana 16:13:12 http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:13:54 reminder that we switched to non-legacy job names on the board where applicable: https://review.openstack.org/#/c/532632/ 16:14:05 we need to check whether all boards still work as intended 16:14:42 there are several boards there right now (coverage, grenade) that don't have any datapoints 16:14:46 all for gate queue 16:14:59 I suspect it's because no failure hits happened for those jobs 16:15:19 that's usually the reason 16:15:38 yeah, I think that's a reasonable explanation for coverage 16:15:52 for grenade, it wouldn't be the case but the update for the dashboard is quite fresh 16:16:02 I mean, tempest / grenade fail from time to time 16:16:17 so long term we should expect some failing data points\ 16:16:29 so I guess let's give it a week or two before paniching 16:16:35 *panicking 16:16:54 in the meantime, we have check queue for grenade jobs so we can still spot issues 16:18:32 looking at other dashboards, fullstack is steady 80%, scenario dvr is 30% (I think we had it like that before) 16:18:38 linuxbridge scenario spiked to 100% 16:19:00 and I found out it's a recent logapi service plugin patch that broke it 16:19:16 because we now enable log l2 extension for linuxbridge despite it not being compatible with the agent 16:19:20 I have a fix here: https://review.openstack.org/#/c/533799/ 16:19:59 slaweq, it now has ci results so maybe you can bump vote 16:20:09 ihrachys: I just +W 16:20:24 there is still more work for the bug, but I will let others to fix that 16:20:38 I left the bug opened 16:20:45 slaweq, thanks 16:21:06 I also notice there is a spike in failure rate for functional 16:21:10 both check and gate queues 16:21:24 30% in gate, 40% in check queues 16:21:35 anyone aware of what's that about? 16:22:07 perhaps the AddBridge issue? 16:22:12 isn't it related to this issue with connections to ovs? 16:22:13 I saw a functional test stuck today again 16:22:23 but I haven't check how often it happens 16:22:38 jlibosva, you mean https://bugs.launchpad.net/neutron/+bug/1741889 ? 16:22:39 Launchpad bug 1741889 in neutron "functional: DbAddCommand sometimes times out after 10 seconds" [High,New] 16:22:43 yep 16:22:47 slaweq, issue with connections? what's that? 16:23:02 this one which You just posted 16:23:05 sorry 16:23:17 I promised today I'll ask otherwiseguy for help, I haven't got to it yet 16:23:30 slaweq, oh I see you referred to ovs log messages there with connection failures 16:23:36 yes 16:23:41 but it's this one, right? 16:23:48 or I mixed something? 16:24:03 jlibosva, that would be nice to get otherwiseguy on it because it's really painful and we don't seem to have a path forward. there was another bug that he was going to look into. 16:24:10 https://bugs.launchpad.net/neutron/+bug/1687074 16:24:10 Launchpad bug 1687074 in neutron "Sometimes ovsdb fails with "tcp:127.0.0.1:6640: error parsing stream"" [High,Confirmed] 16:24:30 I think we usually spot those stream errors in tests that fail with retries 16:24:35 so can be related 16:24:42 slaweq, yeah I think it's this one 16:24:49 :) 16:24:54 thx for confirmation 16:25:05 I think there was a patch for stream issue somewhere.. 16:25:10 looking 16:25:36 ok it was not really a fix, just debug statements: https://review.openstack.org/#/c/525775/4/neutron/agent/common/ovs_lib.py 16:26:14 I think Terry triggered a failure with the patch and got ": RuntimeError: Port nonexistantport does not exist" from ovs socket. 16:26:17 which is very weird 16:26:27 it's definitely neutron error 16:26:42 so how is it possible it results in ovsdb socket 16:26:52 maybe eventlet screwing up file descriptors somehow 16:28:21 well actually, maybe I assume too much saying it's a neutron issue 16:28:27 I mean, neutron error, sorry 16:29:15 but anyway, jlibosva is going to reach out to Terry. please mention both bugs since they are arguably related. 16:29:31 yep, will do 16:30:15 #action jlibosva to talk to otherwiseguy about making progress for functional ovsdb issues (timeouts, retries, stream parsing failures) 16:30:19 thanks! 16:30:46 jlibosva, you mentioned a functional job being stuck. I believe you mean http://logs.openstack.org/99/533799/1/check/neutron-functional/990c856/ 16:30:57 do you think it can be related to ovsdb issues, or there is more to it? 16:31:32 yep, that's it http://logs.openstack.org/99/533799/1/check/neutron-functional/990c856/job-output.txt.gz#_2018-01-15_23_30_49_840378 16:31:37 I mean, that's what I meant 16:32:21 so it's just a random kill 16:32:28 last job I checked, there was a 1.5GB large log file containing output of one particular test and it was full of TRY_AGAIN messages 16:32:44 jlibosva, aha that's a good sign that it's same issue 16:32:44 now I wonder why timeout per test wasn't triggered 16:32:54 right 16:33:02 maybe it didn't switch greenthreads 16:33:13 or we have an issue there 16:34:09 yeah. does anyone have cycles to check why the job timed out? or we just leave it at it while we figure out ovsdb issue 16:34:49 ok let's leave it for now and revisit if it happens more 16:34:59 #topic Fullstack 16:35:05 I can check it 16:35:12 well ok :) 16:35:19 I believe it will be the TRY_AGAIN :) 16:35:56 #action jlibosva to check why functional job times out globally instead of triggering a local test case timeout for TRY_AGAIN ovsdb issue 16:36:31 jlibosva, a lot of pointers to it maybe being the case of eventlet running amok 16:36:35 anyway, back to fullstack 16:36:49 slaweq was doing amazing job lately tackling failures we have one by one 16:37:03 but slowly :) 16:37:13 the latest fix for qos dscp bug is https://review.openstack.org/533318 16:37:24 and it's already +W, just need to get it merged 16:38:08 there was a concern during review that l2 extension manager may bubble up exceptions from extensions 16:38:30 how do we tackle that? I guess the least reporting a low priority bug? 16:38:41 anyone willing to chew it beyond that? 16:38:48 I can report it 16:38:53 go for it! 16:39:00 but I don't know if I will have time to propose fix soon 16:39:10 #action slaweq to report a bug about l2 ext manager not catching exceptions from extensions 16:39:12 ok, I will report it today 16:39:13 that's fine 16:39:33 just don't assign yourself if you don't plan to work on a fix, and someone should pick it up eventually 16:39:42 sure 16:40:04 ok. apart from this fix, do we have anything in review that would help the job? 16:40:09 I want to mention that patch for https://bugs.launchpad.net/neutron/+bug/1733649 was also merged few days ago 16:40:10 Launchpad bug 1733649 in neutron "fullstack neutron.tests.fullstack.test_qos.TestDscpMarkingQoSOvs.test_dscp_marking_packets(openflow-native) failure" [High,Fix released] - Assigned to Gary Kotton (garyk) 16:40:20 so I hope it will be better now 16:40:50 I will check if for few days and will remove "unstable_test" decorator if all will be fine 16:41:06 fullstack is still steady 80% so I guess other issues are more frequent to fail to mitigate the fix impact 16:41:06 and I will also propose this fix for stable branches then 16:41:34 slaweq, oh you mean the test that was ignored. ok. 16:41:44 ihrachys: yes 16:42:11 ok taking one of the fresh runs for fullstack: http://logs.openstack.org/99/533799/1/check/neutron-fullstack/f827430/logs/testr_results.html.gz 16:42:27 I noticed this agent restart test started failing too often in the job 16:42:31 often it's the only failure 16:42:42 so I suspect tackling this one should drop failure rate significantly 16:42:57 I don't think we ever reported the bug for the failure 16:44:18 uhm, shall we mark it as unstable to get some better failure rate? :) 16:44:46 maybe. after reporting a bug at least 16:44:53 I checked ovs agent logs for the test 16:45:10 the only suspicious one is this agent (there are 4 logs total there): http://logs.openstack.org/99/533799/1/check/neutron-fullstack/f827430/logs/dsvm-fullstack-logs/TestUninterruptedConnectivityOnL2AgentRestart.test_l2_agent_restart_OVS,VLANs,openflow-native_/neutron-openvswitch-agent--2018-01-15--21-43-21-502569.txt.gz?level=WARNING 16:45:15 2018-01-15 21:45:49.869 28570 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int RuntimeError: ofctl request version=None,msg_type=None,msg_len=None,xid=None,OFPFlowStatsRequest(cookie=0,cookie_mask=0,flags=0,match=OFPMatch(oxm_fields={}),out_group=4294967295,out_port=4294967295,table_id=23,type=1) error Datapath Invalid 90269446452558 16:45:29 and then later 16:45:29 2018-01-15 21:45:49.942 28570 WARNING neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-f3915075-576d-4155-86c3-c2272e53a1d8 - - - - -] OVS is dead. OVSNeutronAgent will keep running and checking OVS status periodically. 16:47:11 there are some log messages in ovsdb server log like: 2018-01-15T21:45:33.394Z|00012|reconnect|ERR|tcp:127.0.0.1:59162: no response to inactivity probe after 5 seconds, disconnecting 16:47:20 what's the probe? 16:47:33 is it ovsdb server talking to ryu? 16:47:44 like checking if controller is still there? 16:47:58 I think that's ovsdb talking to manager 16:48:22 or rather vice-versa 16:48:25 this ovsdb log is very similar like in stucked functional tests IMHO 16:48:30 maybe it's related somehow? 16:48:33 it's more related to native ovsdb interface than openflow 16:48:38 slaweq, maybe it's always like that :) 16:48:44 jlibosva, yeah sorry mixed things 16:48:47 maybe, I don't know 16:49:05 anyway... I will report a bug for that. I may dedicate some time to dig it too. 16:49:18 #action ihrachys to report bug for l2 agent restart fullstack failures and dig it 16:50:16 #topic Scenarios 16:50:31 haleyb, I saw some activity on dvr bugs that lingered scenarios. any updates? 16:51:33 or was it something else... I think I saw Swami posting something but now I don't. 16:51:58 I guess I dreamt 16:52:18 anyway, if anything this is the bug that affected us for a while: https://bugs.launchpad.net/neutron/+bug/1717302 16:52:19 Launchpad bug 1717302 in neutron "Tempest floatingip scenario tests failing on DVR Multinode setup with HA" [High,Confirmed] - Assigned to Brian Haley (brian-haley) 16:52:32 it's on Brian but there seems to be no fixes or updates since Dec 16:53:13 if haleyb is not around, I can bring it up in the L3 meeting this coming Thursday 16:53:42 yeah please do 16:53:47 ++ 16:54:14 #action mlavalle to bring up floating ip failures in dvr scenario job to l3 team 16:54:30 I am looking at a fresh run 16:54:30 http://logs.openstack.org/60/532460/1/check/legacy-tempest-dsvm-neutron-dvr-multinode-scenario/bf95db0/logs/testr_results.html.gz 16:54:36 see the trunk test failed 16:55:07 I don't think it's reported 16:55:19 anyone up for the job to report it or maybe even having a look? 16:55:53 that might be fixed by https://review.openstack.org/#/c/531414/1 16:56:12 ack! 16:56:21 #topic Rally 16:56:42 we had https://bugs.launchpad.net/neutron/+bug/1741954 raised a while ago 16:56:43 Launchpad bug 1741954 in neutron "create_and_list_trunk_subports rally scenario failed with timeouts" [High,In progress] - Assigned to Armando Migliaccio (armando-migliaccio) 16:56:57 this is mostly to update folks that we have a fix for that here: https://review.openstack.org/#/c/532045/ 16:57:07 (well, it's kinda a fix, it's just optimization) 16:57:27 I think we started hitting more timeouts and perf issues since meltdown patching in infra / clouds we use 16:57:36 but the patch should help one of those instance 16:57:41 *instances 16:57:43 #topic Gate 16:58:15 mlavalle, I recollect you were going to follow up on remaining legacy- jobs in neutron gate that we figured out are still present and not converted to new format 16:58:21 any progress? 16:58:27 (like tempest py3) 16:58:33 didn't have time this past week 16:58:48 I will do it over the next few days 16:58:54 #action mlavalle to transit remaining legacy- jobs to new format 16:59:00 gotcha, that's ok, not pressing! 16:59:07 :-) 16:59:16 ok we are out of time 16:59:27 it's always pretty packed this meeting riiiight? :) 16:59:32 yeah 16:59:36 that's good 16:59:39 ok thanks everyone 16:59:43 #endmeeting