16:01:18 <ihrachys> #startmeeting neutron_ci 16:01:18 <openstack> Meeting started Tue Jan 16 16:01:18 2018 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:01:19 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:01:21 <openstack> The meeting name has been set to 'neutron_ci' 16:01:24 <mlavalle> o/ 16:01:28 <slaweq> hi 16:01:58 <ihrachys> heya 16:02:02 <ihrachys> #topic Actions from prev meeting 16:02:03 <haleyb> hi 16:02:30 <ihrachys> (I think the first one that was on me since it's overridden by another AI) 16:02:32 <ihrachys> next is "mlavalle to follow up with stadium projects on switching to new tempest repo" 16:02:50 <mlavalle> I did follow up with vpnaas and midonet 16:03:03 <mlavalle> they are on board with the move 16:03:30 <ihrachys> eh wait :) 16:03:36 <ihrachys> are we talking the same thing? 16:03:38 <mlavalle> I think the patches from the gentleman leading this have to be updated 16:03:52 <ihrachys> oh ok we DO talk the same thing 16:04:07 <ihrachys> I figured the AI was not worded unambiguously :) 16:04:17 <mlavalle> LOL 16:04:18 <jlibosva> o/ 16:04:36 <mlavalle> I made sure yamamoto and hoangcx keep and eye on the respective patches 16:04:38 <ihrachys> for the record, we are talking about switching imports in other stadium repos to neutron-tempest-plugin, then killing neutron in-tree remaining tempest bits from neutron.tests.tempest 16:04:48 <mlavalle> correct 16:04:53 <ihrachys> mlavalle, keep an eye or follow up / update? 16:05:06 <ihrachys> I am not sure chandankumar is going to follow up on those patches 16:05:39 <mlavalle> ok, do they have to continue with the respective patches 16:05:41 <mlavalle> ? 16:05:55 <ihrachys> I was thinking that yes, they would take over / drive to merge 16:06:08 <mlavalle> ok, I will talk to them again tonight 16:06:18 <ihrachys> for what I understand, chandankumar spearheads several initiatives and projects and may not have equal time for all patches 16:06:30 <ihrachys> pl great 16:06:33 <ihrachys> *ok great 16:07:04 <mlavalle> cool, since I saw the chandankumar was pushing the patches I thought he was going to continue with them 16:07:11 <ihrachys> #action mlavalle to follow up with stadium projects on taking over / merge patches to switch imports to new tempest repo 16:07:35 <ihrachys> oh if he was active recently that's another story. I haven't seen it, probably happened the latest week. 16:07:36 <mlavalle> will do 16:07:52 <ihrachys> ok next was "ihrachys to report bug for dvr scenario job timeouts and try concurrency increase" 16:08:08 <mlavalle> nevermind. I'll have the stadium guys take over 16:08:24 <ihrachys> we merged https://review.openstack.org/#/c/532261/ the bug report was https://launchpad.net/bugs/1742200 16:08:25 <openstack> Launchpad bug 1742200 in neutron "dvr scenario job fails with timeouts" [High,Fix released] 16:08:42 <ihrachys> we will revisit whether it helped in grafana section 16:08:50 <ihrachys> "ihrachys to report bug for mtu scenario not executed for linuxbridge job" 16:09:15 <ihrachys> I reported https://bugs.launchpad.net/neutron/+bug/1742197 16:09:15 <openstack> Launchpad bug 1742197 in neutron "linuxbridge scenario job doesn't execute NetworkMtuBaseTest" [Medium,Confirmed] - Assigned to Dongcan Ye (hellochosen) 16:09:38 <ihrachys> and there is already a fix in review for that: https://review.openstack.org/#/c/532406/ though it relies on plugin to support mtu-writable extension 16:10:06 <ihrachys> I also tried https://review.openstack.org/#/c/532259/ but figured that vlan type driver is not working in gate 16:10:19 <ihrachys> so for linuxbridge we are left with flat and vxlan to pick from 16:10:33 <ihrachys> (well, I *assume* flat works, I haven't checked) 16:11:03 <ihrachys> in light of this, I guess mtu-writable approach is the best 16:11:34 <ihrachys> ok, the last item was "slaweq to take over sec group failure in fullstack (report bug / triage / fix)" 16:11:46 <ihrachys> that was in regards to failure as in http://logs.openstack.org/43/529143/3/check/neutron-fullstack/d031a6b/logs/testr_results.html.gz 16:11:55 <slaweq> I only reported it on launchpad: https://bugs.launchpad.net/neutron/+bug/1742401 16:11:56 <openstack> Launchpad bug 1742401 in neutron "Fullstack tests neutron.tests.fullstack.test_securitygroup.TestSecurityGroupsSameNetwork fails often" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq) 16:12:02 <slaweq> I didn't had time to work on it 16:12:25 <slaweq> because I was fixing https://bugs.launchpad.net/neutron/+bug/1737892 16:12:27 <openstack> Launchpad bug 1737892 in neutron "Fullstack test test_qos.TestBwLimitQoSOvs.test_bw_limit_qos_port_removed failing many times" [High,In progress] - Assigned to Slawek Kaplonski (slaweq) 16:12:36 <slaweq> I hope I will work on this issue with SG this week 16:12:45 <ihrachys> great! thanks for taking it. 16:13:04 <ihrachys> we'll have a look at fullstack patches after briefly checking grafana 16:13:07 <ihrachys> #topic Grafana 16:13:12 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:13:54 <ihrachys> reminder that we switched to non-legacy job names on the board where applicable: https://review.openstack.org/#/c/532632/ 16:14:05 <ihrachys> we need to check whether all boards still work as intended 16:14:42 <ihrachys> there are several boards there right now (coverage, grenade) that don't have any datapoints 16:14:46 <ihrachys> all for gate queue 16:14:59 <ihrachys> I suspect it's because no failure hits happened for those jobs 16:15:19 <haleyb> that's usually the reason 16:15:38 <ihrachys> yeah, I think that's a reasonable explanation for coverage 16:15:52 <ihrachys> for grenade, it wouldn't be the case but the update for the dashboard is quite fresh 16:16:02 <ihrachys> I mean, tempest / grenade fail from time to time 16:16:17 <ihrachys> so long term we should expect some failing data points\ 16:16:29 <ihrachys> so I guess let's give it a week or two before paniching 16:16:35 <ihrachys> *panicking 16:16:54 <ihrachys> in the meantime, we have check queue for grenade jobs so we can still spot issues 16:18:32 <ihrachys> looking at other dashboards, fullstack is steady 80%, scenario dvr is 30% (I think we had it like that before) 16:18:38 <ihrachys> linuxbridge scenario spiked to 100% 16:19:00 <ihrachys> and I found out it's a recent logapi service plugin patch that broke it 16:19:16 <ihrachys> because we now enable log l2 extension for linuxbridge despite it not being compatible with the agent 16:19:20 <ihrachys> I have a fix here: https://review.openstack.org/#/c/533799/ 16:19:59 <ihrachys> slaweq, it now has ci results so maybe you can bump vote 16:20:09 <slaweq> ihrachys: I just +W 16:20:24 <ihrachys> there is still more work for the bug, but I will let others to fix that 16:20:38 <ihrachys> I left the bug opened 16:20:45 <ihrachys> slaweq, thanks 16:21:06 <ihrachys> I also notice there is a spike in failure rate for functional 16:21:10 <ihrachys> both check and gate queues 16:21:24 <ihrachys> 30% in gate, 40% in check queues 16:21:35 <ihrachys> anyone aware of what's that about? 16:22:07 <jlibosva> perhaps the AddBridge issue? 16:22:12 <slaweq> isn't it related to this issue with connections to ovs? 16:22:13 <jlibosva> I saw a functional test stuck today again 16:22:23 <jlibosva> but I haven't check how often it happens 16:22:38 <ihrachys> jlibosva, you mean https://bugs.launchpad.net/neutron/+bug/1741889 ? 16:22:39 <openstack> Launchpad bug 1741889 in neutron "functional: DbAddCommand sometimes times out after 10 seconds" [High,New] 16:22:43 <jlibosva> yep 16:22:47 <ihrachys> slaweq, issue with connections? what's that? 16:23:02 <slaweq> this one which You just posted 16:23:05 <slaweq> sorry 16:23:17 <jlibosva> I promised today I'll ask otherwiseguy for help, I haven't got to it yet 16:23:30 <ihrachys> slaweq, oh I see you referred to ovs log messages there with connection failures 16:23:36 <slaweq> yes 16:23:41 <slaweq> but it's this one, right? 16:23:48 <slaweq> or I mixed something? 16:24:03 <ihrachys> jlibosva, that would be nice to get otherwiseguy on it because it's really painful and we don't seem to have a path forward. there was another bug that he was going to look into. 16:24:10 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1687074 16:24:10 <openstack> Launchpad bug 1687074 in neutron "Sometimes ovsdb fails with "tcp:127.0.0.1:6640: error parsing stream"" [High,Confirmed] 16:24:30 <ihrachys> I think we usually spot those stream errors in tests that fail with retries 16:24:35 <ihrachys> so can be related 16:24:42 <ihrachys> slaweq, yeah I think it's this one 16:24:49 <slaweq> :) 16:24:54 <slaweq> thx for confirmation 16:25:05 <ihrachys> I think there was a patch for stream issue somewhere.. 16:25:10 <ihrachys> looking 16:25:36 <ihrachys> ok it was not really a fix, just debug statements: https://review.openstack.org/#/c/525775/4/neutron/agent/common/ovs_lib.py 16:26:14 <ihrachys> I think Terry triggered a failure with the patch and got ": RuntimeError: Port nonexistantport does not exist" from ovs socket. 16:26:17 <ihrachys> which is very weird 16:26:27 <ihrachys> it's definitely neutron error 16:26:42 <ihrachys> so how is it possible it results in ovsdb socket 16:26:52 <ihrachys> maybe eventlet screwing up file descriptors somehow 16:28:21 <ihrachys> well actually, maybe I assume too much saying it's a neutron issue 16:28:27 <ihrachys> I mean, neutron error, sorry 16:29:15 <ihrachys> but anyway, jlibosva is going to reach out to Terry. please mention both bugs since they are arguably related. 16:29:31 <jlibosva> yep, will do 16:30:15 <ihrachys> #action jlibosva to talk to otherwiseguy about making progress for functional ovsdb issues (timeouts, retries, stream parsing failures) 16:30:19 <ihrachys> thanks! 16:30:46 <ihrachys> jlibosva, you mentioned a functional job being stuck. I believe you mean http://logs.openstack.org/99/533799/1/check/neutron-functional/990c856/ 16:30:57 <ihrachys> do you think it can be related to ovsdb issues, or there is more to it? 16:31:32 <jlibosva> yep, that's it http://logs.openstack.org/99/533799/1/check/neutron-functional/990c856/job-output.txt.gz#_2018-01-15_23_30_49_840378 16:31:37 <jlibosva> I mean, that's what I meant 16:32:21 <ihrachys> so it's just a random kill 16:32:28 <jlibosva> last job I checked, there was a 1.5GB large log file containing output of one particular test and it was full of TRY_AGAIN messages 16:32:44 <ihrachys> jlibosva, aha that's a good sign that it's same issue 16:32:44 <jlibosva> now I wonder why timeout per test wasn't triggered 16:32:54 <ihrachys> right 16:33:02 <jlibosva> maybe it didn't switch greenthreads 16:33:13 <jlibosva> or we have an issue there 16:34:09 <ihrachys> yeah. does anyone have cycles to check why the job timed out? or we just leave it at it while we figure out ovsdb issue 16:34:49 <ihrachys> ok let's leave it for now and revisit if it happens more 16:34:59 <ihrachys> #topic Fullstack 16:35:05 <jlibosva> I can check it 16:35:12 <ihrachys> well ok :) 16:35:19 <jlibosva> I believe it will be the TRY_AGAIN :) 16:35:56 <ihrachys> #action jlibosva to check why functional job times out globally instead of triggering a local test case timeout for TRY_AGAIN ovsdb issue 16:36:31 <ihrachys> jlibosva, a lot of pointers to it maybe being the case of eventlet running amok 16:36:35 <ihrachys> anyway, back to fullstack 16:36:49 <ihrachys> slaweq was doing amazing job lately tackling failures we have one by one 16:37:03 <slaweq> but slowly :) 16:37:13 <ihrachys> the latest fix for qos dscp bug is https://review.openstack.org/533318 16:37:24 <ihrachys> and it's already +W, just need to get it merged 16:38:08 <ihrachys> there was a concern during review that l2 extension manager may bubble up exceptions from extensions 16:38:30 <ihrachys> how do we tackle that? I guess the least reporting a low priority bug? 16:38:41 <ihrachys> anyone willing to chew it beyond that? 16:38:48 <slaweq> I can report it 16:38:53 <ihrachys> go for it! 16:39:00 <slaweq> but I don't know if I will have time to propose fix soon 16:39:10 <ihrachys> #action slaweq to report a bug about l2 ext manager not catching exceptions from extensions 16:39:12 <slaweq> ok, I will report it today 16:39:13 <ihrachys> that's fine 16:39:33 <ihrachys> just don't assign yourself if you don't plan to work on a fix, and someone should pick it up eventually 16:39:42 <slaweq> sure 16:40:04 <ihrachys> ok. apart from this fix, do we have anything in review that would help the job? 16:40:09 <slaweq> I want to mention that patch for https://bugs.launchpad.net/neutron/+bug/1733649 was also merged few days ago 16:40:10 <openstack> Launchpad bug 1733649 in neutron "fullstack neutron.tests.fullstack.test_qos.TestDscpMarkingQoSOvs.test_dscp_marking_packets(openflow-native) failure" [High,Fix released] - Assigned to Gary Kotton (garyk) 16:40:20 <slaweq> so I hope it will be better now 16:40:50 <slaweq> I will check if for few days and will remove "unstable_test" decorator if all will be fine 16:41:06 <ihrachys> fullstack is still steady 80% so I guess other issues are more frequent to fail to mitigate the fix impact 16:41:06 <slaweq> and I will also propose this fix for stable branches then 16:41:34 <ihrachys> slaweq, oh you mean the test that was ignored. ok. 16:41:44 <slaweq> ihrachys: yes 16:42:11 <ihrachys> ok taking one of the fresh runs for fullstack: http://logs.openstack.org/99/533799/1/check/neutron-fullstack/f827430/logs/testr_results.html.gz 16:42:27 <ihrachys> I noticed this agent restart test started failing too often in the job 16:42:31 <ihrachys> often it's the only failure 16:42:42 <ihrachys> so I suspect tackling this one should drop failure rate significantly 16:42:57 <ihrachys> I don't think we ever reported the bug for the failure 16:44:18 <jlibosva> uhm, shall we mark it as unstable to get some better failure rate? :) 16:44:46 <ihrachys> maybe. after reporting a bug at least 16:44:53 <ihrachys> I checked ovs agent logs for the test 16:45:10 <ihrachys> the only suspicious one is this agent (there are 4 logs total there): http://logs.openstack.org/99/533799/1/check/neutron-fullstack/f827430/logs/dsvm-fullstack-logs/TestUninterruptedConnectivityOnL2AgentRestart.test_l2_agent_restart_OVS,VLANs,openflow-native_/neutron-openvswitch-agent--2018-01-15--21-43-21-502569.txt.gz?level=WARNING 16:45:15 <ihrachys> 2018-01-15 21:45:49.869 28570 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int RuntimeError: ofctl request version=None,msg_type=None,msg_len=None,xid=None,OFPFlowStatsRequest(cookie=0,cookie_mask=0,flags=0,match=OFPMatch(oxm_fields={}),out_group=4294967295,out_port=4294967295,table_id=23,type=1) error Datapath Invalid 90269446452558 16:45:29 <ihrachys> and then later 16:45:29 <ihrachys> 2018-01-15 21:45:49.942 28570 WARNING neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-f3915075-576d-4155-86c3-c2272e53a1d8 - - - - -] OVS is dead. OVSNeutronAgent will keep running and checking OVS status periodically. 16:47:11 <ihrachys> there are some log messages in ovsdb server log like: 2018-01-15T21:45:33.394Z|00012|reconnect|ERR|tcp:127.0.0.1:59162: no response to inactivity probe after 5 seconds, disconnecting 16:47:20 <ihrachys> what's the probe? 16:47:33 <ihrachys> is it ovsdb server talking to ryu? 16:47:44 <ihrachys> like checking if controller is still there? 16:47:58 <jlibosva> I think that's ovsdb talking to manager 16:48:22 <jlibosva> or rather vice-versa 16:48:25 <slaweq> this ovsdb log is very similar like in stucked functional tests IMHO 16:48:30 <slaweq> maybe it's related somehow? 16:48:33 <jlibosva> it's more related to native ovsdb interface than openflow 16:48:38 <ihrachys> slaweq, maybe it's always like that :) 16:48:44 <ihrachys> jlibosva, yeah sorry mixed things 16:48:47 <slaweq> maybe, I don't know 16:49:05 <ihrachys> anyway... I will report a bug for that. I may dedicate some time to dig it too. 16:49:18 <ihrachys> #action ihrachys to report bug for l2 agent restart fullstack failures and dig it 16:50:16 <ihrachys> #topic Scenarios 16:50:31 <ihrachys> haleyb, I saw some activity on dvr bugs that lingered scenarios. any updates? 16:51:33 <ihrachys> or was it something else... I think I saw Swami posting something but now I don't. 16:51:58 <ihrachys> I guess I dreamt 16:52:18 <ihrachys> anyway, if anything this is the bug that affected us for a while: https://bugs.launchpad.net/neutron/+bug/1717302 16:52:19 <openstack> Launchpad bug 1717302 in neutron "Tempest floatingip scenario tests failing on DVR Multinode setup with HA" [High,Confirmed] - Assigned to Brian Haley (brian-haley) 16:52:32 <ihrachys> it's on Brian but there seems to be no fixes or updates since Dec 16:53:13 <mlavalle> if haleyb is not around, I can bring it up in the L3 meeting this coming Thursday 16:53:42 <ihrachys> yeah please do 16:53:47 <mlavalle> ++ 16:54:14 <ihrachys> #action mlavalle to bring up floating ip failures in dvr scenario job to l3 team 16:54:30 <ihrachys> I am looking at a fresh run 16:54:30 <ihrachys> http://logs.openstack.org/60/532460/1/check/legacy-tempest-dsvm-neutron-dvr-multinode-scenario/bf95db0/logs/testr_results.html.gz 16:54:36 <ihrachys> see the trunk test failed 16:55:07 <ihrachys> I don't think it's reported 16:55:19 <ihrachys> anyone up for the job to report it or maybe even having a look? 16:55:53 <jlibosva> that might be fixed by https://review.openstack.org/#/c/531414/1 16:56:12 <ihrachys> ack! 16:56:21 <ihrachys> #topic Rally 16:56:42 <ihrachys> we had https://bugs.launchpad.net/neutron/+bug/1741954 raised a while ago 16:56:43 <openstack> Launchpad bug 1741954 in neutron "create_and_list_trunk_subports rally scenario failed with timeouts" [High,In progress] - Assigned to Armando Migliaccio (armando-migliaccio) 16:56:57 <ihrachys> this is mostly to update folks that we have a fix for that here: https://review.openstack.org/#/c/532045/ 16:57:07 <ihrachys> (well, it's kinda a fix, it's just optimization) 16:57:27 <ihrachys> I think we started hitting more timeouts and perf issues since meltdown patching in infra / clouds we use 16:57:36 <ihrachys> but the patch should help one of those instance 16:57:41 <ihrachys> *instances 16:57:43 <ihrachys> #topic Gate 16:58:15 <ihrachys> mlavalle, I recollect you were going to follow up on remaining legacy- jobs in neutron gate that we figured out are still present and not converted to new format 16:58:21 <ihrachys> any progress? 16:58:27 <ihrachys> (like tempest py3) 16:58:33 <mlavalle> didn't have time this past week 16:58:48 <mlavalle> I will do it over the next few days 16:58:54 <ihrachys> #action mlavalle to transit remaining legacy- jobs to new format 16:59:00 <ihrachys> gotcha, that's ok, not pressing! 16:59:07 <mlavalle> :-) 16:59:16 <ihrachys> ok we are out of time 16:59:27 <ihrachys> it's always pretty packed this meeting riiiight? :) 16:59:32 <mlavalle> yeah 16:59:36 <ihrachys> that's good 16:59:39 <ihrachys> ok thanks everyone 16:59:43 <ihrachys> #endmeeting