#openstack-meeting log

16:01:18 <ihrachys> #startmeeting neutron_ci
16:01:18 <openstack> Meeting started Tue Jan 16 16:01:18 2018 UTC and is due to finish in 60 minutes.  The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:01:19 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:01:21 <openstack> The meeting name has been set to 'neutron_ci'
16:01:24 <mlavalle> o/
16:01:28 <slaweq> hi
16:01:58 <ihrachys> heya
16:02:02 <ihrachys> #topic Actions from prev meeting
16:02:03 <haleyb> hi
16:02:30 <ihrachys> (I think the first one that was on me since it's overridden by another AI)
16:02:32 <ihrachys> next is "mlavalle to follow up with stadium projects on switching to new tempest repo"
16:02:50 <mlavalle> I did follow up with vpnaas and midonet
16:03:03 <mlavalle> they are on board with the move
16:03:30 <ihrachys> eh wait :)
16:03:36 <ihrachys> are we talking the same thing?
16:03:38 <mlavalle> I think the patches from the gentleman leading this have to be updated
16:03:52 <ihrachys> oh ok we DO talk the same thing
16:04:07 <ihrachys> I figured the AI was not worded unambiguously :)
16:04:17 <mlavalle> LOL
16:04:18 <jlibosva> o/
16:04:36 <mlavalle> I made sure yamamoto and hoangcx keep and eye on the respective patches
16:04:38 <ihrachys> for the record, we are talking about switching imports in other stadium repos to neutron-tempest-plugin, then killing neutron in-tree remaining tempest bits from neutron.tests.tempest
16:04:48 <mlavalle> correct
16:04:53 <ihrachys> mlavalle, keep an eye or follow up / update?
16:05:06 <ihrachys> I am not sure chandankumar is going to follow up on those patches
16:05:39 <mlavalle> ok, do they have to continue with the respective patches
16:05:41 <mlavalle> ?
16:05:55 <ihrachys> I was thinking that yes, they would take over / drive to merge
16:06:08 <mlavalle> ok, I will talk to them again tonight
16:06:18 <ihrachys> for what I understand, chandankumar spearheads several initiatives and projects and may not have equal time for all patches
16:06:30 <ihrachys> pl great
16:06:33 <ihrachys> *ok great
16:07:04 <mlavalle> cool, since I saw the chandankumar was pushing the patches I thought he was going to continue with them
16:07:11 <ihrachys> #action mlavalle to follow up with stadium projects on taking over / merge patches to switch imports to new tempest repo
16:07:35 <ihrachys> oh if he was active recently that's another story. I haven't seen it, probably happened the latest week.
16:07:36 <mlavalle> will do
16:07:52 <ihrachys> ok next was "ihrachys to report bug for dvr scenario job timeouts and try concurrency increase"
16:08:08 <mlavalle> nevermind. I'll have the stadium guys take over
16:08:24 <ihrachys> we merged https://review.openstack.org/#/c/532261/ the bug report was https://launchpad.net/bugs/1742200
16:08:25 <openstack> Launchpad bug 1742200 in neutron "dvr scenario job fails with timeouts" [High,Fix released]
16:08:42 <ihrachys> we will revisit whether it helped in grafana section
16:08:50 <ihrachys> "ihrachys to report bug for mtu scenario not executed for linuxbridge job"
16:09:15 <ihrachys> I reported https://bugs.launchpad.net/neutron/+bug/1742197
16:09:15 <openstack> Launchpad bug 1742197 in neutron "linuxbridge scenario job doesn't execute NetworkMtuBaseTest" [Medium,Confirmed] - Assigned to Dongcan Ye (hellochosen)
16:09:38 <ihrachys> and there is already a fix in review for that: https://review.openstack.org/#/c/532406/ though it relies on plugin to support mtu-writable extension
16:10:06 <ihrachys> I also tried https://review.openstack.org/#/c/532259/ but figured that vlan type driver is not working in gate
16:10:19 <ihrachys> so for linuxbridge we are left with flat and vxlan to pick from
16:10:33 <ihrachys> (well, I *assume* flat works, I haven't checked)
16:11:03 <ihrachys> in light of this, I guess mtu-writable approach is the best
16:11:34 <ihrachys> ok, the last item was "slaweq to take over sec group failure in fullstack (report bug / triage / fix)"
16:11:46 <ihrachys> that was in regards to failure as in http://logs.openstack.org/43/529143/3/check/neutron-fullstack/d031a6b/logs/testr_results.html.gz
16:11:55 <slaweq> I only reported it on launchpad: https://bugs.launchpad.net/neutron/+bug/1742401
16:11:56 <openstack> Launchpad bug 1742401 in neutron "Fullstack tests neutron.tests.fullstack.test_securitygroup.TestSecurityGroupsSameNetwork fails often" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq)
16:12:02 <slaweq> I didn't had time to work on it
16:12:25 <slaweq> because I was fixing https://bugs.launchpad.net/neutron/+bug/1737892
16:12:27 <openstack> Launchpad bug 1737892 in neutron "Fullstack test test_qos.TestBwLimitQoSOvs.test_bw_limit_qos_port_removed failing many times" [High,In progress] - Assigned to Slawek Kaplonski (slaweq)
16:12:36 <slaweq> I hope I will work on this issue with SG this week
16:12:45 <ihrachys> great! thanks for taking it.
16:13:04 <ihrachys> we'll have a look at fullstack patches after briefly checking grafana
16:13:07 <ihrachys> #topic Grafana
16:13:12 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:13:54 <ihrachys> reminder that we switched to non-legacy job names on the board where applicable: https://review.openstack.org/#/c/532632/
16:14:05 <ihrachys> we need to check whether all boards still work as intended
16:14:42 <ihrachys> there are several boards there right now (coverage, grenade) that don't have any datapoints
16:14:46 <ihrachys> all for gate queue
16:14:59 <ihrachys> I suspect it's because no failure hits happened for those jobs
16:15:19 <haleyb> that's usually the reason
16:15:38 <ihrachys> yeah, I think that's a reasonable explanation for coverage
16:15:52 <ihrachys> for grenade, it wouldn't be the case but the update for the dashboard is quite fresh
16:16:02 <ihrachys> I mean, tempest / grenade fail from time to time
16:16:17 <ihrachys> so long term we should expect some failing data points\
16:16:29 <ihrachys> so I guess let's give it a week or two before paniching
16:16:35 <ihrachys> *panicking
16:16:54 <ihrachys> in the meantime, we have check queue for grenade jobs so we can still spot issues
16:18:32 <ihrachys> looking at other dashboards, fullstack is steady 80%, scenario dvr is 30% (I think we had it like that before)
16:18:38 <ihrachys> linuxbridge scenario spiked to 100%
16:19:00 <ihrachys> and I found out it's a recent logapi service plugin patch that broke it
16:19:16 <ihrachys> because we now enable log l2 extension for linuxbridge despite it not being compatible with the agent
16:19:20 <ihrachys> I have a fix here: https://review.openstack.org/#/c/533799/
16:19:59 <ihrachys> slaweq, it now has ci results so maybe you can bump vote
16:20:09 <slaweq> ihrachys: I just +W
16:20:24 <ihrachys> there is still more work for the bug, but I will let others to fix that
16:20:38 <ihrachys> I left the bug opened
16:20:45 <ihrachys> slaweq, thanks
16:21:06 <ihrachys> I also notice there is a spike in failure rate for functional
16:21:10 <ihrachys> both check and gate queues
16:21:24 <ihrachys> 30% in gate, 40% in check queues
16:21:35 <ihrachys> anyone aware of what's that about?
16:22:07 <jlibosva> perhaps the AddBridge issue?
16:22:12 <slaweq> isn't it related to this issue with connections to ovs?
16:22:13 <jlibosva> I saw a functional test stuck today again
16:22:23 <jlibosva> but I haven't check how often it happens
16:22:38 <ihrachys> jlibosva, you mean https://bugs.launchpad.net/neutron/+bug/1741889 ?
16:22:39 <openstack> Launchpad bug 1741889 in neutron "functional: DbAddCommand sometimes times out after 10 seconds" [High,New]
16:22:43 <jlibosva> yep
16:22:47 <ihrachys> slaweq, issue with connections? what's that?
16:23:02 <slaweq> this one which You just posted
16:23:05 <slaweq> sorry
16:23:17 <jlibosva> I promised today I'll ask otherwiseguy for help, I haven't got to it yet
16:23:30 <ihrachys> slaweq, oh I see you referred to ovs log messages there with connection failures
16:23:36 <slaweq> yes
16:23:41 <slaweq> but it's this one, right?
16:23:48 <slaweq> or I mixed something?
16:24:03 <ihrachys> jlibosva, that would be nice to get otherwiseguy on it because it's really painful and we don't seem to have a path forward. there was another bug that he was going to look into.
16:24:10 <ihrachys> https://bugs.launchpad.net/neutron/+bug/1687074
16:24:10 <openstack> Launchpad bug 1687074 in neutron "Sometimes ovsdb fails with "tcp:127.0.0.1:6640: error parsing stream"" [High,Confirmed]
16:24:30 <ihrachys> I think we usually spot those stream errors in tests that fail with retries
16:24:35 <ihrachys> so can be related
16:24:42 <ihrachys> slaweq, yeah I think it's this one
16:24:49 <slaweq> :)
16:24:54 <slaweq> thx for confirmation
16:25:05 <ihrachys> I think there was a patch for stream issue somewhere..
16:25:10 <ihrachys> looking
16:25:36 <ihrachys> ok it was not really a fix, just debug statements: https://review.openstack.org/#/c/525775/4/neutron/agent/common/ovs_lib.py
16:26:14 <ihrachys> I think Terry triggered a failure with the patch and got ": RuntimeError: Port nonexistantport does not exist" from ovs socket.
16:26:17 <ihrachys> which is very weird
16:26:27 <ihrachys> it's definitely neutron error
16:26:42 <ihrachys> so how is it possible it results in ovsdb socket
16:26:52 <ihrachys> maybe eventlet screwing up file descriptors somehow
16:28:21 <ihrachys> well actually, maybe I assume too much saying it's a neutron issue
16:28:27 <ihrachys> I mean, neutron error, sorry
16:29:15 <ihrachys> but anyway, jlibosva is going to reach out to Terry. please mention both bugs since they are arguably related.
16:29:31 <jlibosva> yep, will do
16:30:15 <ihrachys> #action jlibosva to talk to otherwiseguy about making progress for functional ovsdb issues (timeouts, retries, stream parsing failures)
16:30:19 <ihrachys> thanks!
16:30:46 <ihrachys> jlibosva, you mentioned a functional job being stuck. I believe you mean http://logs.openstack.org/99/533799/1/check/neutron-functional/990c856/
16:30:57 <ihrachys> do you think it can be related to ovsdb issues, or there is more to it?
16:31:32 <jlibosva> yep, that's it http://logs.openstack.org/99/533799/1/check/neutron-functional/990c856/job-output.txt.gz#_2018-01-15_23_30_49_840378
16:31:37 <jlibosva> I mean, that's what I meant
16:32:21 <ihrachys> so it's just a random kill
16:32:28 <jlibosva> last job I checked, there was a 1.5GB large log file containing output of one particular test and it was full of TRY_AGAIN messages
16:32:44 <ihrachys> jlibosva, aha that's a good sign that it's same issue
16:32:44 <jlibosva> now I wonder why timeout per test wasn't triggered
16:32:54 <ihrachys> right
16:33:02 <jlibosva> maybe it didn't switch greenthreads
16:33:13 <jlibosva> or we have an issue there
16:34:09 <ihrachys> yeah. does anyone have cycles to check why the job timed out? or we just leave it at it while we figure out ovsdb issue
16:34:49 <ihrachys> ok let's leave it for now and revisit if it happens more
16:34:59 <ihrachys> #topic Fullstack
16:35:05 <jlibosva> I can check it
16:35:12 <ihrachys> well ok :)
16:35:19 <jlibosva> I believe it will be the TRY_AGAIN :)
16:35:56 <ihrachys> #action jlibosva to check why functional job times out globally instead of triggering a local test case timeout for TRY_AGAIN ovsdb issue
16:36:31 <ihrachys> jlibosva, a lot of pointers to it maybe being the case of eventlet running amok
16:36:35 <ihrachys> anyway, back to fullstack
16:36:49 <ihrachys> slaweq was doing amazing job lately tackling failures we have one by one
16:37:03 <slaweq> but slowly :)
16:37:13 <ihrachys> the latest fix for qos dscp bug is https://review.openstack.org/533318
16:37:24 <ihrachys> and it's already +W, just need to get it merged
16:38:08 <ihrachys> there was a concern during review that l2 extension manager may bubble up exceptions from extensions
16:38:30 <ihrachys> how do we tackle that? I guess the least reporting a low priority bug?
16:38:41 <ihrachys> anyone willing to chew it beyond that?
16:38:48 <slaweq> I can report it
16:38:53 <ihrachys> go for it!
16:39:00 <slaweq> but I don't know if I will have time to propose fix soon
16:39:10 <ihrachys> #action slaweq to report a bug about l2 ext manager not catching exceptions from extensions
16:39:12 <slaweq> ok, I will report it today
16:39:13 <ihrachys> that's fine
16:39:33 <ihrachys> just don't assign yourself if you don't plan to work on a fix, and someone should pick it up eventually
16:39:42 <slaweq> sure
16:40:04 <ihrachys> ok. apart from this fix, do we have anything in review that would help the job?
16:40:09 <slaweq> I want to mention that patch for https://bugs.launchpad.net/neutron/+bug/1733649 was also merged few days ago
16:40:10 <openstack> Launchpad bug 1733649 in neutron "fullstack neutron.tests.fullstack.test_qos.TestDscpMarkingQoSOvs.test_dscp_marking_packets(openflow-native) failure" [High,Fix released] - Assigned to Gary Kotton (garyk)
16:40:20 <slaweq> so I hope it will be better now
16:40:50 <slaweq> I will check if for few days and will remove "unstable_test" decorator if all will be fine
16:41:06 <ihrachys> fullstack is still steady 80% so I guess other issues are more frequent to fail to mitigate the fix impact
16:41:06 <slaweq> and I will also propose this fix for stable branches then
16:41:34 <ihrachys> slaweq, oh you mean the test that was ignored. ok.
16:41:44 <slaweq> ihrachys: yes
16:42:11 <ihrachys> ok taking one of the fresh runs for fullstack: http://logs.openstack.org/99/533799/1/check/neutron-fullstack/f827430/logs/testr_results.html.gz
16:42:27 <ihrachys> I noticed this agent restart test started failing too often in the job
16:42:31 <ihrachys> often it's the only failure
16:42:42 <ihrachys> so I suspect tackling this one should drop failure rate significantly
16:42:57 <ihrachys> I don't think we ever reported the bug for the failure
16:44:18 <jlibosva> uhm, shall we mark it as unstable to get some better failure rate? :)
16:44:46 <ihrachys> maybe. after reporting a bug at least
16:44:53 <ihrachys> I checked ovs agent logs for the test
16:45:10 <ihrachys> the only suspicious one is this agent (there are 4 logs total there): http://logs.openstack.org/99/533799/1/check/neutron-fullstack/f827430/logs/dsvm-fullstack-logs/TestUninterruptedConnectivityOnL2AgentRestart.test_l2_agent_restart_OVS,VLANs,openflow-native_/neutron-openvswitch-agent--2018-01-15--21-43-21-502569.txt.gz?level=WARNING
16:45:15 <ihrachys> 2018-01-15 21:45:49.869 28570 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int RuntimeError: ofctl request version=None,msg_type=None,msg_len=None,xid=None,OFPFlowStatsRequest(cookie=0,cookie_mask=0,flags=0,match=OFPMatch(oxm_fields={}),out_group=4294967295,out_port=4294967295,table_id=23,type=1) error Datapath Invalid 90269446452558
16:45:29 <ihrachys> and then later
16:45:29 <ihrachys> 2018-01-15 21:45:49.942 28570 WARNING neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-f3915075-576d-4155-86c3-c2272e53a1d8 - - - - -] OVS is dead. OVSNeutronAgent will keep running and checking OVS status periodically.
16:47:11 <ihrachys> there are some log messages in ovsdb server log like: 2018-01-15T21:45:33.394Z|00012|reconnect|ERR|tcp:127.0.0.1:59162: no response to inactivity probe after 5 seconds, disconnecting
16:47:20 <ihrachys> what's the probe?
16:47:33 <ihrachys> is it ovsdb server talking to ryu?
16:47:44 <ihrachys> like checking if controller is still there?
16:47:58 <jlibosva> I think that's ovsdb talking to manager
16:48:22 <jlibosva> or rather vice-versa
16:48:25 <slaweq> this ovsdb log is very similar like in stucked functional tests IMHO
16:48:30 <slaweq> maybe it's related somehow?
16:48:33 <jlibosva> it's more related to native ovsdb interface than openflow
16:48:38 <ihrachys> slaweq, maybe it's always like that :)
16:48:44 <ihrachys> jlibosva, yeah sorry mixed things
16:48:47 <slaweq> maybe, I don't know
16:49:05 <ihrachys> anyway... I will report a bug for that. I may dedicate some time to dig it too.
16:49:18 <ihrachys> #action ihrachys to report bug for l2 agent restart fullstack failures and dig it
16:50:16 <ihrachys> #topic Scenarios
16:50:31 <ihrachys> haleyb, I saw some activity on dvr bugs that lingered scenarios. any updates?
16:51:33 <ihrachys> or was it something else... I think I saw Swami posting something but now I don't.
16:51:58 <ihrachys> I guess I dreamt
16:52:18 <ihrachys> anyway, if anything this is the bug that affected us for a while: https://bugs.launchpad.net/neutron/+bug/1717302
16:52:19 <openstack> Launchpad bug 1717302 in neutron "Tempest floatingip scenario tests failing on DVR Multinode setup with HA" [High,Confirmed] - Assigned to Brian Haley (brian-haley)
16:52:32 <ihrachys> it's on Brian but there seems to be no fixes or updates since Dec
16:53:13 <mlavalle> if haleyb is not around, I can bring it up in the L3 meeting this coming Thursday
16:53:42 <ihrachys> yeah please do
16:53:47 <mlavalle> ++
16:54:14 <ihrachys> #action mlavalle to bring up floating ip failures in dvr scenario job to l3 team
16:54:30 <ihrachys> I am looking at a fresh run
16:54:30 <ihrachys> http://logs.openstack.org/60/532460/1/check/legacy-tempest-dsvm-neutron-dvr-multinode-scenario/bf95db0/logs/testr_results.html.gz
16:54:36 <ihrachys> see the trunk test failed
16:55:07 <ihrachys> I don't think it's reported
16:55:19 <ihrachys> anyone up for the job to report it or maybe even having a look?
16:55:53 <jlibosva> that might be fixed by https://review.openstack.org/#/c/531414/1
16:56:12 <ihrachys> ack!
16:56:21 <ihrachys> #topic Rally
16:56:42 <ihrachys> we had https://bugs.launchpad.net/neutron/+bug/1741954 raised a while ago
16:56:43 <openstack> Launchpad bug 1741954 in neutron "create_and_list_trunk_subports rally scenario failed with timeouts" [High,In progress] - Assigned to Armando Migliaccio (armando-migliaccio)
16:56:57 <ihrachys> this is mostly to update folks that we have a fix for that here: https://review.openstack.org/#/c/532045/
16:57:07 <ihrachys> (well, it's kinda a fix, it's just optimization)
16:57:27 <ihrachys> I think we started hitting more timeouts and perf issues since meltdown patching in infra / clouds we use
16:57:36 <ihrachys> but the patch should help one of those instance
16:57:41 <ihrachys> *instances
16:57:43 <ihrachys> #topic Gate
16:58:15 <ihrachys> mlavalle, I recollect you were going to follow up on remaining legacy- jobs in neutron gate that we figured out are still present and not converted to new format
16:58:21 <ihrachys> any progress?
16:58:27 <ihrachys> (like tempest py3)
16:58:33 <mlavalle> didn't have time this past week
16:58:48 <mlavalle> I will do it over the next few days
16:58:54 <ihrachys> #action mlavalle to transit remaining legacy- jobs to new format
16:59:00 <ihrachys> gotcha, that's ok, not pressing!
16:59:07 <mlavalle> :-)
16:59:16 <ihrachys> ok we are out of time
16:59:27 <ihrachys> it's always pretty packed this meeting riiiight? :)
16:59:32 <mlavalle> yeah
16:59:36 <ihrachys> that's good
16:59:39 <ihrachys> ok thanks everyone
16:59:43 <ihrachys> #endmeeting