16:00:27 <ihrachys> #startmeeting neutron_ci 16:00:30 <mlavalle> o/ 16:00:32 <openstack> Meeting started Tue Mar 28 16:00:27 2017 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:33 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:35 * ihrachys waves at everyone 16:00:35 <openstack> The meeting name has been set to 'neutron_ci' 16:00:57 <manjeets_> o/ 16:01:44 <jlibosva> o/ 16:01:51 <ihrachys> let's review action items from the prev meeting 16:01:55 <ihrachys> aka shame on ihrachys 16:01:59 <jlibosva> and jlibosva 16:02:01 <ihrachys> "ihrachys fix e-r bot not reporting in irc channel" 16:02:29 <ihrachys> hasn't happened; I gotta track that in my trello I guess 16:02:33 <ihrachys> #action ihrachys fix e-r bot not reporting in irc channel 16:02:38 <ihrachys> "ihrachys to fix the grafana board to include gate-tempest-dsvm-neutron-dvr-multinode-full-ubuntu-xenial-nv" 16:02:44 <ihrachys> nope, hasn't happened 16:02:54 <ihrachys> will follow up on it after the meeting, it should not take much time :-x 16:03:19 <ihrachys> actually, mlavalle maybe you could take that since you were to track dvr failure rate? 16:03:44 <ihrachys> that's a matter of editing grafana/neutron.yaml in project-config, not a huge task 16:03:49 <mlavalle> ihrachys: sure. I don't know how, but will find out 16:04:00 <ihrachys> that's a good learning opportunity then 16:04:05 <mlavalle> cool 16:04:08 <ihrachys> you can ask me for details in neutron channel 16:04:11 <ihrachys> and thanks 16:04:13 <mlavalle> will do 16:04:19 <ihrachys> #action mlavalle to fix the grafana board to include gate-tempest-dsvm-neutron-dvr-multinode-full-ubuntu-xenial-nv 16:04:43 <mlavalle> I spent time loking at this 16:04:47 <mlavalle> looking^^^ 16:05:11 <mlavalle> all the failures I can see are due to hosts not available for the tests 16:05:31 <mlavalle> or loosing connection with the hypervisor 16:05:55 <mlavalle> the other failures I see are due to the patchsets in the check queue 16:06:23 <mlavalle> as a next step I'll be glad to talk to the infra team about this 16:06:23 <ihrachys> ok I see 16:06:41 <ihrachys> we may revisit that once we have data (grafana) back 16:07:00 <ihrachys> next was "jlibosva to figure out the plan for py3 gate transition and report back" 16:07:17 <jlibosva> didn't sync yet. Although it's quite important I won't be able to make a plan till next meeting as I'll be off most of the time. So I target for now+2 weeks :) 16:08:09 <clarkb> mlavalle: yes please do ping us in -infra after the meeting if you can (I've been trying to get things under control failure wise want to make sure we aren't missing something) 16:08:26 <mlavalle> clarkb: will do 16:08:29 <ihrachys> ok let's punt py3 for now till jlibosva is bcak 16:08:32 <ihrachys> *back 16:09:00 <ihrachys> unless someone want to take a pitch on writing a proposal for py3 coverage in gate 16:10:58 <ihrachys> ok 16:11:05 <ihrachys> #topic State of the Gate 16:11:10 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:12:01 <ihrachys> gate-tempest-dsvm-neutron-linuxbridge-ubuntu-xenial is the only gate job that seem to show high failure rate 16:12:07 <ihrachys> it's 8% right now 16:12:17 <ihrachys> anyone aware of what happens there? 16:12:39 * electrocucaracha is checking 16:12:59 <jlibosva> any chance it's still the echo from the spike before? 16:13:19 <ihrachys> I see this example: http://logs.openstack.org/17/412017/5/check/gate-tempest-dsvm-neutron-linuxbridge-ubuntu-xenial/c2fab50/console.html#_2017-03-24_18_31_35_087822 16:13:28 <ihrachys> jlibosva: looks rather flat from grafna 16:13:36 <ihrachys> timeout? 16:14:37 <ihrachys> I don't see too many patches merged lately, could be a one off 16:15:59 <ihrachys> #topic Fullstack voting progress 16:16:16 <ihrachys> jlibosva: surely fullstack is still at 100% failure rate but do we make progress? 16:16:26 <ihrachys> do we have grasp of all failures there? 16:17:09 <jlibosva> re - linuxbridge - I found the latest failure: http://logs.openstack.org/71/450771/1/check/gate-tempest-dsvm-neutron-linuxbridge-ubuntu-xenial/df0eaa2/console.html#_2017-03-28_14_05_34_665809 16:17:22 <jlibosva> ihrachys: there is still the patch for iptables firewall 16:17:41 <ihrachys> https://review.openstack.org/441353 ? 16:17:56 <jlibosva> yes, that one 16:18:07 <jlibosva> still probably a WIP 16:18:38 <ihrachys> I see test_securitygroup failing with it as in http://logs.openstack.org/53/441353/8/check/gate-neutron-dsvm-fullstack-ubuntu-xenial/fe9f205/testr_results.html.gz 16:18:47 <ihrachys> does it suggest it's not solving it? 16:19:05 <jlibosva> it's probably introducing another regression 16:19:11 <ihrachys> :-) 16:19:17 <jlibosva> as it solves the iptables driver but breaks iptables_hybrid 16:19:24 <ihrachys> whack a mole 16:19:27 <jlibosva> they are closely related and both use the conntrack manager 16:20:08 <ihrachys> kevinbenton: fyi seems like we need the conntrack patch in to move forward with fullstack 16:20:21 <ihrachys> jlibosva: apart from this failure, anything pressing? or is it the last one? 16:20:28 <jlibosva> no, other two :) 16:20:31 <jlibosva> rather :( 16:20:46 <jlibosva> https://bugs.launchpad.net/neutron/+bug/1673531 - introduced recently 16:20:46 <openstack> Launchpad bug 1673531 in neutron "fullstack test_controller_timeout_does_not_break_connectivity_sigkill(GRE and l2pop,openflow-native_ovsdb-cli) failure" [Undecided,New] 16:21:02 <jlibosva> by merging tests for keeping data plane connectivity while agent is restart 16:21:04 <jlibosva> ed 16:21:47 <jlibosva> I also saw another failure in trunk test where patch ports between tbr- and br-int are not cleaned properly after trunk is deleted. 16:22:01 <jlibosva> I haven't investigated that one and I don't think I reported a LP bug yet 16:22:39 <ihrachys> I should probably raise the test_controller_timeout_does_not_break_connectivity_sigkill one on upgrades meeting since it's directly related to upgrade scenarios 16:22:59 <jlibosva> It's unclear to me if it's fullstack or agent 16:23:53 <ihrachys> http://logs.openstack.org/98/446598/1/check/gate-neutron-dsvm-fullstack-ubuntu-xenial/2e0f93e/logs/dsvm-fullstack-logs/TestOvsConnectivitySameNetworkOnOvsBridgeControllerStop.test_controller_timeout_does_not_break_connectivity_sigkill_GRE-and-l2pop,openflow-native_ovsdb-cli_/neutron-openvswitch-agent--2017-03-16--16-06-05-730632.txt.gz?level=TRACE 16:24:08 <ihrachys> looks like multiple agents trying to add same manager? 16:24:28 <ihrachys> since we don't isolate ovs, and we execute two hosts, maybe that's why 16:25:05 <ihrachys> gotta get otherwiseguy looking at it. the bug may be in the code that is now in ovsdbapp. 16:25:36 <jlibosva> that's weird 16:25:41 <jlibosva> it has vsctl ovsdb_interface 16:25:46 <jlibosva> I thought the manager is for native 16:25:55 <ihrachys> it's chicken and egg 16:26:06 <ihrachys> you can't do native before you register the poret 16:26:07 <ihrachys> port 16:26:17 <ihrachys> so if connection fails, we call CLI to add the manager port 16:26:21 <ihrachys> and then repeat native attempt 16:26:24 <jlibosva> but there is no native whatsoever 16:26:49 <jlibosva> it's a vsctl test 16:27:17 <ihrachys> oh 16:27:36 <ihrachys> a reasonable question is then, why do we open the port 16:27:38 <ihrachys> right? 16:27:50 <jlibosva> but anyways, if it tries to create new manager and it's already there, it shouldn't influence the functionality, right? 16:28:11 <ihrachys> depending on what the agent will do with the failure. 16:28:23 <ihrachys> not sure if failure happens on this iteration, or somewhere later 16:30:21 <ihrachys> yeah, seems like the failure happens 30sec+ after that error 16:30:27 <ihrachys> probably not directly related 16:31:04 <jlibosva> I'm looking at the code right now and ovsdb monitor calls to native.helpers.enable_connection_uri 16:32:06 <jlibosva> https://review.openstack.org/#/c/441447/ 16:32:26 <ihrachys> yea, was actually looking for this exact patch 16:32:40 <jlibosva> but by that time the fullstack test wasn't in tree yet 16:34:44 <ihrachys> oh so basically polling.py always passes cfg.CONF.OVS.ovsdb_connection 16:34:56 <ihrachys> and since it has default value, it always triggers the command 16:35:15 <ihrachys> I think there are several issues here. one is - we don't need that at all for vsctl 16:35:25 <ihrachys> another being - multiple calls may probably race 16:35:40 <ihrachys> neither are directly related to fullstack failure 16:36:20 <ihrachys> #action ihrachys to report bugs for fullstack race in ovs agent when calling to enable_connection_uri 16:36:34 <jlibosva> we could hack fullstack to filelock the call 16:36:42 <ihrachys> I don't think that correct 16:36:45 <jlibosva> to avoid races, it can't happen in real world 16:37:11 <ihrachys> because we don't run multiple monitors? 16:37:20 <jlibosva> we don't run multiple ovs agents 16:38:32 <ihrachys> yeah seems like the only place we call the code path is in ovs agent 16:39:53 <ihrachys> I would still prefer code level fix for that, but it would work if we lock too 16:40:06 <jlibosva> the only thing where it's used is vsphere at some dvs_neutron_agent ... http://codesearch.openstack.org/?q=get_polling_manager&i=nope&files=&repos= 16:40:12 <jlibosva> but dunno what that is 16:40:32 <ihrachys> this code looks like ovs agent copy-pasted :) 16:41:05 <ihrachys> but it doesn't seem this code reimplements the agent 16:41:21 <ihrachys> the question would be whether the DVS agent can be used with OVS agent on the same node 16:42:34 * mlavalle has to step out 16:42:38 <ihrachys> ok let's move to the next topic 16:42:48 <ihrachys> #topic Functional job state 16:43:03 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=7&fullscreen 16:43:25 <ihrachys> there are still spikes in close past till 80% 16:43:36 <ihrachys> not sure what that was, I suspect some general gate breakage 16:43:43 <ihrachys> now it's at reasonable 10% 16:43:53 <ihrachys> (note it's check queue so there may be valid breakages) 16:44:24 <ihrachys> of all patches, I am aware of this fix for a func test stability: https://review.openstack.org/#/c/446991/ 16:44:37 <ihrachys> jlibosva: maybe you can have a look 16:45:00 <jlibosva> I will 16:45:19 <jlibosva> also note that almost the whole previous week the rate was around 20% which is still not ideal 16:46:28 <ihrachys> yeah. sadly I am consumed this week by misc stuff so won't be able to have a look. 16:48:25 <ihrachys> #topic Other gate failures 16:48:29 <ihrachys> https://bugs.launchpad.net/neutron/+bugs?field.tag=gate-failure 16:48:41 <jlibosva> we could monitor the trend for next week and we'll see 16:48:45 * ihrachys looks through the list to see if anything could benefit from review attention 16:49:10 <ihrachys> this patch may be interesting since late pecan switch: https://review.openstack.org/#/c/447781/ 16:49:18 <ihrachys> but it needs dasanind to respin it with a test included 16:49:41 <ihrachys> I think the bug hit tempest sometimes. 16:50:46 <ihrachys> any bugs worth raising? 16:51:02 <ihrachys> oh there is one from tmorin with a fix here: https://review.openstack.org/#/c/450865/ 16:51:12 <ihrachys> I haven't checked the fix yet, so I am not sure if it's the right thing 16:51:57 <manjeets_> he just enabled quotas explicitly and it worked 16:52:11 <manjeets_> need to check how quota ovo disrupted normal behavior 16:52:12 <ihrachys> I don't think the change we landed was intended to break subprojects ;) 16:52:19 <ihrachys> gotta find a fix on neutron side 16:52:47 <manjeets_> yea that would be right fix 16:53:10 <ihrachys> ok let's move on 16:53:14 <ihrachys> #topic Open discussion 16:53:36 <ihrachys> https://review.openstack.org/#/c/439114/ from manjeets_ still waits for +W from infra 16:53:43 <ihrachys> I see Clark already +2d it, nice 16:54:00 <manjeets_> i asked clark yesterday for review 16:54:16 <manjeets_> may be need to post once more on infra 16:54:35 <ihrachys> yeah, thanks for following up on it 16:54:51 <ihrachys> apart from that, anything CI related worth mentioning here? 16:55:27 <jlibosva> I noticed that qos is skipped in api job 16:55:42 <jlibosva> e.g. http://logs.openstack.org/91/446991/2/check/gate-neutron-dsvm-api-ubuntu-xenial/044a331/testr_results.html.gz 16:55:49 <jlibosva> test_qos 16:55:50 <manjeets_> one question I was looking at functional tests 16:55:57 <manjeets_> don't see much for qos 16:56:43 <manjeets_> i see trunk is covered in functional but not qos 16:57:09 <ihrachys> jlibosva: I think we had a skip somewhere there 16:57:22 <ihrachys> http://logs.openstack.org/91/446991/2/check/gate-neutron-dsvm-api-ubuntu-xenial/044a331/console.html#_2017-03-24_11_12_55_029613 16:57:36 <ihrachys> apparently the driver (ovs?) doesn't support it 16:57:41 <jlibosva> I digged into it a bit and ended up that settings from local.conf are not propagated to the tempest.conf - but at this patch I see it works ... maybe it 16:57:53 <ihrachys> jlibosva: oh there was another thing related 16:57:55 <jlibosva> yeah, that's probably something else than what I saw - seems fixed by now 16:58:08 <ihrachys> https://review.openstack.org/#/c/449182/ 16:58:20 <ihrachys> that should fix the issue with changes not propagated from hooks into tempest.conf 16:58:33 <ihrachys> so now we have 2 skips only, and they seem to be legit 16:58:37 <jlibosva> ihrachys: yeah, that's probably it :) 16:59:10 <jlibosva> it was weird for me though as I actually saw crudini being called - anyways, it's solved. Thanks ihrachys :) 16:59:17 <ihrachys> np 16:59:31 <ihrachys> manjeets_: there are func tests for qos too 16:59:41 <ihrachys> manjeets_: I will give links in neutron channel since we are at top of the hour 16:59:45 <ihrachys> thanks everyone and keep up! 16:59:47 <ihrachys> #endmeeting