15:01:24 <haleyb> #startmeeting neutron_dvr
15:01:26 <openstack> Meeting started Wed Aug 31 15:01:24 2016 UTC and is due to finish in 60 minutes.  The chair is haleyb. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:01:27 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:01:29 <openstack> The meeting name has been set to 'neutron_dvr'
15:01:36 <haleyb> #chair Swami
15:01:37 <openstack> Current chairs: Swami haleyb
15:02:00 <haleyb> #topic Announcements
15:02:22 <haleyb> N-3 is closing this week
15:02:43 <haleyb> any new features that haven't merged will need to be granted a FFE
15:02:45 <Swami> almost there
15:03:11 <haleyb> We still will be able to fix bugs, especially if they help the check/gate job stability
15:03:14 <Swami> haleyb: what do you think about the Fast path exit, do we need a FFE or can we move it to the next release and backport it.
15:04:04 <haleyb> Swami: i know we don't usually backport new features, but there doesn't seem to be a config change required
15:04:21 <haleyb> how close do you think it is?
15:04:23 <Swami> it can still be considered as a bug
15:04:32 <Swami> It is pretty good.
15:04:52 <Swami> I have addressed couple of review comments yesterday as well on the route rules.
15:06:09 <Swami> haleyb: if you get some time today you can review the rule patch for the fast path exit and we can keep it ready, it if works out we can merge.
15:06:37 <haleyb> I will have to look at the reviews again, then we can move forward once we're happy
15:06:56 <Swami> haleyb: ok, make sense.
15:07:07 <Swami> at this point, we don't need to hurry.
15:07:07 <haleyb> #topic Bugs
15:07:19 <haleyb> right, i'd rather get it right
15:07:42 <Swami> haleyb: we don't have new bugs filed this week. So we pretty much have to address all the existing bugs.
15:08:20 * jschwarz is also here, but in a different meeting
15:08:25 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1612192
15:08:25 <openstack> Launchpad bug 1612192 in neutron "L3 DVR: Unable to complete operation on subnet" [Critical,Confirmed]
15:08:25 <haleyb> No new bugs?  must be a record
15:09:19 <Swami> haleyb: Did you get a chance to triage this.
15:09:51 <Swami> I think this is seen in the gate. Are we still seeing this in the gate after the gate stabilization.
15:09:58 <haleyb> Swami: no, been mostly looking at the dvr multinode job failures
15:10:17 <haleyb> i don't know if we're still seeing it, need to ask logstash
15:10:33 <Swami> Both the criticals that we have were seen in the gate.
15:11:09 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1612804
15:11:09 <openstack> Launchpad bug 1612804 in neutron "test_shelve_instance fails with sshtimeout" [Critical,Confirmed]
15:11:53 <Swami> This was also reported as seen in the gate.
15:12:52 <haleyb> regarding the first, there have been 8 occurences in the past two weeks, all on one day
15:12:55 * jschwarz is back now
15:13:13 <Swami> haleyb: so will it be due to the gate issues or a bad patch.
15:13:16 <Swami> jschwarz: hi
15:13:30 <haleyb> Swami: actually 8 in past 30 days
15:13:57 <haleyb> all happened in 24 hours on 8/23-24
15:14:08 <Swami> haleyb: so let us monitor it for another week and see what happens.
15:14:11 <haleyb> i will look and close if necessary
15:15:22 <haleyb> the second happens more often, but i don't know if it's neutron, have not dug
15:15:43 <Swami> haleyb: ok thanks.
15:15:44 <haleyb> i'll guess and say it's an issue saving to storage
15:16:17 <Swami> haleyb: you may be right.
15:16:40 <Swami> haleyb: we have seen the shelve_instance fail earlier as well.
15:17:28 <haleyb> Swami: hmm, some show a failure to get DHCP
15:17:51 <Swami> haleyb: so is there two different symptoms for the same failure.
15:17:53 <haleyb> and that's in the non-dvr test
15:18:47 <haleyb> Swami: any time we can't ssh to an instance it's neutron's fault :(
15:19:01 <Swami> haleyb: yeah! you got it.
15:19:22 <haleyb> s/neutron/dvr
15:20:02 <Swami> The next set of bugs are related to DVR+HA+L3
15:20:14 <Swami> we have discussed about these bugs earlier.
15:20:23 <jschwarz> haleyb, we're seeing a problem where in certain cases, with HA-only routers, instances lose connectivity
15:20:29 <jschwarz> haleyb, that's still DVR's fault ;-)
15:20:46 <haleyb> jschwarz: let me know when patches are up :)
15:20:53 <jschwarz> haleyb, :D
15:21:26 <haleyb> jschwarz: are there any HA patches that need attention?
15:21:27 <Swami> jschwarz: mostly the L3+HA+DVR combination is generating lot more bugs.
15:21:59 <Swami> jschwarz: any update on the L3+HA+DVR bugs that you are working on.
15:22:05 <jschwarz> yes, sorry
15:22:12 <jschwarz> I came unprepared :)
15:22:17 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1597461
15:22:17 <openstack> Launchpad bug 1597461 in neutron "L3 HA: 2 masters after reboot of controller" [High,Fix released] - Assigned to John Schwarz (jschwarz)
15:22:21 <Swami> jschwarz: no problem
15:22:39 <jschwarz> Swami, the patch for that merged a few days ago
15:22:54 <Swami> jschwarz: so can we close this bug, is it all fixed.
15:23:13 <jschwarz> Swami, yes, with a side note that a couple of other repos are complaining it broke some of their tests
15:23:30 <Swami> jschwarz: that's not good.
15:23:53 <jschwarz> Swami, one was already taken care of, the other one I'm not sure if it has (it was from networking-odl)
15:24:03 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1602320
15:24:03 <openstack> Launchpad bug 1602320 in neutron "ha + distributed router: keepalived process kill vrrp child process" [Undecided,In progress] - Assigned to Dongcan Ye (hellochosen)
15:24:40 <jschwarz> Swami, https://review.openstack.org/#/c/357458/9/neutron/services/l3_router/l3_router_plugin.py@77
15:25:38 <Swami> jschwarz: so are you planning to rollback or fix it with a different patch
15:25:57 <jschwarz> Swami, I see a bug fix here: https://review.openstack.org/#/c/363175/
15:26:10 <jschwarz> Swami, I think it's best to keep it separated and not rollback
15:26:21 <jschwarz> Swami, I will work with Isaku to provide a good fix for this ASAP
15:26:30 <Swami> jschwarz: ok thanks.
15:27:00 <Swami> jschwarz: do you have any input on the keepalived bug that I posted above.
15:27:19 <jschwarz> Swami, nope :<
15:27:40 <jschwarz> Swami, it looks like this patch has seen no action for 2 weeks now
15:27:51 <jschwarz> I'll ping Dongcan Ye to see if this can be moved forward
15:28:04 <Swami> jschwarz: thanks
15:28:36 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1602614
15:28:36 <openstack> Launchpad bug 1602614 in neutron "DVR + L3 HA loss during failover is higher that it is expected" [High,In progress] - Assigned to venkata anil (anil-venkata)
15:29:09 <jschwarz> Swami, the fix is https://review.openstack.org/#/c/255237/ - the long l2pop patch by anilvenkata
15:29:33 <jschwarz> it's being reviewed by Carl, Assaf and Kevin and since it's a complicated fix it's taking a while
15:29:34 <Swami> jschwarz: I think that is still under review.
15:29:43 <anilvenkata> yes
15:29:44 <jschwarz> hopefully this will get in N's RC
15:29:44 <Swami> jschwarz: ok makes sense.
15:29:56 <anilvenkata> getting different suggestions :)
15:30:03 <Swami> jschwarz: ok
15:30:19 <Swami> anilvenkata: thanks
15:30:32 <anilvenkata> thanks Swami jschwarz haleyb
15:31:27 <jschwarz> Swami, we also have https://bugs.launchpad.net/neutron/+bug/1607381
15:31:27 <openstack> Launchpad bug 1607381 in neutron "HA router in l3 dvr_snat/legacy agent has no ha_port" [Undecided,In progress] - Assigned to LIU Yulong (dragon889)
15:31:27 <carl_baldwin> I did not get to that review yesterday. I hope to get to it today.
15:31:37 <carl_baldwin> anilvenkata: jschwarz: ^
15:31:43 <jschwarz> carl_baldwin, ack, thanks :)
15:31:55 <anilvenkata> carl_baldwin, thanks Carl
15:31:57 <jschwarz> the bug I linked to is being dealt with in https://review.openstack.org/#/c/265672/
15:32:04 <Swami> anilvenkata: can we close this bug or is it still valid #link https://bugs.launchpad.net/neutron/+bug/1595043
15:32:04 <openstack> Launchpad bug 1595043 in neutron "Make DVR portbinding implementation useful for HA ports" [Medium,In progress] - Assigned to venkata anil (anil-venkata)
15:32:22 <anilvenkata> Swami, we will keep it
15:32:31 <anilvenkata> Swami, we have other issues with HA
15:32:37 <Swami> anilvenkata: ok.
15:32:44 <anilvenkata> Swami, thanks Swami
15:33:19 <jschwarz> going back, https://review.openstack.org/#/c/265672/ just needs reviews IMO - it looks to be ready
15:33:57 <Swami> jschwarz: thanks
15:34:07 <Swami> The next one is
15:34:12 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1593354
15:34:12 <openstack> Launchpad bug 1593354 in neutron "SNAT HA failed because of missing nat rule in snat namespace iptable" [Undecided,New]
15:34:42 <jschwarz> I have not seen this one before (and it wasn't even triaged)
15:34:54 <Swami> jschwarz: yes
15:35:12 <Swami> haleyb: did you get a chance to triage this bug.
15:35:43 <jschwarz> anilvenkata, ^ is this something that your patch deals with?
15:35:54 <haleyb> Swami: no
15:36:10 <anilvenkata> jschwarz, no for bug 1593354
15:36:10 <openstack> bug 1593354 in neutron "SNAT HA failed because of missing nat rule in snat namespace iptable" [Undecided,New] https://launchpad.net/bugs/1593354
15:36:38 <Swami> haleyb: thanks,
15:36:45 <jschwarz> Swami, I'll have a pass through it tomorrow and triage this, unless haleyb beats me to it
15:38:06 <Swami> jschwarz: thanks
15:38:42 <Swami> That's all I had for the bugs this week.
15:39:11 <haleyb> any other bugs to talk about from anyone?
15:39:53 * jschwarz claps his hands to show they are empty
15:40:28 <haleyb> you need to clap again and make all the bugs go away :)
15:40:38 * jschwarz claps again
15:40:48 <Swami> jschwarz: thanks
15:40:50 * jschwarz magically fails to clap as his hands miss each other
15:41:08 <haleyb> #topic Gate failures
15:41:47 <haleyb> the dvr and dvr-multinode (and regular multinode) jobs still have higher failure rates
15:42:31 <haleyb> i am continuing to triage this when i have time, filing bugs as i go
15:42:31 <carl_baldwin> Any insight in to the high multi-node rate?
15:43:36 <haleyb> carl_baldwin: DHCP is failing, but have not determined the timeline to see if it's the agent not starting
15:44:29 <carl_baldwin> haleyb: Are you all alone in looking in to this?
15:44:33 <haleyb> kevinbenton and armax have been cleaning-up the dhcp issues i've been finding, but none actually fix the bug
15:44:51 <armax> haleyb: dvr is not happy still
15:45:32 <haleyb> armax: yes, i continue to look
15:46:55 <haleyb> and i do see the neutron-dvr job is about 12% failure in the gate today
15:47:10 <Swami> haleyb: is this for the single node
15:47:27 <haleyb> yes, http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=5&fullscreen
15:47:46 <haleyb> gate-tempest-dsvm-neutron-dvr gate job
15:48:12 <Swami> haleyb: thanks
15:48:41 <haleyb> but the failure rate is falling, don't know if it was related to the dhcp patch that merged last night
15:49:08 <Swami> haleyb: will this also fix the multinode issue with respect to dhcp.
15:49:40 <haleyb> Swami: don't think so, dhcp could fail for many reasons as we know...
15:50:30 <haleyb> carl_baldwin: i will look into the gate dvr failure, and yes i'm all alone doing it at the moment, but will reach out to people
15:50:31 <Swami> haleyb: yes.
15:51:02 <carl_baldwin> haleyb: let me know, I might be able to help some.
15:51:37 <haleyb> i need to find a patch failing first
15:52:12 <Swami> haleyb: is there a particular test that fails with respect to dhcp or is it random.
15:52:38 <Swami> Is it the instance that is not able to get the dhcp and so the ssh fails.
15:52:54 <haleyb> Swami: TestNetworkBasicOps is usually the one
15:53:44 <Swami> haleyb: thanks
15:53:51 <haleyb> right, console shows dhcp failing and no IP on eth0.  i started looking on the server and dhcp agent and found some issues, but it's a work in progress
15:55:10 <Swami> haleyb: thanks
15:55:18 <haleyb> #topic Open discussion
15:55:37 <haleyb> Swami: only note on wiki is live migration patch
15:56:01 <Swami> haleyb: what is that. I don't get it.
15:56:36 <haleyb> https://review.openstack.org/#/c/275073/https://review.openstack.org/#/c/275073/
15:56:49 <haleyb> nova patch for live migration w/neutron
15:57:15 <Swami> haleyb: yes got it. We are still waiting for a +2 on it.
15:57:34 <haleyb> even the experimental job was happy
15:57:49 <Swami> I think john ran the experimental job, so let us wait and see.
15:58:18 <Swami> #link https://review.openstack.org/#/c/353788/
15:58:31 <Swami> haleyb: I need your blessing on this patch.
15:58:57 <haleyb> Swami: yes, saw that, will look
15:59:21 <Swami> haleyb: thanks
15:59:36 <haleyb> we're at end of hour, thanks for all the hard work everyone!
15:59:38 <Swami> There is one backport patch as well for /stable/liberyt
15:59:45 <Swami> we will take it offline.
15:59:47 <haleyb> #endmeeting