15:01:18 <haleyb> #startmeeting neutron_dvr
15:01:20 <openstack> Meeting started Wed Aug 24 15:01:18 2016 UTC and is due to finish in 60 minutes.  The chair is haleyb. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:01:21 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:01:24 <openstack> The meeting name has been set to 'neutron_dvr'
15:01:28 <haleyb> #chair Swami
15:01:31 <openstack> Current chairs: Swami haleyb
15:01:46 <jschwarz> \o/
15:01:54 <haleyb> #topic Announcements
15:01:58 <Swami> jschwarz: hi
15:02:03 <jschwarz> Swami, hello :)
15:02:29 <haleyb> hi jschwarz, we'll make sure to give you time for HA :)
15:02:51 <jschwarz> haleyb, :)
15:03:14 * jschwarz is in a concurrent meeting but will try to stay tentative
15:03:27 <Swami> jschwarz: same here
15:04:00 <anilvenkata> hi
15:04:14 <Swami> anilvenkata: hi
15:04:15 <haleyb> So N-3 is almost over, and with the gate a mess we need to focus on getting just the important things in first, and helping with any triage
15:04:54 <Swami> When is the N-3 cut off date.
15:05:18 <haleyb> the schedule shows next monday
15:05:33 <Swami> haleyb: thanks
15:05:49 <haleyb> Swami: or at least next week
15:05:54 <haleyb> https://releases.openstack.org/newton/schedule.html
15:06:47 <haleyb> #topic Bugs
15:07:06 <Swami> haleyb: thanks
15:07:19 <Swami> We have two critical bugs that are seen in the gate failures.
15:07:40 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1612192
15:07:40 <openstack> Launchpad bug 1612192 in neutron "L3 DVR: Unable to complete operation on subnet" [Critical,Confirmed]
15:08:02 <Swami> This seems to be seen along with the HA, so it is might be related with HA+DVR.
15:08:23 <jschwarz> Swami, last I looked, that was a completely different issue form the HA one
15:08:28 <jschwarz> s/form/from/
15:08:54 <Swami> jschwarz: sorry my mistake, yes it says that it is different than the issue seen with the HA and specific to DVR.
15:09:21 <Swami> I have not looked into it yet.
15:09:35 <jschwarz> Swami, and since the HA part of the bug is dropped (and the HA bug is not gate-blocking) I've not looked into the DVR bug any further
15:09:51 <Swami> jschwarz: ok, thanks
15:10:03 <haleyb> Swami: i have looked at a number of failures this week, and most have a dbdeadlock trace somewhere in them
15:10:52 <Swami> haleyb: Yes I am not sure if this is still related to the fix that kevinbenton pushed in last week.
15:11:03 <haleyb> there is still a fix in-flight, trying to find it
15:11:57 <haleyb> https://review.openstack.org/#/c/356530/
15:12:02 <Swami> haleyb: thanks
15:12:16 <haleyb> at least i think that shuld help
15:12:34 <Swami> haleyb: hope so.
15:13:01 <Swami> the next critical bug is
15:13:03 <haleyb> i had been seeing strang asyncronous failures where dhcp was g etting setup, but at the same time the network was deleted
15:13:06 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1612804
15:13:06 <openstack> Launchpad bug 1612804 in neutron "test_shelve_instance fails with sshtimeout" [Critical,Confirmed]
15:13:55 <Swami> haleyb: do we know what patch caused this regression.
15:14:56 <haleyb> Swami: not exactly.  there's been issues with the auto-alloaction code, and carl was thinking the ipam cut-over, i just haven't confirmed, was waiting for that patch to land
15:15:35 <Swami> haleyb: ok thanks
15:16:02 <Swami> it seems that the test_shelve_instance failure is only seen once in the gate for the last 7 days.
15:17:01 <Swami> The next bug in the list is
15:17:05 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1403455
15:17:05 <openstack> Launchpad bug 1403455 in neutron "neutron-netns-cleanup doesn't clean up all L3 agent spawned processes" [High,Triaged] - Assigned to Manjeet Singh Bhatia (manjeet-s-bhatia)
15:18:02 <haleyb> Swami: btw, that log from https://bugs.launchpad.net/neutron/+bug/1612804 showed a dhcp failure, so i can imagine it's all related
15:18:02 <openstack> Launchpad bug 1612804 in neutron "test_shelve_instance fails with sshtimeout" [Critical,Confirmed]
15:18:09 <Swami> This is not just related to DVR, but a generic bug that has a DVR tag.
15:18:13 <haleyb> sorry, i'm slow typing sometimes
15:18:38 <Swami> haleyb: interesting.
15:19:02 <Swami> haleyb: so now this is all pointing to the dhcp failure.
15:19:17 <haleyb> Swami: that netns cleanup bug is pretty old, don't know why it moved to High
15:19:52 <haleyb> Swami: it could be the dhcp issue, i need to look at the q-dhcp log
15:20:00 <Swami> haleyb: I just saw that it poped up in dvr bug list.
15:20:26 <haleyb> Swami: seems armando bumped it up
15:20:29 <jschwarz> that probably includes l3-ha as well
15:20:42 <jschwarz> as l3-ha also doesn't always clean up keepalived processes
15:21:01 <Swami> jschwarz: yes I did see it includes all in L3 agent
15:21:13 * jschwarz didn't read the bug report
15:22:07 <Swami> The next one in the list is
15:22:10 <haleyb> Swami: yes, that log shows the same issue
15:22:11 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1609217
15:22:11 <openstack> Launchpad bug 1609217 in neutron "DVR: dvr router should not exist in not-binded network node" [Undecided,In progress] - Assigned to LIU Yulong (dragon889)
15:22:25 <liuyulong_> hi there
15:22:38 <Swami> haleyb: thanks
15:22:45 <Swami> liuyulong_: hi
15:23:15 <Swami> This above bug seems to be as designed, does any one have any issues with this
15:23:25 <Swami> liuyulong_: did you file the above bug.
15:24:04 <liuyulong_> I'm working on remove this line: https://github.com/openstack/neutron/blob/master/neutron/common/utils.py#L258
15:24:17 <liuyulong_> Swami, yes
15:25:04 <liuyulong_> Swami, for now, the test shows that VM could boot properly.
15:25:08 <Swami> I think the router namespace created on the network-node where dhcp agent is running is as per design.
15:26:00 <liuyulong_> Swami, that namespace is redundant.
15:26:08 <Swami> The reason for the considering the dhcp ports as the dvr service ports is for anyone trying to reach the routed subnet from the dhcp.
15:26:44 <Swami> This is not a new change and this had been there ever since we introduced dvr.
15:27:51 <Swami> liuyulong_: I think brian has already added a comment in there on why the router need to be there when service port is available.
15:28:00 <haleyb> Swami: yes, that router namespace on the NN is by design, it's needed for the dhcp port to reach all the private networks
15:28:03 <liuyulong_> Swami, I'm not familiar with `routed subnet`.
15:28:12 <haleyb> i've explained that many times :)
15:28:27 <liuyulong_> :)
15:29:10 <Swami> at this point, if it is not affecting the functionality we should change the behavior and we have other critical things to address.
15:29:22 <Swami> s/should change/should not change
15:29:28 <liuyulong_> this may related to some DVR+HA issues. Because that DVR router will stay in the L3 agent route_info cache.
15:29:46 <liuyulong_> Even it's not schedule to.
15:30:25 <liuyulong_> done, please go forward. Thank you very much.
15:31:02 <Swami> liuyulong_: thanks for your input. If it affect the DVR+HA operation, we can file a separate bug exactly what it affects with the DVR+HA and then go from there.
15:31:15 <Swami> liuyulong_: thanks for your time and feedback.
15:31:45 <Swami> These are the only new bugs this week.
15:32:00 <haleyb> ok, thanks Swami
15:32:09 <haleyb> any other bugs anyone wants to bring up?
15:32:26 <Swami> the remaining bugs that we discussed before are still there and have not been resolved.
15:32:28 <anilvenkata> haleyb, Swami https://bugs.launchpad.net/neutron/+bug/1602614
15:32:28 <openstack> Launchpad bug 1602614 in neutron "DVR + L3 HA loss during failover is higher that it is expected" [High,In progress] - Assigned to venkata anil (anil-venkata)
15:32:43 <anilvenkata> haleyb, Swami need reviewers for the patch
15:33:12 <anilvenkata> https://review.openstack.org/#/c/255237/
15:33:20 <Swami> haleyb: yes anilvenkata's patch should be given priority while we deal with the HA failure issues.
15:34:03 <haleyb> anilvenkata: i'll look today, but we'll need some ML2 cores to also approve
15:34:09 <anilvenkata> Both  the bugs 1522980 1602614 got "hight" priority and newton as milestone
15:34:09 <openstack> bug 1522980 in neutron "L3 HA integration with l2pop assumes control plane is operational for fail over" [High,In progress] https://launchpad.net/bugs/1522980 - Assigned to venkata anil (anil-venkata)
15:34:25 <anilvenkata> same patch for both the bugs
15:34:56 <anilvenkata> any idea whom to approach for review?
15:35:13 <Swami> jschwarz: have you tested or reviewed anilvenkata's patch with the HA
15:35:31 <jschwarz> Swami, the DB part looks OK to me, but my l2pop skills are not as good
15:36:08 <Swami> jschwarz: can you check with amuller to see if he has any comments on this patch.
15:36:26 <jschwarz> Swami, I have a one-on-one with him in 24 minutes, so I can ask
15:36:58 <Swami> anilvenkata: can you also ping amuller to review the patch, since he might have more insight on the l2pop out of all of us here.
15:37:27 <anilvenkata> Swami, sure
15:37:33 <Swami> anilvenkata: thanks
15:37:40 <anilvenkata> Swami, thanks Swami
15:38:00 <anilvenkata> Swami, haleyb your review for this patch can also help
15:38:08 <Swami> anilvenkata: sure
15:38:15 <haleyb> any other bugs?
15:38:28 <haleyb> jschwarz: anything on the HA/DVR front to talk about?
15:38:33 <jschwarz> yes
15:38:56 <jschwarz> so during the midcycle me and Jakub discussed with armax about changing the gate a bit
15:39:03 <jschwarz> making fullstack voting, etc
15:39:15 <jschwarz> also, we want to have a DVR+HA gate
15:39:21 <jschwarz> so... that happened :)
15:39:36 <Swami> jschwarz: but don't we need an additional node for dVR+HA in gate.
15:39:50 <jschwarz> Swami, the DVR multinode job already runs with 2 nodes, no?
15:39:55 <jschwarz> 2 nodes is enough for AH
15:39:56 <jschwarz> HA*
15:40:13 <jschwarz> (as long as the agents there are dvr_snat)
15:40:27 <Swami> jschwarz: But will it not be usefull to have three nodes, two network and one compute alone to segregate things.
15:41:00 <Swami> jschwarz: otherwise we will not be testing the 'dvr' agent type in the gate.
15:41:13 <jschwarz> Swami, I see your point
15:41:20 <Swami> everything will be tested against the 'dvr_snat' agent type.
15:41:27 <jschwarz> that is a good question to make: can we change the multinode-dvr to have 3 nodes instead of 2
15:42:00 <jschwarz> also, anilvenkata's work on l2pop is blocking this as well, as currently HA+l2pop is broken pretty bad and anilvenkata's patch seems like a magic pill full of happiness :)
15:42:14 <Swami> jschwarz: Also I am seeing another issue with respect to the DVR functional tests were we see random failures with respect to HA+DVR. This is related to some stale namespace that is being used.
15:42:17 <anilvenkata> jschwarz, thanks :)
15:42:24 <Swami> jschwarz: if you get a chance can you take a look at it.
15:42:36 <jschwarz> Swami, can you link me the bug please?
15:42:38 <jschwarz> or some logs?
15:43:03 <Swami> jschwarz: I have not filed a bug yet, but I have been seeing even in the my local setup that it randomly fails.
15:43:14 <jschwarz> Swami, oh, that's always fun
15:43:24 <jschwarz> Swami, lets file a bug report with reproduction steps and I'll put some cycles on it
15:43:27 <Swami> You can reproduce it with a test function that I had that would remove the stale namespace.
15:43:44 <Swami> jschwarz: I will send you the test function that I had to reproduce it.
15:43:51 <jschwarz> Swami, sounds good to me
15:43:56 <Swami> jschwarz: thanks
15:44:03 <jschwarz> Swami, also, re: the current multinode-dvr job
15:44:12 <Swami> jschwarz: yes
15:44:19 <jschwarz> is it running one dvr_snat node and one dvr node?
15:44:33 <Swami> jschwarz: Yes one is dvr_snat and the other is dvr
15:44:43 <jschwarz> ok
15:45:02 <jschwarz> Swami, I'll ask around and see the feasibility of turning that gate into a 3-node
15:45:11 <Swami> jschwarz: yes that would help
15:45:13 <jschwarz> somehow, I imagine #openstack-infra to not like that idea at all :)
15:45:46 <jschwarz> if we already started jumping into HA territory
15:45:47 <haleyb> jschwarz: well, start with an experimental one, since we probably even want to disable compute on the snat nodes possibly
15:45:54 <Swami> jschwarz: that would be a tough job
15:46:15 <jschwarz> haleyb, ack
15:46:28 <jschwarz> anyway, re: HA, we have https://bugs.launchpad.net/neutron/+bug/1609738
15:46:28 <openstack> Launchpad bug 1609738 in neutron "l3-ha: a router can be stuck in the ALLOCATING state" [Undecided,In progress] - Assigned to John Schwarz (jschwarz)
15:46:29 <haleyb> jschwarz: yes, infra won't like it, but how else do we test this, and not introduce new bugs?
15:46:39 <jschwarz> which is being worked on in https://review.openstack.org/#/c/357965/2 and following patches
15:46:50 <jschwarz> haleyb, it's not something we test at all.
15:47:04 <jschwarz> haleyb, unless we count on functional only
15:47:11 <jschwarz> no actual scenario jobs :\
15:47:19 <haleyb> exactly, maybe you do internally, and maybe us...
15:47:34 <jschwarz> I don't like the "internally testing" idea
15:47:47 <jschwarz> if you guys or us benefit of it, we should also share the happiness
15:47:53 <jschwarz> isn't that the opensource way? :P
15:48:18 <haleyb> yes, we need a scenario test, even if it's experimental
15:48:27 <jschwarz> haleyb, so here's the scenario test: https://review.openstack.org/#/c/340383/
15:48:31 <Swami> haleyb: yes that would help.
15:48:35 * jschwarz has come prepared!
15:49:00 <haleyb> i even +1'd it already :)
15:49:06 <jschwarz> haleyb, Swami, basically it's an actual tempest scenario test that is supposed to check DVR
15:49:39 <jschwarz> it's blocked by allowing tempest jobs to run neutron-based scenario jobs, but other than that Genadi would sure like some love
15:49:59 <Swami> jschwarz: got it.
15:50:16 <jschwarz> the link for that is https://review.openstack.org/#/c/355344/
15:50:27 <haleyb> jschwarz: but i think you know what i mean - with a functional test for HA/DVR, when a change to HA is proposed having a 'check experimental' would be nice.  eventually it could be promoted. i'll get off my box now
15:50:51 <jschwarz> haleyb, that sounds like the first step in the right direction
15:51:03 <jschwarz> haleyb, probably #openstack-infra will like that one as well :)
15:51:37 <haleyb> and armando - having untested code in the tree is bad
15:51:40 <jschwarz> anyway, if those 2 patches go in we'll get some much-needed DVR coverage in the stable branches
15:52:02 <anilvenkata> haleyb, Swami there is a fullstack test for DVR functioanlity https://review.openstack.org/#/c/318307/
15:52:16 <jschwarz> I'll get off my podium now, I think that's all I had in mind for this meeting
15:52:44 <Swami> jschwarz: thanks for your time
15:52:47 <haleyb> jschwarz: so one more for you - https://bugs.launchpad.net/neutron/+bug/1597461 - i know there's other HA patches in-flight, but you mentioned in a review none would fix th is
15:52:47 <openstack> Launchpad bug 1597461 in neutron "L3 HA: 2 masters after reboot of controller" [High,In progress] - Assigned to venkata anil (anil-venkata)
15:52:55 <Swami> anilvenkata: thanks will take a look at it.
15:53:01 <anilvenkata> Swami, thanks
15:53:02 <jschwarz> haleyb, yes
15:53:17 <jschwarz> haleyb, so Ann and your new best friend, anilvenkata, are working on that
15:53:37 <haleyb> jschwarz: give me something to +2 :)
15:53:49 <anilvenkata> :)
15:53:58 <jschwarz> haleyb, https://review.openstack.org/#/c/357458/ is WIP
15:54:02 <jschwarz> but I hope it will be ready soon
15:54:06 <haleyb> seriously, let me know when you have something, oh there is it
15:54:14 <jschwarz> XD
15:54:38 <anilvenkata> https://review.openstack.org/#/c/357458/ is ready for review
15:54:44 <anilvenkata> haleyb, Swami ^
15:54:57 <jschwarz> anilvenkata, so you guys should probably drop the 'WIP' tag from the patch
15:54:59 <haleyb> anilvenkata: thanks, i haven't gotten to email yet today to see it was updated
15:55:00 <anilvenkata> jschwarz, hope u will take back your -1
15:55:06 <Swami> anilvenkata: jschwarz: thanks for the links
15:55:29 <anilvenkata> that is taken back, now it is ready
15:55:37 <jschwarz> anilvenkata, I was sure I did, sorry
15:55:44 <anilvenkata> jschwarz, :)
15:56:21 <haleyb> i'm g oing to skip gate failures today as we're all scr*wed right now in that regards
15:56:29 <haleyb> #topic Open Discussion
15:56:52 <haleyb> anything else to discuss?  have ~4 minutes
15:57:02 <jschwarz> armax also mentioned something about possibly deprecating legacy routers in the midcycle
15:57:13 <Swami> haleyb: did you get a chance to take a look at the fast-path-exit patch.
15:57:19 <jschwarz> this got me wondering about migration of legacy routers to DVR, or lack of thereof
15:57:38 <Swami> jschwarz: we already have a way to migrate the legacy to DVR.
15:57:52 <Swami> jschwarz: that is admin only option to migrate the legacy to DVR.
15:57:56 <haleyb> Swami: not yet, i've been trying to help with the gate issues, problem with ipv6 right now, but i promise to look before EOD
15:58:17 <jschwarz> Swami, https://github.com/openstack/neutron/blob/master/neutron/db/l3_hamode_db.py#L504
15:58:23 <jschwarz> this specifically disables it?
15:58:27 <Swami> haleyb: no problem
15:58:47 <haleyb> I need to clone myself
15:59:09 <haleyb> or work smarter, not harder, so 90's
15:59:24 <anilvenkata> work like John
15:59:26 <anilvenkata> :)
15:59:32 <anilvenkata> lot of multi tasking
15:59:41 <Swami> jschwarz: got it, if there are any services associated or HA configured for the legacy routers, probably it should be removed and then migrated.
16:00:00 <jschwarz> Swami, but what if I want to migrate a legacy to HA+DVR?
16:00:02 <haleyb> anilvenkata: i'm already on the phone in another meeting :)
16:00:06 <haleyb> #endmeeting