15:01:18 <haleyb> #startmeeting neutron_dvr 15:01:20 <openstack> Meeting started Wed Aug 24 15:01:18 2016 UTC and is due to finish in 60 minutes. The chair is haleyb. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:01:21 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:01:24 <openstack> The meeting name has been set to 'neutron_dvr' 15:01:28 <haleyb> #chair Swami 15:01:31 <openstack> Current chairs: Swami haleyb 15:01:46 <jschwarz> \o/ 15:01:54 <haleyb> #topic Announcements 15:01:58 <Swami> jschwarz: hi 15:02:03 <jschwarz> Swami, hello :) 15:02:29 <haleyb> hi jschwarz, we'll make sure to give you time for HA :) 15:02:51 <jschwarz> haleyb, :) 15:03:14 * jschwarz is in a concurrent meeting but will try to stay tentative 15:03:27 <Swami> jschwarz: same here 15:04:00 <anilvenkata> hi 15:04:14 <Swami> anilvenkata: hi 15:04:15 <haleyb> So N-3 is almost over, and with the gate a mess we need to focus on getting just the important things in first, and helping with any triage 15:04:54 <Swami> When is the N-3 cut off date. 15:05:18 <haleyb> the schedule shows next monday 15:05:33 <Swami> haleyb: thanks 15:05:49 <haleyb> Swami: or at least next week 15:05:54 <haleyb> https://releases.openstack.org/newton/schedule.html 15:06:47 <haleyb> #topic Bugs 15:07:06 <Swami> haleyb: thanks 15:07:19 <Swami> We have two critical bugs that are seen in the gate failures. 15:07:40 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1612192 15:07:40 <openstack> Launchpad bug 1612192 in neutron "L3 DVR: Unable to complete operation on subnet" [Critical,Confirmed] 15:08:02 <Swami> This seems to be seen along with the HA, so it is might be related with HA+DVR. 15:08:23 <jschwarz> Swami, last I looked, that was a completely different issue form the HA one 15:08:28 <jschwarz> s/form/from/ 15:08:54 <Swami> jschwarz: sorry my mistake, yes it says that it is different than the issue seen with the HA and specific to DVR. 15:09:21 <Swami> I have not looked into it yet. 15:09:35 <jschwarz> Swami, and since the HA part of the bug is dropped (and the HA bug is not gate-blocking) I've not looked into the DVR bug any further 15:09:51 <Swami> jschwarz: ok, thanks 15:10:03 <haleyb> Swami: i have looked at a number of failures this week, and most have a dbdeadlock trace somewhere in them 15:10:52 <Swami> haleyb: Yes I am not sure if this is still related to the fix that kevinbenton pushed in last week. 15:11:03 <haleyb> there is still a fix in-flight, trying to find it 15:11:57 <haleyb> https://review.openstack.org/#/c/356530/ 15:12:02 <Swami> haleyb: thanks 15:12:16 <haleyb> at least i think that shuld help 15:12:34 <Swami> haleyb: hope so. 15:13:01 <Swami> the next critical bug is 15:13:03 <haleyb> i had been seeing strang asyncronous failures where dhcp was g etting setup, but at the same time the network was deleted 15:13:06 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1612804 15:13:06 <openstack> Launchpad bug 1612804 in neutron "test_shelve_instance fails with sshtimeout" [Critical,Confirmed] 15:13:55 <Swami> haleyb: do we know what patch caused this regression. 15:14:56 <haleyb> Swami: not exactly. there's been issues with the auto-alloaction code, and carl was thinking the ipam cut-over, i just haven't confirmed, was waiting for that patch to land 15:15:35 <Swami> haleyb: ok thanks 15:16:02 <Swami> it seems that the test_shelve_instance failure is only seen once in the gate for the last 7 days. 15:17:01 <Swami> The next bug in the list is 15:17:05 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1403455 15:17:05 <openstack> Launchpad bug 1403455 in neutron "neutron-netns-cleanup doesn't clean up all L3 agent spawned processes" [High,Triaged] - Assigned to Manjeet Singh Bhatia (manjeet-s-bhatia) 15:18:02 <haleyb> Swami: btw, that log from https://bugs.launchpad.net/neutron/+bug/1612804 showed a dhcp failure, so i can imagine it's all related 15:18:02 <openstack> Launchpad bug 1612804 in neutron "test_shelve_instance fails with sshtimeout" [Critical,Confirmed] 15:18:09 <Swami> This is not just related to DVR, but a generic bug that has a DVR tag. 15:18:13 <haleyb> sorry, i'm slow typing sometimes 15:18:38 <Swami> haleyb: interesting. 15:19:02 <Swami> haleyb: so now this is all pointing to the dhcp failure. 15:19:17 <haleyb> Swami: that netns cleanup bug is pretty old, don't know why it moved to High 15:19:52 <haleyb> Swami: it could be the dhcp issue, i need to look at the q-dhcp log 15:20:00 <Swami> haleyb: I just saw that it poped up in dvr bug list. 15:20:26 <haleyb> Swami: seems armando bumped it up 15:20:29 <jschwarz> that probably includes l3-ha as well 15:20:42 <jschwarz> as l3-ha also doesn't always clean up keepalived processes 15:21:01 <Swami> jschwarz: yes I did see it includes all in L3 agent 15:21:13 * jschwarz didn't read the bug report 15:22:07 <Swami> The next one in the list is 15:22:10 <haleyb> Swami: yes, that log shows the same issue 15:22:11 <Swami> #link https://bugs.launchpad.net/neutron/+bug/1609217 15:22:11 <openstack> Launchpad bug 1609217 in neutron "DVR: dvr router should not exist in not-binded network node" [Undecided,In progress] - Assigned to LIU Yulong (dragon889) 15:22:25 <liuyulong_> hi there 15:22:38 <Swami> haleyb: thanks 15:22:45 <Swami> liuyulong_: hi 15:23:15 <Swami> This above bug seems to be as designed, does any one have any issues with this 15:23:25 <Swami> liuyulong_: did you file the above bug. 15:24:04 <liuyulong_> I'm working on remove this line: https://github.com/openstack/neutron/blob/master/neutron/common/utils.py#L258 15:24:17 <liuyulong_> Swami, yes 15:25:04 <liuyulong_> Swami, for now, the test shows that VM could boot properly. 15:25:08 <Swami> I think the router namespace created on the network-node where dhcp agent is running is as per design. 15:26:00 <liuyulong_> Swami, that namespace is redundant. 15:26:08 <Swami> The reason for the considering the dhcp ports as the dvr service ports is for anyone trying to reach the routed subnet from the dhcp. 15:26:44 <Swami> This is not a new change and this had been there ever since we introduced dvr. 15:27:51 <Swami> liuyulong_: I think brian has already added a comment in there on why the router need to be there when service port is available. 15:28:00 <haleyb> Swami: yes, that router namespace on the NN is by design, it's needed for the dhcp port to reach all the private networks 15:28:03 <liuyulong_> Swami, I'm not familiar with `routed subnet`. 15:28:12 <haleyb> i've explained that many times :) 15:28:27 <liuyulong_> :) 15:29:10 <Swami> at this point, if it is not affecting the functionality we should change the behavior and we have other critical things to address. 15:29:22 <Swami> s/should change/should not change 15:29:28 <liuyulong_> this may related to some DVR+HA issues. Because that DVR router will stay in the L3 agent route_info cache. 15:29:46 <liuyulong_> Even it's not schedule to. 15:30:25 <liuyulong_> done, please go forward. Thank you very much. 15:31:02 <Swami> liuyulong_: thanks for your input. If it affect the DVR+HA operation, we can file a separate bug exactly what it affects with the DVR+HA and then go from there. 15:31:15 <Swami> liuyulong_: thanks for your time and feedback. 15:31:45 <Swami> These are the only new bugs this week. 15:32:00 <haleyb> ok, thanks Swami 15:32:09 <haleyb> any other bugs anyone wants to bring up? 15:32:26 <Swami> the remaining bugs that we discussed before are still there and have not been resolved. 15:32:28 <anilvenkata> haleyb, Swami https://bugs.launchpad.net/neutron/+bug/1602614 15:32:28 <openstack> Launchpad bug 1602614 in neutron "DVR + L3 HA loss during failover is higher that it is expected" [High,In progress] - Assigned to venkata anil (anil-venkata) 15:32:43 <anilvenkata> haleyb, Swami need reviewers for the patch 15:33:12 <anilvenkata> https://review.openstack.org/#/c/255237/ 15:33:20 <Swami> haleyb: yes anilvenkata's patch should be given priority while we deal with the HA failure issues. 15:34:03 <haleyb> anilvenkata: i'll look today, but we'll need some ML2 cores to also approve 15:34:09 <anilvenkata> Both the bugs 1522980 1602614 got "hight" priority and newton as milestone 15:34:09 <openstack> bug 1522980 in neutron "L3 HA integration with l2pop assumes control plane is operational for fail over" [High,In progress] https://launchpad.net/bugs/1522980 - Assigned to venkata anil (anil-venkata) 15:34:25 <anilvenkata> same patch for both the bugs 15:34:56 <anilvenkata> any idea whom to approach for review? 15:35:13 <Swami> jschwarz: have you tested or reviewed anilvenkata's patch with the HA 15:35:31 <jschwarz> Swami, the DB part looks OK to me, but my l2pop skills are not as good 15:36:08 <Swami> jschwarz: can you check with amuller to see if he has any comments on this patch. 15:36:26 <jschwarz> Swami, I have a one-on-one with him in 24 minutes, so I can ask 15:36:58 <Swami> anilvenkata: can you also ping amuller to review the patch, since he might have more insight on the l2pop out of all of us here. 15:37:27 <anilvenkata> Swami, sure 15:37:33 <Swami> anilvenkata: thanks 15:37:40 <anilvenkata> Swami, thanks Swami 15:38:00 <anilvenkata> Swami, haleyb your review for this patch can also help 15:38:08 <Swami> anilvenkata: sure 15:38:15 <haleyb> any other bugs? 15:38:28 <haleyb> jschwarz: anything on the HA/DVR front to talk about? 15:38:33 <jschwarz> yes 15:38:56 <jschwarz> so during the midcycle me and Jakub discussed with armax about changing the gate a bit 15:39:03 <jschwarz> making fullstack voting, etc 15:39:15 <jschwarz> also, we want to have a DVR+HA gate 15:39:21 <jschwarz> so... that happened :) 15:39:36 <Swami> jschwarz: but don't we need an additional node for dVR+HA in gate. 15:39:50 <jschwarz> Swami, the DVR multinode job already runs with 2 nodes, no? 15:39:55 <jschwarz> 2 nodes is enough for AH 15:39:56 <jschwarz> HA* 15:40:13 <jschwarz> (as long as the agents there are dvr_snat) 15:40:27 <Swami> jschwarz: But will it not be usefull to have three nodes, two network and one compute alone to segregate things. 15:41:00 <Swami> jschwarz: otherwise we will not be testing the 'dvr' agent type in the gate. 15:41:13 <jschwarz> Swami, I see your point 15:41:20 <Swami> everything will be tested against the 'dvr_snat' agent type. 15:41:27 <jschwarz> that is a good question to make: can we change the multinode-dvr to have 3 nodes instead of 2 15:42:00 <jschwarz> also, anilvenkata's work on l2pop is blocking this as well, as currently HA+l2pop is broken pretty bad and anilvenkata's patch seems like a magic pill full of happiness :) 15:42:14 <Swami> jschwarz: Also I am seeing another issue with respect to the DVR functional tests were we see random failures with respect to HA+DVR. This is related to some stale namespace that is being used. 15:42:17 <anilvenkata> jschwarz, thanks :) 15:42:24 <Swami> jschwarz: if you get a chance can you take a look at it. 15:42:36 <jschwarz> Swami, can you link me the bug please? 15:42:38 <jschwarz> or some logs? 15:43:03 <Swami> jschwarz: I have not filed a bug yet, but I have been seeing even in the my local setup that it randomly fails. 15:43:14 <jschwarz> Swami, oh, that's always fun 15:43:24 <jschwarz> Swami, lets file a bug report with reproduction steps and I'll put some cycles on it 15:43:27 <Swami> You can reproduce it with a test function that I had that would remove the stale namespace. 15:43:44 <Swami> jschwarz: I will send you the test function that I had to reproduce it. 15:43:51 <jschwarz> Swami, sounds good to me 15:43:56 <Swami> jschwarz: thanks 15:44:03 <jschwarz> Swami, also, re: the current multinode-dvr job 15:44:12 <Swami> jschwarz: yes 15:44:19 <jschwarz> is it running one dvr_snat node and one dvr node? 15:44:33 <Swami> jschwarz: Yes one is dvr_snat and the other is dvr 15:44:43 <jschwarz> ok 15:45:02 <jschwarz> Swami, I'll ask around and see the feasibility of turning that gate into a 3-node 15:45:11 <Swami> jschwarz: yes that would help 15:45:13 <jschwarz> somehow, I imagine #openstack-infra to not like that idea at all :) 15:45:46 <jschwarz> if we already started jumping into HA territory 15:45:47 <haleyb> jschwarz: well, start with an experimental one, since we probably even want to disable compute on the snat nodes possibly 15:45:54 <Swami> jschwarz: that would be a tough job 15:46:15 <jschwarz> haleyb, ack 15:46:28 <jschwarz> anyway, re: HA, we have https://bugs.launchpad.net/neutron/+bug/1609738 15:46:28 <openstack> Launchpad bug 1609738 in neutron "l3-ha: a router can be stuck in the ALLOCATING state" [Undecided,In progress] - Assigned to John Schwarz (jschwarz) 15:46:29 <haleyb> jschwarz: yes, infra won't like it, but how else do we test this, and not introduce new bugs? 15:46:39 <jschwarz> which is being worked on in https://review.openstack.org/#/c/357965/2 and following patches 15:46:50 <jschwarz> haleyb, it's not something we test at all. 15:47:04 <jschwarz> haleyb, unless we count on functional only 15:47:11 <jschwarz> no actual scenario jobs :\ 15:47:19 <haleyb> exactly, maybe you do internally, and maybe us... 15:47:34 <jschwarz> I don't like the "internally testing" idea 15:47:47 <jschwarz> if you guys or us benefit of it, we should also share the happiness 15:47:53 <jschwarz> isn't that the opensource way? :P 15:48:18 <haleyb> yes, we need a scenario test, even if it's experimental 15:48:27 <jschwarz> haleyb, so here's the scenario test: https://review.openstack.org/#/c/340383/ 15:48:31 <Swami> haleyb: yes that would help. 15:48:35 * jschwarz has come prepared! 15:49:00 <haleyb> i even +1'd it already :) 15:49:06 <jschwarz> haleyb, Swami, basically it's an actual tempest scenario test that is supposed to check DVR 15:49:39 <jschwarz> it's blocked by allowing tempest jobs to run neutron-based scenario jobs, but other than that Genadi would sure like some love 15:49:59 <Swami> jschwarz: got it. 15:50:16 <jschwarz> the link for that is https://review.openstack.org/#/c/355344/ 15:50:27 <haleyb> jschwarz: but i think you know what i mean - with a functional test for HA/DVR, when a change to HA is proposed having a 'check experimental' would be nice. eventually it could be promoted. i'll get off my box now 15:50:51 <jschwarz> haleyb, that sounds like the first step in the right direction 15:51:03 <jschwarz> haleyb, probably #openstack-infra will like that one as well :) 15:51:37 <haleyb> and armando - having untested code in the tree is bad 15:51:40 <jschwarz> anyway, if those 2 patches go in we'll get some much-needed DVR coverage in the stable branches 15:52:02 <anilvenkata> haleyb, Swami there is a fullstack test for DVR functioanlity https://review.openstack.org/#/c/318307/ 15:52:16 <jschwarz> I'll get off my podium now, I think that's all I had in mind for this meeting 15:52:44 <Swami> jschwarz: thanks for your time 15:52:47 <haleyb> jschwarz: so one more for you - https://bugs.launchpad.net/neutron/+bug/1597461 - i know there's other HA patches in-flight, but you mentioned in a review none would fix th is 15:52:47 <openstack> Launchpad bug 1597461 in neutron "L3 HA: 2 masters after reboot of controller" [High,In progress] - Assigned to venkata anil (anil-venkata) 15:52:55 <Swami> anilvenkata: thanks will take a look at it. 15:53:01 <anilvenkata> Swami, thanks 15:53:02 <jschwarz> haleyb, yes 15:53:17 <jschwarz> haleyb, so Ann and your new best friend, anilvenkata, are working on that 15:53:37 <haleyb> jschwarz: give me something to +2 :) 15:53:49 <anilvenkata> :) 15:53:58 <jschwarz> haleyb, https://review.openstack.org/#/c/357458/ is WIP 15:54:02 <jschwarz> but I hope it will be ready soon 15:54:06 <haleyb> seriously, let me know when you have something, oh there is it 15:54:14 <jschwarz> XD 15:54:38 <anilvenkata> https://review.openstack.org/#/c/357458/ is ready for review 15:54:44 <anilvenkata> haleyb, Swami ^ 15:54:57 <jschwarz> anilvenkata, so you guys should probably drop the 'WIP' tag from the patch 15:54:59 <haleyb> anilvenkata: thanks, i haven't gotten to email yet today to see it was updated 15:55:00 <anilvenkata> jschwarz, hope u will take back your -1 15:55:06 <Swami> anilvenkata: jschwarz: thanks for the links 15:55:29 <anilvenkata> that is taken back, now it is ready 15:55:37 <jschwarz> anilvenkata, I was sure I did, sorry 15:55:44 <anilvenkata> jschwarz, :) 15:56:21 <haleyb> i'm g oing to skip gate failures today as we're all scr*wed right now in that regards 15:56:29 <haleyb> #topic Open Discussion 15:56:52 <haleyb> anything else to discuss? have ~4 minutes 15:57:02 <jschwarz> armax also mentioned something about possibly deprecating legacy routers in the midcycle 15:57:13 <Swami> haleyb: did you get a chance to take a look at the fast-path-exit patch. 15:57:19 <jschwarz> this got me wondering about migration of legacy routers to DVR, or lack of thereof 15:57:38 <Swami> jschwarz: we already have a way to migrate the legacy to DVR. 15:57:52 <Swami> jschwarz: that is admin only option to migrate the legacy to DVR. 15:57:56 <haleyb> Swami: not yet, i've been trying to help with the gate issues, problem with ipv6 right now, but i promise to look before EOD 15:58:17 <jschwarz> Swami, https://github.com/openstack/neutron/blob/master/neutron/db/l3_hamode_db.py#L504 15:58:23 <jschwarz> this specifically disables it? 15:58:27 <Swami> haleyb: no problem 15:58:47 <haleyb> I need to clone myself 15:59:09 <haleyb> or work smarter, not harder, so 90's 15:59:24 <anilvenkata> work like John 15:59:26 <anilvenkata> :) 15:59:32 <anilvenkata> lot of multi tasking 15:59:41 <Swami> jschwarz: got it, if there are any services associated or HA configured for the legacy routers, probably it should be removed and then migrated. 16:00:00 <jschwarz> Swami, but what if I want to migrate a legacy to HA+DVR? 16:00:02 <haleyb> anilvenkata: i'm already on the phone in another meeting :) 16:00:06 <haleyb> #endmeeting