15:01:18 #startmeeting neutron_dvr 15:01:20 Meeting started Wed Aug 24 15:01:18 2016 UTC and is due to finish in 60 minutes. The chair is haleyb. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:01:21 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:01:24 The meeting name has been set to 'neutron_dvr' 15:01:28 #chair Swami 15:01:31 Current chairs: Swami haleyb 15:01:46 \o/ 15:01:54 #topic Announcements 15:01:58 jschwarz: hi 15:02:03 Swami, hello :) 15:02:29 hi jschwarz, we'll make sure to give you time for HA :) 15:02:51 haleyb, :) 15:03:14 * jschwarz is in a concurrent meeting but will try to stay tentative 15:03:27 jschwarz: same here 15:04:00 hi 15:04:14 anilvenkata: hi 15:04:15 So N-3 is almost over, and with the gate a mess we need to focus on getting just the important things in first, and helping with any triage 15:04:54 When is the N-3 cut off date. 15:05:18 the schedule shows next monday 15:05:33 haleyb: thanks 15:05:49 Swami: or at least next week 15:05:54 https://releases.openstack.org/newton/schedule.html 15:06:47 #topic Bugs 15:07:06 haleyb: thanks 15:07:19 We have two critical bugs that are seen in the gate failures. 15:07:40 #link https://bugs.launchpad.net/neutron/+bug/1612192 15:07:40 Launchpad bug 1612192 in neutron "L3 DVR: Unable to complete operation on subnet" [Critical,Confirmed] 15:08:02 This seems to be seen along with the HA, so it is might be related with HA+DVR. 15:08:23 Swami, last I looked, that was a completely different issue form the HA one 15:08:28 s/form/from/ 15:08:54 jschwarz: sorry my mistake, yes it says that it is different than the issue seen with the HA and specific to DVR. 15:09:21 I have not looked into it yet. 15:09:35 Swami, and since the HA part of the bug is dropped (and the HA bug is not gate-blocking) I've not looked into the DVR bug any further 15:09:51 jschwarz: ok, thanks 15:10:03 Swami: i have looked at a number of failures this week, and most have a dbdeadlock trace somewhere in them 15:10:52 haleyb: Yes I am not sure if this is still related to the fix that kevinbenton pushed in last week. 15:11:03 there is still a fix in-flight, trying to find it 15:11:57 https://review.openstack.org/#/c/356530/ 15:12:02 haleyb: thanks 15:12:16 at least i think that shuld help 15:12:34 haleyb: hope so. 15:13:01 the next critical bug is 15:13:03 i had been seeing strang asyncronous failures where dhcp was g etting setup, but at the same time the network was deleted 15:13:06 #link https://bugs.launchpad.net/neutron/+bug/1612804 15:13:06 Launchpad bug 1612804 in neutron "test_shelve_instance fails with sshtimeout" [Critical,Confirmed] 15:13:55 haleyb: do we know what patch caused this regression. 15:14:56 Swami: not exactly. there's been issues with the auto-alloaction code, and carl was thinking the ipam cut-over, i just haven't confirmed, was waiting for that patch to land 15:15:35 haleyb: ok thanks 15:16:02 it seems that the test_shelve_instance failure is only seen once in the gate for the last 7 days. 15:17:01 The next bug in the list is 15:17:05 #link https://bugs.launchpad.net/neutron/+bug/1403455 15:17:05 Launchpad bug 1403455 in neutron "neutron-netns-cleanup doesn't clean up all L3 agent spawned processes" [High,Triaged] - Assigned to Manjeet Singh Bhatia (manjeet-s-bhatia) 15:18:02 Swami: btw, that log from https://bugs.launchpad.net/neutron/+bug/1612804 showed a dhcp failure, so i can imagine it's all related 15:18:02 Launchpad bug 1612804 in neutron "test_shelve_instance fails with sshtimeout" [Critical,Confirmed] 15:18:09 This is not just related to DVR, but a generic bug that has a DVR tag. 15:18:13 sorry, i'm slow typing sometimes 15:18:38 haleyb: interesting. 15:19:02 haleyb: so now this is all pointing to the dhcp failure. 15:19:17 Swami: that netns cleanup bug is pretty old, don't know why it moved to High 15:19:52 Swami: it could be the dhcp issue, i need to look at the q-dhcp log 15:20:00 haleyb: I just saw that it poped up in dvr bug list. 15:20:26 Swami: seems armando bumped it up 15:20:29 that probably includes l3-ha as well 15:20:42 as l3-ha also doesn't always clean up keepalived processes 15:21:01 jschwarz: yes I did see it includes all in L3 agent 15:21:13 * jschwarz didn't read the bug report 15:22:07 The next one in the list is 15:22:10 Swami: yes, that log shows the same issue 15:22:11 #link https://bugs.launchpad.net/neutron/+bug/1609217 15:22:11 Launchpad bug 1609217 in neutron "DVR: dvr router should not exist in not-binded network node" [Undecided,In progress] - Assigned to LIU Yulong (dragon889) 15:22:25 hi there 15:22:38 haleyb: thanks 15:22:45 liuyulong_: hi 15:23:15 This above bug seems to be as designed, does any one have any issues with this 15:23:25 liuyulong_: did you file the above bug. 15:24:04 I'm working on remove this line: https://github.com/openstack/neutron/blob/master/neutron/common/utils.py#L258 15:24:17 Swami, yes 15:25:04 Swami, for now, the test shows that VM could boot properly. 15:25:08 I think the router namespace created on the network-node where dhcp agent is running is as per design. 15:26:00 Swami, that namespace is redundant. 15:26:08 The reason for the considering the dhcp ports as the dvr service ports is for anyone trying to reach the routed subnet from the dhcp. 15:26:44 This is not a new change and this had been there ever since we introduced dvr. 15:27:51 liuyulong_: I think brian has already added a comment in there on why the router need to be there when service port is available. 15:28:00 Swami: yes, that router namespace on the NN is by design, it's needed for the dhcp port to reach all the private networks 15:28:03 Swami, I'm not familiar with `routed subnet`. 15:28:12 i've explained that many times :) 15:28:27 :) 15:29:10 at this point, if it is not affecting the functionality we should change the behavior and we have other critical things to address. 15:29:22 s/should change/should not change 15:29:28 this may related to some DVR+HA issues. Because that DVR router will stay in the L3 agent route_info cache. 15:29:46 Even it's not schedule to. 15:30:25 done, please go forward. Thank you very much. 15:31:02 liuyulong_: thanks for your input. If it affect the DVR+HA operation, we can file a separate bug exactly what it affects with the DVR+HA and then go from there. 15:31:15 liuyulong_: thanks for your time and feedback. 15:31:45 These are the only new bugs this week. 15:32:00 ok, thanks Swami 15:32:09 any other bugs anyone wants to bring up? 15:32:26 the remaining bugs that we discussed before are still there and have not been resolved. 15:32:28 haleyb, Swami https://bugs.launchpad.net/neutron/+bug/1602614 15:32:28 Launchpad bug 1602614 in neutron "DVR + L3 HA loss during failover is higher that it is expected" [High,In progress] - Assigned to venkata anil (anil-venkata) 15:32:43 haleyb, Swami need reviewers for the patch 15:33:12 https://review.openstack.org/#/c/255237/ 15:33:20 haleyb: yes anilvenkata's patch should be given priority while we deal with the HA failure issues. 15:34:03 anilvenkata: i'll look today, but we'll need some ML2 cores to also approve 15:34:09 Both the bugs 1522980 1602614 got "hight" priority and newton as milestone 15:34:09 bug 1522980 in neutron "L3 HA integration with l2pop assumes control plane is operational for fail over" [High,In progress] https://launchpad.net/bugs/1522980 - Assigned to venkata anil (anil-venkata) 15:34:25 same patch for both the bugs 15:34:56 any idea whom to approach for review? 15:35:13 jschwarz: have you tested or reviewed anilvenkata's patch with the HA 15:35:31 Swami, the DB part looks OK to me, but my l2pop skills are not as good 15:36:08 jschwarz: can you check with amuller to see if he has any comments on this patch. 15:36:26 Swami, I have a one-on-one with him in 24 minutes, so I can ask 15:36:58 anilvenkata: can you also ping amuller to review the patch, since he might have more insight on the l2pop out of all of us here. 15:37:27 Swami, sure 15:37:33 anilvenkata: thanks 15:37:40 Swami, thanks Swami 15:38:00 Swami, haleyb your review for this patch can also help 15:38:08 anilvenkata: sure 15:38:15 any other bugs? 15:38:28 jschwarz: anything on the HA/DVR front to talk about? 15:38:33 yes 15:38:56 so during the midcycle me and Jakub discussed with armax about changing the gate a bit 15:39:03 making fullstack voting, etc 15:39:15 also, we want to have a DVR+HA gate 15:39:21 so... that happened :) 15:39:36 jschwarz: but don't we need an additional node for dVR+HA in gate. 15:39:50 Swami, the DVR multinode job already runs with 2 nodes, no? 15:39:55 2 nodes is enough for AH 15:39:56 HA* 15:40:13 (as long as the agents there are dvr_snat) 15:40:27 jschwarz: But will it not be usefull to have three nodes, two network and one compute alone to segregate things. 15:41:00 jschwarz: otherwise we will not be testing the 'dvr' agent type in the gate. 15:41:13 Swami, I see your point 15:41:20 everything will be tested against the 'dvr_snat' agent type. 15:41:27 that is a good question to make: can we change the multinode-dvr to have 3 nodes instead of 2 15:42:00 also, anilvenkata's work on l2pop is blocking this as well, as currently HA+l2pop is broken pretty bad and anilvenkata's patch seems like a magic pill full of happiness :) 15:42:14 jschwarz: Also I am seeing another issue with respect to the DVR functional tests were we see random failures with respect to HA+DVR. This is related to some stale namespace that is being used. 15:42:17 jschwarz, thanks :) 15:42:24 jschwarz: if you get a chance can you take a look at it. 15:42:36 Swami, can you link me the bug please? 15:42:38 or some logs? 15:43:03 jschwarz: I have not filed a bug yet, but I have been seeing even in the my local setup that it randomly fails. 15:43:14 Swami, oh, that's always fun 15:43:24 Swami, lets file a bug report with reproduction steps and I'll put some cycles on it 15:43:27 You can reproduce it with a test function that I had that would remove the stale namespace. 15:43:44 jschwarz: I will send you the test function that I had to reproduce it. 15:43:51 Swami, sounds good to me 15:43:56 jschwarz: thanks 15:44:03 Swami, also, re: the current multinode-dvr job 15:44:12 jschwarz: yes 15:44:19 is it running one dvr_snat node and one dvr node? 15:44:33 jschwarz: Yes one is dvr_snat and the other is dvr 15:44:43 ok 15:45:02 Swami, I'll ask around and see the feasibility of turning that gate into a 3-node 15:45:11 jschwarz: yes that would help 15:45:13 somehow, I imagine #openstack-infra to not like that idea at all :) 15:45:46 if we already started jumping into HA territory 15:45:47 jschwarz: well, start with an experimental one, since we probably even want to disable compute on the snat nodes possibly 15:45:54 jschwarz: that would be a tough job 15:46:15 haleyb, ack 15:46:28 anyway, re: HA, we have https://bugs.launchpad.net/neutron/+bug/1609738 15:46:28 Launchpad bug 1609738 in neutron "l3-ha: a router can be stuck in the ALLOCATING state" [Undecided,In progress] - Assigned to John Schwarz (jschwarz) 15:46:29 jschwarz: yes, infra won't like it, but how else do we test this, and not introduce new bugs? 15:46:39 which is being worked on in https://review.openstack.org/#/c/357965/2 and following patches 15:46:50 haleyb, it's not something we test at all. 15:47:04 haleyb, unless we count on functional only 15:47:11 no actual scenario jobs :\ 15:47:19 exactly, maybe you do internally, and maybe us... 15:47:34 I don't like the "internally testing" idea 15:47:47 if you guys or us benefit of it, we should also share the happiness 15:47:53 isn't that the opensource way? :P 15:48:18 yes, we need a scenario test, even if it's experimental 15:48:27 haleyb, so here's the scenario test: https://review.openstack.org/#/c/340383/ 15:48:31 haleyb: yes that would help. 15:48:35 * jschwarz has come prepared! 15:49:00 i even +1'd it already :) 15:49:06 haleyb, Swami, basically it's an actual tempest scenario test that is supposed to check DVR 15:49:39 it's blocked by allowing tempest jobs to run neutron-based scenario jobs, but other than that Genadi would sure like some love 15:49:59 jschwarz: got it. 15:50:16 the link for that is https://review.openstack.org/#/c/355344/ 15:50:27 jschwarz: but i think you know what i mean - with a functional test for HA/DVR, when a change to HA is proposed having a 'check experimental' would be nice. eventually it could be promoted. i'll get off my box now 15:50:51 haleyb, that sounds like the first step in the right direction 15:51:03 haleyb, probably #openstack-infra will like that one as well :) 15:51:37 and armando - having untested code in the tree is bad 15:51:40 anyway, if those 2 patches go in we'll get some much-needed DVR coverage in the stable branches 15:52:02 haleyb, Swami there is a fullstack test for DVR functioanlity https://review.openstack.org/#/c/318307/ 15:52:16 I'll get off my podium now, I think that's all I had in mind for this meeting 15:52:44 jschwarz: thanks for your time 15:52:47 jschwarz: so one more for you - https://bugs.launchpad.net/neutron/+bug/1597461 - i know there's other HA patches in-flight, but you mentioned in a review none would fix th is 15:52:47 Launchpad bug 1597461 in neutron "L3 HA: 2 masters after reboot of controller" [High,In progress] - Assigned to venkata anil (anil-venkata) 15:52:55 anilvenkata: thanks will take a look at it. 15:53:01 Swami, thanks 15:53:02 haleyb, yes 15:53:17 haleyb, so Ann and your new best friend, anilvenkata, are working on that 15:53:37 jschwarz: give me something to +2 :) 15:53:49 :) 15:53:58 haleyb, https://review.openstack.org/#/c/357458/ is WIP 15:54:02 but I hope it will be ready soon 15:54:06 seriously, let me know when you have something, oh there is it 15:54:14 XD 15:54:38 https://review.openstack.org/#/c/357458/ is ready for review 15:54:44 haleyb, Swami ^ 15:54:57 anilvenkata, so you guys should probably drop the 'WIP' tag from the patch 15:54:59 anilvenkata: thanks, i haven't gotten to email yet today to see it was updated 15:55:00 jschwarz, hope u will take back your -1 15:55:06 anilvenkata: jschwarz: thanks for the links 15:55:29 that is taken back, now it is ready 15:55:37 anilvenkata, I was sure I did, sorry 15:55:44 jschwarz, :) 15:56:21 i'm g oing to skip gate failures today as we're all scr*wed right now in that regards 15:56:29 #topic Open Discussion 15:56:52 anything else to discuss? have ~4 minutes 15:57:02 armax also mentioned something about possibly deprecating legacy routers in the midcycle 15:57:13 haleyb: did you get a chance to take a look at the fast-path-exit patch. 15:57:19 this got me wondering about migration of legacy routers to DVR, or lack of thereof 15:57:38 jschwarz: we already have a way to migrate the legacy to DVR. 15:57:52 jschwarz: that is admin only option to migrate the legacy to DVR. 15:57:56 Swami: not yet, i've been trying to help with the gate issues, problem with ipv6 right now, but i promise to look before EOD 15:58:17 Swami, https://github.com/openstack/neutron/blob/master/neutron/db/l3_hamode_db.py#L504 15:58:23 this specifically disables it? 15:58:27 haleyb: no problem 15:58:47 I need to clone myself 15:59:09 or work smarter, not harder, so 90's 15:59:24 work like John 15:59:26 :) 15:59:32 lot of multi tasking 15:59:41 jschwarz: got it, if there are any services associated or HA configured for the legacy routers, probably it should be removed and then migrated. 16:00:00 Swami, but what if I want to migrate a legacy to HA+DVR? 16:00:02 anilvenkata: i'm already on the phone in another meeting :) 16:00:06 #endmeeting