15:00:43 #startmeeting neutron_dvr 15:00:47 Meeting started Wed Aug 3 15:00:43 2016 UTC and is due to finish in 60 minutes. The chair is haleyb. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:48 o/ 15:00:49 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:51 The meeting name has been set to 'neutron_dvr' 15:00:52 #chair Swami 15:00:53 Current chairs: Swami haleyb 15:01:48 #topic Announcements 15:02:49 I wanted to talk a little about the namespace.exists() changes but got side-tracked. will update the agenda while swami talks about bugs 15:02:55 #topic Bugs 15:03:19 haleyb: thanks will talk about it. 15:03:28 There was no new bugs this week. 15:03:56 there was 1609217, but i just triaged it :) 15:03:56 But there was one bug that was initially tagged with l3-ha and now it has been tagged to l3-dvr-backlog. 15:04:12 https://bugs.launchpad.net/neutron/+bug/1602320 15:04:12 Launchpad bug 1602320 in neutron "ha + distributed router: keepalived process kill vrrp child process" [Undecided,In progress] - Assigned to Dongcan Ye (hellochosen) 15:05:13 Just reading the bug log. 15:06:33 Yes I think the bug 1609217 is not a bug. 15:06:33 bug 1609217 in neutron "DVR: dvr router should not exist in not-binded network node" [Undecided,New] https://launchpad.net/bugs/1609217 15:07:04 https://bugs.launchpad.net/neutron/+bug/1602794 15:07:04 Launchpad bug 1602794 in neutron "ItemAllocator class can throw a ValueError when file is corrupted" [High,In progress] - Assigned to Brian Haley (brian-haley) 15:07:23 We need another core to +2 that 15:07:30 This patch should be merged soon. 15:07:44 https://bugs.launchpad.net/neutron/+bug/1602614 15:07:44 Launchpad bug 1602614 in neutron "DVR + L3 HA loss during failover is higher that it is expected" [Undecided,In progress] - Assigned to venkata anil (anil-venkata) 15:08:02 This week I did not have any time to triage these bugs. 15:08:52 As I mentioned earlier we have a bunch of L3/HA/DVR bugs that have to be triaged and fixed and I am not sure if they all depend on anil-venkata's changes that he mentioned last week on the generic port-binding. 15:09:06 https://bugs.launchpad.net/neutron/+bug/1597461 15:09:06 Launchpad bug 1597461 in neutron "L3 HA: 2 masters after reboot of controller" [High,Confirmed] - Assigned to Ann Taraday (akamyshnikova) 15:09:27 https://bugs.launchpad.net/neutron/+bug/1593354 15:09:27 Launchpad bug 1593354 in neutron "SNAT HA failed because of missing nat rule in snat namespace iptable" [Undecided,New] 15:09:30 I was supposed to look at that but didn't get there 15:09:36 (on 1597461) 15:10:01 jschwarz: i can reproduce 1597461 without a reboot 15:10:12 and without killing the agent 15:10:16 jschwarz: fine, you can update the bug report once you have a solution. 15:10:24 haleyb, us too - we have guys who also got into that state a while back 15:11:00 jschwarz: i was going to continue looking into it, but will gladly defer to you 15:11:21 haleyb, I will do my best to allocate some time for this in the coming days 15:11:40 l3 ha scheduler races seems to have stabilized a bit so.. 15:11:56 jschwarz: good 15:12:24 jschwarz: Has you patch for generalizing the auto-scheduler landed or not. 15:12:34 Swami, not yet 15:12:54 Swami, basically I was working on a CaS mechanism to lock the router's schedule() operations to ensure non-concurrency 15:13:03 jschwarz: does it has any impact on any of this l3 ha dvr snat ha issues that we are talking about. 15:13:15 it got stack for a while but me and kevinbenton settled on a path in the middle 15:13:19 I will resume work on that soon 15:13:29 jschwarz: ok thanks 15:13:36 jschwarz: let me know if i can help, i'll leave my systems in place to test patches (on the dual-master bug) 15:13:54 Swami, I don't think it impacts, but I need to have a deeper look at those bug reports. can we sync after this meeting? 15:14:04 haleyb, excellent, thanks :) 15:14:34 jschwarz: I have a meeting in the next hour, if you can ping me later we can sync up or either tomorrow, since you are in a different time zone. 15:14:54 Swami, ack 15:15:00 jschwarz: thanks 15:15:41 haleyb: I don't have any new bugs to discuss. May be you can jump on to the namespace.exists issue that would like to discuss. 15:16:44 Swami: ok, just pushing update, but will just paste here 15:17:12 haleyb: ok will take a look at the diff 15:17:51 obondarev: i put a comment in https://review.openstack.org/#/c/348372 noting how it is similar to https://review.openstack.org/#/c/309050/ (and related to swami's https://review.openstack.org/#/c/326729/) 15:18:30 haleyb: yep, saw that 15:18:33 Can we discuss the first two and select one? They seem to so the same thing 15:18:52 haleyb: ok, no problem. 15:18:53 haleyb: sure 15:19:27 they are fixing the problem on different levels 15:19:34 I mean the first two patches 15:19:53 yes, let's just focus on the first two 15:20:13 not sure how I feel about returning empty list from get_devices() if ns does not exists 15:20:21 one fixes ip_lib, the other the dvr code 15:20:27 as it can mean two things 15:20:39 haleyb: yes. 15:20:40 no namespace or no devices 15:21:07 obondarev: right, but if we change all callers of get_devices() to check the namespace aren't we doing the same thing? 15:21:34 is it only get_devices that can fail? 15:21:50 I guess it’s ns.delete() as well 15:22:01 obondarev: get_devices is one of the failures that we might see. 15:22:09 right, we'd still need to put a check in namespaces.py 15:22:22 or i think it can acutally deal, just log a warning 15:22:30 honestly I don’t have a strong opinion for now, as just saw the alternate patch 15:23:07 but i think it makes sense to try fix the issue at a higher layer 15:23:45 and https://review.openstack.org/#/c/309050/ is closer to it 15:23:50 obondarev: so are we saying we are aligning with the first patch rather than the second one. 15:24:42 i think we need both, just like we need the dvr_snat patch, but if we adopt 309050 then part of obondarev's patch goes away 15:25:44 haleyb: agree, I think we can iterate on https://review.openstack.org/#/c/309050/ first, then see if anything else needed 15:25:51 I don't like the fact that the second patch offers ambiguous return values (empty list could be actually empty, or an error) 15:25:58 raising an exception is the same behaviour as now 15:26:14 jschwarz: agree, that’s my concern as well 15:27:05 obondarev: jschwarz: +1 15:27:08 anyway this can be deferred until we make a decision, since it only triggers when something goes wrong in the server (some race condition, etc) 15:27:34 jschwarz: no only 15:28:24 obondarev, when else? 15:28:29 jschwarz: I mean we shouldn’t think about server, agent should handle those cases anyway 15:29:07 obondarev, I agree - i meant that we shouldn't feel rush to make a decision now since it doesn't appear unless there are problems in the server (which afaik we fixed most) 15:29:33 jschwarz: no rush of course 15:29:59 I'm thinking perhaps 309050 could catch the RuntimeError but produce an actual informative exception 15:30:08 to me, no rush == no decision, and we have lots of bugs to fix :( 15:30:19 then that error can be caught on the calling side 15:30:22 haleyb: agreed 15:30:32 but that adds a lot of overhead code 15:31:21 jschwarz: well, we could add an arg to get_devices, or just add a docstring that it will return [] in certain cases and move forward 15:31:41 haleyb, but it's not only get_devices - delete() can also fail 15:31:50 and probably some other functions as well 15:32:17 309050 should deal with all of those ip_lib functions 15:32:19 IMO 15:33:33 but the bug is about get_devices 15:33:40 jschwarz: so put a check in _as_root() ? that's the bottom but not practical i don't think 15:33:42 jschwarz: but that patch is specific to get_devices, do you think that same patch should address all ip_lib related failures. 15:34:12 i agree we should put a check in the namespace deletion path as well though 15:34:29 all I'm saying is that we might fix get_devices now, but somewhere along the way in a cycle or so, we're going to hate ourselves for not dealing with the other functions 15:35:25 we could... add a decorator that makes sure that the namespace exists before running the ip_lib function 15:35:36 and use that decorator on all the functions we think are relevant 15:36:52 jschwarz: that might be a long-term play, but will take a lot of auditing to get it right 15:37:32 we know we have an issue with get_devices() today so should fix it 15:38:15 we have the issue with delete as well 15:38:21 delete() 15:38:56 https://bugs.launchpad.net/neutron/+bug/1606844 will still be there if fix only get_devices() 15:38:56 Launchpad bug 1606844 in neutron "L3 agent constantly resyncing deleted router" [Medium,In progress] - Assigned to Oleg Bondarev (obondarev) 15:38:57 Namespace.delete() or RouterNamespace.delete() ? the second is calling get_devices() 15:40:09 obondarev: then i guess the answer is to fix this in the agent code, so your and swami's patches 15:40:39 haleyb: that will be faster I guess 15:40:49 haleyb: agreed 15:40:50 or we get rid of the namespace cleaner utility :) 15:40:51 I agree. it makes more sense that *Namespace.delete() will just pass if the namespace doesn't exist (get_devices() should fail if the namespace isn't there) 15:41:30 I’ll update the patch shortly then 15:41:45 ok, so we prioritize https://review.openstack.org/#/c/348372 and https://review.openstack.org/#/c/326729/ - they do share some code 15:42:11 haleyb: agreed 15:42:18 i feel like i can take the rest of the day off :) 15:42:24 both adding exists(self) 15:42:39 yes, so there's a small conflict 15:42:49 well that should merge fine 15:42:56 rebase Swami's on obondarev's? 15:43:04 I mean if code change is the same in the file 15:43:11 Oleg's is smaller so it should merge faster 15:43:15 I don’t think we should rebase 15:43:23 https://review.openstack.org/#/c/326729/ this is ready to merge, I think the jenkins has also passed 15:43:35 just what goes first - the second will rebase on master 15:43:41 okies 15:43:42 :) 15:43:45 jschwarz: swami might disagree being on PS32+ 15:44:12 Any more bugs? 15:44:13 haleyb: jschwarz : yep! 15:44:16 a bit concerned with dvr multinode in https://review.openstack.org/#/c/326729/ 15:44:38 is the failure rate the same as in other patches? 15:45:00 obondarev: i was going to get to check/gate failures, probably not that one patch 15:45:03 haleyb: you mentioned that nova was the cause for the multinode failure. 15:45:14 #topic Gate failures 15:45:25 So last week there was a nova change that broke multinode 15:45:30 obondarev, it's non-voting anyway ;-) 15:45:46 jschwarz: haha 15:46:00 patch merged - https://review.openstack.org/#/c/348186/ 15:46:21 but rate went from 100% down to ~25% 15:46:44 haleyb: that is good. 15:47:09 so there's clearly something still broken in the check queue. gate queue seems ok 15:48:23 haleyb: also I am seeing functional test failures on test_db_find_column_type_list(vsctl) 15:48:48 Swami: that might be yet another bug 15:49:20 obondarev: http://logs.openstack.org/29/326729/33/check/gate-tempest-dsvm-neutron-dvr-multinode-full/a64ca3d/console.html is showing "Instance 3c2dc519-b630-40a7-9592-2df5e0b2659b could not be found." 15:49:50 that could be nova as well 15:50:17 haleyb: right 15:50:33 haleyb: I saw ssh timeout several times as well on this patch 15:51:06 haleyb: Also the functional test has some stale namespace that are being re-used that has to be fixed. I will raise a bug on this. 15:51:21 haleyb: It produces inconsistent results. 15:52:03 Swami: thanks 15:52:33 http://logs.openstack.org/29/326729/33/check/gate-tempest-dsvm-neutron-dvr-multinode-full/a64ca3d/logs/subnode-2/screen-q-l3.txt.gz#_2016-08-03_12_17_22_640 15:53:29 ah, same in http://logs.openstack.org/72/348372/3/check/gate-tempest-dsvm-neutron-dvr-multinode-full/7e1e32d/logs/subnode-2/screen-q-l3.txt.gz#_2016-08-02_18_06_51_162 15:53:39 so its unrelated I guess 15:55:02 does anyone have time to figure out if there is a common failure in the 25% 15:55:56 haleyb: If I have time I will take a look at it 15:56:30 Swami: thanks, guess we need to recruit more people to work on dvr :) 15:56:44 haleyb: + 15:56:49 #topic Open Discussion 15:56:52 obondarev, haleyb, Swami jschwarz https://review.openstack.org/#/c/255237/ is the alternate approach I proposed for DVR+HA+l2pop patches (which u all reviewed) very long back(i.e february 2016) without any dependency on DVR, I will respin it today. Then I can abandon DVR HA portbinding patches 15:56:53 haleyb: agreed 15:57:20 anilvenkata: thanks, will take a look at it. 15:57:28 Swami, thanks Swami 15:57:48 Swami: the nova patch for live migration got a -1 - https://review.openstack.org/#/c/275073 - i'll see about updating 15:58:06 haleyb: thanks 15:58:20 it looks like nits to me, but... 15:58:42 haleyb: yes little nits, but there was question about a negative test case. 15:59:08 anilvenkata, ping me when you respin, I'll have a look :) 15:59:24 Swami: what is the behaviour there? 15:59:29 jschwarz, sure, thanks John 15:59:31 * haleyb notes we have :30 16:00:14 thanks guys, good meeting, but we're out of time 16:00:20 #endmeeting