15:00:43 <haleyb> #startmeeting neutron_dvr 15:00:47 <openstack> Meeting started Wed Aug 3 15:00:43 2016 UTC and is due to finish in 60 minutes. The chair is haleyb. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:48 <obondarev> o/ 15:00:49 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:51 <openstack> The meeting name has been set to 'neutron_dvr' 15:00:52 <haleyb> #chair Swami 15:00:53 <openstack> Current chairs: Swami haleyb 15:01:48 <haleyb> #topic Announcements 15:02:49 <haleyb> I wanted to talk a little about the namespace.exists() changes but got side-tracked. will update the agenda while swami talks about bugs 15:02:55 <haleyb> #topic Bugs 15:03:19 <Swami> haleyb: thanks will talk about it. 15:03:28 <Swami> There was no new bugs this week. 15:03:56 <haleyb> there was 1609217, but i just triaged it :) 15:03:56 <Swami> But there was one bug that was initially tagged with l3-ha and now it has been tagged to l3-dvr-backlog. 15:04:12 <Swami> https://bugs.launchpad.net/neutron/+bug/1602320 15:04:12 <openstack> Launchpad bug 1602320 in neutron "ha + distributed router: keepalived process kill vrrp child process" [Undecided,In progress] - Assigned to Dongcan Ye (hellochosen) 15:05:13 <Swami> Just reading the bug log. 15:06:33 <Swami> Yes I think the bug 1609217 is not a bug. 15:06:33 <openstack> bug 1609217 in neutron "DVR: dvr router should not exist in not-binded network node" [Undecided,New] https://launchpad.net/bugs/1609217 15:07:04 <Swami> https://bugs.launchpad.net/neutron/+bug/1602794 15:07:04 <openstack> Launchpad bug 1602794 in neutron "ItemAllocator class can throw a ValueError when file is corrupted" [High,In progress] - Assigned to Brian Haley (brian-haley) 15:07:23 <haleyb> We need another core to +2 that 15:07:30 <Swami> This patch should be merged soon. 15:07:44 <Swami> https://bugs.launchpad.net/neutron/+bug/1602614 15:07:44 <openstack> Launchpad bug 1602614 in neutron "DVR + L3 HA loss during failover is higher that it is expected" [Undecided,In progress] - Assigned to venkata anil (anil-venkata) 15:08:02 <Swami> This week I did not have any time to triage these bugs. 15:08:52 <Swami> As I mentioned earlier we have a bunch of L3/HA/DVR bugs that have to be triaged and fixed and I am not sure if they all depend on anil-venkata's changes that he mentioned last week on the generic port-binding. 15:09:06 <Swami> https://bugs.launchpad.net/neutron/+bug/1597461 15:09:06 <openstack> Launchpad bug 1597461 in neutron "L3 HA: 2 masters after reboot of controller" [High,Confirmed] - Assigned to Ann Taraday (akamyshnikova) 15:09:27 <Swami> https://bugs.launchpad.net/neutron/+bug/1593354 15:09:27 <openstack> Launchpad bug 1593354 in neutron "SNAT HA failed because of missing nat rule in snat namespace iptable" [Undecided,New] 15:09:30 <jschwarz> I was supposed to look at that but didn't get there 15:09:36 <jschwarz> (on 1597461) 15:10:01 <haleyb> jschwarz: i can reproduce 1597461 without a reboot 15:10:12 <haleyb> and without killing the agent 15:10:16 <Swami> jschwarz: fine, you can update the bug report once you have a solution. 15:10:24 <jschwarz> haleyb, us too - we have guys who also got into that state a while back 15:11:00 <haleyb> jschwarz: i was going to continue looking into it, but will gladly defer to you 15:11:21 <jschwarz> haleyb, I will do my best to allocate some time for this in the coming days 15:11:40 <jschwarz> l3 ha scheduler races seems to have stabilized a bit so.. 15:11:56 <Swami> jschwarz: good 15:12:24 <Swami> jschwarz: Has you patch for generalizing the auto-scheduler landed or not. 15:12:34 <jschwarz> Swami, not yet 15:12:54 <jschwarz> Swami, basically I was working on a CaS mechanism to lock the router's schedule() operations to ensure non-concurrency 15:13:03 <Swami> jschwarz: does it has any impact on any of this l3 ha dvr snat ha issues that we are talking about. 15:13:15 <jschwarz> it got stack for a while but me and kevinbenton settled on a path in the middle 15:13:19 <jschwarz> I will resume work on that soon 15:13:29 <Swami> jschwarz: ok thanks 15:13:36 <haleyb> jschwarz: let me know if i can help, i'll leave my systems in place to test patches (on the dual-master bug) 15:13:54 <jschwarz> Swami, I don't think it impacts, but I need to have a deeper look at those bug reports. can we sync after this meeting? 15:14:04 <jschwarz> haleyb, excellent, thanks :) 15:14:34 <Swami> jschwarz: I have a meeting in the next hour, if you can ping me later we can sync up or either tomorrow, since you are in a different time zone. 15:14:54 <jschwarz> Swami, ack 15:15:00 <Swami> jschwarz: thanks 15:15:41 <Swami> haleyb: I don't have any new bugs to discuss. May be you can jump on to the namespace.exists issue that would like to discuss. 15:16:44 <haleyb> Swami: ok, just pushing update, but will just paste here 15:17:12 <Swami> haleyb: ok will take a look at the diff 15:17:51 <haleyb> obondarev: i put a comment in https://review.openstack.org/#/c/348372 noting how it is similar to https://review.openstack.org/#/c/309050/ (and related to swami's https://review.openstack.org/#/c/326729/) 15:18:30 <obondarev> haleyb: yep, saw that 15:18:33 <haleyb> Can we discuss the first two and select one? They seem to so the same thing 15:18:52 <Swami> haleyb: ok, no problem. 15:18:53 <obondarev> haleyb: sure 15:19:27 <obondarev> they are fixing the problem on different levels 15:19:34 <obondarev> I mean the first two patches 15:19:53 <haleyb> yes, let's just focus on the first two 15:20:13 <obondarev> not sure how I feel about returning empty list from get_devices() if ns does not exists 15:20:21 <haleyb> one fixes ip_lib, the other the dvr code 15:20:27 <obondarev> as it can mean two things 15:20:39 <Swami> haleyb: yes. 15:20:40 <obondarev> no namespace or no devices 15:21:07 <haleyb> obondarev: right, but if we change all callers of get_devices() to check the namespace aren't we doing the same thing? 15:21:34 <obondarev> is it only get_devices that can fail? 15:21:50 <obondarev> I guess it’s ns.delete() as well 15:22:01 <Swami> obondarev: get_devices is one of the failures that we might see. 15:22:09 <haleyb> right, we'd still need to put a check in namespaces.py 15:22:22 <haleyb> or i think it can acutally deal, just log a warning 15:22:30 <obondarev> honestly I don’t have a strong opinion for now, as just saw the alternate patch 15:23:07 <obondarev> but i think it makes sense to try fix the issue at a higher layer 15:23:45 <obondarev> and https://review.openstack.org/#/c/309050/ is closer to it 15:23:50 <Swami> obondarev: so are we saying we are aligning with the first patch rather than the second one. 15:24:42 <haleyb> i think we need both, just like we need the dvr_snat patch, but if we adopt 309050 then part of obondarev's patch goes away 15:25:44 <obondarev> haleyb: agree, I think we can iterate on https://review.openstack.org/#/c/309050/ first, then see if anything else needed 15:25:51 <jschwarz> I don't like the fact that the second patch offers ambiguous return values (empty list could be actually empty, or an error) 15:25:58 <jschwarz> raising an exception is the same behaviour as now 15:26:14 <obondarev> jschwarz: agree, that’s my concern as well 15:27:05 <Swami> obondarev: jschwarz: +1 15:27:08 <jschwarz> anyway this can be deferred until we make a decision, since it only triggers when something goes wrong in the server (some race condition, etc) 15:27:34 <obondarev> jschwarz: no only 15:28:24 <jschwarz> obondarev, when else? 15:28:29 <obondarev> jschwarz: I mean we shouldn’t think about server, agent should handle those cases anyway 15:29:07 <jschwarz> obondarev, I agree - i meant that we shouldn't feel rush to make a decision now since it doesn't appear unless there are problems in the server (which afaik we fixed most) 15:29:33 <obondarev> jschwarz: no rush of course 15:29:59 <jschwarz> I'm thinking perhaps 309050 could catch the RuntimeError but produce an actual informative exception 15:30:08 <haleyb> to me, no rush == no decision, and we have lots of bugs to fix :( 15:30:19 <jschwarz> then that error can be caught on the calling side 15:30:22 <Swami> haleyb: agreed 15:30:32 <jschwarz> but that adds a lot of overhead code 15:31:21 <haleyb> jschwarz: well, we could add an arg to get_devices, or just add a docstring that it will return [] in certain cases and move forward 15:31:41 <jschwarz> haleyb, but it's not only get_devices - delete() can also fail 15:31:50 <jschwarz> and probably some other functions as well 15:32:17 <jschwarz> 309050 should deal with all of those ip_lib functions 15:32:19 <jschwarz> IMO 15:33:33 <obondarev> but the bug is about get_devices 15:33:40 <haleyb> jschwarz: so put a check in _as_root() ? that's the bottom but not practical i don't think 15:33:42 <Swami> jschwarz: but that patch is specific to get_devices, do you think that same patch should address all ip_lib related failures. 15:34:12 <haleyb> i agree we should put a check in the namespace deletion path as well though 15:34:29 <jschwarz> all I'm saying is that we might fix get_devices now, but somewhere along the way in a cycle or so, we're going to hate ourselves for not dealing with the other functions 15:35:25 <jschwarz> we could... add a decorator that makes sure that the namespace exists before running the ip_lib function 15:35:36 <jschwarz> and use that decorator on all the functions we think are relevant 15:36:52 <haleyb> jschwarz: that might be a long-term play, but will take a lot of auditing to get it right 15:37:32 <haleyb> we know we have an issue with get_devices() today so should fix it 15:38:15 <obondarev> we have the issue with delete as well 15:38:21 <obondarev> delete() 15:38:56 <obondarev> https://bugs.launchpad.net/neutron/+bug/1606844 will still be there if fix only get_devices() 15:38:56 <openstack> Launchpad bug 1606844 in neutron "L3 agent constantly resyncing deleted router" [Medium,In progress] - Assigned to Oleg Bondarev (obondarev) 15:38:57 <haleyb> Namespace.delete() or RouterNamespace.delete() ? the second is calling get_devices() 15:40:09 <haleyb> obondarev: then i guess the answer is to fix this in the agent code, so your and swami's patches 15:40:39 <obondarev> haleyb: that will be faster I guess 15:40:49 <Swami> haleyb: agreed 15:40:50 <haleyb> or we get rid of the namespace cleaner utility :) 15:40:51 <jschwarz> I agree. it makes more sense that *Namespace.delete() will just pass if the namespace doesn't exist (get_devices() should fail if the namespace isn't there) 15:41:30 <obondarev> I’ll update the patch shortly then 15:41:45 <haleyb> ok, so we prioritize https://review.openstack.org/#/c/348372 and https://review.openstack.org/#/c/326729/ - they do share some code 15:42:11 <Swami> haleyb: agreed 15:42:18 <haleyb> i feel like i can take the rest of the day off :) 15:42:24 <obondarev> both adding exists(self) 15:42:39 <haleyb> yes, so there's a small conflict 15:42:49 <obondarev> well that should merge fine 15:42:56 <jschwarz> rebase Swami's on obondarev's? 15:43:04 <obondarev> I mean if code change is the same in the file 15:43:11 <jschwarz> Oleg's is smaller so it should merge faster 15:43:15 <obondarev> I don’t think we should rebase 15:43:23 <Swami> https://review.openstack.org/#/c/326729/ this is ready to merge, I think the jenkins has also passed 15:43:35 <obondarev> just what goes first - the second will rebase on master 15:43:41 <jschwarz> okies 15:43:42 <jschwarz> :) 15:43:45 <haleyb> jschwarz: swami might disagree being on PS32+ 15:44:12 <haleyb> Any more bugs? 15:44:13 <Swami> haleyb: jschwarz : yep! 15:44:16 <obondarev> a bit concerned with dvr multinode in https://review.openstack.org/#/c/326729/ 15:44:38 <obondarev> is the failure rate the same as in other patches? 15:45:00 <haleyb> obondarev: i was going to get to check/gate failures, probably not that one patch 15:45:03 <Swami> haleyb: you mentioned that nova was the cause for the multinode failure. 15:45:14 <haleyb> #topic Gate failures 15:45:25 <haleyb> So last week there was a nova change that broke multinode 15:45:30 <jschwarz> obondarev, it's non-voting anyway ;-) 15:45:46 <obondarev> jschwarz: haha 15:46:00 <haleyb> patch merged - https://review.openstack.org/#/c/348186/ 15:46:21 <haleyb> but rate went from 100% down to ~25% 15:46:44 <Swami> haleyb: that is good. 15:47:09 <haleyb> so there's clearly something still broken in the check queue. gate queue seems ok 15:48:23 <Swami> haleyb: also I am seeing functional test failures on test_db_find_column_type_list(vsctl) 15:48:48 <haleyb> Swami: that might be yet another bug 15:49:20 <haleyb> obondarev: http://logs.openstack.org/29/326729/33/check/gate-tempest-dsvm-neutron-dvr-multinode-full/a64ca3d/console.html is showing "Instance 3c2dc519-b630-40a7-9592-2df5e0b2659b could not be found." 15:49:50 <haleyb> that could be nova as well 15:50:17 <obondarev> haleyb: right 15:50:33 <obondarev> haleyb: I saw ssh timeout several times as well on this patch 15:51:06 <Swami> haleyb: Also the functional test has some stale namespace that are being re-used that has to be fixed. I will raise a bug on this. 15:51:21 <Swami> haleyb: It produces inconsistent results. 15:52:03 <haleyb> Swami: thanks 15:52:33 <obondarev> http://logs.openstack.org/29/326729/33/check/gate-tempest-dsvm-neutron-dvr-multinode-full/a64ca3d/logs/subnode-2/screen-q-l3.txt.gz#_2016-08-03_12_17_22_640 15:53:29 <obondarev> ah, same in http://logs.openstack.org/72/348372/3/check/gate-tempest-dsvm-neutron-dvr-multinode-full/7e1e32d/logs/subnode-2/screen-q-l3.txt.gz#_2016-08-02_18_06_51_162 15:53:39 <obondarev> so its unrelated I guess 15:55:02 <haleyb> does anyone have time to figure out if there is a common failure in the 25% 15:55:56 <Swami> haleyb: If I have time I will take a look at it 15:56:30 <haleyb> Swami: thanks, guess we need to recruit more people to work on dvr :) 15:56:44 <obondarev> haleyb: + 15:56:49 <haleyb> #topic Open Discussion 15:56:52 <anilvenkata> obondarev, haleyb, Swami jschwarz https://review.openstack.org/#/c/255237/ is the alternate approach I proposed for DVR+HA+l2pop patches (which u all reviewed) very long back(i.e february 2016) without any dependency on DVR, I will respin it today. Then I can abandon DVR HA portbinding patches 15:56:53 <Swami> haleyb: agreed 15:57:20 <Swami> anilvenkata: thanks, will take a look at it. 15:57:28 <anilvenkata> Swami, thanks Swami 15:57:48 <haleyb> Swami: the nova patch for live migration got a -1 - https://review.openstack.org/#/c/275073 - i'll see about updating 15:58:06 <Swami> haleyb: thanks 15:58:20 <haleyb> it looks like nits to me, but... 15:58:42 <Swami> haleyb: yes little nits, but there was question about a negative test case. 15:59:08 <jschwarz> anilvenkata, ping me when you respin, I'll have a look :) 15:59:24 <haleyb> Swami: what is the behaviour there? 15:59:29 <anilvenkata> jschwarz, sure, thanks John 15:59:31 * haleyb notes we have :30 16:00:14 <haleyb> thanks guys, good meeting, but we're out of time 16:00:20 <haleyb> #endmeeting