15:00:43 <haleyb> #startmeeting neutron_dvr
15:00:47 <openstack> Meeting started Wed Aug  3 15:00:43 2016 UTC and is due to finish in 60 minutes.  The chair is haleyb. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:48 <obondarev> o/
15:00:49 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:51 <openstack> The meeting name has been set to 'neutron_dvr'
15:00:52 <haleyb> #chair Swami
15:00:53 <openstack> Current chairs: Swami haleyb
15:01:48 <haleyb> #topic Announcements
15:02:49 <haleyb> I wanted to talk a little about the namespace.exists() changes but got side-tracked.  will update the agenda while swami talks about bugs
15:02:55 <haleyb> #topic Bugs
15:03:19 <Swami> haleyb: thanks will talk about it.
15:03:28 <Swami> There was no new bugs this week.
15:03:56 <haleyb> there was 1609217, but i just triaged it :)
15:03:56 <Swami> But there was one bug that was initially tagged with l3-ha and now it has been tagged to l3-dvr-backlog.
15:04:12 <Swami> https://bugs.launchpad.net/neutron/+bug/1602320
15:04:12 <openstack> Launchpad bug 1602320 in neutron "ha + distributed router: keepalived process kill vrrp child process" [Undecided,In progress] - Assigned to Dongcan Ye (hellochosen)
15:05:13 <Swami> Just reading the bug log.
15:06:33 <Swami> Yes I think the bug 1609217 is not a bug.
15:06:33 <openstack> bug 1609217 in neutron "DVR: dvr router should not exist in not-binded network node" [Undecided,New] https://launchpad.net/bugs/1609217
15:07:04 <Swami> https://bugs.launchpad.net/neutron/+bug/1602794
15:07:04 <openstack> Launchpad bug 1602794 in neutron "ItemAllocator class can throw a ValueError when file is corrupted" [High,In progress] - Assigned to Brian Haley (brian-haley)
15:07:23 <haleyb> We need another core to +2 that
15:07:30 <Swami> This patch should be merged soon.
15:07:44 <Swami> https://bugs.launchpad.net/neutron/+bug/1602614
15:07:44 <openstack> Launchpad bug 1602614 in neutron "DVR + L3 HA loss during failover is higher that it is expected" [Undecided,In progress] - Assigned to venkata anil (anil-venkata)
15:08:02 <Swami> This week I did not have any time to triage these bugs.
15:08:52 <Swami> As I mentioned earlier we have a bunch of L3/HA/DVR bugs that have to be triaged and fixed and I am not sure if they all depend on anil-venkata's changes that he mentioned last week on the generic port-binding.
15:09:06 <Swami> https://bugs.launchpad.net/neutron/+bug/1597461
15:09:06 <openstack> Launchpad bug 1597461 in neutron "L3 HA: 2 masters after reboot of controller" [High,Confirmed] - Assigned to Ann Taraday (akamyshnikova)
15:09:27 <Swami> https://bugs.launchpad.net/neutron/+bug/1593354
15:09:27 <openstack> Launchpad bug 1593354 in neutron "SNAT HA failed because of missing nat rule in snat namespace iptable" [Undecided,New]
15:09:30 <jschwarz> I was supposed to look at that but didn't get there
15:09:36 <jschwarz> (on 1597461)
15:10:01 <haleyb> jschwarz: i can reproduce 1597461 without a reboot
15:10:12 <haleyb> and without killing the agent
15:10:16 <Swami> jschwarz: fine, you can update the bug report once you have a solution.
15:10:24 <jschwarz> haleyb, us too - we have guys who also got into that state a while back
15:11:00 <haleyb> jschwarz: i was going to continue looking into it, but will gladly defer to you
15:11:21 <jschwarz> haleyb, I will do my best to allocate some time for this in the coming days
15:11:40 <jschwarz> l3 ha scheduler races seems to have stabilized a bit so..
15:11:56 <Swami> jschwarz: good
15:12:24 <Swami> jschwarz: Has you patch for generalizing the auto-scheduler landed or not.
15:12:34 <jschwarz> Swami, not yet
15:12:54 <jschwarz> Swami, basically I was working on a CaS mechanism to lock the router's schedule() operations to ensure non-concurrency
15:13:03 <Swami> jschwarz: does it has any impact on any of this l3 ha dvr snat ha issues that we are talking about.
15:13:15 <jschwarz> it got stack for a while but me and kevinbenton settled on a path in the middle
15:13:19 <jschwarz> I will resume work on that soon
15:13:29 <Swami> jschwarz: ok thanks
15:13:36 <haleyb> jschwarz: let me know if i can help, i'll leave my systems in place to test patches (on the dual-master bug)
15:13:54 <jschwarz> Swami, I don't think it impacts, but I need to have a deeper look at those bug reports. can we sync after this meeting?
15:14:04 <jschwarz> haleyb, excellent, thanks :)
15:14:34 <Swami> jschwarz: I have a meeting in the next hour, if you can ping me later we can sync up or either tomorrow, since you are in a different time zone.
15:14:54 <jschwarz> Swami, ack
15:15:00 <Swami> jschwarz: thanks
15:15:41 <Swami> haleyb: I don't have any new bugs to discuss. May be you can jump on to the namespace.exists issue that would like to discuss.
15:16:44 <haleyb> Swami: ok, just pushing update, but will just paste here
15:17:12 <Swami> haleyb: ok will take a look at the diff
15:17:51 <haleyb> obondarev: i put a comment in  https://review.openstack.org/#/c/348372 noting how it is similar to  https://review.openstack.org/#/c/309050/ (and related to swami's https://review.openstack.org/#/c/326729/)
15:18:30 <obondarev> haleyb: yep, saw that
15:18:33 <haleyb> Can we discuss the first two and select one?  They seem to so the same thing
15:18:52 <Swami> haleyb: ok, no problem.
15:18:53 <obondarev> haleyb: sure
15:19:27 <obondarev> they are fixing the problem on different levels
15:19:34 <obondarev> I mean the first two patches
15:19:53 <haleyb> yes, let's just focus on the first two
15:20:13 <obondarev> not sure how I feel about returning empty list from get_devices() if ns does not exists
15:20:21 <haleyb> one fixes ip_lib, the other the dvr code
15:20:27 <obondarev> as it can mean two things
15:20:39 <Swami> haleyb: yes.
15:20:40 <obondarev> no namespace or no devices
15:21:07 <haleyb> obondarev: right, but if we change all callers of get_devices() to check the namespace aren't we doing the same thing?
15:21:34 <obondarev> is it only get_devices that can fail?
15:21:50 <obondarev> I guess it’s ns.delete() as well
15:22:01 <Swami> obondarev: get_devices is one of the failures that we might see.
15:22:09 <haleyb> right, we'd still need to put a check in namespaces.py
15:22:22 <haleyb> or i think it can acutally deal, just log a warning
15:22:30 <obondarev> honestly I don’t have a strong opinion for now, as just saw the alternate patch
15:23:07 <obondarev> but i think it makes sense to try fix the issue at a higher layer
15:23:45 <obondarev> and https://review.openstack.org/#/c/309050/ is closer to it
15:23:50 <Swami> obondarev: so are we saying we are aligning with the first patch rather than the second one.
15:24:42 <haleyb> i think we need both, just like we need the dvr_snat patch, but if we adopt 309050 then part of obondarev's patch goes away
15:25:44 <obondarev> haleyb: agree, I think we can iterate on https://review.openstack.org/#/c/309050/ first, then see if anything else needed
15:25:51 <jschwarz> I don't like the fact that the second patch offers ambiguous return values (empty list could be actually empty, or an error)
15:25:58 <jschwarz> raising an exception is the same behaviour as now
15:26:14 <obondarev> jschwarz: agree, that’s my concern as well
15:27:05 <Swami> obondarev: jschwarz: +1
15:27:08 <jschwarz> anyway this can be deferred until we make a decision, since it only triggers when something goes wrong in the server (some race condition, etc)
15:27:34 <obondarev> jschwarz: no only
15:28:24 <jschwarz> obondarev, when else?
15:28:29 <obondarev> jschwarz: I mean we shouldn’t think about server, agent should handle those cases anyway
15:29:07 <jschwarz> obondarev, I agree - i meant that we shouldn't feel rush to make a decision now since it doesn't appear unless there are problems in the server (which afaik we fixed most)
15:29:33 <obondarev> jschwarz: no rush of course
15:29:59 <jschwarz> I'm thinking perhaps 309050 could catch the RuntimeError but produce an actual informative exception
15:30:08 <haleyb> to me, no rush == no decision, and we have lots of bugs to fix :(
15:30:19 <jschwarz> then that error can be caught on the calling side
15:30:22 <Swami> haleyb: agreed
15:30:32 <jschwarz> but that adds a lot of overhead code
15:31:21 <haleyb> jschwarz: well, we could add an arg to get_devices, or just add a docstring that it will return [] in certain cases and move forward
15:31:41 <jschwarz> haleyb, but it's not only get_devices - delete() can also fail
15:31:50 <jschwarz> and probably some other functions as well
15:32:17 <jschwarz> 309050 should deal with all of those ip_lib functions
15:32:19 <jschwarz> IMO
15:33:33 <obondarev> but the bug is about get_devices
15:33:40 <haleyb> jschwarz: so put a check in _as_root() ?  that's the bottom but not practical i don't think
15:33:42 <Swami> jschwarz: but that patch is specific to get_devices, do you think that same patch should address all ip_lib related failures.
15:34:12 <haleyb> i agree we should put a check in the namespace deletion path as well though
15:34:29 <jschwarz> all I'm saying is that we might fix get_devices now, but somewhere along the way in a cycle or so, we're going to hate ourselves for not dealing with the other functions
15:35:25 <jschwarz> we could... add a decorator that makes sure that the namespace exists before running the ip_lib function
15:35:36 <jschwarz> and use that decorator on all the functions we think are relevant
15:36:52 <haleyb> jschwarz: that might be a long-term play, but will take a lot of auditing to get it right
15:37:32 <haleyb> we know we have an issue with get_devices() today so should fix it
15:38:15 <obondarev> we have the issue with delete as well
15:38:21 <obondarev> delete()
15:38:56 <obondarev> https://bugs.launchpad.net/neutron/+bug/1606844 will still be there if fix only get_devices()
15:38:56 <openstack> Launchpad bug 1606844 in neutron "L3 agent constantly resyncing deleted router" [Medium,In progress] - Assigned to Oleg Bondarev (obondarev)
15:38:57 <haleyb> Namespace.delete() or RouterNamespace.delete() ?  the second is calling get_devices()
15:40:09 <haleyb> obondarev: then i guess the answer is to fix this in the agent code, so your and swami's patches
15:40:39 <obondarev> haleyb: that will be faster I guess
15:40:49 <Swami> haleyb: agreed
15:40:50 <haleyb> or we get rid of the namespace cleaner utility :)
15:40:51 <jschwarz> I agree. it makes more sense that *Namespace.delete() will just pass if the namespace doesn't exist (get_devices() should fail if the namespace isn't there)
15:41:30 <obondarev> I’ll update the patch shortly then
15:41:45 <haleyb> ok, so we prioritize https://review.openstack.org/#/c/348372 and https://review.openstack.org/#/c/326729/ - they do share some code
15:42:11 <Swami> haleyb: agreed
15:42:18 <haleyb> i feel like i can take the rest of the day off :)
15:42:24 <obondarev> both adding exists(self)
15:42:39 <haleyb> yes, so there's a small conflict
15:42:49 <obondarev> well that should merge fine
15:42:56 <jschwarz> rebase Swami's on obondarev's?
15:43:04 <obondarev> I mean if code change is the same in the file
15:43:11 <jschwarz> Oleg's is smaller so it should merge faster
15:43:15 <obondarev> I don’t think we should rebase
15:43:23 <Swami> https://review.openstack.org/#/c/326729/ this is ready to merge, I think the jenkins has also passed
15:43:35 <obondarev> just what goes first - the second will rebase on master
15:43:41 <jschwarz> okies
15:43:42 <jschwarz> :)
15:43:45 <haleyb> jschwarz: swami might disagree being on PS32+
15:44:12 <haleyb> Any more bugs?
15:44:13 <Swami> haleyb: jschwarz : yep!
15:44:16 <obondarev> a bit concerned with dvr multinode in https://review.openstack.org/#/c/326729/
15:44:38 <obondarev> is the failure rate the same as in other patches?
15:45:00 <haleyb> obondarev: i was going to get to check/gate failures, probably not that one patch
15:45:03 <Swami> haleyb: you mentioned that nova was the cause for the multinode failure.
15:45:14 <haleyb> #topic Gate failures
15:45:25 <haleyb> So last week there was a nova change that broke multinode
15:45:30 <jschwarz> obondarev, it's non-voting anyway ;-)
15:45:46 <obondarev> jschwarz: haha
15:46:00 <haleyb> patch merged  - https://review.openstack.org/#/c/348186/
15:46:21 <haleyb> but rate went from 100% down to ~25%
15:46:44 <Swami> haleyb: that is good.
15:47:09 <haleyb> so there's clearly something still broken in the check queue.  gate queue seems ok
15:48:23 <Swami> haleyb: also I am seeing functional test failures on test_db_find_column_type_list(vsctl)
15:48:48 <haleyb> Swami: that might be yet another bug
15:49:20 <haleyb> obondarev: http://logs.openstack.org/29/326729/33/check/gate-tempest-dsvm-neutron-dvr-multinode-full/a64ca3d/console.html is showing "Instance 3c2dc519-b630-40a7-9592-2df5e0b2659b could not be found."
15:49:50 <haleyb> that could be nova as well
15:50:17 <obondarev> haleyb: right
15:50:33 <obondarev> haleyb: I saw ssh timeout several times as well on this patch
15:51:06 <Swami> haleyb: Also the functional test has some stale namespace that are being re-used that has to be fixed. I will raise a bug on this.
15:51:21 <Swami> haleyb: It produces inconsistent results.
15:52:03 <haleyb> Swami: thanks
15:52:33 <obondarev> http://logs.openstack.org/29/326729/33/check/gate-tempest-dsvm-neutron-dvr-multinode-full/a64ca3d/logs/subnode-2/screen-q-l3.txt.gz#_2016-08-03_12_17_22_640
15:53:29 <obondarev> ah, same in http://logs.openstack.org/72/348372/3/check/gate-tempest-dsvm-neutron-dvr-multinode-full/7e1e32d/logs/subnode-2/screen-q-l3.txt.gz#_2016-08-02_18_06_51_162
15:53:39 <obondarev> so its unrelated I guess
15:55:02 <haleyb> does anyone have time to figure out if there is a common failure in the 25%
15:55:56 <Swami> haleyb: If I have time I will take a look at it
15:56:30 <haleyb> Swami: thanks, guess we need to recruit more people to work on dvr :)
15:56:44 <obondarev> haleyb: +
15:56:49 <haleyb> #topic Open Discussion
15:56:52 <anilvenkata> obondarev, haleyb, Swami jschwarz https://review.openstack.org/#/c/255237/ is the alternate approach I proposed for DVR+HA+l2pop patches (which u all reviewed) very long back(i.e february 2016) without any dependency on DVR, I will respin it today. Then I can abandon DVR HA portbinding patches
15:56:53 <Swami> haleyb: agreed
15:57:20 <Swami> anilvenkata: thanks, will take a look at it.
15:57:28 <anilvenkata> Swami, thanks Swami
15:57:48 <haleyb> Swami: the nova patch for live migration got a -1 - https://review.openstack.org/#/c/275073 - i'll see about updating
15:58:06 <Swami> haleyb: thanks
15:58:20 <haleyb> it looks like nits to me, but...
15:58:42 <Swami> haleyb: yes little nits, but there was question about a negative test case.
15:59:08 <jschwarz> anilvenkata, ping me when you respin, I'll have a look :)
15:59:24 <haleyb> Swami: what is the behaviour there?
15:59:29 <anilvenkata> jschwarz, sure, thanks John
15:59:31 * haleyb notes we have :30
16:00:14 <haleyb> thanks guys, good meeting, but we're out of time
16:00:20 <haleyb> #endmeeting