18:02:40 <SumitNaiksatam> #startmeeting networking_policy
18:02:41 <openstack> Meeting started Thu Dec 21 18:02:40 2017 UTC and is due to finish in 60 minutes.  The chair is SumitNaiksatam. Information about MeetBot at http://wiki.debian.org/MeetBot.
18:02:42 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
18:02:44 <openstack> The meeting name has been set to 'networking_policy'
18:03:04 <rkukura> hi annakk
18:03:06 <SumitNaiksatam> annakk: hi, you made it
18:03:07 <annakk> hi!
18:03:10 <SumitNaiksatam> thanks
18:03:25 <SumitNaiksatam> mostly just wanted to give you guys an update on the pike sync activity
18:04:15 <SumitNaiksatam> so one of the major issues i was seeing last time (rollback not working in some cases) was due to the use of (and switching to) an admin_context
18:05:36 <SumitNaiksatam> on account of the changes that have happened with using the enginefacade (the reader/writer thing), to maintain the same transaction (or nest a transaction) its important to maintain the context, or when you do create a new context, copy over some of the state of the older context
18:06:09 <SumitNaiksatam> most of the time calling “elevated” on the context should work
18:06:15 <SumitNaiksatam> this is what neutron does all the time
18:06:38 <SumitNaiksatam> however in our case, there are places where we need to manufacture a context, because none is being passed
18:06:55 <SumitNaiksatam> earlier we creating a new admin context, and trying to ensure that the session stays the same
18:07:01 <SumitNaiksatam> however that was not enough
18:07:22 <SumitNaiksatam> i have gotten around this by overriding the get_admin_context implementation
18:07:42 <annakk> oh sorry I need to drop off right away, got a call from school.. will catch up later
18:07:53 <annakk> happy holidays!
18:08:32 <SumitNaiksatam> annakk: np, happy holidays
18:08:43 <SumitNaiksatam> you will find that patching here:
18:08:49 <SumitNaiksatam> #link https://review.openstack.org/#/c/518183/26/gbpservice/neutron/plugins/ml2plus/patch_neutron.py
18:09:20 <SumitNaiksatam> rkukura: after that, as i was updating you yesterday, there were a bunch of UT setup related issues
18:09:24 <SumitNaiksatam> i wont go into those now
18:09:31 <SumitNaiksatam> the next big one was this:
18:09:51 <SumitNaiksatam> when running tests like this: gbpservice.neutron.tests.unit.plugins.ml2plus.test_apic_aim.TestAimMapping.test_network_lifecycle
18:10:05 <SumitNaiksatam> i was seeing: InvalidRequestError: Can't attach instance <NetworkSegment at 0x7fa46cadc6d0>; another instance with key (<class 'neutron.db.models.segment.NetworkSegment'>, (u'58ff28c9-2eef-4612-b037-1515ce752ea8',)) is already present in this session.
18:11:02 <rkukura> SumitNaiksatam: Was that error from sqlalchemy?
18:11:05 <SumitNaiksatam> if i remove the “lazy=joined” over here:
18:11:06 <SumitNaiksatam> https://github.com/openstack/group-based-policy/blob/master/gbpservice/neutron/plugins/ml2plus/drivers/apic_aim/db.py#L49
18:11:16 <SumitNaiksatam> the error goes away (at least in the UTs)
18:11:30 <SumitNaiksatam> yes that error is from sqlalchemy
18:12:44 <rkukura> Removing that could impact performance of the apic_aim MD
18:13:03 <SumitNaiksatam> rkukura: do you to eager loading?
18:13:10 <SumitNaiksatam> *due to
18:13:13 <rkukura> that is the intent
18:13:54 <rkukura> we query for a NetworkMapping, and then follow the reference to the Network, and we wanted to avoid a separate DB query for that
18:13:55 <SumitNaiksatam> sorry, is the intent to eager load or lazy load the relationship?
18:14:06 <rkukura> eager in this case, in both directions
18:14:26 <SumitNaiksatam> okay
18:15:12 <rkukura> very curious why removing the lazy=‘joined’ solves the NetworkSegment issue
18:15:22 <SumitNaiksatam> right, just saying that
18:15:31 <SumitNaiksatam> it should work in all cases
18:16:08 <rkukura> right
18:16:16 <SumitNaiksatam> there is most likely some place where the transaction/session gets messed up causing an issue in sqlalchemy
18:16:35 <rkukura> maybe
18:16:41 <rkukura> is this during delete_network?
18:17:11 <SumitNaiksatam> however, in my previous rounding of testing, i did a lot of debugging and following the different references, but couldnt tell that the session or transaction was messed up anywhere, it seemed to be okay
18:17:17 <SumitNaiksatam> yes, during delete_network
18:17:58 <rkukura> Maybe somehow the cascading isn’t working. I think it can happen both in the DB and in sqlalchemy. Maybe we need it to happen in sqlalchemy.
18:18:46 <SumitNaiksatam> rkukura: yes, its the cascading issue
18:19:02 <SumitNaiksatam> the cascading is happening twice for the NetworkSegment resource
18:19:04 <SumitNaiksatam> i could see that
18:19:18 <rkukura> interesting
18:19:33 <SumitNaiksatam> but per your earlier comment, and my thinking too, sqlalchemy should be able to deal with it within the same transaction
18:20:09 <SumitNaiksatam> and this interestingly does not happen with an external_network
18:20:46 <rkukura> yes, but I think cascading can be used in sqlalchemy in 2 ways - where the DB does it on commit, and where sqlalchemy does it in the session’s cache before the commit
18:20:51 <SumitNaiksatam> and the reason is because, if you see the code paths in the apic_aim driver
18:21:02 <SumitNaiksatam> rkukura: okay
18:21:25 <SumitNaiksatam> rkukura: and maybe you are not supposed to do both?
18:22:06 <rkukura> I think the DB schema should enforce the cascade, but there are cases where the in-memory cascading needs to happen too, and maybe we are hitting that
18:22:30 <SumitNaiksatam> rkukura: could be
18:22:37 <SumitNaiksatam> so with the external_network, if you see here:
18:22:38 <SumitNaiksatam> https://github.com/openstack/group-based-policy/blob/master/gbpservice/neutron/plugins/ml2plus/drivers/apic_aim/mechanism_driver.py#L446-L460
18:23:03 <SumitNaiksatam> it does not take the path of getting the network_mapping
18:23:27 <SumitNaiksatam> and that led me to the path of looking at the network_mapping schema
18:25:03 <rkukura> the external network stuff is kind of an anomoly at the moment
18:25:19 <rkukura> its got that weird AIM lib support that hides all the details from the MD
18:25:57 <SumitNaiksatam> sure, perhaps from a consistent pattern perspective, but i am just trying to narrow down this particular issue
18:26:28 <SumitNaiksatam> so it provides a data point in that respect
18:26:34 <rkukura> you are right that external networks do not currently use the NetworkMapping table the same way that regular networks do - the BD, EPG and VRF are stored, but they are also obtained from AIM lib
18:27:21 <SumitNaiksatam> clearly loading the network reference at that point in the network_mapping is conflicting with the deletion of the network later
18:27:44 <rkukura> Does the exception occur in _get_network_mapping called on line 455?
18:27:45 <SumitNaiksatam> and the reason might very well be the delete cascade in DB versus sqlalchemy issue you pointed out
18:27:50 <SumitNaiksatam> rkukura: no
18:28:23 <SumitNaiksatam> the exception occurs when context.session.delete(network) is called in the neutron DB plugin
18:29:04 <SumitNaiksatam> neutron/db/db_base_plugin_v2.py(462)delete_network()
18:29:13 <rkukura> so that is after delete_network_mapping has completed?
18:29:20 <rkukura> delete_network_precommit, I mean
18:29:27 <SumitNaiksatam> rkukura: yes
18:31:11 <SumitNaiksatam> so there is a data structure in sqlalchemy that gets populated with the NetworkSegment reference before the context.session.delete(network) is called
18:31:52 <SumitNaiksatam> when context.session.delete(network) is called, a cascaded delete is attempted and sqlalchemy attempts to populate that same data structure with the same key (for the NetworkSegment object)
18:32:58 <rkukura> so sqlalchemy is doing the cascading in this case - trying to delete the NetworkSegment?
18:33:09 <rkukura> or just load the NetworkSegment?
18:33:59 <SumitNaiksatam> yes, i think it is definitely tryin to doing a cascaded delete the second time around (when context.session.delete(network)) is called
18:34:37 <SumitNaiksatam> the data structure is just named “dirty” or something (if i recall correctly), so its difficult to tell for what its being used for
18:35:02 <rkukura> ok
18:35:19 <SumitNaiksatam> to make progress i am going to remove the “lazy=join” with a big REVISIT explanation, and see how it behaves with the devstack testing
18:35:30 <SumitNaiksatam> we can circle back to this after we have resolved other issues
18:35:40 <rkukura> makes sense
18:36:17 <SumitNaiksatam> you will also get some more time to think on this in the background
18:36:36 <SumitNaiksatam> okay so the next one, when i run this:
18:36:37 <SumitNaiksatam> gbpservice.neutron.tests.unit.plugins.ml2plus.test_apic_aim.TestPortBinding.test_bind_vnic_direct_port
18:37:52 <SumitNaiksatam> if i put a: with db_api.context_manager.writer.using(context):
18:37:54 <SumitNaiksatam> at:
18:37:59 <SumitNaiksatam> https://github.com/openstack/group-based-policy/blob/master/gbpservice/neutron/plugins/ml2plus/plugin.py#L360
18:38:09 <SumitNaiksatam> i get an error:
18:38:47 <SumitNaiksatam> Failed to bind
18:38:52 <SumitNaiksatam> several times
18:38:57 <SumitNaiksatam> and the update fails
18:40:02 <SumitNaiksatam> for now i have removed the "with”
18:40:09 <SumitNaiksatam> but something we need to investigate more
18:40:44 <rkukura> I don’t see the with that you are removing
18:41:22 <rkukura> were you saying that you had added it, it failed, so you removed it?
18:41:40 <SumitNaiksatam> right right
18:42:08 <SumitNaiksatam> i wanted to wrap all the methods
18:42:11 <rkukura> we definitely should not wrap update_port in a transaction if we can avoid it - port binding is definitely supposed to happen outside a transaction
18:42:21 <rkukura> why wrap any of them?
18:42:34 <rkukura> only the precommit part should normally be in a transaction
18:42:54 <SumitNaiksatam> rkukura: okay, so i am guessing there is code in there (on the neutron side) which might creating a new transaction/session
18:43:05 <SumitNaiksatam> okay so i am not going to wrap it
18:43:25 <SumitNaiksatam> only problem is when this gets called from GBP, we will most likely see the same issue
18:43:32 <rkukura> update_port definitely can run multiple seperate transactions
18:43:42 <SumitNaiksatam> since GBP will have its own transaction
18:43:55 <rkukura> does GBP call update_port to set binding:host_id?
18:44:08 <rkukura> that is usually what triggers [re]binding
18:44:30 <SumitNaiksatam> perhaps not directly, but on a create/update policy_target maybe?
18:45:12 <SumitNaiksatam> if this is only triggered from nova, then great
18:46:11 <rkukura> There are other attributes of port that can trigger re-binding if the host_id is already set, like vnic_type
18:46:23 <rkukura> but I would not expect GBP to be updating those
18:46:39 <SumitNaiksatam> okay nice, whew!
18:46:54 <SumitNaiksatam> i have this CLI patch:
18:46:55 <SumitNaiksatam> https://review.openstack.org/#/c/529001/4
18:47:24 <rkukura> had not noticed - will review
18:47:29 <SumitNaiksatam> thanks
18:47:48 <SumitNaiksatam> this is sort of getting validated from: https://review.openstack.org/#/c/518183
18:47:54 <SumitNaiksatam> the devstack jobs are using it
18:47:58 <rkukura> is it safe to merge before the neutron pike sync patch?
18:48:17 <SumitNaiksatam> it should be fine, worst case we can revert
18:48:32 <SumitNaiksatam> we do have a stable/ocata branch
18:48:54 <SumitNaiksatam> so no one is expected to be using the master directly (at least not that we know of)
18:48:59 <SumitNaiksatam> oh well wait
18:49:05 <SumitNaiksatam> i think its better to wait
18:49:18 <SumitNaiksatam> we have other patches in the queue in GBP
18:49:28 <SumitNaiksatam> which might need to merge before the pike sync patch
18:49:36 <rkukura> right, but I’ll review
18:49:40 <SumitNaiksatam> so i guess better to wait
18:49:41 <SumitNaiksatam> thanks
18:49:51 <SumitNaiksatam> the other issue is the time outs in the gate
18:50:08 <SumitNaiksatam> the py27 job again times out
18:50:16 <rkukura> I was just looking at some of the “with db_api.context_manager.writer.using(context):” code added to other plugin methods. Are any of these really needed?
18:50:31 <rkukura> Times out on master?
18:51:02 <SumitNaiksatam> yes
18:51:09 <rkukura> New UTs for validate/repair tool will add some time too
18:51:15 <SumitNaiksatam> bummer
18:51:33 <rkukura> have they cut back our cores on master, or the timeout period?
18:52:12 <SumitNaiksatam> i havent checked, will check
18:52:24 <SumitNaiksatam> but yeah, on those occasions it runs longer than 35 mins
18:52:38 <SumitNaiksatam> and most likely the reason is the cores/threads
18:52:41 <rkukura> maybe we need to start thinking about moving the AIM stuff to a separate repo or something
18:52:59 <SumitNaiksatam> hmmm, good idea, worth exploring
18:53:14 <SumitNaiksatam> or a separate job
18:53:23 <rkukura> maybe
18:53:57 <SumitNaiksatam> thats pretty much what i had
18:54:08 <SumitNaiksatam> oh on your earlier question
18:54:38 <SumitNaiksatam> i was trying to wrap as many places as i could in the “with"
18:55:08 <rkukura> in general, we still want postcommit methods for ML2[plus] to run outside transactions, don’t we?
18:55:23 <SumitNaiksatam> as opposed to before, where if you had a: with session.begin(subtransactions=True)
18:55:41 <SumitNaiksatam> you could expect that any subsequent call would result in a nested transaction
18:55:45 <rkukura> In fact, I’d like to look at eliminating ML2plus soon
18:55:46 <SumitNaiksatam> however now that is not the case
18:56:10 <rkukura> ?
18:56:59 <SumitNaiksatam> so unless you have a “with…writer()” at the top level, the subsequent calls to “with session.begin(subtransactions=True)” can result in different (non-nested) transactions
18:57:23 <rkukura> can we just change GBP to use “wit…writer()” for all its transactions?
18:58:31 <SumitNaiksatam> so on your ealier point about postcommits, for apic_aim, we dont do anything inside the postcommit, right?
18:58:59 <rkukura> right, but ideally apic_aim would be able top co-exist with other MDs that might use postcommit
18:59:23 <SumitNaiksatam> :-)
18:59:46 <rkukura> and apic_aim’s port binding can do remote calls to ESX I think, so that should not be inside a transaction
19:00:08 <SumitNaiksatam> okay so i am definitely removing from update_port
19:00:16 <SumitNaiksatam> i already did in the latest patchset
19:00:31 <SumitNaiksatam> i will remove from other places as well, once i get a clean run
19:00:57 <rkukura> ok, but I’m still concerned about the others - I think switching GBP to use the new API would be best
19:01:21 <SumitNaiksatam> GBP is already switched
19:01:41 <rkukura> oh, good
19:01:57 <rkukura> thanks
19:02:14 <rkukura> think we are out of time
19:02:24 <SumitNaiksatam> i think on one occasion i saw an issue with the outer transaction block not being present in ml2plus, hence added
19:02:28 <SumitNaiksatam> yeah
19:02:35 <SumitNaiksatam> thanks rkukura!
19:02:40 <SumitNaiksatam> happy holidays
19:02:42 <SumitNaiksatam> by
19:02:43 <rkukura> you too!
19:02:44 <SumitNaiksatam> e
19:02:47 <SumitNaiksatam> #endmeeting