18:02:40 <SumitNaiksatam> #startmeeting networking_policy 18:02:41 <openstack> Meeting started Thu Dec 21 18:02:40 2017 UTC and is due to finish in 60 minutes. The chair is SumitNaiksatam. Information about MeetBot at http://wiki.debian.org/MeetBot. 18:02:42 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 18:02:44 <openstack> The meeting name has been set to 'networking_policy' 18:03:04 <rkukura> hi annakk 18:03:06 <SumitNaiksatam> annakk: hi, you made it 18:03:07 <annakk> hi! 18:03:10 <SumitNaiksatam> thanks 18:03:25 <SumitNaiksatam> mostly just wanted to give you guys an update on the pike sync activity 18:04:15 <SumitNaiksatam> so one of the major issues i was seeing last time (rollback not working in some cases) was due to the use of (and switching to) an admin_context 18:05:36 <SumitNaiksatam> on account of the changes that have happened with using the enginefacade (the reader/writer thing), to maintain the same transaction (or nest a transaction) its important to maintain the context, or when you do create a new context, copy over some of the state of the older context 18:06:09 <SumitNaiksatam> most of the time calling “elevated” on the context should work 18:06:15 <SumitNaiksatam> this is what neutron does all the time 18:06:38 <SumitNaiksatam> however in our case, there are places where we need to manufacture a context, because none is being passed 18:06:55 <SumitNaiksatam> earlier we creating a new admin context, and trying to ensure that the session stays the same 18:07:01 <SumitNaiksatam> however that was not enough 18:07:22 <SumitNaiksatam> i have gotten around this by overriding the get_admin_context implementation 18:07:42 <annakk> oh sorry I need to drop off right away, got a call from school.. will catch up later 18:07:53 <annakk> happy holidays! 18:08:32 <SumitNaiksatam> annakk: np, happy holidays 18:08:43 <SumitNaiksatam> you will find that patching here: 18:08:49 <SumitNaiksatam> #link https://review.openstack.org/#/c/518183/26/gbpservice/neutron/plugins/ml2plus/patch_neutron.py 18:09:20 <SumitNaiksatam> rkukura: after that, as i was updating you yesterday, there were a bunch of UT setup related issues 18:09:24 <SumitNaiksatam> i wont go into those now 18:09:31 <SumitNaiksatam> the next big one was this: 18:09:51 <SumitNaiksatam> when running tests like this: gbpservice.neutron.tests.unit.plugins.ml2plus.test_apic_aim.TestAimMapping.test_network_lifecycle 18:10:05 <SumitNaiksatam> i was seeing: InvalidRequestError: Can't attach instance <NetworkSegment at 0x7fa46cadc6d0>; another instance with key (<class 'neutron.db.models.segment.NetworkSegment'>, (u'58ff28c9-2eef-4612-b037-1515ce752ea8',)) is already present in this session. 18:11:02 <rkukura> SumitNaiksatam: Was that error from sqlalchemy? 18:11:05 <SumitNaiksatam> if i remove the “lazy=joined” over here: 18:11:06 <SumitNaiksatam> https://github.com/openstack/group-based-policy/blob/master/gbpservice/neutron/plugins/ml2plus/drivers/apic_aim/db.py#L49 18:11:16 <SumitNaiksatam> the error goes away (at least in the UTs) 18:11:30 <SumitNaiksatam> yes that error is from sqlalchemy 18:12:44 <rkukura> Removing that could impact performance of the apic_aim MD 18:13:03 <SumitNaiksatam> rkukura: do you to eager loading? 18:13:10 <SumitNaiksatam> *due to 18:13:13 <rkukura> that is the intent 18:13:54 <rkukura> we query for a NetworkMapping, and then follow the reference to the Network, and we wanted to avoid a separate DB query for that 18:13:55 <SumitNaiksatam> sorry, is the intent to eager load or lazy load the relationship? 18:14:06 <rkukura> eager in this case, in both directions 18:14:26 <SumitNaiksatam> okay 18:15:12 <rkukura> very curious why removing the lazy=‘joined’ solves the NetworkSegment issue 18:15:22 <SumitNaiksatam> right, just saying that 18:15:31 <SumitNaiksatam> it should work in all cases 18:16:08 <rkukura> right 18:16:16 <SumitNaiksatam> there is most likely some place where the transaction/session gets messed up causing an issue in sqlalchemy 18:16:35 <rkukura> maybe 18:16:41 <rkukura> is this during delete_network? 18:17:11 <SumitNaiksatam> however, in my previous rounding of testing, i did a lot of debugging and following the different references, but couldnt tell that the session or transaction was messed up anywhere, it seemed to be okay 18:17:17 <SumitNaiksatam> yes, during delete_network 18:17:58 <rkukura> Maybe somehow the cascading isn’t working. I think it can happen both in the DB and in sqlalchemy. Maybe we need it to happen in sqlalchemy. 18:18:46 <SumitNaiksatam> rkukura: yes, its the cascading issue 18:19:02 <SumitNaiksatam> the cascading is happening twice for the NetworkSegment resource 18:19:04 <SumitNaiksatam> i could see that 18:19:18 <rkukura> interesting 18:19:33 <SumitNaiksatam> but per your earlier comment, and my thinking too, sqlalchemy should be able to deal with it within the same transaction 18:20:09 <SumitNaiksatam> and this interestingly does not happen with an external_network 18:20:46 <rkukura> yes, but I think cascading can be used in sqlalchemy in 2 ways - where the DB does it on commit, and where sqlalchemy does it in the session’s cache before the commit 18:20:51 <SumitNaiksatam> and the reason is because, if you see the code paths in the apic_aim driver 18:21:02 <SumitNaiksatam> rkukura: okay 18:21:25 <SumitNaiksatam> rkukura: and maybe you are not supposed to do both? 18:22:06 <rkukura> I think the DB schema should enforce the cascade, but there are cases where the in-memory cascading needs to happen too, and maybe we are hitting that 18:22:30 <SumitNaiksatam> rkukura: could be 18:22:37 <SumitNaiksatam> so with the external_network, if you see here: 18:22:38 <SumitNaiksatam> https://github.com/openstack/group-based-policy/blob/master/gbpservice/neutron/plugins/ml2plus/drivers/apic_aim/mechanism_driver.py#L446-L460 18:23:03 <SumitNaiksatam> it does not take the path of getting the network_mapping 18:23:27 <SumitNaiksatam> and that led me to the path of looking at the network_mapping schema 18:25:03 <rkukura> the external network stuff is kind of an anomoly at the moment 18:25:19 <rkukura> its got that weird AIM lib support that hides all the details from the MD 18:25:57 <SumitNaiksatam> sure, perhaps from a consistent pattern perspective, but i am just trying to narrow down this particular issue 18:26:28 <SumitNaiksatam> so it provides a data point in that respect 18:26:34 <rkukura> you are right that external networks do not currently use the NetworkMapping table the same way that regular networks do - the BD, EPG and VRF are stored, but they are also obtained from AIM lib 18:27:21 <SumitNaiksatam> clearly loading the network reference at that point in the network_mapping is conflicting with the deletion of the network later 18:27:44 <rkukura> Does the exception occur in _get_network_mapping called on line 455? 18:27:45 <SumitNaiksatam> and the reason might very well be the delete cascade in DB versus sqlalchemy issue you pointed out 18:27:50 <SumitNaiksatam> rkukura: no 18:28:23 <SumitNaiksatam> the exception occurs when context.session.delete(network) is called in the neutron DB plugin 18:29:04 <SumitNaiksatam> neutron/db/db_base_plugin_v2.py(462)delete_network() 18:29:13 <rkukura> so that is after delete_network_mapping has completed? 18:29:20 <rkukura> delete_network_precommit, I mean 18:29:27 <SumitNaiksatam> rkukura: yes 18:31:11 <SumitNaiksatam> so there is a data structure in sqlalchemy that gets populated with the NetworkSegment reference before the context.session.delete(network) is called 18:31:52 <SumitNaiksatam> when context.session.delete(network) is called, a cascaded delete is attempted and sqlalchemy attempts to populate that same data structure with the same key (for the NetworkSegment object) 18:32:58 <rkukura> so sqlalchemy is doing the cascading in this case - trying to delete the NetworkSegment? 18:33:09 <rkukura> or just load the NetworkSegment? 18:33:59 <SumitNaiksatam> yes, i think it is definitely tryin to doing a cascaded delete the second time around (when context.session.delete(network)) is called 18:34:37 <SumitNaiksatam> the data structure is just named “dirty” or something (if i recall correctly), so its difficult to tell for what its being used for 18:35:02 <rkukura> ok 18:35:19 <SumitNaiksatam> to make progress i am going to remove the “lazy=join” with a big REVISIT explanation, and see how it behaves with the devstack testing 18:35:30 <SumitNaiksatam> we can circle back to this after we have resolved other issues 18:35:40 <rkukura> makes sense 18:36:17 <SumitNaiksatam> you will also get some more time to think on this in the background 18:36:36 <SumitNaiksatam> okay so the next one, when i run this: 18:36:37 <SumitNaiksatam> gbpservice.neutron.tests.unit.plugins.ml2plus.test_apic_aim.TestPortBinding.test_bind_vnic_direct_port 18:37:52 <SumitNaiksatam> if i put a: with db_api.context_manager.writer.using(context): 18:37:54 <SumitNaiksatam> at: 18:37:59 <SumitNaiksatam> https://github.com/openstack/group-based-policy/blob/master/gbpservice/neutron/plugins/ml2plus/plugin.py#L360 18:38:09 <SumitNaiksatam> i get an error: 18:38:47 <SumitNaiksatam> Failed to bind 18:38:52 <SumitNaiksatam> several times 18:38:57 <SumitNaiksatam> and the update fails 18:40:02 <SumitNaiksatam> for now i have removed the "with” 18:40:09 <SumitNaiksatam> but something we need to investigate more 18:40:44 <rkukura> I don’t see the with that you are removing 18:41:22 <rkukura> were you saying that you had added it, it failed, so you removed it? 18:41:40 <SumitNaiksatam> right right 18:42:08 <SumitNaiksatam> i wanted to wrap all the methods 18:42:11 <rkukura> we definitely should not wrap update_port in a transaction if we can avoid it - port binding is definitely supposed to happen outside a transaction 18:42:21 <rkukura> why wrap any of them? 18:42:34 <rkukura> only the precommit part should normally be in a transaction 18:42:54 <SumitNaiksatam> rkukura: okay, so i am guessing there is code in there (on the neutron side) which might creating a new transaction/session 18:43:05 <SumitNaiksatam> okay so i am not going to wrap it 18:43:25 <SumitNaiksatam> only problem is when this gets called from GBP, we will most likely see the same issue 18:43:32 <rkukura> update_port definitely can run multiple seperate transactions 18:43:42 <SumitNaiksatam> since GBP will have its own transaction 18:43:55 <rkukura> does GBP call update_port to set binding:host_id? 18:44:08 <rkukura> that is usually what triggers [re]binding 18:44:30 <SumitNaiksatam> perhaps not directly, but on a create/update policy_target maybe? 18:45:12 <SumitNaiksatam> if this is only triggered from nova, then great 18:46:11 <rkukura> There are other attributes of port that can trigger re-binding if the host_id is already set, like vnic_type 18:46:23 <rkukura> but I would not expect GBP to be updating those 18:46:39 <SumitNaiksatam> okay nice, whew! 18:46:54 <SumitNaiksatam> i have this CLI patch: 18:46:55 <SumitNaiksatam> https://review.openstack.org/#/c/529001/4 18:47:24 <rkukura> had not noticed - will review 18:47:29 <SumitNaiksatam> thanks 18:47:48 <SumitNaiksatam> this is sort of getting validated from: https://review.openstack.org/#/c/518183 18:47:54 <SumitNaiksatam> the devstack jobs are using it 18:47:58 <rkukura> is it safe to merge before the neutron pike sync patch? 18:48:17 <SumitNaiksatam> it should be fine, worst case we can revert 18:48:32 <SumitNaiksatam> we do have a stable/ocata branch 18:48:54 <SumitNaiksatam> so no one is expected to be using the master directly (at least not that we know of) 18:48:59 <SumitNaiksatam> oh well wait 18:49:05 <SumitNaiksatam> i think its better to wait 18:49:18 <SumitNaiksatam> we have other patches in the queue in GBP 18:49:28 <SumitNaiksatam> which might need to merge before the pike sync patch 18:49:36 <rkukura> right, but I’ll review 18:49:40 <SumitNaiksatam> so i guess better to wait 18:49:41 <SumitNaiksatam> thanks 18:49:51 <SumitNaiksatam> the other issue is the time outs in the gate 18:50:08 <SumitNaiksatam> the py27 job again times out 18:50:16 <rkukura> I was just looking at some of the “with db_api.context_manager.writer.using(context):” code added to other plugin methods. Are any of these really needed? 18:50:31 <rkukura> Times out on master? 18:51:02 <SumitNaiksatam> yes 18:51:09 <rkukura> New UTs for validate/repair tool will add some time too 18:51:15 <SumitNaiksatam> bummer 18:51:33 <rkukura> have they cut back our cores on master, or the timeout period? 18:52:12 <SumitNaiksatam> i havent checked, will check 18:52:24 <SumitNaiksatam> but yeah, on those occasions it runs longer than 35 mins 18:52:38 <SumitNaiksatam> and most likely the reason is the cores/threads 18:52:41 <rkukura> maybe we need to start thinking about moving the AIM stuff to a separate repo or something 18:52:59 <SumitNaiksatam> hmmm, good idea, worth exploring 18:53:14 <SumitNaiksatam> or a separate job 18:53:23 <rkukura> maybe 18:53:57 <SumitNaiksatam> thats pretty much what i had 18:54:08 <SumitNaiksatam> oh on your earlier question 18:54:38 <SumitNaiksatam> i was trying to wrap as many places as i could in the “with" 18:55:08 <rkukura> in general, we still want postcommit methods for ML2[plus] to run outside transactions, don’t we? 18:55:23 <SumitNaiksatam> as opposed to before, where if you had a: with session.begin(subtransactions=True) 18:55:41 <SumitNaiksatam> you could expect that any subsequent call would result in a nested transaction 18:55:45 <rkukura> In fact, I’d like to look at eliminating ML2plus soon 18:55:46 <SumitNaiksatam> however now that is not the case 18:56:10 <rkukura> ? 18:56:59 <SumitNaiksatam> so unless you have a “with…writer()” at the top level, the subsequent calls to “with session.begin(subtransactions=True)” can result in different (non-nested) transactions 18:57:23 <rkukura> can we just change GBP to use “wit…writer()” for all its transactions? 18:58:31 <SumitNaiksatam> so on your ealier point about postcommits, for apic_aim, we dont do anything inside the postcommit, right? 18:58:59 <rkukura> right, but ideally apic_aim would be able top co-exist with other MDs that might use postcommit 18:59:23 <SumitNaiksatam> :-) 18:59:46 <rkukura> and apic_aim’s port binding can do remote calls to ESX I think, so that should not be inside a transaction 19:00:08 <SumitNaiksatam> okay so i am definitely removing from update_port 19:00:16 <SumitNaiksatam> i already did in the latest patchset 19:00:31 <SumitNaiksatam> i will remove from other places as well, once i get a clean run 19:00:57 <rkukura> ok, but I’m still concerned about the others - I think switching GBP to use the new API would be best 19:01:21 <SumitNaiksatam> GBP is already switched 19:01:41 <rkukura> oh, good 19:01:57 <rkukura> thanks 19:02:14 <rkukura> think we are out of time 19:02:24 <SumitNaiksatam> i think on one occasion i saw an issue with the outer transaction block not being present in ml2plus, hence added 19:02:28 <SumitNaiksatam> yeah 19:02:35 <SumitNaiksatam> thanks rkukura! 19:02:40 <SumitNaiksatam> happy holidays 19:02:42 <SumitNaiksatam> by 19:02:43 <rkukura> you too! 19:02:44 <SumitNaiksatam> e 19:02:47 <SumitNaiksatam> #endmeeting