18:02:40 #startmeeting networking_policy 18:02:41 Meeting started Thu Dec 21 18:02:40 2017 UTC and is due to finish in 60 minutes. The chair is SumitNaiksatam. Information about MeetBot at http://wiki.debian.org/MeetBot. 18:02:42 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 18:02:44 The meeting name has been set to 'networking_policy' 18:03:04 hi annakk 18:03:06 annakk: hi, you made it 18:03:07 hi! 18:03:10 thanks 18:03:25 mostly just wanted to give you guys an update on the pike sync activity 18:04:15 so one of the major issues i was seeing last time (rollback not working in some cases) was due to the use of (and switching to) an admin_context 18:05:36 on account of the changes that have happened with using the enginefacade (the reader/writer thing), to maintain the same transaction (or nest a transaction) its important to maintain the context, or when you do create a new context, copy over some of the state of the older context 18:06:09 most of the time calling “elevated” on the context should work 18:06:15 this is what neutron does all the time 18:06:38 however in our case, there are places where we need to manufacture a context, because none is being passed 18:06:55 earlier we creating a new admin context, and trying to ensure that the session stays the same 18:07:01 however that was not enough 18:07:22 i have gotten around this by overriding the get_admin_context implementation 18:07:42 oh sorry I need to drop off right away, got a call from school.. will catch up later 18:07:53 happy holidays! 18:08:32 annakk: np, happy holidays 18:08:43 you will find that patching here: 18:08:49 #link https://review.openstack.org/#/c/518183/26/gbpservice/neutron/plugins/ml2plus/patch_neutron.py 18:09:20 rkukura: after that, as i was updating you yesterday, there were a bunch of UT setup related issues 18:09:24 i wont go into those now 18:09:31 the next big one was this: 18:09:51 when running tests like this: gbpservice.neutron.tests.unit.plugins.ml2plus.test_apic_aim.TestAimMapping.test_network_lifecycle 18:10:05 i was seeing: InvalidRequestError: Can't attach instance ; another instance with key (, (u'58ff28c9-2eef-4612-b037-1515ce752ea8',)) is already present in this session. 18:11:02 SumitNaiksatam: Was that error from sqlalchemy? 18:11:05 if i remove the “lazy=joined” over here: 18:11:06 https://github.com/openstack/group-based-policy/blob/master/gbpservice/neutron/plugins/ml2plus/drivers/apic_aim/db.py#L49 18:11:16 the error goes away (at least in the UTs) 18:11:30 yes that error is from sqlalchemy 18:12:44 Removing that could impact performance of the apic_aim MD 18:13:03 rkukura: do you to eager loading? 18:13:10 *due to 18:13:13 that is the intent 18:13:54 we query for a NetworkMapping, and then follow the reference to the Network, and we wanted to avoid a separate DB query for that 18:13:55 sorry, is the intent to eager load or lazy load the relationship? 18:14:06 eager in this case, in both directions 18:14:26 okay 18:15:12 very curious why removing the lazy=‘joined’ solves the NetworkSegment issue 18:15:22 right, just saying that 18:15:31 it should work in all cases 18:16:08 right 18:16:16 there is most likely some place where the transaction/session gets messed up causing an issue in sqlalchemy 18:16:35 maybe 18:16:41 is this during delete_network? 18:17:11 however, in my previous rounding of testing, i did a lot of debugging and following the different references, but couldnt tell that the session or transaction was messed up anywhere, it seemed to be okay 18:17:17 yes, during delete_network 18:17:58 Maybe somehow the cascading isn’t working. I think it can happen both in the DB and in sqlalchemy. Maybe we need it to happen in sqlalchemy. 18:18:46 rkukura: yes, its the cascading issue 18:19:02 the cascading is happening twice for the NetworkSegment resource 18:19:04 i could see that 18:19:18 interesting 18:19:33 but per your earlier comment, and my thinking too, sqlalchemy should be able to deal with it within the same transaction 18:20:09 and this interestingly does not happen with an external_network 18:20:46 yes, but I think cascading can be used in sqlalchemy in 2 ways - where the DB does it on commit, and where sqlalchemy does it in the session’s cache before the commit 18:20:51 and the reason is because, if you see the code paths in the apic_aim driver 18:21:02 rkukura: okay 18:21:25 rkukura: and maybe you are not supposed to do both? 18:22:06 I think the DB schema should enforce the cascade, but there are cases where the in-memory cascading needs to happen too, and maybe we are hitting that 18:22:30 rkukura: could be 18:22:37 so with the external_network, if you see here: 18:22:38 https://github.com/openstack/group-based-policy/blob/master/gbpservice/neutron/plugins/ml2plus/drivers/apic_aim/mechanism_driver.py#L446-L460 18:23:03 it does not take the path of getting the network_mapping 18:23:27 and that led me to the path of looking at the network_mapping schema 18:25:03 the external network stuff is kind of an anomoly at the moment 18:25:19 its got that weird AIM lib support that hides all the details from the MD 18:25:57 sure, perhaps from a consistent pattern perspective, but i am just trying to narrow down this particular issue 18:26:28 so it provides a data point in that respect 18:26:34 you are right that external networks do not currently use the NetworkMapping table the same way that regular networks do - the BD, EPG and VRF are stored, but they are also obtained from AIM lib 18:27:21 clearly loading the network reference at that point in the network_mapping is conflicting with the deletion of the network later 18:27:44 Does the exception occur in _get_network_mapping called on line 455? 18:27:45 and the reason might very well be the delete cascade in DB versus sqlalchemy issue you pointed out 18:27:50 rkukura: no 18:28:23 the exception occurs when context.session.delete(network) is called in the neutron DB plugin 18:29:04 neutron/db/db_base_plugin_v2.py(462)delete_network() 18:29:13 so that is after delete_network_mapping has completed? 18:29:20 delete_network_precommit, I mean 18:29:27 rkukura: yes 18:31:11 so there is a data structure in sqlalchemy that gets populated with the NetworkSegment reference before the context.session.delete(network) is called 18:31:52 when context.session.delete(network) is called, a cascaded delete is attempted and sqlalchemy attempts to populate that same data structure with the same key (for the NetworkSegment object) 18:32:58 so sqlalchemy is doing the cascading in this case - trying to delete the NetworkSegment? 18:33:09 or just load the NetworkSegment? 18:33:59 yes, i think it is definitely tryin to doing a cascaded delete the second time around (when context.session.delete(network)) is called 18:34:37 the data structure is just named “dirty” or something (if i recall correctly), so its difficult to tell for what its being used for 18:35:02 ok 18:35:19 to make progress i am going to remove the “lazy=join” with a big REVISIT explanation, and see how it behaves with the devstack testing 18:35:30 we can circle back to this after we have resolved other issues 18:35:40 makes sense 18:36:17 you will also get some more time to think on this in the background 18:36:36 okay so the next one, when i run this: 18:36:37 gbpservice.neutron.tests.unit.plugins.ml2plus.test_apic_aim.TestPortBinding.test_bind_vnic_direct_port 18:37:52 if i put a: with db_api.context_manager.writer.using(context): 18:37:54 at: 18:37:59 https://github.com/openstack/group-based-policy/blob/master/gbpservice/neutron/plugins/ml2plus/plugin.py#L360 18:38:09 i get an error: 18:38:47 Failed to bind 18:38:52 several times 18:38:57 and the update fails 18:40:02 for now i have removed the "with” 18:40:09 but something we need to investigate more 18:40:44 I don’t see the with that you are removing 18:41:22 were you saying that you had added it, it failed, so you removed it? 18:41:40 right right 18:42:08 i wanted to wrap all the methods 18:42:11 we definitely should not wrap update_port in a transaction if we can avoid it - port binding is definitely supposed to happen outside a transaction 18:42:21 why wrap any of them? 18:42:34 only the precommit part should normally be in a transaction 18:42:54 rkukura: okay, so i am guessing there is code in there (on the neutron side) which might creating a new transaction/session 18:43:05 okay so i am not going to wrap it 18:43:25 only problem is when this gets called from GBP, we will most likely see the same issue 18:43:32 update_port definitely can run multiple seperate transactions 18:43:42 since GBP will have its own transaction 18:43:55 does GBP call update_port to set binding:host_id? 18:44:08 that is usually what triggers [re]binding 18:44:30 perhaps not directly, but on a create/update policy_target maybe? 18:45:12 if this is only triggered from nova, then great 18:46:11 There are other attributes of port that can trigger re-binding if the host_id is already set, like vnic_type 18:46:23 but I would not expect GBP to be updating those 18:46:39 okay nice, whew! 18:46:54 i have this CLI patch: 18:46:55 https://review.openstack.org/#/c/529001/4 18:47:24 had not noticed - will review 18:47:29 thanks 18:47:48 this is sort of getting validated from: https://review.openstack.org/#/c/518183 18:47:54 the devstack jobs are using it 18:47:58 is it safe to merge before the neutron pike sync patch? 18:48:17 it should be fine, worst case we can revert 18:48:32 we do have a stable/ocata branch 18:48:54 so no one is expected to be using the master directly (at least not that we know of) 18:48:59 oh well wait 18:49:05 i think its better to wait 18:49:18 we have other patches in the queue in GBP 18:49:28 which might need to merge before the pike sync patch 18:49:36 right, but I’ll review 18:49:40 so i guess better to wait 18:49:41 thanks 18:49:51 the other issue is the time outs in the gate 18:50:08 the py27 job again times out 18:50:16 I was just looking at some of the “with db_api.context_manager.writer.using(context):” code added to other plugin methods. Are any of these really needed? 18:50:31 Times out on master? 18:51:02 yes 18:51:09 New UTs for validate/repair tool will add some time too 18:51:15 bummer 18:51:33 have they cut back our cores on master, or the timeout period? 18:52:12 i havent checked, will check 18:52:24 but yeah, on those occasions it runs longer than 35 mins 18:52:38 and most likely the reason is the cores/threads 18:52:41 maybe we need to start thinking about moving the AIM stuff to a separate repo or something 18:52:59 hmmm, good idea, worth exploring 18:53:14 or a separate job 18:53:23 maybe 18:53:57 thats pretty much what i had 18:54:08 oh on your earlier question 18:54:38 i was trying to wrap as many places as i could in the “with" 18:55:08 in general, we still want postcommit methods for ML2[plus] to run outside transactions, don’t we? 18:55:23 as opposed to before, where if you had a: with session.begin(subtransactions=True) 18:55:41 you could expect that any subsequent call would result in a nested transaction 18:55:45 In fact, I’d like to look at eliminating ML2plus soon 18:55:46 however now that is not the case 18:56:10 ? 18:56:59 so unless you have a “with…writer()” at the top level, the subsequent calls to “with session.begin(subtransactions=True)” can result in different (non-nested) transactions 18:57:23 can we just change GBP to use “wit…writer()” for all its transactions? 18:58:31 so on your ealier point about postcommits, for apic_aim, we dont do anything inside the postcommit, right? 18:58:59 right, but ideally apic_aim would be able top co-exist with other MDs that might use postcommit 18:59:23 :-) 18:59:46 and apic_aim’s port binding can do remote calls to ESX I think, so that should not be inside a transaction 19:00:08 okay so i am definitely removing from update_port 19:00:16 i already did in the latest patchset 19:00:31 i will remove from other places as well, once i get a clean run 19:00:57 ok, but I’m still concerned about the others - I think switching GBP to use the new API would be best 19:01:21 GBP is already switched 19:01:41 oh, good 19:01:57 thanks 19:02:14 think we are out of time 19:02:24 i think on one occasion i saw an issue with the outer transaction block not being present in ml2plus, hence added 19:02:28 yeah 19:02:35 thanks rkukura! 19:02:40 happy holidays 19:02:42 by 19:02:43 you too! 19:02:44 e 19:02:47 #endmeeting