18:01:26 #startmeeting networking_policy 18:01:27 Meeting started Thu Jun 29 18:01:26 2017 UTC and is due to finish in 60 minutes. The chair is SumitNaiksatam. Information about MeetBot at http://wiki.debian.org/MeetBot. 18:01:28 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 18:01:30 The meeting name has been set to 'networking_policy' 18:01:36 hi 18:02:08 #info agenda https://wiki.openstack.org/wiki/Meetings/GroupBasedPolicy#June_29th_2017 18:02:22 #topic Ocata sync 18:02:44 #link https://review.openstack.org/#/q/topic:ocata_sync 18:03:08 first up, annak thanks for getting all the UTs to pass :-) 18:03:17 +2 18:03:26 :) 18:03:45 also thanks for the detailed commit message, and outlining the gray areas in your email 18:04:06 in my follow up patch i am trying to get the devstack jobs to pass 18:04:22 the initial issues were related to job setup 18:05:06 after couple of days of wrangling with that, the resource_mapping driver and aim_mapping driver related devstack jobs are getting setup and run properly 18:05:26 with the resource_mapping driver the exercise tests pass 18:05:42 but a couple of tests are failing from teh gbpfunc test suite (jishnu’s test suite) 18:05:48 i havent followed up on those 18:05:58 the issue with the aim_mapping driver is more serious 18:06:22 and reinforces some of the questions that annak has asked in her email 18:06:40 most notably, neutorn has switched to using the new engine facade 18:07:32 and us using the old engine factorung pattern -> with session.begin(subtransactions=true): 18:07:36 is not working 18:07:47 at least with the aim_mapping driver 18:08:15 I think its not working when wrapped with new facade (or the other way around) 18:08:18 already discussed this a little bit with annak, so this is mostly an update for you rkukura (and also for the record) 18:08:25 annak: correct 18:08:33 thanks - I’m paying attention! 18:08:50 so the place where its failing is during l3p delete 18:09:04 within the big transaction we first delete the subnetpool 18:09:08 Do we know what we should be doing instead of session.begin() to start a (potentially nested) transaction? 18:09:12 then we try to delete the address_scope 18:09:18 rkukura: one sec 18:09:47 SumitNaiksatam: keep going on your explaination 18:10:17 when we try to delete the address_scope, the deleted subnetpool (which is not yet committed) does not show up as deleted in neutron’s db_plugin 18:10:24 hence address_scope delete fails 18:10:49 its over here #link https://github.com/openstack/neutron/blame/master/neutron/db/db_base_plugin_v2.py#L1195 18:11:27 rkukura: so to your question (and annak’s comment earlier), when the code hits this point, there is a mix of the old pattern (in ml2plus) and the new pattern ^^^ 18:11:50 Do we think we are therefore running two separate transactions? 18:12:07 apparently the deleted subnetpool object is not present in the same session/transaction cache when it hits the above code 18:12:20 rkukura: that is my suspicios 18:12:25 *suspicion 18:13:11 so, i did change the pattern in ml2plus to be consistent with the new pattern, though i did it alredy in one place, in delete_addresss_scope, and it did not help 18:13:24 wondering if it has to be changed everywhere 18:13:55 but before doing that i wanted to understand if this was the right thing to do (in order to maintain transactional consistency) 18:14:05 right 18:14:17 as in whether its equivalent to using the old nested/subtransactions pattern 18:14:25 posed that question to annak yesterday 18:14:49 and meant to do my own reading, but havent gotten a chance to read up 18:15:17 btw, reference material is here: #link https://blueprints.launchpad.net/neutron/+spec/enginefacade-switch 18:15:39 annak: you seem to think changing it consistently will fix the problem? 18:16:04 I have a deja vu about these types of errors, and was trying to find relevant commits, but didn't succeed so far 18:16:26 annak: okay, you also mentioned something about UTs, but i didnt understand that part 18:16:33 as in why its not failing in the UTs 18:16:39 I think we should change all transactions around subnets/subnetpools 18:17:02 (my theory was that sqlite is making the issue, assuming we have the right test coverage in the first place) 18:17:07 annak: okay 18:17:08 yes, and it would help to understand why UTs are passing - maybe this flow is not covered? 18:17:19 annak: :-) 18:17:24 :) 18:17:45 i would be really surprised if this flow is not getting covered in the UTs 18:18:00 its l3 policy deletion, correct? 18:18:06 i mean we do have the lifecycle test 18:18:09 annak: right 18:18:18 it will only happen in the aim_mapping driver case thougs 18:18:20 though 18:18:35 yes, that's where the subnetpool test are 18:18:43 yeah 18:18:50 I can check the coverage 18:19:26 rkukura: the other thing is that we noticed neutron has not switched to this pattern everywhere (yet) 18:19:29 annak: thanks 18:19:52 if we fix this specific issue, but break subrtransactions, will this be revealed in testing? 18:20:01 annak: it should be 18:20:07 good 18:20:26 annak: we also have tests which check that rollback happens 18:21:04 SumitNaiksatam: Could be a problem if old and new patterns cannot be successfully nested, neutron has some flows using each, and GBP needs to wrap both kinds in neutron within same top-level transaction 18:21:17 so basically an exception is raised in the innermost part of the code while creating each resource, and check that all operations are rolledback correctly 18:21:35 rkukura: yeah 18:22:18 there is one thing though, neutron is incrementally switching to this pattern 18:22:44 so we were wondering how things dont break in the interim (like they are breaking for us now when there is mix of the patterns) 18:23:18 and whether the places that have not been migrated are on the path to migration, or in certain cases the old pattern needs to be used 18:23:36 Neutron itself does not use large deeply nested transactions the way we are doing with AIM - remember the transaction guards that prevent nested calls? 18:23:50 rkukura: right, i was just typing that 18:24:01 so that might explain 18:25:03 i guess we havent done enough home work on understanding the new pattern 18:25:05 Is it likely that the issue mixing patterns is that different sessions get used? 18:25:26 rkukura: seems like 18:26:12 it wasnt easy to instrument the code to check this (with the new pattern) 18:26:38 thats what i was trying to figure out before this meeting 18:27:32 the second issue was related to the queued notifications 18:27:54 annak: i shared the stack trace with you earlier 18:30:48 i havent had a chance to dig too much into it 18:31:21 coming from here; File "/opt/stack/new/neutron/neutron/plugins/ml2/ovo_rpc.py", line 61, in _is_session_semantic_violated 18:32:16 annak: anyting you wanted to add to the above in terms of open issues? 18:32:28 yes, I saw it too, I wasn't sure if its an issue or just a warning 18:32:56 I think the resource mapping failures might be syntax related 18:33:12 annak: ah okay good 18:33:59 annak: so you are not saying you saw the rpc notification issue with the resource_mapping driver as well, are you? 18:34:05 and regarding delayed notification - do you think there's a place for backporting the fixes? 18:34:54 annak: sorry didnt understand - “do you think there's a place for backporting the fixes?” 18:35:53 there are some fixes in notification area that are not related to ocata 18:36:36 (these issues happen when extra callbacks are added) 18:36:47 so I wasn't sure if it was relevant to backport 18:36:55 just for the record, I still believe our approach to delaying notifications is not ideal 18:37:42 rkukura: i agree, you have my complete support to changing it :-) 18:38:08 annak: if you think the issue exists in neutron as well, sure we should backport the fix 18:38:46 i meant newton not neutron 18:39:29 annak: is this something you outlined in your email, or something else? 18:39:50 yes 18:40:37 ok, will backport 18:40:37 For annak’s benefit, my argument has been that we should not delay the in-process callbacks, which are intended to be run at specific points during request processing (i.e. before, inside, or after transactions), but should instead find a way to delay sending the outgoing messages that they emit. What we’vfe been doing may have worked, but changes in the callback methods could break it at any time. 18:41:22 rkukura: we are not delaying the in-process callbacks (or at least we dont intend to) 18:41:41 rkukura: thanks, Sumit explained to me the motivation and weak points 18:41:52 it seems indeed fragile 18:42:01 SumitNaiksatam: I must be really confused, but I thought we were delaying certain ones of them 18:42:14 not in-process 18:42:44 the callback method itself is in-process, and may send a message via AMQP to other processes 18:42:49 we aim to indentify all the in-process callbacks, and send them immediately 18:43:13 so my understanding is we’ve tried to delay only the callbacks that send AMQP messages 18:43:20 according to the code, if in-process callbacks exist, out-of-process will not be delayed as well 18:43:21 rkukura: right 18:43:40 but the code that sends the messages may expect to be called at a certain point (i.e. inside or outside the transaction) 18:44:01 sounds like we are all on the same page here 18:44:16 rkukura: right, so there are certain exceptions to the out-of-process notifications as well 18:44:35 rkukura: there are some out-of-process notifications which we send immediately 18:44:44 SumitNaiksatam: Didn’t realize that 18:45:13 rkukura: for the reason that you mentioned, but with the downside that if the transaction fails, the notification sent was incorrect 18:45:36 rkukura: but in these cases it seems the agent is resilient 18:45:58 and we havent seen issues, but we might not have tested it enough 18:46:00 Just to avoid confusion, I prefer the term “callback” for the in-process method calls (via the registry), and “notification” for the messages sent via AMQP. Some “callbacks” generate “notifications” and others don’t. 18:46:18 rkukura: sure, good characterization 18:46:37 annak: coming to your point 18:47:40 annak: if i recall correctly, we do discriminate based on the called entity in terms of whether we want to queue or not given a particular event 18:48:25 it was not intended to process the event just one way or the other (meaning either delay or not) 18:48:58 yes 18:49:28 the in-process entities should get called immediately, and the entities which perform the out-of-process notification should get queued 18:49:34 for the same event 18:49:50 at least that was the goal 18:50:04 so i think the behavior is wrong then 18:51:25 annak: okay, perhaps you can point to the specifics 18:51:39 yes, just a sec 18:51:46 annak: the other thing was in the email you mentioned about events being eaten 18:51:59 annak: so would be curious to know where that happened as well 18:53:41 https://github.com/openstack/group-based-policy/blob/master/gbpservice/network/neutronv2/local_api.py#L170 18:56:18 oh sorry I think I was wrong.. 18:56:53 I'll need to revisit the fix 18:56:55 sorry :) 18:57:43 I did see the callback disappear, but my fix is not good 18:59:05 if in_process_callback exist, we'll never get to queue anything, right? because callbacks is set to in_process_callbacks 18:59:15 rkukura: annak: sorry i lost network connectivity 18:59:27 so the out_of_process callback is lost 18:59:47 but I assumed this was the desired behavior, and fixed according to it 19:00:13 so I need to refix according to what SumitNaiksatam explained 19:00:22 annak: i will read the chat transcript 19:00:33 and i think you have details in your email as well, which i overlooked 19:00:51 we have hit the hour, so will end the meeting now 19:00:54 #endmeeting