18:01:26 <SumitNaiksatam> #startmeeting networking_policy 18:01:27 <openstack> Meeting started Thu Jun 29 18:01:26 2017 UTC and is due to finish in 60 minutes. The chair is SumitNaiksatam. Information about MeetBot at http://wiki.debian.org/MeetBot. 18:01:28 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 18:01:30 <openstack> The meeting name has been set to 'networking_policy' 18:01:36 <rkukura> hi 18:02:08 <SumitNaiksatam> #info agenda https://wiki.openstack.org/wiki/Meetings/GroupBasedPolicy#June_29th_2017 18:02:22 <SumitNaiksatam> #topic Ocata sync 18:02:44 <SumitNaiksatam> #link https://review.openstack.org/#/q/topic:ocata_sync 18:03:08 <SumitNaiksatam> first up, annak thanks for getting all the UTs to pass :-) 18:03:17 <rkukura> +2 18:03:26 <annak> :) 18:03:45 <SumitNaiksatam> also thanks for the detailed commit message, and outlining the gray areas in your email 18:04:06 <SumitNaiksatam> in my follow up patch i am trying to get the devstack jobs to pass 18:04:22 <SumitNaiksatam> the initial issues were related to job setup 18:05:06 <SumitNaiksatam> after couple of days of wrangling with that, the resource_mapping driver and aim_mapping driver related devstack jobs are getting setup and run properly 18:05:26 <SumitNaiksatam> with the resource_mapping driver the exercise tests pass 18:05:42 <SumitNaiksatam> but a couple of tests are failing from teh gbpfunc test suite (jishnu’s test suite) 18:05:48 <SumitNaiksatam> i havent followed up on those 18:05:58 <SumitNaiksatam> the issue with the aim_mapping driver is more serious 18:06:22 <SumitNaiksatam> and reinforces some of the questions that annak has asked in her email 18:06:40 <SumitNaiksatam> most notably, neutorn has switched to using the new engine facade 18:07:32 <SumitNaiksatam> and us using the old engine factorung pattern -> with session.begin(subtransactions=true): 18:07:36 <SumitNaiksatam> is not working 18:07:47 <SumitNaiksatam> at least with the aim_mapping driver 18:08:15 <annak> I think its not working when wrapped with new facade (or the other way around) 18:08:18 <SumitNaiksatam> already discussed this a little bit with annak, so this is mostly an update for you rkukura (and also for the record) 18:08:25 <SumitNaiksatam> annak: correct 18:08:33 <rkukura> thanks - I’m paying attention! 18:08:50 <SumitNaiksatam> so the place where its failing is during l3p delete 18:09:04 <SumitNaiksatam> within the big transaction we first delete the subnetpool 18:09:08 <rkukura> Do we know what we should be doing instead of session.begin() to start a (potentially nested) transaction? 18:09:12 <SumitNaiksatam> then we try to delete the address_scope 18:09:18 <SumitNaiksatam> rkukura: one sec 18:09:47 <rkukura> SumitNaiksatam: keep going on your explaination 18:10:17 <SumitNaiksatam> when we try to delete the address_scope, the deleted subnetpool (which is not yet committed) does not show up as deleted in neutron’s db_plugin 18:10:24 <SumitNaiksatam> hence address_scope delete fails 18:10:49 <SumitNaiksatam> its over here #link https://github.com/openstack/neutron/blame/master/neutron/db/db_base_plugin_v2.py#L1195 18:11:27 <SumitNaiksatam> rkukura: so to your question (and annak’s comment earlier), when the code hits this point, there is a mix of the old pattern (in ml2plus) and the new pattern ^^^ 18:11:50 <rkukura> Do we think we are therefore running two separate transactions? 18:12:07 <SumitNaiksatam> apparently the deleted subnetpool object is not present in the same session/transaction cache when it hits the above code 18:12:20 <SumitNaiksatam> rkukura: that is my suspicios 18:12:25 <SumitNaiksatam> *suspicion 18:13:11 <SumitNaiksatam> so, i did change the pattern in ml2plus to be consistent with the new pattern, though i did it alredy in one place, in delete_addresss_scope, and it did not help 18:13:24 <SumitNaiksatam> wondering if it has to be changed everywhere 18:13:55 <SumitNaiksatam> but before doing that i wanted to understand if this was the right thing to do (in order to maintain transactional consistency) 18:14:05 <rkukura> right 18:14:17 <SumitNaiksatam> as in whether its equivalent to using the old nested/subtransactions pattern 18:14:25 <SumitNaiksatam> posed that question to annak yesterday 18:14:49 <SumitNaiksatam> and meant to do my own reading, but havent gotten a chance to read up 18:15:17 <SumitNaiksatam> btw, reference material is here: #link https://blueprints.launchpad.net/neutron/+spec/enginefacade-switch 18:15:39 <SumitNaiksatam> annak: you seem to think changing it consistently will fix the problem? 18:16:04 <annak> I have a deja vu about these types of errors, and was trying to find relevant commits, but didn't succeed so far 18:16:26 <SumitNaiksatam> annak: okay, you also mentioned something about UTs, but i didnt understand that part 18:16:33 <SumitNaiksatam> as in why its not failing in the UTs 18:16:39 <annak> I think we should change all transactions around subnets/subnetpools 18:17:02 <SumitNaiksatam> (my theory was that sqlite is making the issue, assuming we have the right test coverage in the first place) 18:17:07 <SumitNaiksatam> annak: okay 18:17:08 <annak> yes, and it would help to understand why UTs are passing - maybe this flow is not covered? 18:17:19 <SumitNaiksatam> annak: :-) 18:17:24 <annak> :) 18:17:45 <SumitNaiksatam> i would be really surprised if this flow is not getting covered in the UTs 18:18:00 <annak> its l3 policy deletion, correct? 18:18:06 <SumitNaiksatam> i mean we do have the lifecycle test 18:18:09 <SumitNaiksatam> annak: right 18:18:18 <SumitNaiksatam> it will only happen in the aim_mapping driver case thougs 18:18:20 <SumitNaiksatam> though 18:18:35 <annak> yes, that's where the subnetpool test are 18:18:43 <SumitNaiksatam> yeah 18:18:50 <annak> I can check the coverage 18:19:26 <SumitNaiksatam> rkukura: the other thing is that we noticed neutron has not switched to this pattern everywhere (yet) 18:19:29 <SumitNaiksatam> annak: thanks 18:19:52 <annak> if we fix this specific issue, but break subrtransactions, will this be revealed in testing? 18:20:01 <SumitNaiksatam> annak: it should be 18:20:07 <annak> good 18:20:26 <SumitNaiksatam> annak: we also have tests which check that rollback happens 18:21:04 <rkukura> SumitNaiksatam: Could be a problem if old and new patterns cannot be successfully nested, neutron has some flows using each, and GBP needs to wrap both kinds in neutron within same top-level transaction 18:21:17 <SumitNaiksatam> so basically an exception is raised in the innermost part of the code while creating each resource, and check that all operations are rolledback correctly 18:21:35 <SumitNaiksatam> rkukura: yeah 18:22:18 <SumitNaiksatam> there is one thing though, neutron is incrementally switching to this pattern 18:22:44 <SumitNaiksatam> so we were wondering how things dont break in the interim (like they are breaking for us now when there is mix of the patterns) 18:23:18 <SumitNaiksatam> and whether the places that have not been migrated are on the path to migration, or in certain cases the old pattern needs to be used 18:23:36 <rkukura> Neutron itself does not use large deeply nested transactions the way we are doing with AIM - remember the transaction guards that prevent nested calls? 18:23:50 <SumitNaiksatam> rkukura: right, i was just typing that 18:24:01 <SumitNaiksatam> so that might explain 18:25:03 <SumitNaiksatam> i guess we havent done enough home work on understanding the new pattern 18:25:05 <rkukura> Is it likely that the issue mixing patterns is that different sessions get used? 18:25:26 <SumitNaiksatam> rkukura: seems like 18:26:12 <SumitNaiksatam> it wasnt easy to instrument the code to check this (with the new pattern) 18:26:38 <SumitNaiksatam> thats what i was trying to figure out before this meeting 18:27:32 <SumitNaiksatam> the second issue was related to the queued notifications 18:27:54 <SumitNaiksatam> annak: i shared the stack trace with you earlier 18:30:48 <SumitNaiksatam> i havent had a chance to dig too much into it 18:31:21 <SumitNaiksatam> coming from here; File "/opt/stack/new/neutron/neutron/plugins/ml2/ovo_rpc.py", line 61, in _is_session_semantic_violated 18:32:16 <SumitNaiksatam> annak: anyting you wanted to add to the above in terms of open issues? 18:32:28 <annak> yes, I saw it too, I wasn't sure if its an issue or just a warning 18:32:56 <annak> I think the resource mapping failures might be syntax related 18:33:12 <SumitNaiksatam> annak: ah okay good 18:33:59 <SumitNaiksatam> annak: so you are not saying you saw the rpc notification issue with the resource_mapping driver as well, are you? 18:34:05 <annak> and regarding delayed notification - do you think there's a place for backporting the fixes? 18:34:54 <SumitNaiksatam> annak: sorry didnt understand - “do you think there's a place for backporting the fixes?” 18:35:53 <annak> there are some fixes in notification area that are not related to ocata 18:36:36 <annak> (these issues happen when extra callbacks are added) 18:36:47 <annak> so I wasn't sure if it was relevant to backport 18:36:55 <rkukura> just for the record, I still believe our approach to delaying notifications is not ideal 18:37:42 <SumitNaiksatam> rkukura: i agree, you have my complete support to changing it :-) 18:38:08 <SumitNaiksatam> annak: if you think the issue exists in neutron as well, sure we should backport the fix 18:38:46 <SumitNaiksatam> i meant newton not neutron 18:39:29 <SumitNaiksatam> annak: is this something you outlined in your email, or something else? 18:39:50 <annak> yes 18:40:37 <annak> ok, will backport 18:40:37 <rkukura> For annak’s benefit, my argument has been that we should not delay the in-process callbacks, which are intended to be run at specific points during request processing (i.e. before, inside, or after transactions), but should instead find a way to delay sending the outgoing messages that they emit. What we’vfe been doing may have worked, but changes in the callback methods could break it at any time. 18:41:22 <SumitNaiksatam> rkukura: we are not delaying the in-process callbacks (or at least we dont intend to) 18:41:41 <annak> rkukura: thanks, Sumit explained to me the motivation and weak points 18:41:52 <annak> it seems indeed fragile 18:42:01 <rkukura> SumitNaiksatam: I must be really confused, but I thought we were delaying certain ones of them 18:42:14 <SumitNaiksatam> not in-process 18:42:44 <rkukura> the callback method itself is in-process, and may send a message via AMQP to other processes 18:42:49 <SumitNaiksatam> we aim to indentify all the in-process callbacks, and send them immediately 18:43:13 <rkukura> so my understanding is we’ve tried to delay only the callbacks that send AMQP messages 18:43:20 <annak> according to the code, if in-process callbacks exist, out-of-process will not be delayed as well 18:43:21 <SumitNaiksatam> rkukura: right 18:43:40 <rkukura> but the code that sends the messages may expect to be called at a certain point (i.e. inside or outside the transaction) 18:44:01 <rkukura> sounds like we are all on the same page here 18:44:16 <SumitNaiksatam> rkukura: right, so there are certain exceptions to the out-of-process notifications as well 18:44:35 <SumitNaiksatam> rkukura: there are some out-of-process notifications which we send immediately 18:44:44 <rkukura> SumitNaiksatam: Didn’t realize that 18:45:13 <SumitNaiksatam> rkukura: for the reason that you mentioned, but with the downside that if the transaction fails, the notification sent was incorrect 18:45:36 <SumitNaiksatam> rkukura: but in these cases it seems the agent is resilient 18:45:58 <SumitNaiksatam> and we havent seen issues, but we might not have tested it enough 18:46:00 <rkukura> Just to avoid confusion, I prefer the term “callback” for the in-process method calls (via the registry), and “notification” for the messages sent via AMQP. Some “callbacks” generate “notifications” and others don’t. 18:46:18 <SumitNaiksatam> rkukura: sure, good characterization 18:46:37 <SumitNaiksatam> annak: coming to your point 18:47:40 <SumitNaiksatam> annak: if i recall correctly, we do discriminate based on the called entity in terms of whether we want to queue or not given a particular event 18:48:25 <SumitNaiksatam> it was not intended to process the event just one way or the other (meaning either delay or not) 18:48:58 <annak> yes 18:49:28 <SumitNaiksatam> the in-process entities should get called immediately, and the entities which perform the out-of-process notification should get queued 18:49:34 <SumitNaiksatam> for the same event 18:49:50 <SumitNaiksatam> at least that was the goal 18:50:04 <annak> so i think the behavior is wrong then 18:51:25 <SumitNaiksatam> annak: okay, perhaps you can point to the specifics 18:51:39 <annak> yes, just a sec 18:51:46 <SumitNaiksatam> annak: the other thing was in the email you mentioned about events being eaten 18:51:59 <SumitNaiksatam> annak: so would be curious to know where that happened as well 18:53:41 <annak> https://github.com/openstack/group-based-policy/blob/master/gbpservice/network/neutronv2/local_api.py#L170 18:56:18 <annak> oh sorry I think I was wrong.. 18:56:53 <annak> I'll need to revisit the fix 18:56:55 <annak> sorry :) 18:57:43 <annak> I did see the callback disappear, but my fix is not good 18:59:05 <annak> if in_process_callback exist, we'll never get to queue anything, right? because callbacks is set to in_process_callbacks 18:59:15 <SumitNaiksatam> rkukura: annak: sorry i lost network connectivity 18:59:27 <annak> so the out_of_process callback is lost 18:59:47 <annak> but I assumed this was the desired behavior, and fixed according to it 19:00:13 <annak> so I need to refix according to what SumitNaiksatam explained 19:00:22 <SumitNaiksatam> annak: i will read the chat transcript 19:00:33 <SumitNaiksatam> and i think you have details in your email as well, which i overlooked 19:00:51 <SumitNaiksatam> we have hit the hour, so will end the meeting now 19:00:54 <SumitNaiksatam> #endmeeting