#openstack-meeting-3 log

18:01:26 <SumitNaiksatam> #startmeeting networking_policy
18:01:27 <openstack> Meeting started Thu Jun 29 18:01:26 2017 UTC and is due to finish in 60 minutes.  The chair is SumitNaiksatam. Information about MeetBot at http://wiki.debian.org/MeetBot.
18:01:28 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
18:01:30 <openstack> The meeting name has been set to 'networking_policy'
18:01:36 <rkukura> hi
18:02:08 <SumitNaiksatam> #info agenda https://wiki.openstack.org/wiki/Meetings/GroupBasedPolicy#June_29th_2017
18:02:22 <SumitNaiksatam> #topic Ocata sync
18:02:44 <SumitNaiksatam> #link https://review.openstack.org/#/q/topic:ocata_sync
18:03:08 <SumitNaiksatam> first up, annak thanks for getting all the UTs to pass :-)
18:03:17 <rkukura> +2
18:03:26 <annak> :)
18:03:45 <SumitNaiksatam> also thanks for the detailed commit message, and outlining the gray areas in your email
18:04:06 <SumitNaiksatam> in my follow up patch i am trying to get the devstack jobs to pass
18:04:22 <SumitNaiksatam> the initial issues were related to job setup
18:05:06 <SumitNaiksatam> after couple of days of wrangling with that, the resource_mapping driver and aim_mapping driver related devstack jobs are getting setup and run properly
18:05:26 <SumitNaiksatam> with the resource_mapping driver the exercise tests pass
18:05:42 <SumitNaiksatam> but a couple of tests are failing from teh gbpfunc test suite (jishnu’s test suite)
18:05:48 <SumitNaiksatam> i havent followed up on those
18:05:58 <SumitNaiksatam> the issue with the aim_mapping driver is more serious
18:06:22 <SumitNaiksatam> and reinforces some of the questions that annak has asked in her email
18:06:40 <SumitNaiksatam> most notably, neutorn has switched to using the new engine facade
18:07:32 <SumitNaiksatam> and us using the old engine factorung pattern -> with session.begin(subtransactions=true):
18:07:36 <SumitNaiksatam> is not working
18:07:47 <SumitNaiksatam> at least with the aim_mapping driver
18:08:15 <annak> I think its not working when wrapped with new facade (or the other way around)
18:08:18 <SumitNaiksatam> already discussed this a little bit with annak, so this is mostly an update for you rkukura (and also for the record)
18:08:25 <SumitNaiksatam> annak: correct
18:08:33 <rkukura> thanks - I’m paying attention!
18:08:50 <SumitNaiksatam> so the place where its failing is during l3p delete
18:09:04 <SumitNaiksatam> within the big transaction we first delete the subnetpool
18:09:08 <rkukura> Do we know what we should be doing instead of session.begin() to start a (potentially nested) transaction?
18:09:12 <SumitNaiksatam> then we try to delete the address_scope
18:09:18 <SumitNaiksatam> rkukura: one sec
18:09:47 <rkukura> SumitNaiksatam: keep going on your explaination
18:10:17 <SumitNaiksatam> when we try to delete the address_scope, the deleted subnetpool (which is not yet committed) does not show up as deleted in neutron’s db_plugin
18:10:24 <SumitNaiksatam> hence address_scope delete fails
18:10:49 <SumitNaiksatam> its over here #link https://github.com/openstack/neutron/blame/master/neutron/db/db_base_plugin_v2.py#L1195
18:11:27 <SumitNaiksatam> rkukura: so to your question (and annak’s comment earlier), when the code hits this point, there is a mix of the old pattern (in ml2plus) and the new pattern ^^^
18:11:50 <rkukura> Do we think we are therefore running two separate transactions?
18:12:07 <SumitNaiksatam> apparently the deleted subnetpool object is not present in the same session/transaction cache when it hits the above code
18:12:20 <SumitNaiksatam> rkukura: that is my suspicios
18:12:25 <SumitNaiksatam> *suspicion
18:13:11 <SumitNaiksatam> so, i did change the pattern in ml2plus to be consistent with the new pattern, though i did it alredy in one place, in delete_addresss_scope, and it did not help
18:13:24 <SumitNaiksatam> wondering if it has to be changed everywhere
18:13:55 <SumitNaiksatam> but before doing that i wanted to understand if this was the right thing to do (in order to maintain transactional consistency)
18:14:05 <rkukura> right
18:14:17 <SumitNaiksatam> as in whether its equivalent to using the old nested/subtransactions pattern
18:14:25 <SumitNaiksatam> posed that question to annak yesterday
18:14:49 <SumitNaiksatam> and meant to do my own reading, but havent gotten a chance to read up
18:15:17 <SumitNaiksatam> btw, reference material is here: #link https://blueprints.launchpad.net/neutron/+spec/enginefacade-switch
18:15:39 <SumitNaiksatam> annak: you seem to think changing it consistently will fix the problem?
18:16:04 <annak> I have a deja vu about these types of errors, and was trying to find relevant commits, but didn't succeed so far
18:16:26 <SumitNaiksatam> annak: okay, you also mentioned something about UTs, but i didnt understand that part
18:16:33 <SumitNaiksatam> as in why its not failing in the UTs
18:16:39 <annak> I think we should change all transactions around subnets/subnetpools
18:17:02 <SumitNaiksatam> (my theory was that sqlite is making the issue, assuming we have the right test coverage in the first place)
18:17:07 <SumitNaiksatam> annak: okay
18:17:08 <annak> yes, and it would help to understand why UTs are passing - maybe this flow is not covered?
18:17:19 <SumitNaiksatam> annak: :-)
18:17:24 <annak> :)
18:17:45 <SumitNaiksatam> i would be really surprised if this flow is not getting covered in the UTs
18:18:00 <annak> its l3 policy deletion, correct?
18:18:06 <SumitNaiksatam> i mean we do have the lifecycle test
18:18:09 <SumitNaiksatam> annak: right
18:18:18 <SumitNaiksatam> it will only happen in the aim_mapping driver case thougs
18:18:20 <SumitNaiksatam> though
18:18:35 <annak> yes, that's where the subnetpool test are
18:18:43 <SumitNaiksatam> yeah
18:18:50 <annak> I can check the coverage
18:19:26 <SumitNaiksatam> rkukura: the other thing is that we noticed neutron has not switched to this pattern everywhere (yet)
18:19:29 <SumitNaiksatam> annak: thanks
18:19:52 <annak> if we fix this specific issue, but break subrtransactions, will this be revealed in testing?
18:20:01 <SumitNaiksatam> annak: it should be
18:20:07 <annak> good
18:20:26 <SumitNaiksatam> annak: we also have tests which check that rollback happens
18:21:04 <rkukura> SumitNaiksatam: Could be a problem if old and new patterns cannot be successfully nested, neutron has some flows using each, and GBP needs to wrap both kinds in neutron within same top-level transaction
18:21:17 <SumitNaiksatam> so basically an exception is raised in the innermost part of the code while creating each resource, and check that all operations are rolledback correctly
18:21:35 <SumitNaiksatam> rkukura: yeah
18:22:18 <SumitNaiksatam> there is one thing though, neutron is incrementally switching to this pattern
18:22:44 <SumitNaiksatam> so we were wondering how things dont break in the interim (like they are breaking for us now when there is mix of the patterns)
18:23:18 <SumitNaiksatam> and whether the places that have not been migrated are on the path to migration, or in certain cases the old pattern needs to be used
18:23:36 <rkukura> Neutron itself does not use large deeply nested transactions the way we are doing with AIM - remember the transaction guards that prevent nested calls?
18:23:50 <SumitNaiksatam> rkukura: right, i was just typing that
18:24:01 <SumitNaiksatam> so that might explain
18:25:03 <SumitNaiksatam> i guess we havent done enough home work on understanding the new pattern
18:25:05 <rkukura> Is it likely that the issue mixing patterns is that different sessions get used?
18:25:26 <SumitNaiksatam> rkukura: seems like
18:26:12 <SumitNaiksatam> it wasnt easy to instrument the code to check this (with the new pattern)
18:26:38 <SumitNaiksatam> thats what i was trying to figure out before this meeting
18:27:32 <SumitNaiksatam> the second issue was related to the queued notifications
18:27:54 <SumitNaiksatam> annak: i shared the stack trace with you earlier
18:30:48 <SumitNaiksatam> i havent had a chance to dig too much into it
18:31:21 <SumitNaiksatam> coming from here;  File "/opt/stack/new/neutron/neutron/plugins/ml2/ovo_rpc.py", line 61, in _is_session_semantic_violated
18:32:16 <SumitNaiksatam> annak: anyting you wanted to add to the above in terms of open issues?
18:32:28 <annak> yes, I saw it too, I wasn't sure if its an issue or just a warning
18:32:56 <annak> I think the resource mapping failures might be syntax related
18:33:12 <SumitNaiksatam> annak: ah okay good
18:33:59 <SumitNaiksatam> annak: so you are not saying you saw the rpc notification issue with the resource_mapping driver as well, are you?
18:34:05 <annak> and regarding delayed notification - do you think there's a place for backporting the fixes?
18:34:54 <SumitNaiksatam> annak: sorry didnt understand - “do you think there's a place for backporting the fixes?”
18:35:53 <annak> there are some fixes in notification area that are not related to ocata
18:36:36 <annak> (these issues happen when extra callbacks are added)
18:36:47 <annak> so I wasn't sure if it was relevant to backport
18:36:55 <rkukura> just for the record, I still believe our approach to delaying notifications is not ideal
18:37:42 <SumitNaiksatam> rkukura: i agree, you have my complete support to changing it :-)
18:38:08 <SumitNaiksatam> annak: if you think the issue exists in neutron as well, sure we should backport the fix
18:38:46 <SumitNaiksatam> i meant newton not neutron
18:39:29 <SumitNaiksatam> annak: is this something you outlined in your email, or something else?
18:39:50 <annak> yes
18:40:37 <annak> ok, will backport
18:40:37 <rkukura> For annak’s benefit, my argument has been that we should not delay the in-process callbacks, which are intended to be run at specific points during request processing (i.e. before, inside, or after transactions), but should instead find a way to delay sending the outgoing messages that they emit. What we’vfe been doing may have worked, but changes in the callback methods could break it at any time.
18:41:22 <SumitNaiksatam> rkukura: we are not delaying the in-process callbacks (or at least we dont intend to)
18:41:41 <annak> rkukura: thanks, Sumit explained to me the motivation and weak points
18:41:52 <annak> it seems indeed fragile
18:42:01 <rkukura> SumitNaiksatam: I must be really confused, but I thought we were delaying certain ones of them
18:42:14 <SumitNaiksatam> not in-process
18:42:44 <rkukura> the callback method itself is in-process, and may send a message via AMQP to other processes
18:42:49 <SumitNaiksatam> we aim to indentify all the in-process callbacks, and send them immediately
18:43:13 <rkukura> so my understanding is we’ve tried to delay only the callbacks that send AMQP messages
18:43:20 <annak> according to the code, if in-process callbacks exist, out-of-process will not be delayed as well
18:43:21 <SumitNaiksatam> rkukura: right
18:43:40 <rkukura> but the code that sends the messages may expect to be called at a certain point (i.e. inside or outside the transaction)
18:44:01 <rkukura> sounds like we are all on the same page here
18:44:16 <SumitNaiksatam> rkukura: right, so there are certain exceptions to the out-of-process notifications as well
18:44:35 <SumitNaiksatam> rkukura: there are some out-of-process notifications which we send immediately
18:44:44 <rkukura> SumitNaiksatam: Didn’t realize that
18:45:13 <SumitNaiksatam> rkukura: for the reason that you mentioned, but with the downside that if the transaction fails, the notification sent was incorrect
18:45:36 <SumitNaiksatam> rkukura: but in these cases it seems the agent is resilient
18:45:58 <SumitNaiksatam> and we havent seen issues, but we might not have tested it enough
18:46:00 <rkukura> Just to avoid confusion, I prefer the term “callback” for the in-process method calls (via the registry), and “notification” for the messages sent via AMQP. Some “callbacks” generate “notifications” and others don’t.
18:46:18 <SumitNaiksatam> rkukura: sure, good characterization
18:46:37 <SumitNaiksatam> annak: coming to your point
18:47:40 <SumitNaiksatam> annak: if i recall correctly, we do discriminate based on the called entity in terms of whether we want to queue or not given a particular event
18:48:25 <SumitNaiksatam> it was not intended to process the event just one way or the other (meaning either delay or not)
18:48:58 <annak> yes
18:49:28 <SumitNaiksatam> the in-process entities should get called immediately, and the entities which perform the out-of-process notification should get queued
18:49:34 <SumitNaiksatam> for the same event
18:49:50 <SumitNaiksatam> at least that was the goal
18:50:04 <annak> so i think the behavior is wrong then
18:51:25 <SumitNaiksatam> annak: okay, perhaps you can point to the specifics
18:51:39 <annak> yes, just a sec
18:51:46 <SumitNaiksatam> annak: the other thing was in the email you mentioned about events being eaten
18:51:59 <SumitNaiksatam> annak: so would be curious to know where that happened as well
18:53:41 <annak> https://github.com/openstack/group-based-policy/blob/master/gbpservice/network/neutronv2/local_api.py#L170
18:56:18 <annak> oh sorry I think I was wrong..
18:56:53 <annak> I'll need to revisit the fix
18:56:55 <annak> sorry :)
18:57:43 <annak> I did see the callback disappear, but my fix is not good
18:59:05 <annak> if in_process_callback exist, we'll never get to queue anything, right? because callbacks is set to in_process_callbacks
18:59:15 <SumitNaiksatam> rkukura: annak: sorry i lost network connectivity
18:59:27 <annak> so the out_of_process callback is lost
18:59:47 <annak> but I assumed this was the desired behavior, and fixed according to it
19:00:13 <annak> so I need to refix according to what SumitNaiksatam explained
19:00:22 <SumitNaiksatam> annak: i will read the chat transcript
19:00:33 <SumitNaiksatam> and i think you have details in your email as well, which i overlooked
19:00:51 <SumitNaiksatam> we have hit the hour, so will end the meeting now
19:00:54 <SumitNaiksatam> #endmeeting