#openstack-meeting-3 log

18:02:04 <SumitNaiksatam> #startmeeting networking_policy
18:02:05 <openstack> Meeting started Thu Dec  1 18:02:04 2016 UTC and is due to finish in 60 minutes.  The chair is SumitNaiksatam. Information about MeetBot at http://wiki.debian.org/MeetBot.
18:02:07 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
18:02:09 <openstack> The meeting name has been set to 'networking_policy'
18:02:55 <SumitNaiksatam> #topic Remove expunge_all calls
18:03:03 <SumitNaiksatam> #link https://review.openstack.org/#/c/399772/
18:03:15 <SumitNaiksatam> as discussed before, tbachman had been working on this
18:03:20 <SumitNaiksatam> thanks tbachman!
18:03:25 <tbachman> SumitNaiksatam: np!
18:03:32 <tbachman> just wish I’d had a bit more time on it
18:03:33 <SumitNaiksatam> we noticed consistent gate failures in the integration tests
18:03:44 <SumitNaiksatam> these were in the nfp-nv job
18:04:03 <SumitNaiksatam> and we have been trying to see if the failures are on account of the changes in this patch or if they are not related
18:04:21 <SumitNaiksatam> we had reached out to osm and his team
18:04:27 <SumitNaiksatam> to get their take on this
18:04:35 <SumitNaiksatam> osm -> songole
18:04:47 <SumitNaiksatam> i just pinged him in case he can join
18:05:20 <SumitNaiksatam> i am just trying to pull the email which had their analysis on this
18:06:25 <SumitNaiksatam> they have pointed to this particular log: #link http://logs.openstack.org/72/399772/2/check/gate-group-based-policy-dsvm-nfp-nv/f3cb8b0/logs/q-svc.txt.gz?level=ERROR
18:07:30 <SumitNaiksatam> the above log is slightly different from the ones that we were observing earlier
18:07:55 <SumitNaiksatam> slightly -> the error is completely different
18:08:02 <SumitNaiksatam> songole: hi, thanks for joining
18:08:13 <songole> hi SumitNaiksatam
18:08:38 <SumitNaiksatam> songole: thanks to you and your team for investigating the issue with the “removal of expunge_all” patch
18:08:49 <SumitNaiksatam> we are just discussing how we should process from where we are at
18:09:38 <SumitNaiksatam> songole: this error which KCS pointed to: #link http://logs.openstack.org/72/399772/2/check/gate-group-based-policy-dsvm-nfp-nv/f3cb8b0/logs/q-svc.txt.gz?level=ERROR
18:09:59 <SumitNaiksatam> are you seeing it only in the “expunge_all” patch or are you seeing it anywhere else?
18:10:06 <SumitNaiksatam> even if sporadic
18:10:41 <songole> we are not seeing it anywhere else
18:10:49 <SumitNaiksatam> songole: ah bummer
18:11:19 <SumitNaiksatam> i think we all agree that we should get rid of the “expunge_all” calls (let me know if anyone disagrees)
18:11:36 * tbachman wonders if ivar_lazarro is around
18:11:38 <SumitNaiksatam> however we have not had a single clean run in the jobs after removing the calls
18:12:09 <SumitNaiksatam> so i would like check in the team if anyone has ideas on how we can make progress here
18:12:26 <tbachman> SumitNaiksatam: it feels to me like there is more than one context (thread kind of context) using the same session
18:12:26 <SumitNaiksatam> at this point its the nfp-nv gate job which is consistently failing
18:12:27 <SumitNaiksatam> not others
18:12:35 <songole> nfp tests are failing. but they are failing in resource-mapping driver. is that right?
18:12:54 <tbachman> which I don’t think is how the sessions are meant to be used
18:13:05 <tbachman> (i.e. they are supposed to be used as a session per thread)
18:13:26 <tbachman> I unfortunately haven’t managed to find the time to look into this more of late
18:13:27 <SumitNaiksatam> songole: yes, but my observation has been that has been more than one error that has been noticed
18:13:42 <SumitNaiksatam> tbachman: thats my understanding too
18:14:05 <SumitNaiksatam> tbachman: since you mention that, there is some new thread creation in the nfp code
18:14:26 <songole> tbachman: nfp uses multiple threads and processes.
18:14:34 <SumitNaiksatam> wonder if that has something to do with this, and was not getting triggered earlier since we were doing this expunge_all
18:14:43 <tbachman> songole: as long as they don’t reuse session objects between threads, they should be okay
18:15:04 <songole> tbachman: I will check how it is being used.
18:15:09 <SumitNaiksatam> tbachman: i dont think that is being intentionally done, but...
18:15:19 <SumitNaiksatam> i mean its difficult to figure out
18:15:27 <tbachman> Just some background on this
18:15:32 <SumitNaiksatam> where a session might be getting leaked in that way
18:15:40 <rkukura> tbachman: passing DB model objects between threads would also be sharing the session they came from, right?
18:15:41 <tbachman> During our integration with a different L3 plugin, we started seeing some failures
18:16:11 <tbachman> rkukura: I’d have to check how SQLAlchemy ties DB objects to sessions
18:16:18 <tbachman> but very possibly yes
18:16:59 <tbachman> continuing the explanation of the failures — we started seeing DetatchedInstanceError exceptions
18:17:19 <SumitNaiksatam> rkukura: hmmm, i think you would have to explicitly get a reference to the session from those objects
18:17:20 <tbachman> due to DB objects used outside of sessions accessing backrefs to other objects in the DB
18:17:26 <SumitNaiksatam> rkukura: and i am not sure that is being done
18:17:44 <tbachman> The weird thing was that we didn’t see these errors before
18:17:48 <SumitNaiksatam> tbachman: right
18:17:49 <tbachman> Implying that something changed
18:17:59 <SumitNaiksatam> tbachman: and it being used in many places in neutron
18:18:00 <tbachman> And the thought is that change was in some transitive dependency
18:18:14 <SumitNaiksatam> and it doesnt seem to fail anywhere else
18:18:38 <tbachman> SumitNaiksatam: ack
18:18:58 <SumitNaiksatam> i believe we are the only ones doing these expunge_all calls, and hence its likely to break that neutron usage pattern even in the future
18:19:03 <tbachman> During the process of debugging this failure, I found that the exceptions went away if I removed the cll to expunge
18:19:10 * tbachman nods
18:19:28 <tbachman> In any case, rkukura brought up the valid question as to why the expunges were needed in the first place
18:19:35 <rkukura> SumitNaiksatam: I think any attempt to update or delete a DB object, or to follow an association, can involve the session it game from
18:20:06 <tbachman> And the expunges have been in there for over a year (I think something like August of 2015?)
18:20:13 <rkukura> s/game/came/
18:20:30 <SumitNaiksatam> rkukura: i am not sure about that (in the way its being used in the places where tbachman describes it)
18:20:30 <tbachman> So, when I got back to this, my plan was to try to understand why we needed the expunges
18:21:42 <tbachman> My hope is that if we can find the reason for the expunges, we may be able to address the original issue (provided it still exists)
18:21:49 <SumitNaiksatam> rkukura: clarification, i dont think the places which tbachman is looking at, does not do an update/delete (tbachman correct me if i am wrong), it merely access a property of that db object
18:21:57 <SumitNaiksatam> *accesses
18:22:00 <tbachman> SumitNaiksatam: ack
18:22:16 <tbachman> though I’d have to check
18:22:23 <SumitNaiksatam> sorry, two negatives in that sentence
18:22:26 <tbachman> this “pattern” is used in that L3 plugin pretty pervasively
18:22:32 <songole> tbachman: we seem to share db session across threads. Are you saying each thread should have its own session?
18:22:32 * tbachman followed what SumitNaiksatam was saying ;)
18:22:40 <tbachman> songole: ack
18:22:43 <SumitNaiksatam> tbachman: :-)
18:22:45 <tbachman> let me pull up the SQLAlchemy ref
18:22:50 <rkukura> We’d also want to check the multithreaded NFP code
18:23:21 <tbachman> #info http://docs.sqlalchemy.org/en/latest/orm/session_basics.html
18:23:23 <tbachman> #link http://docs.sqlalchemy.org/en/latest/orm/session_basics.html
18:23:27 <tbachman> songole: ^^^
18:23:40 <tbachman> look for the section that says “Is the session thread-safe?”
18:23:50 <tbachman> “The Session is very much intended to be used in a non-concurrent fashion, which usually means in only one thread at a time."
18:24:17 <songole> got it.
18:24:24 <SumitNaiksatam> tbachman: thanks for pulling that up, very helpful!
18:24:33 <tbachman> songole: that said, Im not sure it’s just an issue for NFP :(
18:24:36 <tbachman> SumitNaiksatam: np!
18:25:13 <tbachman> I need to spend some time understanding the overall threading models in GBP
18:25:14 <SumitNaiksatam> i do think how its shared between the thread matters
18:25:27 <SumitNaiksatam> the -> it
18:26:06 <SumitNaiksatam> at the simplest level, if same session is used for servicing two different API calls concurrently, its definitely not a good thing
18:26:10 <SumitNaiksatam> right?
18:26:10 <tbachman> SumitNaiksatam: ack, but we don’t have any way of ensuring this isn’t used incorrectly (other than some form of policy/best practices)
18:26:15 <tbachman> SumitNaiksatam: ack
18:26:23 <tbachman> (or at least according to the text)
18:26:24 <songole> tbachman: now that we got some clues, we will investigate further
18:26:31 <SumitNaiksatam> tbachman: very true, its going to be really difficult to find that out
18:26:33 <tbachman> songole: cool. Thanks!
18:26:38 <SumitNaiksatam> songole: ah cool
18:26:47 <SumitNaiksatam> songole:  thanks for offerint to do that, i was going to ask
18:27:49 <SumitNaiksatam> i do think that if we have to include this change in the upcoming releases we have to be able to get it in fairly soon
18:28:21 <SumitNaiksatam> since, we need to do be able to run through several QA cycles (outside of the upstream gate)
18:28:49 <SumitNaiksatam> so this becomes time sensitive and critical in that sense
18:29:14 <SumitNaiksatam> i am inclined to merge this patch, if we see three gate runs on this patch without any failures
18:29:38 <rkukura> One question on the details of the patch…
18:29:49 <SumitNaiksatam> perhaps that is not going to happen without some more code changes (and we might get a better idea after songole’s investigation)
18:29:55 <SumitNaiksatam> rkukura: sure
18:30:27 <rkukura> It basically turns clean_session into a no-op contextmanaget. Should the patch remove all uses of clean_session, or should that be done later?
18:30:44 <rkukura> s/contextmanaget/contextmanager/
18:30:54 <tbachman> rkukura: that would be more comprehensive, and be best going forward.
18:31:03 <SumitNaiksatam> rkukura: yes
18:31:10 <tbachman> I can rework it for that, but I need some time to get back to it
18:31:14 <songole> could someone help me understand what this expunge is all about.. :)
18:31:18 <SumitNaiksatam> rkukura: it does not necessarily have to be this patch
18:31:36 <tbachman> songole: I can help there
18:31:40 <SumitNaiksatam> rkukura: the bigger change would be more mechanical once we know that removing the calls works
18:31:48 <rkukura> SumitNaiksatam: I agree phasing this might make sense, especially if we are not 100% certain we can live without the expunging
18:31:55 <SumitNaiksatam> rkukura: right
18:32:03 <SumitNaiksatam> otherwise we can put it back
18:32:10 <SumitNaiksatam> just one line :-)
18:32:15 <tbachman> objects that have representation in the DB can have different states
18:32:25 <tbachman> there is a session, which sort of acts like a cache of DB objects
18:32:38 <tbachman> so, if you’ve read a DB object, it keeps track of it in the session
18:33:02 <tbachman> if you do an expunge, you disassociate the object with the session
18:33:32 <tbachman> you can continue to use the object, but if you need any additional DB lookups (e.g. if the object has a backref to another DB object), then you’ll get an exception
18:33:40 <tbachman> b/c the object is no longer associated with a session
18:34:03 <tbachman> the expunge basically removes all DB object state associated with a given session
18:34:07 <tbachman> making it a “clean” session
18:34:13 <tbachman> to get new dB objects and their state
18:34:30 <songole> tbachman: thanks
18:34:36 <tbachman> songole: np!
18:34:43 <SumitNaiksatam> tbachman: yes, very nice summary
18:35:07 <SumitNaiksatam> also to keep in context, we did this since there was a UT which kept failing consistently
18:35:15 <SumitNaiksatam> and no one could figure out how to fix this
18:35:18 <songole> to access object it again, do i need to get a new session?
18:35:26 <SumitNaiksatam> ivar tried this approach and it worked
18:35:30 <rkukura> Once the session is expunged, the DB objects still exist in memory, but trying to use them in ways that involve the DB will cause exceptions, such as ObjectDeletedError: Instance '<ServicePolicyPTGIpAddressMapping at 0x7f5d9e726dd0>' has been deleted, or its row is otherwise not present.
18:35:34 <tbachman> songole: once you’ve expunged it, yes you need a new session
18:35:35 <SumitNaiksatam> its been with us since
18:36:13 <songole> ok. thanks
18:36:46 <SumitNaiksatam> there is also some eventlet related interplay here
18:36:50 <tbachman> here’s the original bug ID:
18:36:53 <tbachman> #link https://bugs.launchpad.net/group-based-policy/+bug/1404412
18:36:53 <openstack> Launchpad bug 1404412 in Group Based Policy "Session's persistent objects across transactions create inconsistency" [High,Fix released] - Assigned to Ivar Lazzaro (mmaleckk)
18:37:07 <tbachman> and here’s hte original patch:
18:37:10 <tbachman> #link https://github.com/openstack/group-based-policy/commit/2c8b401b8e0432bba4f7ae089883b36d760f733c
18:37:37 <SumitNaiksatam> i was pointing thomas (and songole) to a comment in the oslo db code, which recommended putting things always inside a transaction
18:37:39 <SumitNaiksatam> tbachman: thanks
18:38:15 <SumitNaiksatam> i dont think we would have the luxury to make that kind of a change in other projects
18:38:28 <tbachman> Here’s that same one in gerrit
18:38:29 <SumitNaiksatam> but if needed we can do that in our project?
18:38:30 <tbachman> #link https://review.openstack.org/#/c/213362/
18:38:49 <SumitNaiksatam> for instance in the places where its currently failing (the one which rkukura mentioned above)
18:39:35 * tbachman realizes he should put some context in the chat line when he uses a link
18:40:32 <songole> if a different node updates the object, would there be a problem with this caching mechanism?
18:40:56 <rkukura> Actually, the ObjectDeletedError above (from the log earlier this meeting) may not be directly due to the object having been expunged. I’ve seen other errors that clearly do mention the object being expunged, such as when trying to follow an association.
18:40:56 <tbachman> songole: you mean if a different context updates an object read from the DB?
18:41:08 <songole> yes
18:41:38 <tbachman> songole: in the expunged case
18:41:39 <tbachman> ?
18:41:48 <rkukura> In my mind, the only safe pattern is to start a transaction using a session, do everything inside that transaction, and do not access the DB objects after the transaction is committed or rolled back.
18:41:55 <tbachman> rkukura: ack
18:42:20 <songole> ok
18:42:21 <tbachman> it seems like that’s the intent. You *can* use it outside of a transaction, but that seems to open you up to other issues (e.g. like the kind we’re seeing)
18:42:44 <tbachman> that said — there’s *a lot* of code to change in that L3 plugin, if that’s the case :(
18:42:53 <tbachman> not saying that shouldn’t be the solution
18:42:58 <tbachman> just pointing it out ;)
18:43:12 <songole> even nfp as well
18:43:33 <rkukura> If there is a real need to use a DB object after the transaction ends, I think there is a way to disassociate the DB object from the session, making it just lake a new model class instance for which add() was never called.
18:43:36 <tbachman> note that using it inside of a transaction still doesn’t solve a threading issue with sessions
18:43:45 <tbachman> so, it seems like 2 things need fixing
18:45:07 <SumitNaiksatam> tbachman: right
18:45:41 <SumitNaiksatam> rkukura: wrapping in a transaction is the change i was earlier referring, but which do not have the luxury of changing outside our project
18:46:06 <tbachman> <cough, cough> monkey-patch <cough, cough>
18:46:09 <tbachman> what?
18:46:13 <SumitNaiksatam> and like tbachman pointed out, there are several places across neutron and its sub-projects where that pattern is used
18:47:51 <SumitNaiksatam> and breakage in those places is likely to unexpectedly manifest until we have these expunge_all calls
18:48:11 <SumitNaiksatam> songole had earlier run into the same issue with lbaas code
18:48:23 <tbachman> SumitNaiksatam: I think we just need to spend some time first investigating this some more
18:48:29 <tbachman> to get more info about the problem
18:48:42 <tbachman> I hope to be able to do that in the near-ish future
18:48:58 <tbachman> and if songole or others are able to have a look as well, hopefully we can discover the root-cause
18:49:08 <tbachman> (together)
18:49:18 <SumitNaiksatam> tbachman: the predicament is that we are sitting on a time-bomb while we have this expunge-all call in place
18:49:30 <tbachman> SumitNaiksatam: ack
18:49:55 <SumitNaiksatam> and if we agree that is the case, then removing it asap would make sense, so that we can test well after we remove it (repeating myself)
18:50:05 <SumitNaiksatam> and we dont have enought time left for that
18:50:17 <SumitNaiksatam> assuming we dont do this soon
18:51:15 <SumitNaiksatam> tbachman: your plan is reasonable and the only option we have i guess
18:51:38 <SumitNaiksatam> tbachman, songole: in terms of time frame, how should we budget this?
18:51:55 <tbachman> SumitNaiksatam: a week (i.e. next week’s meeting)?
18:52:13 <SumitNaiksatam> tbachman: right, that seems like the earliest
18:52:25 <songole> ack
18:52:47 <songole> looks like I was wrong on the sharing observation.. but, i will check more
18:52:50 <SumitNaiksatam> songole: is it possible to get any additional data points outside of the upstream gate tests by removing the expunge_all calls?
18:53:32 <songole> SumitNaiksatam: we will test the patch locally..
18:53:40 <SumitNaiksatam> songole: okay thanks
18:54:02 <SumitNaiksatam> another way to look at this is assume that the expunge_all call never existed
18:54:25 <SumitNaiksatam> and any error we run into (in our code) without it being present is an issue we need to fix
18:54:59 <SumitNaiksatam> the first thing that can be attempted in such places is to wrap things in a subtranscation (once the appropriate place is identified)
18:55:45 <SumitNaiksatam> if sharing session between threads is an issue, that would be much more difficult to diagnose, but i am hoping thats not the case
18:56:05 <SumitNaiksatam> tbachman: songole: rkukura: thanks for the input and discussion on this
18:56:18 <tbachman> SumitNaiksatam: np!
18:56:33 <SumitNaiksatam> tbachman definitely deserves a big pat of the bag for the several days he has put into this
18:56:40 <SumitNaiksatam> more than a pat on the back
18:56:44 * tbachman likes pat on the bags
18:56:46 <tbachman> :-o
18:57:16 <tbachman> #redact
18:57:20 <SumitNaiksatam> and its kind of a coincidence that songole hit the issue with the lbaas code just about at the same time that tbachman ran into this (a little before)
18:57:54 <SumitNaiksatam> bag -> back (typo, my bad)
18:57:57 <tbachman> lol
18:58:10 <SumitNaiksatam> #topic Open Discussion
18:58:15 <tbachman> SumitNaiksatam: glad to work on this. Just sorry I didn’t resolve it yet
18:58:31 <SumitNaiksatam> tbachman: no need to apologixe :-)
18:58:54 <SumitNaiksatam> we have still not been able to merge igordcard’s patch:
18:59:02 <SumitNaiksatam> #link https://review.openstack.org/#/c/301701
18:59:34 <SumitNaiksatam> its failing the nfp gate but i am not sure its relevant
18:59:46 <SumitNaiksatam> on that branch
18:59:56 <SumitNaiksatam> rkukura: shall we merge this patch now?
19:00:03 <SumitNaiksatam> oh we are out of time
19:00:12 <SumitNaiksatam> will take this off this channel
19:00:16 <SumitNaiksatam> thanks all, bye!
19:00:20 <SumitNaiksatam> #endmeeting