18:02:04 #startmeeting networking_policy 18:02:05 Meeting started Thu Dec 1 18:02:04 2016 UTC and is due to finish in 60 minutes. The chair is SumitNaiksatam. Information about MeetBot at http://wiki.debian.org/MeetBot. 18:02:07 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 18:02:09 The meeting name has been set to 'networking_policy' 18:02:55 #topic Remove expunge_all calls 18:03:03 #link https://review.openstack.org/#/c/399772/ 18:03:15 as discussed before, tbachman had been working on this 18:03:20 thanks tbachman! 18:03:25 SumitNaiksatam: np! 18:03:32 just wish I’d had a bit more time on it 18:03:33 we noticed consistent gate failures in the integration tests 18:03:44 these were in the nfp-nv job 18:04:03 and we have been trying to see if the failures are on account of the changes in this patch or if they are not related 18:04:21 we had reached out to osm and his team 18:04:27 to get their take on this 18:04:35 osm -> songole 18:04:47 i just pinged him in case he can join 18:05:20 i am just trying to pull the email which had their analysis on this 18:06:25 they have pointed to this particular log: #link http://logs.openstack.org/72/399772/2/check/gate-group-based-policy-dsvm-nfp-nv/f3cb8b0/logs/q-svc.txt.gz?level=ERROR 18:07:30 the above log is slightly different from the ones that we were observing earlier 18:07:55 slightly -> the error is completely different 18:08:02 songole: hi, thanks for joining 18:08:13 hi SumitNaiksatam 18:08:38 songole: thanks to you and your team for investigating the issue with the “removal of expunge_all” patch 18:08:49 we are just discussing how we should process from where we are at 18:09:38 songole: this error which KCS pointed to: #link http://logs.openstack.org/72/399772/2/check/gate-group-based-policy-dsvm-nfp-nv/f3cb8b0/logs/q-svc.txt.gz?level=ERROR 18:09:59 are you seeing it only in the “expunge_all” patch or are you seeing it anywhere else? 18:10:06 even if sporadic 18:10:41 we are not seeing it anywhere else 18:10:49 songole: ah bummer 18:11:19 i think we all agree that we should get rid of the “expunge_all” calls (let me know if anyone disagrees) 18:11:36 * tbachman wonders if ivar_lazarro is around 18:11:38 however we have not had a single clean run in the jobs after removing the calls 18:12:09 so i would like check in the team if anyone has ideas on how we can make progress here 18:12:26 SumitNaiksatam: it feels to me like there is more than one context (thread kind of context) using the same session 18:12:26 at this point its the nfp-nv gate job which is consistently failing 18:12:27 not others 18:12:35 nfp tests are failing. but they are failing in resource-mapping driver. is that right? 18:12:54 which I don’t think is how the sessions are meant to be used 18:13:05 (i.e. they are supposed to be used as a session per thread) 18:13:26 I unfortunately haven’t managed to find the time to look into this more of late 18:13:27 songole: yes, but my observation has been that has been more than one error that has been noticed 18:13:42 tbachman: thats my understanding too 18:14:05 tbachman: since you mention that, there is some new thread creation in the nfp code 18:14:26 tbachman: nfp uses multiple threads and processes. 18:14:34 wonder if that has something to do with this, and was not getting triggered earlier since we were doing this expunge_all 18:14:43 songole: as long as they don’t reuse session objects between threads, they should be okay 18:15:04 tbachman: I will check how it is being used. 18:15:09 tbachman: i dont think that is being intentionally done, but... 18:15:19 i mean its difficult to figure out 18:15:27 Just some background on this 18:15:32 where a session might be getting leaked in that way 18:15:40 tbachman: passing DB model objects between threads would also be sharing the session they came from, right? 18:15:41 During our integration with a different L3 plugin, we started seeing some failures 18:16:11 rkukura: I’d have to check how SQLAlchemy ties DB objects to sessions 18:16:18 but very possibly yes 18:16:59 continuing the explanation of the failures — we started seeing DetatchedInstanceError exceptions 18:17:19 rkukura: hmmm, i think you would have to explicitly get a reference to the session from those objects 18:17:20 due to DB objects used outside of sessions accessing backrefs to other objects in the DB 18:17:26 rkukura: and i am not sure that is being done 18:17:44 The weird thing was that we didn’t see these errors before 18:17:48 tbachman: right 18:17:49 Implying that something changed 18:17:59 tbachman: and it being used in many places in neutron 18:18:00 And the thought is that change was in some transitive dependency 18:18:14 and it doesnt seem to fail anywhere else 18:18:38 SumitNaiksatam: ack 18:18:58 i believe we are the only ones doing these expunge_all calls, and hence its likely to break that neutron usage pattern even in the future 18:19:03 During the process of debugging this failure, I found that the exceptions went away if I removed the cll to expunge 18:19:10 * tbachman nods 18:19:28 In any case, rkukura brought up the valid question as to why the expunges were needed in the first place 18:19:35 SumitNaiksatam: I think any attempt to update or delete a DB object, or to follow an association, can involve the session it game from 18:20:06 And the expunges have been in there for over a year (I think something like August of 2015?) 18:20:13 s/game/came/ 18:20:30 rkukura: i am not sure about that (in the way its being used in the places where tbachman describes it) 18:20:30 So, when I got back to this, my plan was to try to understand why we needed the expunges 18:21:42 My hope is that if we can find the reason for the expunges, we may be able to address the original issue (provided it still exists) 18:21:49 rkukura: clarification, i dont think the places which tbachman is looking at, does not do an update/delete (tbachman correct me if i am wrong), it merely access a property of that db object 18:21:57 *accesses 18:22:00 SumitNaiksatam: ack 18:22:16 though I’d have to check 18:22:23 sorry, two negatives in that sentence 18:22:26 this “pattern” is used in that L3 plugin pretty pervasively 18:22:32 tbachman: we seem to share db session across threads. Are you saying each thread should have its own session? 18:22:32 * tbachman followed what SumitNaiksatam was saying ;) 18:22:40 songole: ack 18:22:43 tbachman: :-) 18:22:45 let me pull up the SQLAlchemy ref 18:22:50 We’d also want to check the multithreaded NFP code 18:23:21 #info http://docs.sqlalchemy.org/en/latest/orm/session_basics.html 18:23:23 #link http://docs.sqlalchemy.org/en/latest/orm/session_basics.html 18:23:27 songole: ^^^ 18:23:40 look for the section that says “Is the session thread-safe?” 18:23:50 “The Session is very much intended to be used in a non-concurrent fashion, which usually means in only one thread at a time." 18:24:17 got it. 18:24:24 tbachman: thanks for pulling that up, very helpful! 18:24:33 songole: that said, Im not sure it’s just an issue for NFP :( 18:24:36 SumitNaiksatam: np! 18:25:13 I need to spend some time understanding the overall threading models in GBP 18:25:14 i do think how its shared between the thread matters 18:25:27 the -> it 18:26:06 at the simplest level, if same session is used for servicing two different API calls concurrently, its definitely not a good thing 18:26:10 right? 18:26:10 SumitNaiksatam: ack, but we don’t have any way of ensuring this isn’t used incorrectly (other than some form of policy/best practices) 18:26:15 SumitNaiksatam: ack 18:26:23 (or at least according to the text) 18:26:24 tbachman: now that we got some clues, we will investigate further 18:26:31 tbachman: very true, its going to be really difficult to find that out 18:26:33 songole: cool. Thanks! 18:26:38 songole: ah cool 18:26:47 songole: thanks for offerint to do that, i was going to ask 18:27:49 i do think that if we have to include this change in the upcoming releases we have to be able to get it in fairly soon 18:28:21 since, we need to do be able to run through several QA cycles (outside of the upstream gate) 18:28:49 so this becomes time sensitive and critical in that sense 18:29:14 i am inclined to merge this patch, if we see three gate runs on this patch without any failures 18:29:38 One question on the details of the patch… 18:29:49 perhaps that is not going to happen without some more code changes (and we might get a better idea after songole’s investigation) 18:29:55 rkukura: sure 18:30:27 It basically turns clean_session into a no-op contextmanaget. Should the patch remove all uses of clean_session, or should that be done later? 18:30:44 s/contextmanaget/contextmanager/ 18:30:54 rkukura: that would be more comprehensive, and be best going forward. 18:31:03 rkukura: yes 18:31:10 I can rework it for that, but I need some time to get back to it 18:31:14 could someone help me understand what this expunge is all about.. :) 18:31:18 rkukura: it does not necessarily have to be this patch 18:31:36 songole: I can help there 18:31:40 rkukura: the bigger change would be more mechanical once we know that removing the calls works 18:31:48 SumitNaiksatam: I agree phasing this might make sense, especially if we are not 100% certain we can live without the expunging 18:31:55 rkukura: right 18:32:03 otherwise we can put it back 18:32:10 just one line :-) 18:32:15 objects that have representation in the DB can have different states 18:32:25 there is a session, which sort of acts like a cache of DB objects 18:32:38 so, if you’ve read a DB object, it keeps track of it in the session 18:33:02 if you do an expunge, you disassociate the object with the session 18:33:32 you can continue to use the object, but if you need any additional DB lookups (e.g. if the object has a backref to another DB object), then you’ll get an exception 18:33:40 b/c the object is no longer associated with a session 18:34:03 the expunge basically removes all DB object state associated with a given session 18:34:07 making it a “clean” session 18:34:13 to get new dB objects and their state 18:34:30 tbachman: thanks 18:34:36 songole: np! 18:34:43 tbachman: yes, very nice summary 18:35:07 also to keep in context, we did this since there was a UT which kept failing consistently 18:35:15 and no one could figure out how to fix this 18:35:18 to access object it again, do i need to get a new session? 18:35:26 ivar tried this approach and it worked 18:35:30 Once the session is expunged, the DB objects still exist in memory, but trying to use them in ways that involve the DB will cause exceptions, such as ObjectDeletedError: Instance '' has been deleted, or its row is otherwise not present. 18:35:34 songole: once you’ve expunged it, yes you need a new session 18:35:35 its been with us since 18:36:13 ok. thanks 18:36:46 there is also some eventlet related interplay here 18:36:50 here’s the original bug ID: 18:36:53 #link https://bugs.launchpad.net/group-based-policy/+bug/1404412 18:36:53 Launchpad bug 1404412 in Group Based Policy "Session's persistent objects across transactions create inconsistency" [High,Fix released] - Assigned to Ivar Lazzaro (mmaleckk) 18:37:07 and here’s hte original patch: 18:37:10 #link https://github.com/openstack/group-based-policy/commit/2c8b401b8e0432bba4f7ae089883b36d760f733c 18:37:37 i was pointing thomas (and songole) to a comment in the oslo db code, which recommended putting things always inside a transaction 18:37:39 tbachman: thanks 18:38:15 i dont think we would have the luxury to make that kind of a change in other projects 18:38:28 Here’s that same one in gerrit 18:38:29 but if needed we can do that in our project? 18:38:30 #link https://review.openstack.org/#/c/213362/ 18:38:49 for instance in the places where its currently failing (the one which rkukura mentioned above) 18:39:35 * tbachman realizes he should put some context in the chat line when he uses a link 18:40:32 if a different node updates the object, would there be a problem with this caching mechanism? 18:40:56 Actually, the ObjectDeletedError above (from the log earlier this meeting) may not be directly due to the object having been expunged. I’ve seen other errors that clearly do mention the object being expunged, such as when trying to follow an association. 18:40:56 songole: you mean if a different context updates an object read from the DB? 18:41:08 yes 18:41:38 songole: in the expunged case 18:41:39 ? 18:41:48 In my mind, the only safe pattern is to start a transaction using a session, do everything inside that transaction, and do not access the DB objects after the transaction is committed or rolled back. 18:41:55 rkukura: ack 18:42:20 ok 18:42:21 it seems like that’s the intent. You *can* use it outside of a transaction, but that seems to open you up to other issues (e.g. like the kind we’re seeing) 18:42:44 that said — there’s *a lot* of code to change in that L3 plugin, if that’s the case :( 18:42:53 not saying that shouldn’t be the solution 18:42:58 just pointing it out ;) 18:43:12 even nfp as well 18:43:33 If there is a real need to use a DB object after the transaction ends, I think there is a way to disassociate the DB object from the session, making it just lake a new model class instance for which add() was never called. 18:43:36 note that using it inside of a transaction still doesn’t solve a threading issue with sessions 18:43:45 so, it seems like 2 things need fixing 18:45:07 tbachman: right 18:45:41 rkukura: wrapping in a transaction is the change i was earlier referring, but which do not have the luxury of changing outside our project 18:46:06 monkey-patch 18:46:09 what? 18:46:13 and like tbachman pointed out, there are several places across neutron and its sub-projects where that pattern is used 18:47:51 and breakage in those places is likely to unexpectedly manifest until we have these expunge_all calls 18:48:11 songole had earlier run into the same issue with lbaas code 18:48:23 SumitNaiksatam: I think we just need to spend some time first investigating this some more 18:48:29 to get more info about the problem 18:48:42 I hope to be able to do that in the near-ish future 18:48:58 and if songole or others are able to have a look as well, hopefully we can discover the root-cause 18:49:08 (together) 18:49:18 tbachman: the predicament is that we are sitting on a time-bomb while we have this expunge-all call in place 18:49:30 SumitNaiksatam: ack 18:49:55 and if we agree that is the case, then removing it asap would make sense, so that we can test well after we remove it (repeating myself) 18:50:05 and we dont have enought time left for that 18:50:17 assuming we dont do this soon 18:51:15 tbachman: your plan is reasonable and the only option we have i guess 18:51:38 tbachman, songole: in terms of time frame, how should we budget this? 18:51:55 SumitNaiksatam: a week (i.e. next week’s meeting)? 18:52:13 tbachman: right, that seems like the earliest 18:52:25 ack 18:52:47 looks like I was wrong on the sharing observation.. but, i will check more 18:52:50 songole: is it possible to get any additional data points outside of the upstream gate tests by removing the expunge_all calls? 18:53:32 SumitNaiksatam: we will test the patch locally.. 18:53:40 songole: okay thanks 18:54:02 another way to look at this is assume that the expunge_all call never existed 18:54:25 and any error we run into (in our code) without it being present is an issue we need to fix 18:54:59 the first thing that can be attempted in such places is to wrap things in a subtranscation (once the appropriate place is identified) 18:55:45 if sharing session between threads is an issue, that would be much more difficult to diagnose, but i am hoping thats not the case 18:56:05 tbachman: songole: rkukura: thanks for the input and discussion on this 18:56:18 SumitNaiksatam: np! 18:56:33 tbachman definitely deserves a big pat of the bag for the several days he has put into this 18:56:40 more than a pat on the back 18:56:44 * tbachman likes pat on the bags 18:56:46 :-o 18:57:16 #redact 18:57:20 and its kind of a coincidence that songole hit the issue with the lbaas code just about at the same time that tbachman ran into this (a little before) 18:57:54 bag -> back (typo, my bad) 18:57:57 lol 18:58:10 #topic Open Discussion 18:58:15 SumitNaiksatam: glad to work on this. Just sorry I didn’t resolve it yet 18:58:31 tbachman: no need to apologixe :-) 18:58:54 we have still not been able to merge igordcard’s patch: 18:59:02 #link https://review.openstack.org/#/c/301701 18:59:34 its failing the nfp gate but i am not sure its relevant 18:59:46 on that branch 18:59:56 rkukura: shall we merge this patch now? 19:00:03 oh we are out of time 19:00:12 will take this off this channel 19:00:16 thanks all, bye! 19:00:20 #endmeeting