Monday, 2018-11-05

*** Nel1x has quit IRC04:58
*** e0ne has joined #openstack-placement06:17
*** tetsuro has joined #openstack-placement06:18
*** e0ne has quit IRC06:36
*** rubasov has joined #openstack-placement07:45
*** tssurya has joined #openstack-placement08:20
*** helenafm has joined #openstack-placement08:31
*** bauwser is now known as bauzas08:56
*** e0ne has joined #openstack-placement09:10
*** ttsiouts has joined #openstack-placement09:20
*** ttsiouts has quit IRC09:36
*** ttsiouts has joined #openstack-placement09:38
*** takashin has joined #openstack-placement10:22
*** cdent has joined #openstack-placement10:22
*** sean-k-mooney has joined #openstack-placement11:27
*** tetsuro has quit IRC12:20
*** jroll has quit IRC12:32
*** jroll has joined #openstack-placement12:34
*** e0ne_ has joined #openstack-placement12:37
*** e0ne has quit IRC12:40
*** ttsiouts has quit IRC13:05
openstackgerritChris Dent proposed openstack/placement master: Added alembic environment  https://review.openstack.org/61435013:11
openstackgerritChris Dent proposed openstack/placement master: Delete the old migrations  https://review.openstack.org/61144013:12
openstackgerritChris Dent proposed openstack/placement master: Add a document for creating DB revisions  https://review.openstack.org/61402413:12
openstackgerritChris Dent proposed openstack/placement master: Add a placement-manage CLI  https://review.openstack.org/60016113:12
openstackgerritChris Dent proposed openstack/placement master: Remove sqlalchemy-migrate from requirements.txt  https://review.openstack.org/61453913:12
openstackgerritChris Dent proposed openstack/placement master: WIP - Show an alembic migration  https://review.openstack.org/61402513:15
cdentedleafe: I redid the alembic tests ^ based on my experiments over the weekend and some conversation with zzzeek.13:16
cdentI may finally work reliably13:16
cdentI hope13:16
*** tetsuro has joined #openstack-placement13:19
* cdent waves at tetsuro 13:32
* tetsuro waves back to cdent13:42
*** ttsiouts has joined #openstack-placement13:43
*** leakypipes is now known as jaypipes13:53
cdentfried_rice: you about?14:00
*** mriedem has joined #openstack-placement14:11
*** fried_rice is now known as efried14:25
efriedcdent: DST fail, thanks for covering.14:26
cdentno prob14:26
efriedDST is so stupid.14:26
efriedThere, I might limit my semi-yearly rant to that.14:27
cdentI agree with you14:29
*** SteelyDan is now known as dansmith14:37
openstackgerritChris Dent proposed openstack/placement master: Added alembic environment  https://review.openstack.org/61435014:37
openstackgerritChris Dent proposed openstack/placement master: Delete the old migrations  https://review.openstack.org/61144014:37
openstackgerritChris Dent proposed openstack/placement master: Add a document for creating DB revisions  https://review.openstack.org/61402414:37
openstackgerritChris Dent proposed openstack/placement master: WIP - Show an alembic migration  https://review.openstack.org/61402514:37
openstackgerritChris Dent proposed openstack/placement master: Remove sqlalchemy-migrate from requirements.txt  https://review.openstack.org/61453914:37
openstackgerritChris Dent proposed openstack/placement master: Add a placement-manage CLI  https://review.openstack.org/60016114:38
cdentedleafe: 97th time is charm ^ ?14:38
cdentoff to appt, biab14:38
openstackgerritJohn Garbutt proposed openstack/nova-specs master: Add Unified Limits Spec  https://review.openstack.org/60220114:46
*** tetsuro has quit IRC14:55
*** takashin has left #openstack-placement15:04
openstackgerritJohn Garbutt proposed openstack/nova-specs master: Add Unified Limits Spec  https://review.openstack.org/60220115:13
*** ttsiouts has quit IRC15:22
*** ttsiouts has joined #openstack-placement15:25
*** lyarpwood is now known as lyarwood15:26
*** ttsiouts has quit IRC15:33
*** ttsiouts has joined #openstack-placement15:54
*** ttsiouts has quit IRC16:14
*** ttsiouts has joined #openstack-placement16:14
*** ttsiouts has quit IRC16:19
*** e0ne_ has quit IRC16:36
*** ttsiouts has joined #openstack-placement16:47
*** openstackgerrit has quit IRC16:48
*** helenafm has quit IRC17:01
*** openstackgerrit has joined #openstack-placement17:36
openstackgerritChris Dent proposed openstack/placement master: Added alembic environment  https://review.openstack.org/61435017:36
openstackgerritChris Dent proposed openstack/placement master: Delete the old migrations  https://review.openstack.org/61144017:36
openstackgerritChris Dent proposed openstack/placement master: Add a document for creating DB revisions  https://review.openstack.org/61402417:36
openstackgerritChris Dent proposed openstack/placement master: WIP - Show an alembic migration  https://review.openstack.org/61402517:36
openstackgerritChris Dent proposed openstack/placement master: Remove sqlalchemy-migrate from requirements.txt  https://review.openstack.org/61453917:37
openstackgerritChris Dent proposed openstack/placement master: Add a placement-manage CLI  https://review.openstack.org/60016117:37
*** ttsiouts has quit IRC17:50
*** ttsiouts has joined #openstack-placement17:51
*** ttsiouts has quit IRC17:55
*** e0ne has joined #openstack-placement17:57
cdentefried: are you caught up on the refresh email thread?18:26
efriedcdent: Yes.18:26
cdentdoes silence in this case mean implict "yeah, there's no crazy in there"?18:27
cdentor "I'm coming back later with the crazy hammer"?18:27
efriedcdent: I'm working on the code atm, but when I respond, I will gently quibble with your (non-)definition of "cache". Otherwise, there was nothing I felt the need to jump all over immediately, no.18:27
cdentrad. I was hoping to tickle your quibble on that18:27
efriedto me, it's a cache by definition because we're saving information so we don't have to call back to the source every time we want to use the information.18:28
cdentthere had been a longer version of a response in my head which distinguished what you just said with a placement-side cache (using memcached or similar) that had namespaced keys with explication invalidation routines18:29
cdentbut I figured I would save that for another time as it would be premature18:29
cdentI don't think of it as a cache because there are only two things involved: 'my information' and 'your information' and no 'this third thing I check'18:30
cdentbut it is quibbling indeed18:30
cdentI was merely trying to establish my confusion over how it makes it seem more complicated than it is: as if there is an additional thing to manage18:31
efriedcdent: Maybe I'm just too uneducated, using a rudimentary collegiate definition of "cache".18:32
cdent"a collection of items of the same type stored in a hidden or inaccessible place."18:32
cdent:D18:32
efried"hidden or inaccessible"? This came from where?18:33
cdenthttps://www.google.co.uk/search?q=define%3Acache&oq=define%3Acache&aqs=chrome..69i57j69i58.1643j0j7&sourceid=chrome&ie=UTF-818:33
cdentand the french is 'cacher': 'to hide'18:33
cdentAfter all the holes I've been in with oslo_db and sqlalchemy today, I'm not sure I wan to fall in this one18:34
efriedoh, I misread. It's not that the collection of items is the same type as what's stored in the hidden/inaccessible place (i.e. the backing store). It's the collection of items that's stored in a hidden or inaccessible place.18:34
efriedThis is a generic definition of cache like pirate treasure, not the computing term.18:34
efriedhttps://en.wikipedia.org/wiki/Cache_(computing)18:35
efriedFirst paragraph describes the provider tree cache perfectly.18:35
cdenthuh, not to me it doesn't18:35
cdentwe're not storing it for future requests18:36
cdentwe're storing it so we have it at all18:36
efriedum18:36
cdent"so that future requests for that data can be served faster"18:36
efriedwe're storing it for every "request" we make to update_provider_tree to say, "has this changed?"18:36
cdentwe get resources from the server, then use we use those resources18:36
cdentit's just a data structure18:37
cdentbut please, let's just agree to disagree, the way things are going in the code is right, even if we disagree on the terms18:37
efrieda data structure that exists so we don't have to go to the backing store every time we need the data.18:37
efriedokay, but I'm going to be naming things in that code18:37
cdentin that sentence what is the "backing store"?18:37
efriedand the names are going to include "cache" in them.18:38
cdentis suspect the question I just asked is the root of the whole thing18:38
cdents/is/I/18:38
cdentif you're thinking the placement service is the backing store, that's probably where we're having an impedance problem18:39
efriedis a [...] software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be [...] a copy of data stored elsewhere [<== backing store]. [...] Cache hits are served by reading data from the cache, which is faster than recomputing a result or reading from a slower data store [<== backing store]; thus, the more requests that can be served from the cache, the fast18:39
efriedyes, "placement" is the backing store. Totally.18:39
cdentno, it's not18:39
efried"the placement database", if you like.18:40
efriedthe service is just the means to access that database.18:40
cdentit's the HTTP service of resource18:40
cdentno18:40
cdentif you're writing code on the nova side thinking that placement is a front to a database you're going to have all sort of difficulties getting it to work as well as it should18:40
efried"a database" or just "data"18:41
cdenteven that18:41
efriedI don't get why that's a bad thing.18:41
cdentthe idea of a resource oriented HTTP API is that when you do a GET you have the state and it _the_ state (from your point of view) until you try to write it back18:41
cdentthe outcome of the discussion on that thread is the expression of that ^18:42
cdentthe reason for having generations is to achieve that ^^18:43
efriedThat sentence is a perfect justification for the existence and form of the provider tree cache.18:43
efriedIt reflects what I GET'd and it's *the* state until I try to write it back (as the result of changing something via upt)18:44
cdents/ cache//18:44
efriedbut it's a cache because I don't go re-GET it every time I ask upt whether something has changed.18:44
* cdent shrugs18:44
efriedIt seems like we're agreeing on the technology and just not agreeing on the use of the word 'cache'.18:45
cdentsomewhat18:45
cdentbut I think (although I'm not certain) that thinking of it as a cache is what got us to where we are now (prior to your illustrious fix), and if that's the case, then there's something that could be clarified, I'm not sure what.18:46
efriedOkay, so when I write this patch that clears the ProviderTree object as well as the association_refresh_time when you send SIGHUP to the compute process, I had planned to call it18:46
efriedSchedulerReportClient.clear_provider_cache()18:46
efriedWhat would you propose instead?18:46
cdentI've always felt like there's some kind of, but not sure what, disconnect between how ProviderTree is/was implemented and placement itself. And my continued discussion on these lines is an effort to make sure we make a thing that's happy and also simple.18:47
efriedIMO the problem here is not the existence of ProviderTree, but the fact that we are trying to refresh it every periodic.18:47
cdentI'd probably do something like clear_provider()18:48
efriedI'm not clearing any providers though :P18:48
cdentI'm not against the ProviderTree. I'm against that it is a layered data object that does things to itself rather than just being data.18:48
efriedIf we eliminated the ProviderTree entirely, we would be doing basically the same thing as if that conf option was set to the same interval as the periodic. I.e. going to placement to populate *some* data structure (a big bag o dicts?) before every call to update_provider_tree18:48
efriedokay18:49
efriedwhat do you mean, does things to itself?18:49
efriedBy just being data, you still have to be able to add stuff to it, etc.18:49
efriedCRUD ops on that data18:49
* cdent nods18:49
efriedThe methods on ProviderTree just give those operations sane names that reflect some meaning of the data being changed.18:50
efriedrather than relying on the caller to understand under what circumstances it's okay to say bagodicts[attr] = value18:50
cdentbut you asert that when you clear the ProviderTree you are not clearing providers? Are you not clearing your local representation of those providers? You are resetting your local state, yes?18:51
efriedsubtle difference between clearing the cache and "resetting local state".18:51
efriedby clearing the cache I'm saying, "next time I need this information, go get it from the source"18:51
efriedresetting local state would be more like, "I think it should look like *this*"18:52
cdentand the plan is to reset/clear on sighup and whenever there is a 409 when trying to write, is that right?18:53
efriedWhich I don't think we want to do, because then we're guaranteed that the next push to placement will bounce 409 (because we would have no idea what the generation should be) and we would have to re-GET...18:53
efriedno, definitely not.18:53
efriedWherever feasible, 409 should be "re-GET the offending object, redrive my edits, redo the call"18:54
cdentI think that's what both jay and I implied (with maybe some difference in degree) in the email thread18:54
efriedAh, I think I get where that's coming from.18:55
efriedWhen update_from_provider_tree encounters an error - any error - we have to go back to update_provider_tree with a freshly-acquired representation of what placement thinks the data looks like.18:55
*** tssurya has quit IRC18:56
efriedIn a very abstract theoretical world, we could just take whatever upt gave us and blat it down. But once you start picking that apart, it quickly breaks down.18:57
jaypipescdent: +118:58
efriedby design, upt needs to be empowered to operate on the tree as it really exists (in placement). If all we ever did was ask the driver what the tree should look like out of nowhere, we're not better off than we were with get_available_resource/get_inventory.18:58
cdentefried: i don't think anyone is disupting that part of things18:59
efriedOkay, then we need to give upt an accurate representation.18:59
cdentI really do think it is important that we restate (in English not Python) what it is we're expecting the RT to do18:59
cdentyes, that accurate representation is something you GET when things go wrong, but _only_ when things go wrong19:00
efried"allow the virt driver, via update_provider_tree, to edit the providers and associated artifacts belonging to this compute host"19:00
efried"If something goes wrong with flushing that edit back to placement, the RT needs to give the virt driver another chance."19:01
cdentjaypipes: I'm unclear which part you were +1 on, but thanks?19:03
efried"Due to the chunky nature of the objects/APIs - the fact that we don't have an atomic API allowing us to flush changes to multiple providers' inventories, aggregates, and traits, as well as adding/removing providers - if something goes wrong partway through the flushing, we need to refresh our view of *at least* the object on which an operation failed."19:03
cdentefried: so is the "no, definitely not" that you don't want to re-GET the entire tree?19:04
jaypipescdent: "I think that's what both jay and I implied"19:04
cdent(thanks)19:04
efriedcdent: It was "No, we definitely do not want, as a rule, to reset/clear any time there is a 409"19:04
efriedBut ufpt is a special case of ^ that.19:04
efriedwhere we *do* clear on 409 (and any other failures)19:04
efriedfor reasons I'm trying to explain...19:05
efriedAm I making sense?19:05
cdentsomewhat, but I'm not following why this isn't okay (and I'll use 'cache' here to try to keep in the same tune):19:05
cdenta) I start up, get from placement what it knows about "me" and put it in a ProviderTree19:06
cdentb) that tree, which may currently be nothing, is then shown to the virt driver who says "I've got some different stuff"19:07
cdentc) we write that different stuff to placement19:07
cdentd) we _cache_ that stuff in the ProviderTree19:07
cdente) a some point in the future we ask the virtdriver if anything has changed19:08
cdentf) we compare that with the cache, sync back to placement19:08
efriedf) ...sync *iff we think anything has changed*19:09
efried(so far so good)19:09
cdentg) if there are any errors at all we flush the entire provider tree, get it again from placement, make what we got the new "cache"19:09
efried(signal when done)19:09
cdenth) either ask the virt driver to try again, or use the info we already had from it to do a sync back19:10
cdenti) if the write is successful we have a happy "cache" and we wait until the next time round the loop19:10
cdentj) sit back and enjoy that the code has a simple (albeit expensive, but not as expensive as what we have now) failure mode where the same thing always happens, but happens rarely because changes aren't all that common19:11
cdentEOF19:11
efriedTotally on board with all of that.19:11
efriedExactly where I'm trying to get to.19:12
cdentbut that is have the reset/clear, so I'm confused: "No, we definitely do not want, as a rule, to reset/clear any time there is a 409""19:13
efriedThe major thing that's different from that today is that there's a separate, parallel code path that does:19:13
efrieda) If a timer has expired, get everything whether we think it's stale or not19:13
efriedEOF19:13
cdentyeah, it is great that a is going away19:13
efriedcdent: Right "as a rule". But in the next line I said that ufpt is an exception to that.19:13
cdentwhat's in the RT is out of ufpt where this comes in to play? (assuming a virt driver that supports it)19:14
efriedBecause of the contract we have with upt, we need to react to any error, including "we tried to update something but what we started with was stale, our bad but still an error condition", by purging and refreshing our view.19:14
efriedsorry, rephrase please?19:15
efriedufpt essentially pushes the same bits to ProviderTree and placement.19:15
efriedthereby making sure they're in sync.19:15
efriedexcept that it only pushes bits if bits have changed.19:15
efriedto avoid making unnecessary calls.19:16
cdentone sec19:16
efriedbecause we <3 CERN19:16
cdentSorry, I'm still confused. Point me to some code where we're going to send something in a ProviderTree to placement and we don't want to "clear the cache"?19:19
efriedcdent: "clear" vs "update".19:20
cdentpardon?19:20
efriedSo like, we GET the compute RP, and it says CPU: {'total': 4}. We stuff that in the reportclient._provider_tree.19:20
cdentI left out ...cache" on a 409?19:21
* cdent nods19:21
efriedah, okay.19:21
efriedabort current explanation.19:21
efriedWhat you're essentially asking is where, outside of ufpt, would we be pushing updates to placement.19:22
efriedand I assume we're still not talking about allocations. Because there.19:22
cdentyes, allocations are off the board at the moment19:22
efriedThe only example I can think of offhand is when the API process is syncing host AZs to placement aggregates.19:23
efriedOther than that, it may be true that all other non-allocation updates are being funneled through upt/ufpt.19:23
cdentI'm talking about _in the resource tracker/compute mananger_.19:23
efriedokay, so ---^19:23
efriedI'm just sayin, in theory, one would mostly *like* to respond to a 409 by just re-GETting the thing that failed, redoing the edit to it, and re-pushing it.19:24
efriedwhich is actually what the existing incarnation of ufpt is doing.19:24
cdentso you are, then, okay with dumping the "cache" every time there is a 409 (and any other error) in ufpt, because it provides a small surface area of handling, yes?19:25
efriedby "clearing the cache" for *just* that object, and letting the next periodic do the re-GET and re-edit.19:25
efriedyes19:25
cdentgreat. glad to hear it :()19:25
cdentthat's some kind of open mouthed kiss apparently19:25
efriedbecause what ufpt is trying to do now (just redoing the failed object) is overly complicated and, as an optimization, probably not worth the aggravation.19:26
cdentyes19:26
efriedGetting rid of that, and clearing the whole cache, will allow me to get rid of a TODO and reinstate another optimization19:26
cdentespecially when we have turned off the periodic refresh, so muich win19:26
efriedI think that bit is in the current incarnation, looking...19:26
efriedhttps://review.openstack.org/#/c/614886/2/nova/scheduler/client/report.py@a65219:27
cdentis that possibly causing wrongness if the list of children has changed? I would think that's an optimisation we don't want if that's the case19:31
cdenti.e. the cyborg and neutron cases?19:32
* cdent is not really sure about that19:32
cdentI'm getting pretty exhaused at this stage, this is a huge context from the fight I had with https://review.openstack.org/#/c/614350/ over the weekend19:32
cdentsigh19:33
cdentcontext *switch*19:34
efriedcdent: (was getting lunch) Importantly, we decided we don't need to deal with children created outside of the compute service.19:41
efriedIf the list of children has changed because upt told us so, we'll pick it up. If it has changed for any other reason, we don't care and don't need to know.19:42
cdentI understand that people discussed that, but I don't understand how one can think one has a ProviderTree if you don't have all the children, and if that block of code is the one optimization that that allows _and_ we expect the cache to not get wrong often, why bother?19:43
cdentor in other words: that optimization looks like a bug waiting to happen19:43
efriedcdent: Because it's the only way we avoid polling the backing store every periodic to make sure our view isn't stale. Lacking some kind of async notifier, anyway.19:45
efriedI intend to leave some NOTEs in there warning that this is the case.19:46
efriedBut the docstring for upt already warns implementors not to muck with providers it finds in there that it doesn't "own".19:46
cdentor cascading generations. Can we imagine there ever being a case where an alien provider is not a leaf?19:46
efriedWhich is tantamount to the same thing, it's just now you won't find any.19:46
efriedwell, or an intermediary node, none of whose descendants belong to the compute service.19:47
efriedI'm not sure. I would hope so, but I'm not sure.19:47
cdentthat's something we should keep in mind, I guess19:48
efriedYup. I'm gonna call YAGNI on this, though. Until we have a situation where the compute service needs to be aware of externally-created tree nodes, let's not solve for it. I don't think we've painted ourselves into a corner by doing this - cause we can always reinstate the ?in_tree= poll.19:49
cdentIf you're gonna call Y19:50
efried...or notifier, or cascading generations, or whatever.19:50
cdentAGNI on worry about alien nodes, then I can call YAGNI on the optimisation19:50
efriedback atcha, we do NI, as shown by belmoreira's steeply-sloped graphs.19:50
cdentinstead we should not call _ensure_provider19:51
efriedthat's gonna happen too.19:51
cdentthen we don't need to optimize _ensure_resource_provider if we aren't going to call it all the time, right?19:51
cdentI'm sorry if I seem like I'm being a dick about this, but I'd really like to see us scale back to something that does only what's necessary, without optimisations, and then layer them on19:53
cdentbrb19:54
efriedthe plan is actually to refactor _ensure_resource_provider so it checks whether the provider exists in the cache and/or is stale per the timer19:54
efriedif in the cache and not stale, no calls to placement.19:54
efriedbecause the callers *do* still need to make sure it's in the cache.19:55
cdentI thought the goal was to remove the timer?19:55
efriedmake the timer disable-able.19:55
cdentIf the provider is in the cache, assume it is correct until it fails on a write19:55
efriedyeah, or until the timer expires.19:56
efriedGoal to remove the timer <== long term yes19:56
efriednot sure if we're going to get away with doing that immediately.19:56
efriedcurrent poc adds the ability to set it to zero to "disable" refresh.19:56
cdentwhat do we get by keeping it? what changes?19:56
cdentbrb19:56
cdentnm, a call to go outside timed out19:57
efriedwe get the ability to move forward cautiously, convincing/allowing the likes of CERN and vexx to try out the totally-disabled-timer thing to make sure it really works in a large-scale live environment.19:58
cdentDo you have any guess on what would be different? It seems to me that the entire point of this change is so that we have a more clear understanding of how/why the providers and associates go back and forth between the nova-compute and placement.20:01
cdentand if we have that understanding then it, hey presto, it works!20:02
* cdent laughs at self20:02
efriedcdent: I don't see any way it would be different. My operating theory is that we will observe zero (non-operator-induced) drift of inventories, traits, and aggregates when the timer is disabled and all these calls are eliminated. But I'd kinda prefer to be right about that and have to sling another patch later to make it the default than make it the default now and be wrong about it.20:05
cdentsure, but are we targetting it being the default before the end of the cycle?20:06
efriedas you well know, being able to reason about a thing doesn't make it so. If that were true, there would be no bugs.20:06
efriedno idea.20:06
efriedbaby steps20:06
efriedI can write the code for it. And then we can merge it or not.20:06
cdent20:08
*** e0ne has quit IRC20:45
openstackgerritMatt Riedemann proposed openstack/nova-specs master: Add support for emulated virtual TPM  https://review.openstack.org/57111121:43
openstackgerritChris Dent proposed openstack/placement master: Add a placement-manage CLI  https://review.openstack.org/60016122:01
*** e0ne has joined #openstack-placement22:02
openstackgerritChris Dent proposed openstack/placement master: Add a placement-manage CLI  https://review.openstack.org/60016122:05
*** e0ne has quit IRC22:05
openstackgerritMatt Riedemann proposed openstack/placement master: Add bandwidth related standard resource classes  https://review.openstack.org/61568722:56
*** mriedem has quit IRC23:20

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!