Monday, 2018-11-05

*** Nel1x has quit IRC		04:58
*** e0ne has joined #openstack-placement		06:17
*** tetsuro has joined #openstack-placement		06:18
*** e0ne has quit IRC		06:36
*** rubasov has joined #openstack-placement		07:45
*** tssurya has joined #openstack-placement		08:20
*** helenafm has joined #openstack-placement		08:31
*** bauwser is now known as bauzas		08:56
*** e0ne has joined #openstack-placement		09:10
*** ttsiouts has joined #openstack-placement		09:20
*** ttsiouts has quit IRC		09:36
*** ttsiouts has joined #openstack-placement		09:38
*** takashin has joined #openstack-placement		10:22
*** cdent has joined #openstack-placement		10:22
*** sean-k-mooney has joined #openstack-placement		11:27
*** tetsuro has quit IRC		12:20
*** jroll has quit IRC		12:32
*** jroll has joined #openstack-placement		12:34
*** e0ne_ has joined #openstack-placement		12:37
*** e0ne has quit IRC		12:40
*** ttsiouts has quit IRC		13:05
openstackgerrit	Chris Dent proposed openstack/placement master: Added alembic environment https://review.openstack.org/614350	13:11
openstackgerrit	Chris Dent proposed openstack/placement master: Delete the old migrations https://review.openstack.org/611440	13:12
openstackgerrit	Chris Dent proposed openstack/placement master: Add a document for creating DB revisions https://review.openstack.org/614024	13:12
openstackgerrit	Chris Dent proposed openstack/placement master: Add a placement-manage CLI https://review.openstack.org/600161	13:12
openstackgerrit	Chris Dent proposed openstack/placement master: Remove sqlalchemy-migrate from requirements.txt https://review.openstack.org/614539	13:12
openstackgerrit	Chris Dent proposed openstack/placement master: WIP - Show an alembic migration https://review.openstack.org/614025	13:15
cdent	edleafe: I redid the alembic tests ^ based on my experiments over the weekend and some conversation with zzzeek.	13:16
cdent	I may finally work reliably	13:16
cdent	I hope	13:16
*** tetsuro has joined #openstack-placement		13:19
* cdent waves at tetsuro		13:32
* tetsuro waves back to cdent		13:42
*** ttsiouts has joined #openstack-placement		13:43
*** leakypipes is now known as jaypipes		13:53
cdent	fried_rice: you about?	14:00
*** mriedem has joined #openstack-placement		14:11
*** fried_rice is now known as efried		14:25
efried	cdent: DST fail, thanks for covering.	14:26
cdent	no prob	14:26
efried	DST is so stupid.	14:26
efried	There, I might limit my semi-yearly rant to that.	14:27
cdent	I agree with you	14:29
*** SteelyDan is now known as dansmith		14:37
openstackgerrit	Chris Dent proposed openstack/placement master: Added alembic environment https://review.openstack.org/614350	14:37
openstackgerrit	Chris Dent proposed openstack/placement master: Delete the old migrations https://review.openstack.org/611440	14:37
openstackgerrit	Chris Dent proposed openstack/placement master: Add a document for creating DB revisions https://review.openstack.org/614024	14:37
openstackgerrit	Chris Dent proposed openstack/placement master: WIP - Show an alembic migration https://review.openstack.org/614025	14:37
openstackgerrit	Chris Dent proposed openstack/placement master: Remove sqlalchemy-migrate from requirements.txt https://review.openstack.org/614539	14:37
openstackgerrit	Chris Dent proposed openstack/placement master: Add a placement-manage CLI https://review.openstack.org/600161	14:38
cdent	edleafe: 97th time is charm ^ ?	14:38
cdent	off to appt, biab	14:38
openstackgerrit	John Garbutt proposed openstack/nova-specs master: Add Unified Limits Spec https://review.openstack.org/602201	14:46
*** tetsuro has quit IRC		14:55
*** takashin has left #openstack-placement		15:04
openstackgerrit	John Garbutt proposed openstack/nova-specs master: Add Unified Limits Spec https://review.openstack.org/602201	15:13
*** ttsiouts has quit IRC		15:22
*** ttsiouts has joined #openstack-placement		15:25
*** lyarpwood is now known as lyarwood		15:26
*** ttsiouts has quit IRC		15:33
*** ttsiouts has joined #openstack-placement		15:54
*** ttsiouts has quit IRC		16:14
*** ttsiouts has joined #openstack-placement		16:14
*** ttsiouts has quit IRC		16:19
*** e0ne_ has quit IRC		16:36
*** ttsiouts has joined #openstack-placement		16:47
*** openstackgerrit has quit IRC		16:48
*** helenafm has quit IRC		17:01
*** openstackgerrit has joined #openstack-placement		17:36
openstackgerrit	Chris Dent proposed openstack/placement master: Added alembic environment https://review.openstack.org/614350	17:36
openstackgerrit	Chris Dent proposed openstack/placement master: Delete the old migrations https://review.openstack.org/611440	17:36
openstackgerrit	Chris Dent proposed openstack/placement master: Add a document for creating DB revisions https://review.openstack.org/614024	17:36
openstackgerrit	Chris Dent proposed openstack/placement master: WIP - Show an alembic migration https://review.openstack.org/614025	17:36
openstackgerrit	Chris Dent proposed openstack/placement master: Remove sqlalchemy-migrate from requirements.txt https://review.openstack.org/614539	17:37
openstackgerrit	Chris Dent proposed openstack/placement master: Add a placement-manage CLI https://review.openstack.org/600161	17:37
*** ttsiouts has quit IRC		17:50
*** ttsiouts has joined #openstack-placement		17:51
*** ttsiouts has quit IRC		17:55
*** e0ne has joined #openstack-placement		17:57
cdent	efried: are you caught up on the refresh email thread?	18:26
efried	cdent: Yes.	18:26
cdent	does silence in this case mean implict "yeah, there's no crazy in there"?	18:27
cdent	or "I'm coming back later with the crazy hammer"?	18:27
efried	cdent: I'm working on the code atm, but when I respond, I will gently quibble with your (non-)definition of "cache". Otherwise, there was nothing I felt the need to jump all over immediately, no.	18:27
cdent	rad. I was hoping to tickle your quibble on that	18:27
efried	to me, it's a cache by definition because we're saving information so we don't have to call back to the source every time we want to use the information.	18:28
cdent	there had been a longer version of a response in my head which distinguished what you just said with a placement-side cache (using memcached or similar) that had namespaced keys with explication invalidation routines	18:29
cdent	but I figured I would save that for another time as it would be premature	18:29
cdent	I don't think of it as a cache because there are only two things involved: 'my information' and 'your information' and no 'this third thing I check'	18:30
cdent	but it is quibbling indeed	18:30
cdent	I was merely trying to establish my confusion over how it makes it seem more complicated than it is: as if there is an additional thing to manage	18:31
efried	cdent: Maybe I'm just too uneducated, using a rudimentary collegiate definition of "cache".	18:32
cdent	"a collection of items of the same type stored in a hidden or inaccessible place."	18:32
cdent	:D	18:32
efried	"hidden or inaccessible"? This came from where?	18:33
cdent	https://www.google.co.uk/search?q=define%3Acache&oq=define%3Acache&aqs=chrome..69i57j69i58.1643j0j7&sourceid=chrome&ie=UTF-8	18:33
cdent	and the french is 'cacher': 'to hide'	18:33
cdent	After all the holes I've been in with oslo_db and sqlalchemy today, I'm not sure I wan to fall in this one	18:34
efried	oh, I misread. It's not that the collection of items is the same type as what's stored in the hidden/inaccessible place (i.e. the backing store). It's the collection of items that's stored in a hidden or inaccessible place.	18:34
efried	This is a generic definition of cache like pirate treasure, not the computing term.	18:34
efried	https://en.wikipedia.org/wiki/Cache_(computing)	18:35
efried	First paragraph describes the provider tree cache perfectly.	18:35
cdent	huh, not to me it doesn't	18:35
cdent	we're not storing it for future requests	18:36
cdent	we're storing it so we have it at all	18:36
efried	um	18:36
cdent	"so that future requests for that data can be served faster"	18:36
efried	we're storing it for every "request" we make to update_provider_tree to say, "has this changed?"	18:36
cdent	we get resources from the server, then use we use those resources	18:36
cdent	it's just a data structure	18:37
cdent	but please, let's just agree to disagree, the way things are going in the code is right, even if we disagree on the terms	18:37
efried	a data structure that exists so we don't have to go to the backing store every time we need the data.	18:37
efried	okay, but I'm going to be naming things in that code	18:37
cdent	in that sentence what is the "backing store"?	18:37
efried	and the names are going to include "cache" in them.	18:38
cdent	is suspect the question I just asked is the root of the whole thing	18:38
cdent	s/is/I/	18:38
cdent	if you're thinking the placement service is the backing store, that's probably where we're having an impedance problem	18:39
efried	is a [...] software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be [...] a copy of data stored elsewhere [<== backing store]. [...] Cache hits are served by reading data from the cache, which is faster than recomputing a result or reading from a slower data store [<== backing store]; thus, the more requests that can be served from the cache, the fast	18:39
efried	yes, "placement" is the backing store. Totally.	18:39
cdent	no, it's not	18:39
efried	"the placement database", if you like.	18:40
efried	the service is just the means to access that database.	18:40
cdent	it's the HTTP service of resource	18:40
cdent	no	18:40
cdent	if you're writing code on the nova side thinking that placement is a front to a database you're going to have all sort of difficulties getting it to work as well as it should	18:40
efried	"a database" or just "data"	18:41
cdent	even that	18:41
efried	I don't get why that's a bad thing.	18:41
cdent	the idea of a resource oriented HTTP API is that when you do a GET you have the state and it _the_ state (from your point of view) until you try to write it back	18:41
cdent	the outcome of the discussion on that thread is the expression of that ^	18:42
cdent	the reason for having generations is to achieve that ^^	18:43
efried	That sentence is a perfect justification for the existence and form of the provider tree cache.	18:43
efried	It reflects what I GET'd and it's the state until I try to write it back (as the result of changing something via upt)	18:44
cdent	s/ cache//	18:44
efried	but it's a cache because I don't go re-GET it every time I ask upt whether something has changed.	18:44
* cdent shrugs		18:44
efried	It seems like we're agreeing on the technology and just not agreeing on the use of the word 'cache'.	18:45
cdent	somewhat	18:45
cdent	but I think (although I'm not certain) that thinking of it as a cache is what got us to where we are now (prior to your illustrious fix), and if that's the case, then there's something that could be clarified, I'm not sure what.	18:46
efried	Okay, so when I write this patch that clears the ProviderTree object as well as the association_refresh_time when you send SIGHUP to the compute process, I had planned to call it	18:46
efried	SchedulerReportClient.clear_provider_cache()	18:46
efried	What would you propose instead?	18:46
cdent	I've always felt like there's some kind of, but not sure what, disconnect between how ProviderTree is/was implemented and placement itself. And my continued discussion on these lines is an effort to make sure we make a thing that's happy and also simple.	18:47
efried	IMO the problem here is not the existence of ProviderTree, but the fact that we are trying to refresh it every periodic.	18:47
cdent	I'd probably do something like clear_provider()	18:48
efried	I'm not clearing any providers though :P	18:48
cdent	I'm not against the ProviderTree. I'm against that it is a layered data object that does things to itself rather than just being data.	18:48
efried	If we eliminated the ProviderTree entirely, we would be doing basically the same thing as if that conf option was set to the same interval as the periodic. I.e. going to placement to populate some data structure (a big bag o dicts?) before every call to update_provider_tree	18:48
efried	okay	18:49
efried	what do you mean, does things to itself?	18:49
efried	By just being data, you still have to be able to add stuff to it, etc.	18:49
efried	CRUD ops on that data	18:49
* cdent nods		18:49
efried	The methods on ProviderTree just give those operations sane names that reflect some meaning of the data being changed.	18:50
efried	rather than relying on the caller to understand under what circumstances it's okay to say bagodicts[attr] = value	18:50
cdent	but you asert that when you clear the ProviderTree you are not clearing providers? Are you not clearing your local representation of those providers? You are resetting your local state, yes?	18:51
efried	subtle difference between clearing the cache and "resetting local state".	18:51
efried	by clearing the cache I'm saying, "next time I need this information, go get it from the source"	18:51
efried	resetting local state would be more like, "I think it should look like this"	18:52
cdent	and the plan is to reset/clear on sighup and whenever there is a 409 when trying to write, is that right?	18:53
efried	Which I don't think we want to do, because then we're guaranteed that the next push to placement will bounce 409 (because we would have no idea what the generation should be) and we would have to re-GET...	18:53
efried	no, definitely not.	18:53
efried	Wherever feasible, 409 should be "re-GET the offending object, redrive my edits, redo the call"	18:54
cdent	I think that's what both jay and I implied (with maybe some difference in degree) in the email thread	18:54
efried	Ah, I think I get where that's coming from.	18:55
efried	When update_from_provider_tree encounters an error - any error - we have to go back to update_provider_tree with a freshly-acquired representation of what placement thinks the data looks like.	18:55
*** tssurya has quit IRC		18:56
efried	In a very abstract theoretical world, we could just take whatever upt gave us and blat it down. But once you start picking that apart, it quickly breaks down.	18:57
jaypipes	cdent: +1	18:58
efried	by design, upt needs to be empowered to operate on the tree as it really exists (in placement). If all we ever did was ask the driver what the tree should look like out of nowhere, we're not better off than we were with get_available_resource/get_inventory.	18:58
cdent	efried: i don't think anyone is disupting that part of things	18:59
efried	Okay, then we need to give upt an accurate representation.	18:59
cdent	I really do think it is important that we restate (in English not Python) what it is we're expecting the RT to do	18:59
cdent	yes, that accurate representation is something you GET when things go wrong, but _only_ when things go wrong	19:00
efried	"allow the virt driver, via update_provider_tree, to edit the providers and associated artifacts belonging to this compute host"	19:00
efried	"If something goes wrong with flushing that edit back to placement, the RT needs to give the virt driver another chance."	19:01
cdent	jaypipes: I'm unclear which part you were +1 on, but thanks?	19:03
efried	"Due to the chunky nature of the objects/APIs - the fact that we don't have an atomic API allowing us to flush changes to multiple providers' inventories, aggregates, and traits, as well as adding/removing providers - if something goes wrong partway through the flushing, we need to refresh our view of at least the object on which an operation failed."	19:03
cdent	efried: so is the "no, definitely not" that you don't want to re-GET the entire tree?	19:04
jaypipes	cdent: "I think that's what both jay and I implied"	19:04
cdent	(thanks)	19:04
efried	cdent: It was "No, we definitely do not want, as a rule, to reset/clear any time there is a 409"	19:04
efried	But ufpt is a special case of ^ that.	19:04
efried	where we do clear on 409 (and any other failures)	19:04
efried	for reasons I'm trying to explain...	19:05
efried	Am I making sense?	19:05
cdent	somewhat, but I'm not following why this isn't okay (and I'll use 'cache' here to try to keep in the same tune):	19:05
cdent	a) I start up, get from placement what it knows about "me" and put it in a ProviderTree	19:06
cdent	b) that tree, which may currently be nothing, is then shown to the virt driver who says "I've got some different stuff"	19:07
cdent	c) we write that different stuff to placement	19:07
cdent	d) we _cache_ that stuff in the ProviderTree	19:07
cdent	e) a some point in the future we ask the virtdriver if anything has changed	19:08
cdent	f) we compare that with the cache, sync back to placement	19:08
efried	f) ...sync iff we think anything has changed	19:09
efried	(so far so good)	19:09
cdent	g) if there are any errors at all we flush the entire provider tree, get it again from placement, make what we got the new "cache"	19:09
efried	(signal when done)	19:09
cdent	h) either ask the virt driver to try again, or use the info we already had from it to do a sync back	19:10
cdent	i) if the write is successful we have a happy "cache" and we wait until the next time round the loop	19:10
cdent	j) sit back and enjoy that the code has a simple (albeit expensive, but not as expensive as what we have now) failure mode where the same thing always happens, but happens rarely because changes aren't all that common	19:11
cdent	EOF	19:11
efried	Totally on board with all of that.	19:11
efried	Exactly where I'm trying to get to.	19:12
cdent	but that is have the reset/clear, so I'm confused: "No, we definitely do not want, as a rule, to reset/clear any time there is a 409""	19:13
efried	The major thing that's different from that today is that there's a separate, parallel code path that does:	19:13
efried	a) If a timer has expired, get everything whether we think it's stale or not	19:13
efried	EOF	19:13
cdent	yeah, it is great that a is going away	19:13
efried	cdent: Right "as a rule". But in the next line I said that ufpt is an exception to that.	19:13
cdent	what's in the RT is out of ufpt where this comes in to play? (assuming a virt driver that supports it)	19:14
efried	Because of the contract we have with upt, we need to react to any error, including "we tried to update something but what we started with was stale, our bad but still an error condition", by purging and refreshing our view.	19:14
efried	sorry, rephrase please?	19:15
efried	ufpt essentially pushes the same bits to ProviderTree and placement.	19:15
efried	thereby making sure they're in sync.	19:15
efried	except that it only pushes bits if bits have changed.	19:15
efried	to avoid making unnecessary calls.	19:16
cdent	one sec	19:16
efried	because we <3 CERN	19:16
cdent	Sorry, I'm still confused. Point me to some code where we're going to send something in a ProviderTree to placement and we don't want to "clear the cache"?	19:19
efried	cdent: "clear" vs "update".	19:20
cdent	pardon?	19:20
efried	So like, we GET the compute RP, and it says CPU: {'total': 4}. We stuff that in the reportclient._provider_tree.	19:20
cdent	I left out ...cache" on a 409?	19:21
* cdent nods		19:21
efried	ah, okay.	19:21
efried	abort current explanation.	19:21
efried	What you're essentially asking is where, outside of ufpt, would we be pushing updates to placement.	19:22
efried	and I assume we're still not talking about allocations. Because there.	19:22
cdent	yes, allocations are off the board at the moment	19:22
efried	The only example I can think of offhand is when the API process is syncing host AZs to placement aggregates.	19:23
efried	Other than that, it may be true that all other non-allocation updates are being funneled through upt/ufpt.	19:23
cdent	I'm talking about _in the resource tracker/compute mananger_.	19:23
efried	okay, so ---^	19:23
efried	I'm just sayin, in theory, one would mostly like to respond to a 409 by just re-GETting the thing that failed, redoing the edit to it, and re-pushing it.	19:24
efried	which is actually what the existing incarnation of ufpt is doing.	19:24
cdent	so you are, then, okay with dumping the "cache" every time there is a 409 (and any other error) in ufpt, because it provides a small surface area of handling, yes?	19:25
efried	by "clearing the cache" for just that object, and letting the next periodic do the re-GET and re-edit.	19:25
efried	yes	19:25
cdent	great. glad to hear it :()	19:25
cdent	that's some kind of open mouthed kiss apparently	19:25
efried	because what ufpt is trying to do now (just redoing the failed object) is overly complicated and, as an optimization, probably not worth the aggravation.	19:26
cdent	yes	19:26
efried	Getting rid of that, and clearing the whole cache, will allow me to get rid of a TODO and reinstate another optimization	19:26
cdent	especially when we have turned off the periodic refresh, so muich win	19:26
efried	I think that bit is in the current incarnation, looking...	19:26
efried	https://review.openstack.org/#/c/614886/2/nova/scheduler/client/report.py@a652	19:27
cdent	is that possibly causing wrongness if the list of children has changed? I would think that's an optimisation we don't want if that's the case	19:31
cdent	i.e. the cyborg and neutron cases?	19:32
* cdent is not really sure about that		19:32
cdent	I'm getting pretty exhaused at this stage, this is a huge context from the fight I had with https://review.openstack.org/#/c/614350/ over the weekend	19:32
cdent	sigh	19:33
cdent	context switch	19:34
efried	cdent: (was getting lunch) Importantly, we decided we don't need to deal with children created outside of the compute service.	19:41
efried	If the list of children has changed because upt told us so, we'll pick it up. If it has changed for any other reason, we don't care and don't need to know.	19:42
cdent	I understand that people discussed that, but I don't understand how one can think one has a ProviderTree if you don't have all the children, and if that block of code is the one optimization that that allows _and_ we expect the cache to not get wrong often, why bother?	19:43
cdent	or in other words: that optimization looks like a bug waiting to happen	19:43
efried	cdent: Because it's the only way we avoid polling the backing store every periodic to make sure our view isn't stale. Lacking some kind of async notifier, anyway.	19:45
efried	I intend to leave some NOTEs in there warning that this is the case.	19:46
efried	But the docstring for upt already warns implementors not to muck with providers it finds in there that it doesn't "own".	19:46
cdent	or cascading generations. Can we imagine there ever being a case where an alien provider is not a leaf?	19:46
efried	Which is tantamount to the same thing, it's just now you won't find any.	19:46
efried	well, or an intermediary node, none of whose descendants belong to the compute service.	19:47
efried	I'm not sure. I would hope so, but I'm not sure.	19:47
cdent	that's something we should keep in mind, I guess	19:48
efried	Yup. I'm gonna call YAGNI on this, though. Until we have a situation where the compute service needs to be aware of externally-created tree nodes, let's not solve for it. I don't think we've painted ourselves into a corner by doing this - cause we can always reinstate the ?in_tree= poll.	19:49
cdent	If you're gonna call Y	19:50
efried	...or notifier, or cascading generations, or whatever.	19:50
cdent	AGNI on worry about alien nodes, then I can call YAGNI on the optimisation	19:50
efried	back atcha, we do NI, as shown by belmoreira's steeply-sloped graphs.	19:50
cdent	instead we should not call _ensure_provider	19:51
efried	that's gonna happen too.	19:51
cdent	then we don't need to optimize _ensure_resource_provider if we aren't going to call it all the time, right?	19:51
cdent	I'm sorry if I seem like I'm being a dick about this, but I'd really like to see us scale back to something that does only what's necessary, without optimisations, and then layer them on	19:53
cdent	brb	19:54
efried	the plan is actually to refactor _ensure_resource_provider so it checks whether the provider exists in the cache and/or is stale per the timer	19:54
efried	if in the cache and not stale, no calls to placement.	19:54
efried	because the callers do still need to make sure it's in the cache.	19:55
cdent	I thought the goal was to remove the timer?	19:55
efried	make the timer disable-able.	19:55
cdent	If the provider is in the cache, assume it is correct until it fails on a write	19:55
efried	yeah, or until the timer expires.	19:56
efried	Goal to remove the timer <== long term yes	19:56
efried	not sure if we're going to get away with doing that immediately.	19:56
efried	current poc adds the ability to set it to zero to "disable" refresh.	19:56
cdent	what do we get by keeping it? what changes?	19:56
cdent	brb	19:56
cdent	nm, a call to go outside timed out	19:57
efried	we get the ability to move forward cautiously, convincing/allowing the likes of CERN and vexx to try out the totally-disabled-timer thing to make sure it really works in a large-scale live environment.	19:58
cdent	Do you have any guess on what would be different? It seems to me that the entire point of this change is so that we have a more clear understanding of how/why the providers and associates go back and forth between the nova-compute and placement.	20:01
cdent	and if we have that understanding then it, hey presto, it works!	20:02
* cdent laughs at self		20:02
efried	cdent: I don't see any way it would be different. My operating theory is that we will observe zero (non-operator-induced) drift of inventories, traits, and aggregates when the timer is disabled and all these calls are eliminated. But I'd kinda prefer to be right about that and have to sling another patch later to make it the default than make it the default now and be wrong about it.	20:05
cdent	sure, but are we targetting it being the default before the end of the cycle?	20:06
efried	as you well know, being able to reason about a thing doesn't make it so. If that were true, there would be no bugs.	20:06
efried	no idea.	20:06
efried	baby steps	20:06
efried	I can write the code for it. And then we can merge it or not.	20:06
cdent	✔	20:08
*** e0ne has quit IRC		20:45
openstackgerrit	Matt Riedemann proposed openstack/nova-specs master: Add support for emulated virtual TPM https://review.openstack.org/571111	21:43
openstackgerrit	Chris Dent proposed openstack/placement master: Add a placement-manage CLI https://review.openstack.org/600161	22:01
*** e0ne has joined #openstack-placement		22:02
openstackgerrit	Chris Dent proposed openstack/placement master: Add a placement-manage CLI https://review.openstack.org/600161	22:05
*** e0ne has quit IRC		22:05
openstackgerrit	Matt Riedemann proposed openstack/placement master: Add bandwidth related standard resource classes https://review.openstack.org/615687	22:56
*** mriedem has quit IRC		23:20

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!