Friday, 2016-01-15

*** tzn has joined #openstack-solar		00:26
*** tzn has quit IRC		01:01
*** tzn has joined #openstack-solar		02:58
*** tzn has quit IRC		03:03
*** tzn has joined #openstack-solar		03:59
*** tzn has quit IRC		04:04
*** tzn has joined #openstack-solar		05:00
*** tzn has quit IRC		05:04
*** tzn has joined #openstack-solar		06:00
*** tzn has quit IRC		06:05
*** tzn has joined #openstack-solar		07:01
*** tzn has quit IRC		07:06
*** dshulyak_ has joined #openstack-solar		07:27
*** tzn has joined #openstack-solar		08:02
*** tzn has quit IRC		08:07
*** salmon_ has joined #openstack-solar		08:27
pigmej	so	08:45
pigmej	I don't know WTF but my riak on vagrant crashed on our examples dshulyak_	08:46
pigmej	BUT 2 other riaks survived	08:46
dshulyak_	hi	08:46
dshulyak_	maybe u are using riak from your example?	08:46
pigmej	I'm going to dig into riak logs to check wtf, because it looks like something is wrong	08:46
dshulyak_	like there is some config file somewhere	08:46
pigmej	dshulyak_: no docker from our image	08:46
pigmej	but I also noticed that sometimes ram usage on that machine is high	08:47
pigmej	so /maybe/ riak behaves incorrectly / badly in some rare condidions	08:47
pigmej	BUT I woulda say that we can ignore errors that I had, because 2 other riaks survived, adding your + salmon it's 4 vs 1	08:48
pigmej	so... ;)	08:48
pigmej	dshulyak_: do you agree with that assumpiton?	08:49
pigmej	because tbh, I have no other ideas	08:49
pigmej	I also talked with one guy from basho, he also said that "it should work with n_val=1"	08:49
dshulyak_	yes, to me it looks like correct behaviour	08:53
salmon_	yo https://github.com/Mirantis/solar-resources/pull/3	08:54
*** tzn has joined #openstack-solar		09:03
pigmej	so dshulyak_ can you post PR with force=True ?	09:06
dshulyak_	i can, but it solves only 30% of problems :) there is still no proper removal of lock and problem with counter	09:07
dshulyak_	so maybe lets merge that patch with concurrency limit	09:07
dshulyak_	https://review.openstack.org/#/c/267769/	09:07
pigmej	dshulyak_: removal?	09:07
pigmej	that "B" thinks that A still locks it, because B saved that info ?	09:08
dshulyak_	yes, that case, when B reads lock, A deletes, B saves	09:08
*** tzn has quit IRC		09:08
pigmej	k,	09:08
pigmej	I will simplify my lock, because n_val works as needed it seems, then it should be fine...	09:08
pigmej	so there is a chance that my code will be not removed; D	09:08
dshulyak_	i wonder how u are going to work with riak on your vagrant :)	09:09
pigmej	I'm going to not do it	09:09
pigmej	;]	09:09
dshulyak_	sounds strange :)	09:10
pigmej	I spawned the same docker image outside vagrant	09:10
pigmej	and it works...	09:10
pigmej	so... it's wtf, but ... it works...	09:11
dshulyak_	it really sounds like u have some weird config file, or your docker image is wrong	09:11
dshulyak_	on vagrant	09:11
openstackgerrit	Merged openstack/solar: Set concurrency=1 for system log and scheduler queues https://review.openstack.org/267769	09:13
pigmej	dshulyak_: I checked it and it's our image	09:14
pigmej	I would rather say some memory problems or sth	09:14
pigmej	I just don't know	09:14
pigmej	let's just ignore this fact	09:15
*** tzn has joined #openstack-solar		09:18
salmon_	pigmej: dshulyak_ https://review.openstack.org/#/c/267562	09:19
*** tzn has quit IRC		09:20
*** tzn has joined #openstack-solar		09:21
pigmej	salmon_: there you are :)	09:21
pigmej	dshulyak_: https://review.openstack.org/#/c/267503/ :)	09:22
openstackgerrit	Merged openstack/solar: Include ansible config when syncing repo https://review.openstack.org/267562	09:25
openstackgerrit	Merged openstack/solar: Conditional imports in locking (riak or peewee) https://review.openstack.org/267503	09:27
*** tzn has quit IRC		09:50
salmon_	https://github.com/Mirantis/solar-resources/pull/4	10:01
salmon_	pigmej: dshulyak_^	10:03
salmon_	btw, it seems that the hook is working: https://readthedocs.org/projects/solar/builds/	10:04
pigmej	cool	10:05
pigmej	dshulyak_: btw, why we exactly need this lock ? It's a lock for whole graph?	10:22
dshulyak_	yes, scheduling of single graph shouldnt be concurrent, it is possible that some tasks wont be scheduled at all, or will be scheduled several times	10:28
pigmej	k	10:34
*** dshulyak_ has quit IRC		10:57
openstackgerrit	Lukasz Oles proposed openstack/solar: Add test for wordpress example https://review.openstack.org/266360	10:58
*** dshulyak_ has joined #openstack-solar		10:58
*** dshulyak_ has quit IRC		11:01
*** dshulyak_ has joined #openstack-solar		11:02
pigmej	ehs, I have bad ideas today :(	11:18
*** tzn has joined #openstack-solar		11:19
salmon_	dshulyak_: pigmej https://review.openstack.org/#/c/266360 plz :)	11:36
pigmej	salmon_: I messed up with my env, I cannot check it :(	11:36
salmon_	just +1 :P	11:36
pigmej	ehs	11:36
pigmej	;P	11:37
pigmej	there you are salmon_	11:37
salmon_	pigmej: and this https://review.openstack.org/#/c/267761/ ;)	11:37
pigmej	done salmon_	11:38
pigmej	dshulyak_: https://review.openstack.org/#/c/267453/ can you +1 ?	11:38
dshulyak_	sure, but os-infra will merge it anyway	11:39
pigmej	yeah I know	11:39
pigmej	hmm dshulyak_ I think we have problem, I may be wrong, BUT how is graph task saved ?	12:23
pigmej	isn't it db.read(), modify, db.save() ?	12:23
dshulyak_	yes, something like that	12:24
dshulyak_	:)	12:24
pigmej	so, on normal full sized riak, there is a chance that we will save "old" value in place of new, isn't it?	12:25
dshulyak_	if we have a lock or execution is simply sequantial then i dont see how it is possible	12:27
dshulyak_	but i guess there might be a problem if one of replicas will go down	12:28
pigmej	yup,	12:29
pigmej	also, when n_val != 1, and nodes != 1 there are some different stories	12:29
pigmej	in theory it's easy to write resolver for that, but I'm not sure if it's right solution	12:29
pigmej	because it's obvious that when INPROGRESS conflicts with SUCCESS or ERROR final state should be SUCCESS or ERROR	12:30
dshulyak_	yeah, i had same idea	12:30
pigmej	and everything should work fine, because /something/ which set SUCCESS / ERROR is already aware of this situation	12:31
pigmej	and it will already assume SUCCESS / ERROR there not inprogress	12:31
dshulyak_	but there might be a prolem if we will lose INPROGRESS update, and schedule one task several times	12:32
dshulyak_	i would like to test pw=n_val behaviour and disable sloppy quorum somehow	12:32
dshulyak_	pr i mean	12:32
pigmej	https://aphyr.com/posts/285-call-me-maybe-riak + "Strict Quorum"	12:33
pigmej	(but it's a bit unforunate example)	12:34
pigmej	the more I play with this lock the more I dislike how we designed that part :(	12:35
dshulyak_	i remember that, but it is quite old already - 2013, maybe smth changed :)	12:35
pigmej	not at this area	12:35
pigmej	because it's how stuff works	12:35
pigmej	:)	12:35
pigmej	btw, scheduler always looks at full graph, right?	12:36
dshulyak_	what the difference then between pr/pw and r/w	12:36
pigmej	primary vnode can have multiple fallback vnodes	12:37
dshulyak_	but when i do r/w i will still go to first at primary	12:37
pigmej	before riak will realize this situation, ti may succeed to write to primary vnode, and this primary vnode can mess up with other values	12:37
dshulyak_	right now we look at full graph, but it can be optimized a bit, like select childs of updated and all parents of those childs	12:38
dshulyak_	and perform scheduling only for that part	12:38
pigmej	some nodes may respond "gtfo PW not satisfied", but some may yet not know that. Therefore it starting to be messy there	12:38
pigmej	dshulyak_: because from what I understand, only the real problem is missing INPROGRESS	12:39
pigmej	which may lead to duplication, BUT, we could fix this a bit	12:39
dshulyak_	thats somehow related to lock?	12:39
pigmej	it's probably the same problem as with lock delete	12:39
pigmej	it's not directly related to lock I hope :) BUT it's the same problem	12:40
dshulyak_	because wo lock there will be more problems	12:40
pigmej	from what I see	12:40
pigmej	yeah sure,	12:40
dshulyak_	i had another idea which is alternative to lock - perform consistent routing based on the hash of graph	12:41
dshulyak_	hash of graph id	12:41
pigmej	what do you mean by consistent routing ?	12:41
dshulyak_	i mean - all scheduling for particular graph will be done in one thread	12:42
pigmej	ah	12:42
pigmej	but then there is a problem	12:43
dshulyak_	e.g. we have pool of 100 threads, and based on hash of graph uid we will reroute all requests to some thread	12:43
pigmej	because we will introduce some inmemory state	12:43
pigmej	also, what if we will have more than one process ?	12:43
dshulyak_	similar to riak ring placement probably	12:43
dshulyak_	yeah	12:43
pigmej	or more machines doing scheduling etc	12:43
pigmej	it starts then to be complicated as hell	12:43
dshulyak_	thats true :)	12:43
pigmej	and we will implement then our own cluster placement thingy, with our own bugs :D	12:44
dshulyak_	what u dont like about lock? u are talking about n_val=1 or the general idea?	12:44
pigmej	and because exactly-once-delivery is impossible (some client side hacks can archive something close to exactly-once-devlivery), but still	12:44
pigmej	dshulyak_: about general idea	12:44
dshulyak_	with ensemble it seems quite natural thing	12:44
pigmej	dshulyak_: well, sure, the lock is let's say "ok"	12:45
pigmej	except that we're hammering DB and sleep/retry approach but that could be optimized too	12:45
pigmej	BUT I think our real problem is graph update	12:45
pigmej	which currently is solved by this lock, BUT it may be not enough	12:45
pigmej	I'm starting to thik that we should move LogItem.state from LogItem, and let's say put it into separate k/v place, with different logic	12:47
pigmej	on n_val=1 it will not matter, but for bigger cluster, we could use strong_consistent bucket for state	12:47
pigmej	THEN we will not need locks, at all	12:47
pigmej	because each item will behave like lock, "if inprogress exists" => you can't set other 'inprogress'	12:48
pigmej	so we will /just/ need to add some pre-state, which will work kinda like lock.acquire	12:49
pigmej	dshulyak_: If I'm saying bullshit feel free to say it :)	12:49
dshulyak_	what is pre-state?	12:49
pigmej	'pre-inprogress'	12:50
*** openstackgerrit has quit IRC		12:50
pigmej	it would mean that something started this item, to prevent other switches from PENDING => INPROGRESS	12:50
*** openstackgerrit has joined #openstack-solar		12:50
pigmej	maybe it could be even INPROGRESS directly, without 'pre-inprogress' stuff	12:51
pigmej	but then we will not need lock in form as we have now, isn't it?	12:51
dshulyak_	if we will always see error on write, and send tasks for executions only after succesfull save - then afaiu we wont need locking	12:54
pigmej	and according to our recent findings and tests we can assume that	12:54
pigmej	because	12:54
pigmej	n_val = 1 => it works	12:54
pigmej	strong consistent bucket => it works	12:54
pigmej	sql DB => same as strong consistent bucket	12:54
pigmej	because any PK constraint will give us that.	12:55
dshulyak_	yeah, but i noticed it is quite hard to handle updates with sql properly :)	12:55
pigmej	dshulyak_: that "executions only after succesfull save - then afaiu we wont need	12:55
pigmej	locking" is why I wanted to have this "pre-state"	12:55
pigmej	yeah, it is.	12:56
pigmej	but if you don't do updates... :)	12:56
dshulyak_	and for strong consistent bucket - we are using 2i in graph and history	12:56
pigmej	yeah BUT	12:56
dshulyak_	so maybe we will have to split data somehow	12:56
pigmej	i just meant to remove 'state" from this bucket :)	12:57
dshulyak_	ah ok	12:57
pigmej	and to keep state in strong consistent bucket	12:57
pigmej	(maybe with other things like child, etc)	12:57
dshulyak_	so it sounds like split to me	12:57
pigmej	yeah	12:58
pigmej	does it work for you dshulyak_ ? (at least in head)	12:59
dshulyak_	yes, it makes sense	12:59
pigmej	we will get rid of current locking implementation	13:00
dshulyak_	actually we will get same behaviour as with lock	13:00
pigmej	and we will introduce smaller kinda locks	13:00
pigmej	yup	13:00
pigmej	but not for whole graph	13:00
pigmej	we will get better states though, it should be also a bit faster, because 2 different paths will not fight for the same lock	13:01
pigmej	though workers could fight for the same task	13:02
pigmej	BUT it's then matter of scheduler to take care about this situation and not create it too often	13:03
pigmej	but after this there is A => C and B => C situation	13:04
dshulyak_	i think both A and B should be written in A and in B, then B will be able to notice collision and schedule C properly	13:08
pigmej	Yeah, I think the same	13:09
pigmej	or during graph building we can count how many childs we need	13:09
pigmej	and if reached => schedule C	13:09
salmon_	what I can say now is that all examples are working now. I'm running now openstack which is only ;eft	13:18
salmon_	*left	13:18
pigmej	salmon_: because we reverted locks, and gevent for celery	13:20
pigmej	so it's not suprise :D	13:20
salmon_	pigmej: I know ;)	13:20
salmon_	just reporting	13:20
pigmej	:)	13:22
pigmej	dshulyak_: I have question about that Counter	16:37
dshulyak_	ok, i am around	16:38
pigmej	dshulyak_: we need it /just/ growing or it cannot contain gaps ?	16:40
pigmej	1,2,3 or 1,3,150 ?	16:40
dshulyak_	just growing	16:41
pigmej	k	16:42
*** tzn has quit IRC		17:12
*** dshulyak_ has quit IRC		17:28
*** dshulyak_ has joined #openstack-solar		17:42
*** dshulyak_ has quit IRC		17:43
*** dshulyak_ has joined #openstack-solar		18:48
*** dshulyak_ has quit IRC		18:49
*** dshulyak_ has joined #openstack-solar		18:55
*** dshulyak_ has quit IRC		19:04
*** tzn has joined #openstack-solar		19:07
*** tzn has quit IRC		19:13
*** tzn has joined #openstack-solar		19:28
*** dshulyak_ has joined #openstack-solar		19:48
*** tzn has quit IRC		20:18
*** dshulyak_ has quit IRC		20:53
*** dshulyak_ has joined #openstack-solar		20:57
openstackgerrit	Lukasz Oles proposed openstack/solar: Remove ansible.cfg, we use .ssh/config now https://review.openstack.org/268331	21:03
*** dshulyak_ has quit IRC		21:19

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!