Friday, 2016-01-15

*** tzn has joined #openstack-solar00:26
*** tzn has quit IRC01:01
*** tzn has joined #openstack-solar02:58
*** tzn has quit IRC03:03
*** tzn has joined #openstack-solar03:59
*** tzn has quit IRC04:04
*** tzn has joined #openstack-solar05:00
*** tzn has quit IRC05:04
*** tzn has joined #openstack-solar06:00
*** tzn has quit IRC06:05
*** tzn has joined #openstack-solar07:01
*** tzn has quit IRC07:06
*** dshulyak_ has joined #openstack-solar07:27
*** tzn has joined #openstack-solar08:02
*** tzn has quit IRC08:07
*** salmon_ has joined #openstack-solar08:27
pigmejI don't know WTF but my riak on vagrant crashed on our examples dshulyak_08:46
pigmejBUT 2 other riaks survived08:46
dshulyak_maybe u are using riak from your example?08:46
pigmejI'm going to dig into riak logs to check wtf, because it looks like something is wrong08:46
dshulyak_like there is some config file somewhere08:46
pigmejdshulyak_: no docker from our image08:46
pigmejbut I also noticed that sometimes ram usage on that machine is high08:47
pigmejso /maybe/ riak behaves incorrectly / badly in some rare condidions08:47
pigmejBUT I woulda say that we can ignore errors that I had, because 2 other riaks survived, adding your + salmon it's 4 vs 108:48
pigmejso... ;)08:48
pigmejdshulyak_: do you agree with that assumpiton?08:49
pigmejbecause tbh, I have no other ideas08:49
pigmejI also talked with one guy from basho, he also said that "it should work with n_val=1"08:49
dshulyak_yes, to me it looks like correct behaviour08:53
*** tzn has joined #openstack-solar09:03
pigmejso dshulyak_ can you post PR with force=True ?09:06
dshulyak_i can, but it solves only 30% of problems :) there is still no proper removal of lock and problem with counter09:07
dshulyak_so maybe lets merge that patch with concurrency limit09:07
pigmejdshulyak_: removal?09:07
pigmejthat "B" thinks that A still locks it, because B saved that info ?09:08
dshulyak_yes, that case, when B reads lock, A deletes, B saves09:08
*** tzn has quit IRC09:08
pigmejI will simplify my lock, because n_val works as needed it seems, then it should be fine...09:08
pigmejso there is a chance that my code will be not removed; D09:08
dshulyak_i wonder how u are going to work with riak on your vagrant :)09:09
pigmejI'm going to not do it09:09
dshulyak_sounds strange :)09:10
pigmejI spawned the same docker image outside vagrant09:10
pigmejand it works...09:10
pigmejso... it's wtf, but ... it works...09:11
dshulyak_it really sounds like u have some weird config file, or your docker image is wrong09:11
dshulyak_on vagrant09:11
openstackgerritMerged openstack/solar: Set concurrency=1 for system log and scheduler queues
pigmejdshulyak_: I checked it and it's our image09:14
pigmejI would rather say some memory problems or sth09:14
pigmejI just don't know09:14
pigmejlet's just ignore this fact09:15
*** tzn has joined #openstack-solar09:18
salmon_pigmej: dshulyak_
*** tzn has quit IRC09:20
*** tzn has joined #openstack-solar09:21
pigmejsalmon_: there you are :)09:21
pigmejdshulyak_: :)09:22
openstackgerritMerged openstack/solar: Include ansible config when syncing repo
openstackgerritMerged openstack/solar: Conditional imports in locking (riak or peewee)
*** tzn has quit IRC09:50
salmon_pigmej: dshulyak_^10:03
salmon_btw, it seems that  the hook is working:
pigmejdshulyak_: btw, why we exactly need this lock ? It's a lock for whole graph?10:22
dshulyak_yes, scheduling of single graph shouldnt be concurrent, it is possible that some tasks wont be scheduled at all, or will be scheduled several times10:28
*** dshulyak_ has quit IRC10:57
openstackgerritLukasz Oles proposed openstack/solar: Add test for wordpress example
*** dshulyak_ has joined #openstack-solar10:58
*** dshulyak_ has quit IRC11:01
*** dshulyak_ has joined #openstack-solar11:02
pigmejehs, I have bad ideas today :(11:18
*** tzn has joined #openstack-solar11:19
salmon_dshulyak_: pigmej plz :)11:36
pigmejsalmon_: I messed up with my env, I cannot check it :(11:36
salmon_just +1 :P11:36
pigmejthere you are salmon_11:37
salmon_pigmej: and this ;)11:37
pigmejdone salmon_11:38
pigmejdshulyak_: can you +1 ?11:38
dshulyak_sure, but os-infra will merge it anyway11:39
pigmejyeah I know11:39
pigmejhmm dshulyak_ I think we have problem, I may be wrong, BUT how is graph task saved ?12:23
pigmejisn't it, modify, ?12:23
dshulyak_yes, something like that12:24
pigmejso, on normal full sized riak, there is a chance that we will save "old" value in place of new, isn't it?12:25
dshulyak_if we have a lock or execution is simply sequantial then i dont see how it is possible12:27
dshulyak_but i guess there might be a problem if one of replicas will go down12:28
pigmejalso, when n_val != 1, and nodes != 1 there are some different stories12:29
pigmejin theory it's easy to write resolver for that, but I'm not sure if it's right solution12:29
pigmejbecause it's obvious that when INPROGRESS conflicts with SUCCESS or ERROR final state should be SUCCESS or ERROR12:30
dshulyak_yeah, i had same idea12:30
pigmejand everything should work fine, because /something/ which set SUCCESS / ERROR is already aware of this situation12:31
pigmejand it will already assume SUCCESS / ERROR there not inprogress12:31
dshulyak_but there might be a prolem if we will lose INPROGRESS update, and schedule one task several times12:32
dshulyak_i would like to test pw=n_val behaviour and disable sloppy quorum somehow12:32
dshulyak_pr i mean12:32
pigmej + "Strict Quorum"12:33
pigmej(but it's a bit unforunate example)12:34
pigmejthe more I play with this lock the more I dislike how we designed that part :(12:35
dshulyak_i remember that, but it is quite old already - 2013, maybe smth changed :)12:35
pigmejnot at this area12:35
pigmejbecause it's how stuff works12:35
pigmejbtw, scheduler always looks at full graph, right?12:36
dshulyak_what the difference then between pr/pw and r/w12:36
pigmejprimary vnode can have multiple fallback vnodes12:37
dshulyak_but when i do r/w i will still go to first at primary12:37
pigmejbefore riak will realize this situation, ti may succeed to write to primary vnode, and this primary vnode can mess up with other values12:37
dshulyak_right now we look at full graph, but it can be optimized a bit, like select childs of updated and all parents of those childs12:38
dshulyak_and perform scheduling only for that part12:38
pigmejsome nodes may respond "gtfo PW not satisfied", but some may yet *not know* that. Therefore it starting to be messy there12:38
pigmejdshulyak_: because from what I understand, only the real problem is missing INPROGRESS12:39
pigmejwhich *may* lead to duplication, BUT, we could fix this a bit12:39
dshulyak_thats somehow related to lock?12:39
pigmejit's probably the same problem as with lock delete12:39
pigmejit's not directly related to lock I hope :) BUT it's the same problem12:40
dshulyak_because wo lock there will be more problems12:40
pigmejfrom what I see12:40
pigmejyeah sure,12:40
dshulyak_i had another idea which is alternative to lock - perform consistent routing based on the hash of graph12:41
dshulyak_hash of graph id12:41
pigmejwhat do you mean by consistent routing ?12:41
dshulyak_i mean - all scheduling for particular graph will be done in one thread12:42
pigmejbut then there is a problem12:43
dshulyak_e.g. we have pool of 100 threads, and based on hash of graph uid we will reroute all requests to some thread12:43
pigmejbecause we will introduce some inmemory state12:43
pigmejalso, what if we will have more than one process ?12:43
dshulyak_similar to riak ring placement probably12:43
pigmejor more machines doing scheduling etc12:43
pigmejit starts then to be complicated as hell12:43
dshulyak_thats true :)12:43
pigmejand we will implement then our own cluster placement thingy, with our own bugs :D12:44
dshulyak_what u dont like about lock? u are talking about n_val=1 or the general idea?12:44
pigmejand because exactly-once-delivery is impossible (some client side hacks can archive something close to exactly-once-devlivery), but still12:44
pigmejdshulyak_: about general idea12:44
dshulyak_with ensemble it seems quite natural thing12:44
pigmejdshulyak_: well, sure, the lock is let's say "ok"12:45
pigmejexcept that we're hammering DB and sleep/retry approach but that could be optimized too12:45
pigmejBUT I think our real problem is graph update12:45
pigmejwhich currently is solved by this lock, BUT it may be not enough12:45
pigmejI'm starting to thik that we should move LogItem.state from LogItem, and let's say put it into separate k/v place, with different logic12:47
pigmejon n_val=1 it will not matter, but for bigger cluster, we could use strong_consistent bucket for state12:47
pigmejTHEN we will not need locks, at all12:47
pigmejbecause each item will behave like lock, "if inprogress exists" => you can't set other 'inprogress'12:48
pigmejso we will /just/ need to add some pre-state, which will work kinda like lock.acquire12:49
pigmejdshulyak_: If I'm saying bullshit feel free to say it :)12:49
dshulyak_what is pre-state?12:49
*** openstackgerrit has quit IRC12:50
pigmejit would mean that something started this item, to prevent other switches from PENDING => INPROGRESS12:50
*** openstackgerrit has joined #openstack-solar12:50
pigmejmaybe it could be even INPROGRESS directly, without 'pre-inprogress' stuff12:51
pigmejbut then we will *not* need lock in form as we have now, isn't it?12:51
dshulyak_if we will always see error on write, and send tasks for executions only after succesfull save - then afaiu we wont need locking12:54
pigmejand according to our recent findings and tests we can assume that12:54
pigmejn_val = 1 => it works12:54
pigmejstrong consistent bucket => it works12:54
pigmejsql DB => same as strong consistent bucket12:54
pigmejbecause any PK constraint will give us that.12:55
dshulyak_yeah, but i noticed it is quite hard to handle updates with sql properly :)12:55
pigmejdshulyak_: that "executions only after succesfull save - then afaiu we wont need12:55
pigmej    locking" is why I wanted to have this "pre-state"12:55
pigmejyeah, it is.12:56
pigmejbut if you don't do updates... :)12:56
dshulyak_and for strong consistent bucket - we are using 2i in graph and history12:56
pigmejyeah BUT12:56
dshulyak_so maybe we will have to split data somehow12:56
pigmeji just meant to remove 'state" from this bucket :)12:57
dshulyak_ah ok12:57
pigmejand to keep state in strong consistent bucket12:57
pigmej(maybe with other things like child, etc)12:57
dshulyak_so it sounds like split to me12:57
pigmejdoes it work for you dshulyak_ ? (at least in head)12:59
dshulyak_yes, it makes sense12:59
pigmejwe will get rid of current locking implementation13:00
dshulyak_actually we will get same behaviour as with lock13:00
pigmejand we will introduce smaller kinda locks13:00
pigmejbut not for whole graph13:00
pigmejwe will get better states though, it should be also a bit faster, because 2 different paths will not fight for the same lock13:01
pigmejthough workers could fight for the same task13:02
pigmejBUT it's then matter of scheduler to take care about this situation and not create it too often13:03
pigmejbut after this there is A => C and B => C situation13:04
dshulyak_i think both A and B should be written in A and in B, then B will be able to notice collision and schedule C properly13:08
pigmejYeah, I think the same13:09
pigmejor during graph building we can count how many childs we need13:09
pigmejand if reached => schedule C13:09
salmon_what I can say now is that all examples are working now. I'm running now openstack which is only ;eft13:18
pigmejsalmon_: because we reverted locks, and gevent for celery13:20
pigmejso it's not suprise :D13:20
salmon_pigmej: I know ;)13:20
salmon_just reporting13:20
pigmejdshulyak_: I have question about that Counter16:37
dshulyak_ok, i am around16:38
pigmejdshulyak_: we need it /just/ growing or it cannot contain gaps ?16:40
pigmej1,2,3 or 1,3,150 ?16:40
dshulyak_just growing16:41
*** tzn has quit IRC17:12
*** dshulyak_ has quit IRC17:28
*** dshulyak_ has joined #openstack-solar17:42
*** dshulyak_ has quit IRC17:43
*** dshulyak_ has joined #openstack-solar18:48
*** dshulyak_ has quit IRC18:49
*** dshulyak_ has joined #openstack-solar18:55
*** dshulyak_ has quit IRC19:04
*** tzn has joined #openstack-solar19:07
*** tzn has quit IRC19:13
*** tzn has joined #openstack-solar19:28
*** dshulyak_ has joined #openstack-solar19:48
*** tzn has quit IRC20:18
*** dshulyak_ has quit IRC20:53
*** dshulyak_ has joined #openstack-solar20:57
openstackgerritLukasz Oles proposed openstack/solar: Remove ansible.cfg, we use .ssh/config now
*** dshulyak_ has quit IRC21:19

Generated by 2.14.0 by Marius Gedminas - find it at!