*** tzn has joined #openstack-solar | 00:26 | |
*** tzn has quit IRC | 01:01 | |
*** tzn has joined #openstack-solar | 02:58 | |
*** tzn has quit IRC | 03:03 | |
*** tzn has joined #openstack-solar | 03:59 | |
*** tzn has quit IRC | 04:04 | |
*** tzn has joined #openstack-solar | 05:00 | |
*** tzn has quit IRC | 05:04 | |
*** tzn has joined #openstack-solar | 06:00 | |
*** tzn has quit IRC | 06:05 | |
*** tzn has joined #openstack-solar | 07:01 | |
*** tzn has quit IRC | 07:06 | |
*** dshulyak_ has joined #openstack-solar | 07:27 | |
*** tzn has joined #openstack-solar | 08:02 | |
*** tzn has quit IRC | 08:07 | |
*** salmon_ has joined #openstack-solar | 08:27 | |
pigmej | so | 08:45 |
---|---|---|
pigmej | I don't know WTF but my riak on vagrant crashed on our examples dshulyak_ | 08:46 |
pigmej | BUT 2 other riaks survived | 08:46 |
dshulyak_ | hi | 08:46 |
dshulyak_ | maybe u are using riak from your example? | 08:46 |
pigmej | I'm going to dig into riak logs to check wtf, because it looks like something is wrong | 08:46 |
dshulyak_ | like there is some config file somewhere | 08:46 |
pigmej | dshulyak_: no docker from our image | 08:46 |
pigmej | but I also noticed that sometimes ram usage on that machine is high | 08:47 |
pigmej | so /maybe/ riak behaves incorrectly / badly in some rare condidions | 08:47 |
pigmej | BUT I woulda say that we can ignore errors that I had, because 2 other riaks survived, adding your + salmon it's 4 vs 1 | 08:48 |
pigmej | so... ;) | 08:48 |
pigmej | dshulyak_: do you agree with that assumpiton? | 08:49 |
pigmej | because tbh, I have no other ideas | 08:49 |
pigmej | I also talked with one guy from basho, he also said that "it should work with n_val=1" | 08:49 |
dshulyak_ | yes, to me it looks like correct behaviour | 08:53 |
salmon_ | yo https://github.com/Mirantis/solar-resources/pull/3 | 08:54 |
*** tzn has joined #openstack-solar | 09:03 | |
pigmej | so dshulyak_ can you post PR with force=True ? | 09:06 |
dshulyak_ | i can, but it solves only 30% of problems :) there is still no proper removal of lock and problem with counter | 09:07 |
dshulyak_ | so maybe lets merge that patch with concurrency limit | 09:07 |
dshulyak_ | https://review.openstack.org/#/c/267769/ | 09:07 |
pigmej | dshulyak_: removal? | 09:07 |
pigmej | that "B" thinks that A still locks it, because B saved that info ? | 09:08 |
dshulyak_ | yes, that case, when B reads lock, A deletes, B saves | 09:08 |
*** tzn has quit IRC | 09:08 | |
pigmej | k, | 09:08 |
pigmej | I will simplify my lock, because n_val works as needed it seems, then it should be fine... | 09:08 |
pigmej | so there is a chance that my code will be not removed; D | 09:08 |
dshulyak_ | i wonder how u are going to work with riak on your vagrant :) | 09:09 |
pigmej | I'm going to not do it | 09:09 |
pigmej | ;] | 09:09 |
dshulyak_ | sounds strange :) | 09:10 |
pigmej | I spawned the same docker image outside vagrant | 09:10 |
pigmej | and it works... | 09:10 |
pigmej | so... it's wtf, but ... it works... | 09:11 |
dshulyak_ | it really sounds like u have some weird config file, or your docker image is wrong | 09:11 |
dshulyak_ | on vagrant | 09:11 |
openstackgerrit | Merged openstack/solar: Set concurrency=1 for system log and scheduler queues https://review.openstack.org/267769 | 09:13 |
pigmej | dshulyak_: I checked it and it's our image | 09:14 |
pigmej | I would rather say some memory problems or sth | 09:14 |
pigmej | I just don't know | 09:14 |
pigmej | let's just ignore this fact | 09:15 |
*** tzn has joined #openstack-solar | 09:18 | |
salmon_ | pigmej: dshulyak_ https://review.openstack.org/#/c/267562 | 09:19 |
*** tzn has quit IRC | 09:20 | |
*** tzn has joined #openstack-solar | 09:21 | |
pigmej | salmon_: there you are :) | 09:21 |
pigmej | dshulyak_: https://review.openstack.org/#/c/267503/ :) | 09:22 |
openstackgerrit | Merged openstack/solar: Include ansible config when syncing repo https://review.openstack.org/267562 | 09:25 |
openstackgerrit | Merged openstack/solar: Conditional imports in locking (riak or peewee) https://review.openstack.org/267503 | 09:27 |
*** tzn has quit IRC | 09:50 | |
salmon_ | https://github.com/Mirantis/solar-resources/pull/4 | 10:01 |
salmon_ | pigmej: dshulyak_^ | 10:03 |
salmon_ | btw, it seems that the hook is working: https://readthedocs.org/projects/solar/builds/ | 10:04 |
pigmej | cool | 10:05 |
pigmej | dshulyak_: btw, why we exactly need this lock ? It's a lock for whole graph? | 10:22 |
dshulyak_ | yes, scheduling of single graph shouldnt be concurrent, it is possible that some tasks wont be scheduled at all, or will be scheduled several times | 10:28 |
pigmej | k | 10:34 |
*** dshulyak_ has quit IRC | 10:57 | |
openstackgerrit | Lukasz Oles proposed openstack/solar: Add test for wordpress example https://review.openstack.org/266360 | 10:58 |
*** dshulyak_ has joined #openstack-solar | 10:58 | |
*** dshulyak_ has quit IRC | 11:01 | |
*** dshulyak_ has joined #openstack-solar | 11:02 | |
pigmej | ehs, I have bad ideas today :( | 11:18 |
*** tzn has joined #openstack-solar | 11:19 | |
salmon_ | dshulyak_: pigmej https://review.openstack.org/#/c/266360 plz :) | 11:36 |
pigmej | salmon_: I messed up with my env, I cannot check it :( | 11:36 |
salmon_ | just +1 :P | 11:36 |
pigmej | ehs | 11:36 |
pigmej | ;P | 11:37 |
pigmej | there you are salmon_ | 11:37 |
salmon_ | pigmej: and this https://review.openstack.org/#/c/267761/ ;) | 11:37 |
pigmej | done salmon_ | 11:38 |
pigmej | dshulyak_: https://review.openstack.org/#/c/267453/ can you +1 ? | 11:38 |
dshulyak_ | sure, but os-infra will merge it anyway | 11:39 |
pigmej | yeah I know | 11:39 |
pigmej | hmm dshulyak_ I think we have problem, I may be wrong, BUT how is graph task saved ? | 12:23 |
pigmej | isn't it db.read(), modify, db.save() ? | 12:23 |
dshulyak_ | yes, something like that | 12:24 |
dshulyak_ | :) | 12:24 |
pigmej | so, on normal full sized riak, there is a chance that we will save "old" value in place of new, isn't it? | 12:25 |
dshulyak_ | if we have a lock or execution is simply sequantial then i dont see how it is possible | 12:27 |
dshulyak_ | but i guess there might be a problem if one of replicas will go down | 12:28 |
pigmej | yup, | 12:29 |
pigmej | also, when n_val != 1, and nodes != 1 there are some different stories | 12:29 |
pigmej | in theory it's easy to write resolver for that, but I'm not sure if it's right solution | 12:29 |
pigmej | because it's obvious that when INPROGRESS conflicts with SUCCESS or ERROR final state should be SUCCESS or ERROR | 12:30 |
dshulyak_ | yeah, i had same idea | 12:30 |
pigmej | and everything should work fine, because /something/ which set SUCCESS / ERROR is already aware of this situation | 12:31 |
pigmej | and it will already assume SUCCESS / ERROR there not inprogress | 12:31 |
dshulyak_ | but there might be a prolem if we will lose INPROGRESS update, and schedule one task several times | 12:32 |
dshulyak_ | i would like to test pw=n_val behaviour and disable sloppy quorum somehow | 12:32 |
dshulyak_ | pr i mean | 12:32 |
pigmej | https://aphyr.com/posts/285-call-me-maybe-riak + "Strict Quorum" | 12:33 |
pigmej | (but it's a bit unforunate example) | 12:34 |
pigmej | the more I play with this lock the more I dislike how we designed that part :( | 12:35 |
dshulyak_ | i remember that, but it is quite old already - 2013, maybe smth changed :) | 12:35 |
pigmej | not at this area | 12:35 |
pigmej | because it's how stuff works | 12:35 |
pigmej | :) | 12:35 |
pigmej | btw, scheduler always looks at full graph, right? | 12:36 |
dshulyak_ | what the difference then between pr/pw and r/w | 12:36 |
pigmej | primary vnode can have multiple fallback vnodes | 12:37 |
dshulyak_ | but when i do r/w i will still go to first at primary | 12:37 |
pigmej | before riak will realize this situation, ti may succeed to write to primary vnode, and this primary vnode can mess up with other values | 12:37 |
dshulyak_ | right now we look at full graph, but it can be optimized a bit, like select childs of updated and all parents of those childs | 12:38 |
dshulyak_ | and perform scheduling only for that part | 12:38 |
pigmej | some nodes may respond "gtfo PW not satisfied", but some may yet *not know* that. Therefore it starting to be messy there | 12:38 |
pigmej | dshulyak_: because from what I understand, only the real problem is missing INPROGRESS | 12:39 |
pigmej | which *may* lead to duplication, BUT, we could fix this a bit | 12:39 |
dshulyak_ | thats somehow related to lock? | 12:39 |
pigmej | it's probably the same problem as with lock delete | 12:39 |
pigmej | it's not directly related to lock I hope :) BUT it's the same problem | 12:40 |
dshulyak_ | because wo lock there will be more problems | 12:40 |
pigmej | from what I see | 12:40 |
pigmej | yeah sure, | 12:40 |
dshulyak_ | i had another idea which is alternative to lock - perform consistent routing based on the hash of graph | 12:41 |
dshulyak_ | hash of graph id | 12:41 |
pigmej | what do you mean by consistent routing ? | 12:41 |
dshulyak_ | i mean - all scheduling for particular graph will be done in one thread | 12:42 |
pigmej | ah | 12:42 |
pigmej | but then there is a problem | 12:43 |
dshulyak_ | e.g. we have pool of 100 threads, and based on hash of graph uid we will reroute all requests to some thread | 12:43 |
pigmej | because we will introduce some inmemory state | 12:43 |
pigmej | also, what if we will have more than one process ? | 12:43 |
dshulyak_ | similar to riak ring placement probably | 12:43 |
dshulyak_ | yeah | 12:43 |
pigmej | or more machines doing scheduling etc | 12:43 |
pigmej | it starts then to be complicated as hell | 12:43 |
dshulyak_ | thats true :) | 12:43 |
pigmej | and we will implement then our own cluster placement thingy, with our own bugs :D | 12:44 |
dshulyak_ | what u dont like about lock? u are talking about n_val=1 or the general idea? | 12:44 |
pigmej | and because exactly-once-delivery is impossible (some client side hacks can archive something close to exactly-once-devlivery), but still | 12:44 |
pigmej | dshulyak_: about general idea | 12:44 |
dshulyak_ | with ensemble it seems quite natural thing | 12:44 |
pigmej | dshulyak_: well, sure, the lock is let's say "ok" | 12:45 |
pigmej | except that we're hammering DB and sleep/retry approach but that could be optimized too | 12:45 |
pigmej | BUT I think our real problem is graph update | 12:45 |
pigmej | which currently is solved by this lock, BUT it may be not enough | 12:45 |
pigmej | I'm starting to thik that we should move LogItem.state from LogItem, and let's say put it into separate k/v place, with different logic | 12:47 |
pigmej | on n_val=1 it will not matter, but for bigger cluster, we could use strong_consistent bucket for state | 12:47 |
pigmej | THEN we will not need locks, at all | 12:47 |
pigmej | because each item will behave like lock, "if inprogress exists" => you can't set other 'inprogress' | 12:48 |
pigmej | so we will /just/ need to add some pre-state, which will work kinda like lock.acquire | 12:49 |
pigmej | dshulyak_: If I'm saying bullshit feel free to say it :) | 12:49 |
dshulyak_ | what is pre-state? | 12:49 |
pigmej | 'pre-inprogress' | 12:50 |
*** openstackgerrit has quit IRC | 12:50 | |
pigmej | it would mean that something started this item, to prevent other switches from PENDING => INPROGRESS | 12:50 |
*** openstackgerrit has joined #openstack-solar | 12:50 | |
pigmej | maybe it could be even INPROGRESS directly, without 'pre-inprogress' stuff | 12:51 |
pigmej | but then we will *not* need lock in form as we have now, isn't it? | 12:51 |
dshulyak_ | if we will always see error on write, and send tasks for executions only after succesfull save - then afaiu we wont need locking | 12:54 |
pigmej | and according to our recent findings and tests we can assume that | 12:54 |
pigmej | because | 12:54 |
pigmej | n_val = 1 => it works | 12:54 |
pigmej | strong consistent bucket => it works | 12:54 |
pigmej | sql DB => same as strong consistent bucket | 12:54 |
pigmej | because any PK constraint will give us that. | 12:55 |
dshulyak_ | yeah, but i noticed it is quite hard to handle updates with sql properly :) | 12:55 |
pigmej | dshulyak_: that "executions only after succesfull save - then afaiu we wont need | 12:55 |
pigmej | locking" is why I wanted to have this "pre-state" | 12:55 |
pigmej | yeah, it is. | 12:56 |
pigmej | but if you don't do updates... :) | 12:56 |
dshulyak_ | and for strong consistent bucket - we are using 2i in graph and history | 12:56 |
pigmej | yeah BUT | 12:56 |
dshulyak_ | so maybe we will have to split data somehow | 12:56 |
pigmej | i just meant to remove 'state" from this bucket :) | 12:57 |
dshulyak_ | ah ok | 12:57 |
pigmej | and to keep state in strong consistent bucket | 12:57 |
pigmej | (maybe with other things like child, etc) | 12:57 |
dshulyak_ | so it sounds like split to me | 12:57 |
pigmej | yeah | 12:58 |
pigmej | does it work for you dshulyak_ ? (at least in head) | 12:59 |
dshulyak_ | yes, it makes sense | 12:59 |
pigmej | we will get rid of current locking implementation | 13:00 |
dshulyak_ | actually we will get same behaviour as with lock | 13:00 |
pigmej | and we will introduce smaller kinda locks | 13:00 |
pigmej | yup | 13:00 |
pigmej | but not for whole graph | 13:00 |
pigmej | we will get better states though, it should be also a bit faster, because 2 different paths will not fight for the same lock | 13:01 |
pigmej | though workers could fight for the same task | 13:02 |
pigmej | BUT it's then matter of scheduler to take care about this situation and not create it too often | 13:03 |
pigmej | but after this there is A => C and B => C situation | 13:04 |
dshulyak_ | i think both A and B should be written in A and in B, then B will be able to notice collision and schedule C properly | 13:08 |
pigmej | Yeah, I think the same | 13:09 |
pigmej | or during graph building we can count how many childs we need | 13:09 |
pigmej | and if reached => schedule C | 13:09 |
salmon_ | what I can say now is that all examples are working now. I'm running now openstack which is only ;eft | 13:18 |
salmon_ | *left | 13:18 |
pigmej | salmon_: because we reverted locks, and gevent for celery | 13:20 |
pigmej | so it's not suprise :D | 13:20 |
salmon_ | pigmej: I know ;) | 13:20 |
salmon_ | just reporting | 13:20 |
pigmej | :) | 13:22 |
pigmej | dshulyak_: I have question about that Counter | 16:37 |
dshulyak_ | ok, i am around | 16:38 |
pigmej | dshulyak_: we need it /just/ growing or it cannot contain gaps ? | 16:40 |
pigmej | 1,2,3 or 1,3,150 ? | 16:40 |
dshulyak_ | just growing | 16:41 |
pigmej | k | 16:42 |
*** tzn has quit IRC | 17:12 | |
*** dshulyak_ has quit IRC | 17:28 | |
*** dshulyak_ has joined #openstack-solar | 17:42 | |
*** dshulyak_ has quit IRC | 17:43 | |
*** dshulyak_ has joined #openstack-solar | 18:48 | |
*** dshulyak_ has quit IRC | 18:49 | |
*** dshulyak_ has joined #openstack-solar | 18:55 | |
*** dshulyak_ has quit IRC | 19:04 | |
*** tzn has joined #openstack-solar | 19:07 | |
*** tzn has quit IRC | 19:13 | |
*** tzn has joined #openstack-solar | 19:28 | |
*** dshulyak_ has joined #openstack-solar | 19:48 | |
*** tzn has quit IRC | 20:18 | |
*** dshulyak_ has quit IRC | 20:53 | |
*** dshulyak_ has joined #openstack-solar | 20:57 | |
openstackgerrit | Lukasz Oles proposed openstack/solar: Remove ansible.cfg, we use .ssh/config now https://review.openstack.org/268331 | 21:03 |
*** dshulyak_ has quit IRC | 21:19 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!