*** dshulyak_ has joined #openstack-solar | 07:27 | |
*** salmon_ has joined #openstack-solar | 08:12 | |
pigmej | Hello | 08:37 |
---|---|---|
dshulyak_ | hi | 09:06 |
salmon_ | dshulyak_: pigmej https://github.com/Mirantis/solar-resources/pull/2 | 09:28 |
pigmej | wtf is 'echo 1' ? | 09:29 |
salmon_ | :) | 09:30 |
salmon_ | It can be empty if you want :) | 09:30 |
salmon_ | pigmej: updated | 09:35 |
pigmej | sounds good to me | 09:39 |
pigmej | dshulyak_: thanks for +1 :) I merged :) | 09:40 |
salmon_ | https://review.openstack.org/#/c/267193/ anyone? :) | 09:55 |
pigmej | I'm chceking it right now | 09:56 |
pigmej | what is our merge policy right now? | 09:58 |
salmon_ | ? | 09:59 |
pigmej | Should I give you +2 and +1 Workflow there or we stick with +2 from one and +1 workflow from someone else ? | 10:00 |
salmon_ | I think, 2 reviews are batter than one | 10:01 |
pigmej | yeah but speed-- | 10:01 |
salmon_ | quality++ | 10:01 |
pigmej | ~ | 10:02 |
dshulyak_ | https://github.com/openstack/fuel-plugin-contrail | 10:08 |
pigmej | dshulyak_: https://review.openstack.org/#/c/260082/ | 10:08 |
pigmej | random review from this project :D | 10:09 |
salmon_ | noop? Is it just Null test? | 10:09 |
pigmej | well the noop is wtf, but there is one extra gate | 10:09 |
pigmej | fuel-plugin.contrail.build | 10:09 |
salmon_ | I think noop is a way to go | 10:10 |
dshulyak_ | one more - https://review.openstack.org/#/c/265780/ | 10:11 |
dshulyak_ | only noop tests | 10:11 |
pigmej | ok this one is cool | 10:12 |
pigmej | ok, then I will try to create similar repo to it | 10:12 |
pigmej | obviously if someone knows how to do it properly... then feel free to take it;D | 10:14 |
salmon_ | dshulyak_: https://review.openstack.org/#/c/267193/ :) | 10:23 |
pigmej | salmon_: https://review.openstack.org/#/c/266255/ :P | 10:24 |
dshulyak_ | done | 10:25 |
salmon_ | thx | 10:25 |
openstackgerrit | Merged openstack/solar: Allow to modify computable inputs in Composer files https://review.openstack.org/267193 | 10:28 |
pigmej | https://review.openstack.org/267453 :) | 10:41 |
*** tzn has joined #openstack-solar | 10:54 | |
tzn | @salmon_ how is CI work going? | 10:59 |
salmon_ | tzn: we agreed with Sasha do to test job where you can set which test to run. I'm working on a job efinition now | 11:00 |
salmon_ | *definition | 11:00 |
tzn | ok, cool | 11:01 |
tzn | any estimate? | 11:01 |
tzn | @pigmej do we have solar-resources on review? | 11:03 |
tzn | I mean - repo creation? | 11:03 |
pigmej | https://review.openstack.org/267453 | 11:04 |
salmon_ | zen they are very very very busy, so no. I will prepare review today but they need to check it | 11:04 |
dshulyak_ | should we split worker that runs tasks and scheduler? i splitted them, but i’m not sure if thats better | 11:11 |
dshulyak_ | because there always will be atlest 2 processes | 11:11 |
pigmej | dshulyak_: we should | 11:12 |
pigmej | because then we will be able to create that "small" worker | 11:12 |
pigmej | isn't it? | 11:13 |
dshulyak_ | i dont see how splitting them will help in creation of small worker | 11:14 |
dshulyak_ | ok, it will be better if i will share my work and then we will talk | 11:14 |
dshulyak_ | tomorrow probably | 11:14 |
pigmej | yeah probably :) | 11:14 |
pigmej | k, because talkig about something without code is... tricky :) | 11:14 |
pigmej | hmm, guys do we require gevent now? | 11:52 |
openstackgerrit | Jedrzej Nowak proposed openstack/solar: Set ansible<2.0 in requirements (removed callbacks) https://review.openstack.org/267500 | 12:07 |
openstackgerrit | Jedrzej Nowak proposed openstack/solar: Conditional imports in locking (riak or peewee) https://review.openstack.org/267503 | 12:18 |
pigmej | salmon_: `pip install solar` *almost* works | 12:19 |
salmon_ | yupi | 12:20 |
pigmej | you just need to have these 2 patches ;D | 12:21 |
pigmej | and obviously we need some password based examples | 12:22 |
pigmej | but that's other story :) | 12:22 |
pigmej | updated | 12:33 |
openstackgerrit | Jedrzej Nowak proposed openstack/solar: Set ansible<2.0 in requirements https://review.openstack.org/267500 | 12:37 |
pigmej | salmon_: message extended :) | 12:37 |
salmon_ | +2ed | 12:39 |
pigmej | Ok, I was able to use solar without vagrant env :) | 12:39 |
pigmej | dshulyak_ tzn salmon_ :) | 12:39 |
pigmej | "archivement unlocked" | 12:40 |
pigmej | salmon_: it will also simplify fuel-devops (and it will add more speed to it, no docker magic required) | 12:40 |
openstackgerrit | Merged openstack/solar: Set ansible<2.0 in requirements https://review.openstack.org/267500 | 12:48 |
tzn | pigmej: +1 | 12:50 |
pigmej | tzn: is there any bot that reports launchpad bugs etc ? | 12:51 |
tzn | from IRC? | 12:51 |
tzn | or to IRC? | 12:51 |
pigmej | no, post bugs changes TO irc | 12:51 |
tzn | yes, there are plenty | 12:51 |
tzn | but I need tsome time to configure them | 12:52 |
tzn | I will talk to devops guys | 12:52 |
pigmej | k | 12:52 |
salmon_ | https://blueprints.launchpad.net/solar/+spec/cleanup-solar-resources for next release :) | 12:53 |
pigmej | salmon_: we should also undo versions probalby | 12:54 |
salmon_ | why undo? | 12:54 |
pigmej | or maybe even now... because some resources are marked as 1.0 | 12:54 |
pigmej | which is ekhm... | 12:54 |
salmon_ | all are marked as 1.0.0 | 12:54 |
pigmej | not all :D | 12:55 |
pigmej | there are some 0.0.1 | 12:55 |
pigmej | salmon_: you should also add that it's mostly about Openstack resources | 12:56 |
pigmej | because, transport, ro_node etc are fine | 12:56 |
salmon_ | this is why I created this bp :) | 12:57 |
tzn | can you guys mark them as 0.1.0 | 12:59 |
pigmej | all versions ? | 13:04 |
salmon_ | dshulyak_: pigmej https://bpaste.net/show/f2b61ca779eb | 13:04 |
salmon_ | hosts example :( | 13:04 |
salmon_ | hosts_file2.run -> INPROGRESS | 13:04 |
salmon_ | hosts_file1.run -> SUCCESS | 13:04 |
pigmej | salmon_: hmm | 13:04 |
salmon_ | it hung | 13:05 |
pigmej | yeah because it crashed | 13:05 |
pigmej | dshulyak_: is it desired behaviour ? | 13:06 |
salmon_ | ah, yes https://bpaste.net/show/57bd70255dfb | 13:06 |
salmon_ | do we need retries here ? | 13:06 |
pigmej | salmon_: wait, what have you done? | 13:07 |
salmon_ | pigmej: hosts example | 13:07 |
pigmej | but how did you make object in conflict ? | 13:08 |
salmon_ | I just run the example... :P | 13:08 |
salmon_ | via fuel-devops | 13:08 |
salmon_ | clean env | 13:08 |
pigmej | hmm | 13:09 |
pigmej | you crashed history | 13:11 |
salmon_ | I did nothing! | 13:11 |
salmon_ | dshulyak_: pigmej I reproduced it again. Just run hosts example | 13:20 |
pigmej | dshulyak_: then it means that sadly riak lock is broken | 13:21 |
dshulyak_ | error from bpaste is not related to lock | 13:22 |
pigmej | the first is | 13:22 |
pigmej | the second is probably side effect | 13:22 |
salmon_ | full log https://bpaste.net/show/803b18d0e29f | 13:22 |
pigmej | the thing is, it works for me ;( | 13:23 |
pigmej | salmon_: can you wipe riak container and try again? | 13:23 |
salmon_ | wipe? | 13:23 |
pigmej | ah you spawn always on fresh env ? | 13:24 |
salmon_ | yup | 13:24 |
pigmej | salmon_: can you print siblings data there? | 13:24 |
salmon_ | command? | 13:25 |
pigmej | dshulyak_: maybe the reason is that counter ? | 13:25 |
pigmej | but hmm, you should reach resolver first... | 13:26 |
salmon_ | in the meantime you can +1 https://review.openstack.org/#/c/267558 | 13:29 |
pigmej | dshulyak_: you started to debug it ? | 13:31 |
openstackgerrit | Lukasz Oles proposed openstack/solar: Include ansible config when syncing repo https://review.openstack.org/267562 | 13:32 |
dshulyak_ | not yet | 13:32 |
pigmej | ok, riak lock is broken | 13:35 |
pigmej | I'm able to crash it | 13:35 |
dshulyak_ | how? | 13:36 |
pigmej | use gevent worker | 13:36 |
pigmej | and create 10 hosts file example | 13:36 |
pigmej | then I switched to "ensemble" and it seems working | 13:37 |
pigmej | on sqlite it seems to be ok too | 13:37 |
pigmej | dshulyak_: with single riak and n_val=1, I have now broken history | 13:37 |
pigmej | https://bpaste.net/show/854fc91bbba9 | 13:37 |
pigmej | yup salmon_ I can reproduce | 13:39 |
pigmej | though I needed more hosts | 13:39 |
pigmej | but it's weird, because it looks like some things were done twice | 13:40 |
pigmej | salmon_: can you do solar o report last | 13:40 |
pigmej | ? | 13:40 |
dshulyak_ | i see, i think my collision resolution doesnt work properly | 13:41 |
pigmej | yeah something is wrong there | 13:41 |
pigmej | dshulyak_: ['{"status": "PENDING", "task_type": "solar_resource", "target": "314b40de7e918d2897b6b84fbe8b9baa", "args": ["hosts_file1", "run"], "childs": [], "parents": ["system_log:0d194126-2fa5-478b-88a2-a2dbf7638b39~node2.run", "system_log:0d194126-2fa5-478b-88a2-a2dbf7638b39~node1.run"], "execution": "system_log:0d194126-2fa5-478b-88a2-a2dbf7638b39", "errmsg": "", "name": "hosts_file1.run"}', '{"status": "PENDING", "task_t | 13:41 |
pigmej | "solar_resource", "target": "314b40de7e918d2897b6b84fbe8b9baa", "args": ["hosts_file1", "run"], "childs": [], "parents": ["system_log:0d194126-2fa5-478b-88a2-a2dbf7638b39~node2.run", "system_log:0d194126-2fa5-478b-88a2-a2dbf7638b39~node1.run"], "execution": "system_log:0d194126-2fa5-478b-88a2-a2dbf7638b39", "errmsg": "", "name": "hosts_file1.run"}'] | 13:41 |
dshulyak_ | after except SiblingsError: | 13:41 |
pigmej | oh crap | 13:41 |
pigmej | I have conflict where we have 2 different parents | 13:41 |
pigmej | or they are even the same... | 13:41 |
dshulyak_ | do you have similar line in your log - Race condition for lock with UID system_log:7f9e1785-c075-4fcb-a5c5-1a9e5d092fc8, among [u'140691354006064', u'140691354009424'] ? | 13:42 |
pigmej | vagrant@solar-dev:~$ grep -i 'race condition' /var/run/celery/celery1.log | 13:43 |
pigmej | vagrant@solar-dev:~$ | 13:43 |
pigmej | dshulyak_: these items are identical for me. so for me it looks like something started twice | 13:43 |
pigmej | salmon_: can you check it too please ? | 13:43 |
dshulyak_ | for example here it is not related to lock - https://bpaste.net/show/854fc91bbba9 | 13:44 |
*** tzn has quit IRC | 13:45 | |
pigmej | dshulyak_: well, I diasagree probably | 13:45 |
pigmej | I have 2 exactly the same childs | 13:45 |
dshulyak_ | childs of what? | 13:46 |
pigmej | siblings | 13:46 |
dshulyak_ | which Model ? | 13:46 |
pigmej | https://bpaste.net/show/9d110bac7b72 | 13:46 |
pigmej | dshulyak_: history | 13:46 |
pigmej | or whatever we call it | 13:46 |
pigmej | dshulyak_: I never saw these errors before | 13:51 |
dshulyak_ | so how you reproduced it? just run 10 times hosts_file? | 13:54 |
pigmej | yeah try to do so, | 13:54 |
pigmej | now it crashed on standard example even | 13:54 |
pigmej | just like salmon_ have | 13:55 |
pigmej | dshulyak_: now I have exactly the same as salmon_ had, + race condition in logs | 13:57 |
dshulyak_ | pigmej: can you try with save(force=True) on L104 in locking ? | 13:58 |
pigmej | nothign will change | 14:00 |
pigmej | isn't it ? | 14:00 |
pigmej | ah | 14:00 |
pigmej | no, we raise error when nothing changes | 14:00 |
pigmej | hmm | 14:01 |
pigmej | dshulyak_: but look | 14:01 |
pigmej | there is "fuckup" | 14:02 |
pigmej | 2 siblings, A , B | 14:02 |
pigmej | B checks and notices A, B siblings, in the same time A checks and notices A, B | 14:02 |
dshulyak_ | i think it is the other way here | 14:03 |
pigmej | so, then you have this for loop, which will remove "me" from conflicts, right ? | 14:03 |
pigmej | so B removes B and A removes A | 14:03 |
dshulyak_ | not like this | 14:03 |
pigmej | and they save object with one sibling, but still conflicting | 14:03 |
dshulyak_ | they are not saving it :) | 14:03 |
dshulyak_ | there is no save(force=True) | 14:03 |
dshulyak_ | and i dont think that both A and B see a race | 14:03 |
dshulyak_ | only B | 14:03 |
dshulyak_ | and we can see it in log | 14:03 |
pigmej | well it would crash without force :) | 14:03 |
dshulyak_ | please try with force, i cant reproduce it | 14:04 |
pigmej | but force changes nothing there.... | 14:04 |
dshulyak_ | it changes | 14:04 |
dshulyak_ | it will be saved actually | 14:05 |
pigmej | https://github.com/openstack/solar/blob/master/solar/dblayer/model.py#L922 if no force is given then it would be exception htere, isn't it ? | 14:05 |
dshulyak_ | ok, can you please try :) ? | 14:06 |
pigmej | yeah | 14:06 |
pigmej | doing | 14:06 |
dshulyak_ | there is clearly no loop in logs | 14:06 |
pigmej | the same | 14:07 |
dshulyak_ | https://bpaste.net/show/803b18d0e29f | 14:07 |
pigmej | or even worse, because for me history conflicted now | 14:07 |
pigmej | https://bpaste.net/show/ef135161332b | 14:07 |
pigmej | hmm, dshulyak_ I have question, | 14:08 |
pigmej | hmm, nvm | 14:08 |
pigmej | but I think that both workers thinks that they are only one | 14:08 |
pigmej | only one notices race condition though | 14:09 |
dshulyak_ | it looks to me that A acquired it, B sees this race and starts to wait, but when A released lock - B doesnt see it | 14:10 |
dshulyak_ | what is that ['{"count": -9}', '{"count": -9}'] ? | 14:10 |
pigmej | siblings content | 14:10 |
dshulyak_ | of what? | 14:10 |
pigmej | of conflicted object | 14:10 |
dshulyak_ | COunter ? | 14:11 |
pigmej | I added print just before raise | 14:11 |
pigmej | yup | 14:11 |
pigmej | dshulyak_: | 14:12 |
pigmej | https://bpaste.net/show/c870ad642962 | 14:12 |
dshulyak_ | for Lock thats normal, the problem is that B always thinks that lock acquired by A | 14:13 |
pigmej | how does A release it / | 14:14 |
dshulyak_ | delete value in database | 14:14 |
pigmej | ah | 14:14 |
dshulyak_ | record | 14:14 |
pigmej | so if action is fast there will be conflict | 14:15 |
pigmej | maybe that's the case | 14:15 |
pigmej | because A will delete, but B will overwrite | 14:15 |
pigmej | I have idea how to improve it | 14:15 |
pigmej | crdt like structure | 14:15 |
pigmej | tuple with + or - identity | 14:15 |
pigmej | then in conflict resulution we can easily firuge out wtf, and in lock too | 14:16 |
dshulyak_ | overwrite? | 14:16 |
dshulyak_ | release is here - [2016-01-14 15:06:59,173: DEBUG/MainProcess] Release lock system_log:581d20df-48ff-454f-8f67-0cdb920447b7 with 140359979031120 | 14:16 |
dshulyak_ | but then in B | 14:16 |
dshulyak_ | Found lock with UID system_log:581d20df-48ff-454f-8f67-0cdb920447b7, owned by 140359979031120, owner False | 14:16 |
pigmej | dshulyak_: and it's 30ms later than B saves object | 14:16 |
dshulyak_ | nope | 14:17 |
pigmej | it *could* be reordered in riak | 14:17 |
pigmej | [2016-01-14 15:06:59,120: DEBUG/MainProcess] Race condition for lock with UID system_log:581d20df-48ff-454f-8f67-0cdb920447b7, among [u'140359979031120', u'140359979031920'] | 14:17 |
pigmej | this is from B, right? | 14:17 |
pigmej | [2016-01-14 15:06:59,173: DEBUG/MainProcess] Release lock system_log:581d20df-48ff-454f-8f67-0cdb920447b7 with 140359979031120 | 14:17 |
pigmej | and this is from A | 14:17 |
dshulyak_ | ah, so it is possible that B saves object that was removed | 14:19 |
pigmej | yeah that's what I'm talking about | 14:19 |
pigmej | A removes lock, then B saves lock with A inside "because it was like that" | 14:19 |
pigmej | dshulyak_: I can try to fix it with crdt like thingy | 14:20 |
pigmej | then no delete, and we should be fine | 14:20 |
pigmej | works for you dshulyak_ ? | 14:21 |
pigmej | I mean can I ? :) | 14:21 |
dshulyak_ | sure) | 14:21 |
pigmej | with crdt like thingy, we will be safe, we may have slightly longer latency though | 14:21 |
pigmej | but we should be fine | 14:21 |
pigmej | I wonder why it worked for me before.... | 14:22 |
pigmej | salmon_: good finding :) | 14:23 |
pigmej | brb, I have to prepare chicken for lunch :( | 14:23 |
dshulyak_ | hm, but if object was deleted there should be sibling wo data | 14:24 |
dshulyak_ | pigmej: we should be able to see that collision, on second write, yes? | 14:33 |
*** tzn has joined #openstack-solar | 14:35 | |
pigmej | dshulyak_: BUT it was there before A deleted it | 14:37 |
pigmej | A & B reads, A saves, A deletes, B saves | 14:37 |
dshulyak_ | yes, but on B saves - there shoould be tombstone from A | 14:37 |
pigmej | which was resolved by our conflict resolution | 14:39 |
pigmej | :) | 14:39 |
dshulyak_ | yes seems so, but i cant reproduce :) i guess i have too slow environment for this | 14:40 |
pigmej | good that we have different cpus | 14:41 |
*** dshulyak_ has quit IRC | 15:00 | |
salmon_ | re | 15:14 |
salmon_ | pigmej: how can I help? | 15:14 |
pigmej | I'm imporoving lock | 15:15 |
pigmej | switch to sqlite :P | 15:15 |
pigmej | brb | 15:25 |
pigmej | lunch | 15:25 |
salmon_ | pigmej: yup, with sqlite it's ok, but seems to be slower | 16:01 |
pigmej | yeah sqlite is sometimes a bit slower than riak | 16:01 |
*** tzn has quit IRC | 16:39 | |
pigmej | ok new lock seems to be working... | 16:48 |
pigmej | I like when my room is full of papper work :D | 16:48 |
*** dshulyak_ has joined #openstack-solar | 16:52 | |
pigmej | salmon_: https://review.openstack.org/#/c/266255/ | 17:06 |
pigmej | please review this | 17:06 |
*** tzn has joined #openstack-solar | 17:06 | |
*** tzn has quit IRC | 17:08 | |
salmon_ | pigmej: ok | 17:08 |
salmon_ | pigmej: in the meantime, new error: https://bpaste.net/show/89ef23df900f ;) | 17:08 |
pigmej | not to me :P | 17:08 |
salmon_ | dshulyak_: ^ ;) | 17:09 |
dshulyak_ | salmon_: do you see any errors in celery.log? | 17:14 |
pigmej | dshulyak_: I'm constainly getting conflicts on Counter object | 17:17 |
pigmej | no matter what I will do, I'm getting conflicts there | 17:17 |
pigmej | always on [2016-01-14 18:16:54,544: WARNING/MainProcess] ['{"count": -9}', '{"count": -9}'] | 17:17 |
pigmej | are you sure that everything is correct in that manner? | 17:17 |
salmon_ | dshulyak_: I deleted env already, recreating now | 17:18 |
openstackgerrit | Merged openstack/solar: Use stevedore for handlers https://review.openstack.org/266255 | 17:18 |
dshulyak_ | pigmej: well, it might be that gevent affected counter somehow, because that part wasnt concurrent previously | 17:20 |
pigmej | yeah it certainly is broken now | 17:22 |
dshulyak_ | looks like we need same logic for counter as for the lock, either resolve SiblingsError or use ensemble | 17:23 |
*** tzn has joined #openstack-solar | 17:25 | |
pigmej | dshulyak_: I described it already, you can't do counter in the same way | 17:26 |
dshulyak_ | pigmej: hm, why? | 17:27 |
pigmej | 1) with ensemble "it will work", with normal riak not at all | 17:27 |
dshulyak_ | isnt it just a matter of retry on error? | 17:27 |
dshulyak_ | on SiblingsError | 17:27 |
pigmej | nope, why would it be ? | 17:27 |
pigmej | ehs, you will be able to save the same object twice | 17:28 |
pigmej | and none of these could see conflict | 17:28 |
pigmej | I just checked and it seems that sadly I was right at very beginning, it's perfectly fine to save object twice, and none notices "siblings" | 17:28 |
dshulyak_ | but with n_val we will always see it | 17:28 |
pigmej | it seems not | 17:29 |
dshulyak_ | n_val=1 | 17:29 |
dshulyak_ | are u sure? | 17:29 |
pigmej | No, I need test more | 17:29 |
pigmej | :) | 17:29 |
pigmej | so there is still chance that we're not totally f** :) | 17:29 |
dshulyak_ | for me 2nd write is always able to see a conflict | 17:30 |
pigmej | for you lock also worked :) | 17:31 |
dshulyak_ | well, it still works :) | 17:31 |
pigmej | yeah... | 17:31 |
pigmej | that's why I want to run long tests to verify that :) | 17:31 |
pigmej | I mean that write behaviour ;) | 17:31 |
pigmej | if yes, then we need similar logic as for locks and we're safe | 17:32 |
pigmej | it will be still against all known good practices though :D | 17:32 |
dshulyak_ | u said once that we cannot use this types - http://docs.basho.com/riak/latest/dev/using/data-types/#Counters ? | 17:33 |
pigmej | yup | 17:33 |
pigmej | it's CRDT type | 17:33 |
dshulyak_ | increment operation looks much better | 17:33 |
pigmej | it's floating counter | 17:33 |
dshulyak_ | but maybe with n_val=1 :) | 17:33 |
pigmej | it sill doesn't guarantee you that you will not see the same number twice | 17:33 |
pigmej | yeah, there is a chance IF n_val works as we want | 17:34 |
pigmej | then ... maybe :) | 17:34 |
pigmej | though it's CRDT... | 17:34 |
pigmej | I just found one case where lock is not working correctly ;/ | 17:35 |
dshulyak_ | when? | 17:39 |
dshulyak_ | or where?) | 17:39 |
pigmej | in my implementation :D | 17:39 |
pigmej | but I was unlucky | 17:40 |
pigmej | I hit the same `identity` after restart :D | 17:40 |
pigmej | hmm dshulyak_ how can I start celery in foreground now/ | 17:43 |
pigmej | ? | 17:43 |
dshulyak_ | celery worker -A …. | 17:43 |
pigmej | thx | 17:44 |
dshulyak_ | maybe wo pidfile | 17:44 |
pigmej | yeah and log :D | 17:44 |
pigmej | ehs | 17:49 |
pigmej | siblings | 17:49 |
pigmej | ['{"status": "PENDING", "task_type": "solar_resource", "target": "6053ea6868fb026b81af3637d4ec79e2", "args": ["hosts_file1", "run"], "childs": [], "parents": ["system_log:f5d4820f-8f15-4e81-95d3-0cefd55da035~node2.run", "system_log:f5d4820f-8f15-4e81-95d3-0cefd55da035~node1.run"], "execution": "system_log:f5d4820f-8f15-4e81-95d3-0cefd55da035", "errmsg": "", "name": "hosts_file1.run"}', '{"status": "PENDING", "task_type": | 17:49 |
pigmej | "solar_resource", "target": "6053ea6868fb026b81af3637d4ec79e2", "args": ["hosts_file1", "run"], "childs": [], "parents": ["system_log:f5d4820f-8f15-4e81-95d3-0cefd55da035~node2.run", "system_log:f5d4820f-8f15-4e81-95d3-0cefd55da035~node1.run"], "execution": "system_log:f5d4820f-8f15-4e81-95d3-0cefd55da035", "errmsg": "", "name": "hosts_file1.run"}'] | 17:49 |
pigmej | and another one dshulyak_ | 17:53 |
pigmej | https://bpaste.net/show/77271b6156b6 | 17:53 |
pigmej | this one is bad, because it says SUCCESS vs INPROGRESS | 17:53 |
pigmej | I can obviously write conflict resolution functions for it, BUT it should not happen anyway | 17:54 |
pigmej | this conflict probably says all about n_val=1 | 17:54 |
pigmej | I'm even able to get [2016-01-14 18:56:50,243: WARNING/MainProcess] ['{"count": -1}', '{"count": -1}'] | 17:57 |
dshulyak_ | counter isnt protected by lock | 17:57 |
dshulyak_ | thats separate thing | 17:57 |
pigmej | I know | 17:57 |
dshulyak_ | but those two should be | 17:57 |
pigmej | what should be ? | 17:57 |
pigmej | it's the same identity process, isn't it? | 17:57 |
dshulyak_ | identity process? | 17:58 |
dshulyak_ | what do u mean? | 17:58 |
pigmej | that identity that you use to "check unique" thingy | 17:58 |
dshulyak_ | i mean that those 2 errors (tasks with multiple siblings) should be protected by lock | 17:58 |
pigmej | but it's the same task, just updated from inprogress to DONE | 17:59 |
pigmej | and there is a chance that it's done within the same worker, isn't it ? | 17:59 |
dshulyak_ | identity is not a process id | 18:00 |
dshulyak_ | i dont get your last point | 18:01 |
dshulyak_ | there is clearly an error somewhere, maybe lock was acquired by two threads because n_val=1 doesnt work like i wanted | 18:01 |
dshulyak_ | but what about worker identity? | 18:01 |
pigmej | nvm then, I thought identity == some worker gevent thingy | 18:01 |
pigmej | https://bpaste.net/show/ab6c28f54e49 | 18:02 |
dshulyak_ | it is id of gevent thread | 18:02 |
pigmej | (don't look at lines numbers, I tuned up stuff for performance) | 18:02 |
dshulyak_ | yeah, looks like n_val doesnt work like i want | 18:04 |
pigmej | then we're kinda fucked | 18:04 |
dshulyak_ | it is same code, or u changed something? | 18:04 |
pigmej | I make all tasks do nothing etc, so it's certainly not "the same code" | 18:05 |
dshulyak_ | well, we can enable old concurrency for scheduler | 18:05 |
dshulyak_ | with prefork=1 | 18:05 |
pigmej | but it's weird, it worked for me when I tested it | 18:05 |
pigmej | but now it's not | 18:06 |
pigmej | :( | 18:06 |
pigmej | dshulyak_: yeah for release it will be "maybe good idea" | 18:06 |
pigmej | but still, we need to solve it | 18:06 |
pigmej | maybe the right solution would be to drop n_val=1 support completely | 18:06 |
openstackgerrit | Lukasz Oles proposed openstack/solar: Update path in tests https://review.openstack.org/267761 | 18:27 |
openstackgerrit | Dmitry Shulyak proposed openstack/solar: Set concurrency=1 for system log and scheduler queues https://review.openstack.org/267769 | 18:43 |
dshulyak_ | pigmej: oh wait, that log doesnt prove that n_val=1 doesnt work :) | 18:47 |
dshulyak_ | there is save without force | 18:47 |
pigmej | I changed it to force | 18:47 |
dshulyak_ | still the same? | 18:47 |
pigmej | as you wanted ;) | 18:47 |
pigmej | yup, I mean it happened with force | 18:47 |
pigmej | without force it didn't show that beautifull errors | 18:48 |
pigmej | BUT I'm tired I may make stupid mistakes now ;D | 18:48 |
pigmej | I'm checking again, crossing fingers that I was wrong :) | 18:48 |
dshulyak_ | :D | 18:48 |
dshulyak_ | i am a bit dissapointed in riak :) | 18:49 |
tzn | Guys, what about fault tolerance of state machine? | 18:49 |
tzn | When you start executing graph - what happens if something breaks | 18:50 |
pigmej | tzn: tha'ts not a problem | 18:50 |
pigmej | dshulyak_: well, it works as desired | 18:50 |
tzn | ok | 18:50 |
tzn | any explanation? | 18:50 |
pigmej | state is saved in DB | 18:50 |
pigmej | each task state | 18:50 |
tzn | so also every step status? | 18:51 |
tzn | ok | 18:51 |
pigmej | yeah, though we have some problems about that part right now | 18:51 |
pigmej | ;D | 18:51 |
tzn | no in memory storing | 18:51 |
tzn | Yes, I figured that out ;) | 18:51 |
tzn | it just reminded me about this fault tolerance ;) | 18:51 |
salmon_ | tzn: as long db is working we can restore execution | 18:51 |
tzn | ok, so for fault tolerance we need riak | 18:52 |
pigmej | we could start it from any point | 18:52 |
tzn | at this stage | 18:52 |
pigmej | tzn: well, the easiest answer is "it depends" | 18:52 |
tzn | what is task starts and have no chance to send confirmaton/status to solar? | 18:52 |
pigmej | lukasz answer was correct, as long as DB has all needed info, everything is fine | 18:52 |
dshulyak_ | if something brakes in unexpected way - then we can miss a status update, and user will have to restart execution | 18:52 |
tzn | @pigmej as always in your case ;) | 18:52 |
salmon_ | pigmej: 'depends' is never the easiest answer :P | 18:53 |
pigmej | dshulyak_: but only from this broken task, | 18:53 |
tzn | yes, but assuming idempotency, that should be safe | 18:53 |
pigmej | salmon_: it's easiest for me :D | 18:53 |
salmon_ | pigmej: :D | 18:53 |
pigmej | tzn: i would keep salmon_ sentence "as long as DB has correct info we're safe" | 18:53 |
tzn | sure | 18:53 |
pigmej | it implies all backend features (riak vs sql) | 18:54 |
tzn | but this is not an answer from my perspective ;) | 18:54 |
pigmej | it also implies what dshulyak_ said :) | 18:54 |
salmon_ | assuming idempotency ") | 18:54 |
salmon_ | :) | 18:54 |
dshulyak_ | well anyways :) the truth is we can miss update if something brakes in unexpected way | 18:54 |
pigmej | dshulyak_: but assuming idempotency of tasks we're safe | 18:55 |
dshulyak_ | but not all tasks will be idempotent | 18:55 |
pigmej | and we can't miss `n-1` update | 18:55 |
pigmej | we can miss `n` update, but no `n-1` | 18:55 |
pigmej | (excluding backend fuckups) | 18:55 |
dshulyak_ | provisioning, removal of smth | 18:55 |
dshulyak_ | is not idempotent | 18:55 |
dshulyak_ | so it still may lead to error | 18:56 |
pigmej | removal is | 18:56 |
pigmej | if you want to remove somethign which is already removed then you just "pass" | 18:56 |
pigmej | dshulyak_: in theory we have 4 states | 18:56 |
dshulyak_ | well, what if you want to erase a node, but u cant ssh to the node? | 18:56 |
pigmej | PENDING, INPROGRESS, SUCCESS | ERROR | 18:56 |
dshulyak_ | how can u know if it is removed? | 18:56 |
tzn | well, it's about new ref architecture | 18:57 |
pigmej | if task is PENDING => it wasn't executed yet for sure. | 18:57 |
tzn | and orchestrating upgrades for example | 18:57 |
tzn | there is no way they will make tasks idempotent | 18:57 |
pigmej | and if it's success or error, it was for sure executed | 18:57 |
pigmej | INPROGRESS is tricky, but then you should probably check system state by hand if major problem was detected in the middle | 18:57 |
pigmej | dshulyak_: well, in that case I would lookup if machine card is registered on switch / port whatever, or if mac address / ip address mapping exists, if no I can assume it's removed | 18:58 |
dshulyak_ | hm, no :) | 18:59 |
dshulyak_ | it wont tell anything about state of machine | 18:59 |
pigmej | sure, | 19:00 |
dshulyak_ | i think the only way here is to mark such task an error | 19:00 |
pigmej | but keeping machine in datached state is not a big problem. It will not have network connectivity etc. So it will not mess with other systems | 19:00 |
pigmej | dshulyak_: it should be marked as "wtf" :) | 19:00 |
pigmej | as with all network connection problems | 19:01 |
dshulyak_ | so what about that n_val=1 - you was able to reproduce it again? i dont know wtf but on my env i am not able to reproduce even race | 19:02 |
pigmej | salmon can yu change that save to save(force) ? | 19:02 |
pigmej | but i'm able easily to reproduce that error even with force.... | 19:02 |
salmon_ | ? | 19:03 |
salmon_ | what, where? | 19:03 |
pigmej | dshulyak_: give him line numbers :) | 19:03 |
dshulyak_ | salmon_: L104 dblayer/locking.py | 19:04 |
salmon_ | dshulyak_: lk.save(force-True) ? | 19:05 |
pigmej | yup | 19:06 |
salmon_ | force=True | 19:06 |
salmon_ | ? | 19:06 |
salmon_ | ok | 19:06 |
salmon_ | checking hosts.py example | 19:06 |
pigmej | dshulyak_: to me it looks now clearly that n_val solves nothing | 19:06 |
pigmej | it narrows the window, but still | 19:06 |
dshulyak_ | A vnode is the unit of concurrency, replication, and fault tolerance :) | 19:07 |
dshulyak_ | strange | 19:07 |
pigmej | trying to replicate with simple script | 19:07 |
pigmej | dshulyak_: but our riak has 8vnodes | 19:08 |
dshulyak_ | but n_val is a replication number, isnt it? | 19:09 |
pigmej | kinda | 19:10 |
pigmej | it's how many copies of SINGLE object are keept | 19:10 |
dshulyak_ | maybe pw=1 should be added | 19:10 |
pigmej | https://bpaste.net/show/415b3339388d | 19:10 |
pigmej | it's default afair | 19:10 |
dshulyak_ | i remember about sloppy quorum, but it is only when primary vnode is not available, right? | 19:11 |
pigmej | or maybe not, because they changed it | 19:11 |
pigmej | what is primary vnode for non existing key ? | 19:11 |
dshulyak_ | whatever it is - it should be the same one) for A and B, because the placement is consistent | 19:12 |
pigmej | but it can be adjusted | 19:13 |
pigmej | vnode isn't static | 19:13 |
pigmej | it's not like key % vnode_num => it's given vnode | 19:13 |
pigmej | salmon_: crashes for you? | 19:16 |
salmon_ | pigmej: worked ok now | 19:20 |
pigmej | ... | 19:20 |
salmon_ | running again | 19:20 |
pigmej | https://bpaste.net/show/5a8be828c95f | 19:20 |
pigmej | dshulyak_: | 19:23 |
pigmej | https://bpaste.net/show/5bd25a7d9609 | 19:23 |
pigmej | salmon_: you too | 19:23 |
pigmej | could you please guys to execute that ? | 19:23 |
pigmej | and what it prints to you? | 19:23 |
pigmej | 1 and -1 nothing more ? | 19:24 |
pigmej | It would be super cool if it prints to you both only one 1 and the rest have -1 | 19:24 |
pigmej | or maybe better change that range to something smaller | 19:25 |
salmon_ | pigmej: on second run it crashed in the same way as before | 19:25 |
pigmej | yeah | 19:25 |
pigmej | so n_val is not working as we need | 19:25 |
salmon_ | [1, -1, -1, -1, -1, -1, -1, -1, -1, 1] | 19:26 |
salmon_ | always one 1 | 19:26 |
pigmej | wait fixing one thing there | 19:26 |
pigmej | https://bpaste.net/show/0727dc69d7be | 19:27 |
pigmej | this one | 19:27 |
pigmej | i have sometimes two times 1... | 19:27 |
salmon_ | all are the same: 0 [1, -1] | 19:27 |
salmon_ | to 9 [1, -1] | 19:28 |
pigmej | then super cool | 19:28 |
pigmej | becuase then it would mean that n_val works as dshulyak_ expected... | 19:28 |
dshulyak_ | also made a test - http://paste.openstack.org/show/483918/ | 19:28 |
pigmej | BUT wtf I have [1, 1] sometimes | 19:28 |
pigmej | salmon_: so you have always 1,-1 and it still crashes "as before" ? | 19:29 |
salmon_ | yup | 19:29 |
pigmej | what is exception ? | 19:29 |
salmon_ | the same as before | 19:29 |
pigmej | we discussed like 10 exceptions there... ;) | 19:30 |
salmon_ | https://bpaste.net/show/2e13c965df2d | 19:30 |
salmon_ | going off now, see you tomorrow | 19:30 |
pigmej | counter | 19:30 |
pigmej | salmon_: counter is different exception :D | 19:30 |
pigmej | dshulyak_: so let's say my riak is stupid, and that it does something 'wrong' | 19:30 |
dshulyak_ | 0 [1, -1] | 19:31 |
dshulyak_ | 1 [1, -1] | 19:31 |
dshulyak_ | 2 [1, -1] | 19:31 |
dshulyak_ | 3 [1, -1] | 19:31 |
dshulyak_ | 4 [1, -1] | 19:31 |
dshulyak_ | 5 [1, -1] | 19:31 |
dshulyak_ | 6 [1, -1] | 19:31 |
dshulyak_ | 7 [1, -1] | 19:31 |
dshulyak_ | 8 [1, -1] | 19:31 |
dshulyak_ | 9 [1, -1] | 19:31 |
dshulyak_ | sorry :) | 19:31 |
pigmej | I will run tests for night, and we will see | 19:31 |
pigmej | dshulyak_: yeah, then wtf I have [1, 1] sometimes | 19:31 |
pigmej | like in <1% | 19:31 |
dshulyak_ | in my test there is always 1 with 2 siblings, and 1 with 1 | 19:31 |
dshulyak_ | let me try 3 | 19:31 |
pigmej | yeah, it's then the same as mine [1, -1] | 19:32 |
dshulyak_ | http://paste.openstack.org/show/483919/ | 19:32 |
dshulyak_ | it is [1,2,3] | 19:32 |
pigmej | so perfect | 19:33 |
pigmej | hmm | 19:33 |
pigmej | does your gevent support AttributeError: 'module' object has no attribute 'monkey' ? | 19:33 |
dshulyak_ | still i think it would be better to rollback init script to two workers, and then decide what we will do with n_val and counter for gevent | 19:33 |
dshulyak_ | yeah, it works | 19:34 |
pigmej | Hmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm | 19:34 |
dshulyak_ | gevent==1.0.2 | 19:34 |
pigmej | same there | 19:35 |
pigmej | maybe that's message "It's late Jedrzej, go off" ? | 19:35 |
dshulyak_ | maybe u are using old container for riak? | 19:36 |
dshulyak_ | without actual n_val=1 ? | 19:36 |
dshulyak_ | its not that late :) | 19:37 |
pigmej | well, i'm there since 9:30 :D | 19:37 |
pigmej | so it's arount 11 hours ;P | 19:38 |
pigmej | around | 19:38 |
pigmej | I would need 2-3 more to switch to hardcore mode ;] | 19:38 |
pigmej | dshulyak_: your script failes for me... | 19:38 |
pigmej | I have [1,2,2] sometimes | 19:38 |
pigmej | not funny ;/ | 19:39 |
pigmej | but well, if n_val works like that, there is a chance that crdt counter will work as we need (it shares the same logic, imagine a object with list as siblings with negative and positive lists, and the value is just sum of these) | 19:41 |
pigmej | then we could use counters on riak and "autoincrement" on sqlite | 19:44 |
pigmej | ok, dshulyak_ thanks' for debugging sessiion, I spawned 3 riak vms, your script is running, mine too | 19:44 |
pigmej | we will see :) | 19:45 |
pigmej | take care! | 19:45 |
dshulyak_ | yes, thanks, it was interesting debug session :) | 19:58 |
*** dshulyak_ has quit IRC | 20:08 | |
*** dshulyak_ has joined #openstack-solar | 20:24 | |
*** dshulyak_ has quit IRC | 20:41 | |
tzn | anyone still online | 20:45 |
salmon_ | tzn: what's up? | 20:51 |
*** mihgen has quit IRC | 21:28 | |
*** mihgen has joined #openstack-solar | 21:35 | |
*** tzn has quit IRC | 22:11 | |
*** salmon_ has quit IRC | 22:20 | |
*** 21WAASK71 has joined #openstack-solar | 22:34 | |
*** 21WAASK71 has quit IRC | 22:37 | |
*** tzn has joined #openstack-solar | 23:16 | |
*** tzn has quit IRC | 23:26 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!