*** dshulyak_ has joined #openstack-solar | 08:24 | |
*** salmon_ has joined #openstack-solar | 08:39 | |
pigmej | dshulyak_: I have some concerns about our test_lock.py have you ever seen that they failed on current implementation ? | 10:11 |
---|---|---|
dshulyak_ | pigmej: no, they passed on 3 backends, what is the failure? | 10:12 |
pigmej | I ran them during night, and with fast ackquire / release scenarios I can crash them easily | 10:13 |
pigmej | that's that delete + write case probably | 10:14 |
pigmej | but I don't get why it's not crashing for you. That's why I asked if you saw failure there | 10:14 |
pigmej | and I again started to wonder if my env is OK, but our scripts from last week are working properly | 10:15 |
dshulyak_ | which test fails? | 10:18 |
pigmej | acquire_release_logic, lock_ackquired_released | 10:18 |
pigmej | acquired_released* | 10:18 |
pigmej | pretty "standard" tests | 10:19 |
pigmej | obviously, in release_logic it fails on last assert, | 10:19 |
pigmej | and in acquired_released it fails because 11 != 12 | 10:19 |
dshulyak_ | it is with riak n_val=1 backend? | 10:22 |
dshulyak_ | or sqlite? | 10:22 |
pigmej | n_val=1 | 10:22 |
pigmej | it's probably that delete + 'write old state' scenario, but I cannot confirm it, because if I add additional debug, it always works | 10:23 |
pigmej | dshulyak_: I mostly needed confirmation that IF it sometimes fails for you or not at all | 10:25 |
dshulyak_ | i thought that case with write of old state only possible in concurrent env | 10:26 |
pigmej | yeah | 10:26 |
pigmej | me too | 10:26 |
pigmej | but maybe delete works in strange way | 10:27 |
pigmej | because we know that it deletes with some delay | 10:27 |
dshulyak_ | yeah, but you added conflict resolution for deleted items | 10:28 |
pigmej | yeah, anyway I will debug this somehow ;) | 10:30 |
pigmej | BUT coffe first :D | 10:30 |
dshulyak_ | but yeah, it looks like that for you old identity is returned is either in get or after siblingserror | 10:37 |
dshulyak_ | i will try to run those tests for sometime | 10:38 |
pigmej | yeah kinda like that | 10:39 |
dshulyak_ | i thought that maybe the problem was that i am using 1cpu for vagrant, but i switched to 2, and its all the same | 10:42 |
pigmej | well, I added 3 debug prints to check it and then everything always worked | 10:42 |
pigmej | but it looks like that | 10:43 |
pigmej | DEBUG (locking.py::76)::Lock for 11 acquired by 11 | 10:43 |
pigmej | DEBUG (locking.py::86)::Release lock 11 with 11 | 10:43 |
pigmej | DEBUG (locking.py::106)::Found lock with UID 11, owned by 11, owner False | 10:43 |
pigmej | so it's clear that after release *sometimes* it sill finds old | 10:44 |
dshulyak_ | so it is even in get | 10:44 |
pigmej | yup | 10:45 |
pigmej | BUT I'm not sure if always like that | 10:45 |
dshulyak_ | pigmej: btw you are using 1cpu for solar-dev or 2-3 ? | 10:48 |
pigmej | 2 | 10:49 |
pigmej | + 2,5 G ram | 10:49 |
dshulyak_ | i tried to add more ram but it is still the same, 100000 iterations of acquire_release - no failure | 11:02 |
pigmej | cool.... | 11:03 |
pigmej | wtf is with my laptop then? | 11:03 |
pigmej | or with me :D | 11:05 |
dshulyak_ | salmon_: can you try to run test_lock with this change? http://paste.openstack.org/show/484121/ | 11:05 |
dshulyak_ | solar/test/test_lock.py | 11:05 |
salmon_ | dshulyak_: shure, one moment | 11:05 |
dshulyak_ | with riak | 11:05 |
salmon_ | dshulyak_: how to configure tests to use riak? | 11:06 |
dshulyak_ | cat /.solar_config_override | 11:07 |
dshulyak_ | solar_db: riak://10.0.0.2:8087 | 11:07 |
salmon_ | ok | 11:08 |
pigmej | salmon_: default vagrant env uses riak n_val1 | 11:08 |
salmon_ | dshulyak_: btw, is thix `x` used somwhere? | 11:08 |
pigmej | so it's default config | 11:08 |
salmon_ | pigmej: ok | 11:08 |
pigmej | salmon_: dshulyak_ added it just for parametrization :) | 11:08 |
dshulyak_ | yeah, range loop, i think otherwise pytest will fail | 11:09 |
pigmej | yup | 11:10 |
pigmej | hmm, | 11:16 |
salmon_ | how long will it take? :) | 11:16 |
pigmej | 100000 * 0.05 | 11:17 |
dshulyak_ | i didnt notice, but should be quite fast :) | 11:17 |
salmon_ | ... | 11:17 |
salmon_ | still running | 11:17 |
pigmej | salmon_: ;D | 11:17 |
pigmej | ok, I restarted env + laptop | 11:18 |
pigmej | and... it works for me too (this test) | 11:18 |
dshulyak_ | if it behaves sometimes this way on your laptop then it is also possible in production | 11:20 |
pigmej | sure | 11:20 |
pigmej | that's why I asked you for checking | 11:21 |
salmon_ | ..'.count('.') | 11:21 |
salmon_ | Out[2]: 5376 | 11:21 |
salmon_ | It will take hours.... | 11:21 |
dshulyak_ | hm | 11:21 |
dshulyak_ | maybe i executed 10000 :) | 11:21 |
pigmej | one of you have broken env then ;P | 11:21 |
dshulyak_ | let me recheck | 11:21 |
salmon_ | pigmej: how long did it take for you? | 11:21 |
pigmej | ~0.02 each test I think | 11:21 |
pigmej | I switched to my branch now | 11:22 |
pigmej | but wait I can check | 11:22 |
pigmej | 0.03 | 11:22 |
pigmej | so 0.03 + 100 000 | 11:22 |
salmon_ | hmm, "Killed" | 11:22 |
pigmej | means like 3000k seconds ? | 11:22 |
pigmej | 3000 seconds which is hmm, like 50 minutes? | 11:23 |
pigmej | dshulyak_: you didn't notice test that ran for 50 minutes ?:D | 11:23 |
salmon_ | after 5805 iteration I got message "Killed" | 11:23 |
pigmej | oom or pytest killed it? | 11:23 |
salmon_ | all tests passed though | 11:23 |
dshulyak_ | i think i run 10000, not 100000 | 11:23 |
salmon_ | rechecking | 11:23 |
dshulyak_ | sorry :D | 11:23 |
pigmej | dshulyak_: :D | 11:24 |
salmon_ | riak+test eat a lot of ram during this test | 11:25 |
dshulyak_ | so what the conclusion - we wont rely on n_val=1? | 11:25 |
salmon_ | why does it take so much ram? | 11:27 |
pigmej | dshulyak_: I think something is worng in my env | 11:27 |
pigmej | becuase I already had stupid problem with that n_val | 11:27 |
pigmej | I think we can rely on it | 11:27 |
pigmej | BUT idk what's wrong with my stuff | 11:28 |
dshulyak_ | pigmej: you should buy youself macbook | 11:28 |
dshulyak_ | zero problems :D | 11:28 |
pigmej | You know what's worst about buying macbook ? | 11:28 |
pigmej | or maybe I shouldn't say this joke there.... | 11:29 |
salmon_ | you shouldnn't :P | 11:29 |
dshulyak_ | what? | 11:29 |
pigmej | salmon_: yeah I stopped it in the middle :P | 11:30 |
pigmej | anyway, I ran this 10000 and It failed on 117 try | 11:31 |
pigmej | before restart | 11:31 |
dshulyak_ | parametrize its a lot - with 10k it is about 600, but with 100k - 1200 | 11:32 |
salmon_ | 10004 passed in 1851.42 seconds | 11:59 |
pigmej | k, then I will blame my env... | 12:01 |
salmon_ | I may tray with more RAM because it was slow, was using SWAP | 12:03 |
pigmej | ok I have alternative lock approach | 12:12 |
pigmej | which seems to be working | 12:12 |
dshulyak_ | i remember that on friday we discussed locking based on state | 12:13 |
pigmej | yeah | 12:13 |
pigmej | I have some experiment about that too | 12:13 |
pigmej | :) | 12:13 |
pigmej | but it requires changes in workers etc, so I think we sould wait with that reimplementation for new worker, isn't it? | 12:14 |
pigmej | or we should somehow integrate it with model... | 12:14 |
openstackgerrit | Jedrzej Nowak proposed openstack/solar: CRDTish lock to avoid concurrent update/delete https://review.openstack.org/269018 | 12:26 |
salmon_ | CRDTish :) | 12:27 |
pigmej | dshulyak_ salmon_ https://review.openstack.org/#/c/269018/ | 12:27 |
pigmej | yup | 12:27 |
pigmej | it's kinda like aworset | 12:27 |
pigmej | we could use native set from riak too, but we need this implementation for SQLite for sure | 12:28 |
pigmej | now time to tackle counters :) | 12:28 |
openstackgerrit | Lukasz Oles proposed openstack/solar: Remove ansible.cfg, we use .ssh/config now https://review.openstack.org/269035 | 13:12 |
salmon_ | ups | 13:12 |
pigmej | :> | 13:13 |
pigmej | wtf ? | 13:13 |
pigmej | isn't it https://review.openstack.org/#/c/268331/ ? | 13:13 |
salmon_ | I messed with topics :/ | 13:14 |
openstackgerrit | Lukasz Oles proposed openstack/solar: Remove ansible.cfg, we use .ssh/config now https://review.openstack.org/269035 | 13:14 |
openstackgerrit | Lukasz Oles proposed openstack/solar: Remove ansible.cfg, we use .ssh/config now https://review.openstack.org/269035 | 13:15 |
salmon_ | the last one is correct ^ | 13:16 |
pigmej | ... | 13:29 |
pigmej | why you have 3 changes there? | 13:30 |
pigmej | 2 | 13:30 |
salmon_ | Pach sets? | 13:30 |
pigmej | yup | 13:32 |
pigmej | no | 13:32 |
pigmej | ehs | 13:32 |
pigmej | salmon_: | 13:32 |
pigmej | Remove ansible.cfg, we use .ssh/config now Change-Id: I382bfb6e2b969a4058b74d569972418c19ebc834 Fix provision and image build after removing ansible.cfg Change-Id: I257bd0c7050516746ff77b8ef09dc169b945deae | 13:32 |
pigmej | this is how your commit msg looks like | 13:32 |
salmon_ | ah | 13:33 |
salmon_ | git stash... | 13:33 |
pigmej | yeah yeah, excuses ;P | 13:33 |
salmon_ | afk, ~1h :) | 13:34 |
pigmej | ;] | 13:34 |
openstackgerrit | Jedrzej Nowak proposed openstack/solar: CRDTish lock to avoid concurrent update/delete https://review.openstack.org/269018 | 15:08 |
pigmej | dshulyak_: in fact, are duplicates of "counter" bad for us? | 15:24 |
pigmej | because I'm looking into code, and ... I don't find for now a place where we require that it's unique | 15:25 |
dshulyak_ | in general no, but we need to be sure that the history is correct | 15:27 |
dshulyak_ | sorry not in general.. in some cases no | 15:27 |
pigmej | what do you mean by 'history is correct' ? | 15:28 |
dshulyak_ | e.g. if B was executed after A - they shouldnt be the same | 15:29 |
pigmej | ok but what if C and B were executed just after A ? | 15:30 |
pigmej | wouldn't it be valid if A would have counter 1, B would have 2 and C would have also 2 ? | 15:30 |
pigmej | I mean, do we really need "numbered order" there or we can have nonnumbered order there like successors and predecessors ? | 15:31 |
pigmej | Because I'm looking into code, and I find that history.filter is used only in one place, which is "history_last" method | 15:32 |
dshulyak_ | i think it is also used in solar ch history | 15:34 |
dshulyak_ | or should be | 15:34 |
dshulyak_ | C and B can be the same i guess | 15:34 |
pigmej | well solar ch history uses that composite | 15:36 |
pigmej | which uses log, resource and action | 15:36 |
pigmej | the drawback could be that we could have a bit "randomized" order | 15:40 |
pigmej | which is unwanted | 15:40 |
pigmej | because all tasks with the same counter value *could* be then presented in any order | 15:40 |
pigmej | BUT they are independent (because they were executed in the same time), so... | 15:41 |
pigmej | so it maybe a problem but I'm not sure | 15:43 |
pigmej | dshulyak_: it turns out that we can use counters for n_val=1... I was mistaken about that. | 15:56 |
dshulyak_ | u mean crdt countaers, right? | 15:57 |
pigmej | yeah | 15:57 |
pigmej | I just executed some testing things | 15:57 |
pigmej | and with one node with n_val 1 we're fine | 15:57 |
pigmej | and obviously the same for strong consistent bucket | 15:58 |
pigmej | s | 15:58 |
dshulyak_ | btw can we use this data strctures with strongly consistent buckets? | 15:59 |
pigmej | ? | 15:59 |
pigmej | we can use CRDT with strong consistent buckets, | 15:59 |
dshulyak_ | can we use counters and sets with ensemble buckets? | 15:59 |
pigmej | AND we can use it with n_val=1 | 16:00 |
pigmej | yeah we can | 16:00 |
pigmej | :( | 16:00 |
dshulyak_ | whats up? it started to fail :) ? | 16:01 |
pigmej | no | 16:01 |
pigmej | I'm sad, because I was wrong about n_val=1 ;P | 16:02 |
pigmej | the tests were executed for an hour | 16:02 |
pigmej | one of counters is now ~10mln | 16:02 |
pigmej | not a single value missed / duplicated | 16:02 |
pigmej | :) | 16:03 |
pigmej | though we will need client side stuff for sql :) | 16:05 |
pigmej | dshulyak_: any ideas how to do our counter in SQL ? | 16:07 |
dshulyak_ | client side? | 16:08 |
dshulyak_ | can we just insert empty rows? | 16:08 |
pigmej | and take first non empty ? | 16:08 |
dshulyak_ | ah | 16:08 |
pigmej | the problem is that we need to know value | 16:09 |
pigmej | I mean, `self.history = StrInt(next(NegativeCounter.get_or_create('history')))` | 16:09 |
dshulyak_ | but isnt it the same as with increment? | 16:09 |
pigmej | I know how to do it with riak with counter | 16:10 |
dshulyak_ | i mean we will know it for sure after write | 16:10 |
pigmej | ok, then we should move counter value as pkey in sql | 16:10 |
openstackgerrit | Dmitry Shulyak proposed openstack/solar: Zerorpc worker for orchestration modules https://review.openstack.org/269166 | 16:10 |
pigmej | because otherwise we could still get doubled the same value, isn't it ? | 16:11 |
dshulyak_ | i will still work on cleaning this, but here is an example how it works - https://review.openstack.org/#/c/269166/1/solar/test/functional/test_tasks_subscribers.py | 16:12 |
pigmej | I wonder if sqlite will handle properly 2 concurrent +1 to the same row | 16:12 |
dshulyak_ | yes, i thought that we will use pkey for sqlite | 16:13 |
pigmej | the thing is that then we will need to have netsted transactions | 16:14 |
pigmej | nested* | 16:14 |
dshulyak_ | where? | 16:15 |
pigmej | because we will not know about counter conflict otherwise | 16:16 |
pigmej | dshulyak_: transaction begin, x = [sql update +1], A.a = x, transaction end | 16:18 |
pigmej | in concurrent end, it will not work properly, isn't it? | 16:18 |
dshulyak_ | wont sqlite lock whole table? | 16:20 |
pigmej | Let's check, because I'm not sure | 16:22 |
pigmej | I mean, it's not even +1 in our case | 16:22 |
pigmej | because we will do x = (db.get() + 1).save().x | 16:22 |
pigmej | pseudocode obviously ^ | 16:22 |
pigmej | brb | 16:29 |
dshulyak_ | wont we use autoincrement? i think it should be ().save().pk | 16:38 |
-openstackstatus- NOTICE: Gerrit is restarting quickly as a workaround for performance degradation | 16:50 | |
pigmej | back | 16:51 |
pigmej | dshulyak_: yeah BUT ,https://www.sqlite.org/autoinc.html | 16:51 |
pigmej | dshulyak_: for some reason | 16:58 |
pigmej | https://bpaste.net/show/08849a0c59e3 | 16:58 |
pigmej | works fine | 16:58 |
pigmej | so... for now I will treat this as a solution :) | 16:58 |
dshulyak_ | wiyh sqlite? | 16:58 |
dshulyak_ | with | 16:58 |
pigmej | yup | 16:59 |
pigmej | later we can adjust it for pk | 16:59 |
pigmej | but I will make counter based for riaks | 16:59 |
openstackgerrit | Dmitry Shulyak proposed openstack/solar: Zerorpc worker for orchestration modules https://review.openstack.org/269166 | 17:17 |
dshulyak_ | the good thing is that with new worker i can catch counter errors pretty easily | 17:18 |
pigmej | it shouldn't crash :) | 17:18 |
pigmej | dshulyak_: can you set workflow -1 for yourself there ? | 17:18 |
dshulyak_ | ok, usually i dont care about that, anyway patch wont be merged accidentally | 17:20 |
pigmej | well, for now probably no, but later we may do it by accident | 17:20 |
pigmej | :) | 17:21 |
pigmej | but it's just my opinion :) | 17:21 |
pigmej | hmm, any ideas how should we proceed with riak data types & our vagrant env? | 17:24 |
openstackgerrit | Dmitry Shulyak proposed openstack/solar: Zerorpc worker for orchestration modules https://review.openstack.org/269166 | 18:27 |
openstackgerrit | Dmitry Shulyak proposed openstack/solar: Zerorpc worker for orchestration modules https://review.openstack.org/269166 | 18:29 |
salmon_ | dshulyak_: it looks really nice | 18:35 |
pigmej | yeah the separation is cool | 18:36 |
pigmej | salmon_: https://review.openstack.org/#/c/269166/4/solar/orchestration/executors/inproc.py this is my fav | 18:36 |
dshulyak_ | lets hope it will work :) | 18:36 |
pigmej | yeah :) | 18:36 |
salmon_ | pigmej: yeah, nice :) | 18:36 |
pigmej | dshulyak_: well, we're engeneers, we don't hope, we know :D | 18:37 |
pigmej | excep: https://scontent-frt3-1.xx.fbcdn.net/hphotos-xpl1/v/t1.0-9/12507573_1114134135263634_3266113277144032533_n.jpg?oh=bf3b89c35ef4991aa3600078c06b1866&oe=570128EF :P | 18:37 |
salmon_ | :D | 18:39 |
pigmej | ;] | 18:39 |
openstackgerrit | Jedrzej Nowak proposed openstack/solar: Fixing concurrency problems in history counter https://review.openstack.org/269238 | 18:42 |
pigmej | it's WIP | 18:43 |
openstackgerrit | Jedrzej Nowak proposed openstack/solar: Fixing concurrency problems in history counter https://review.openstack.org/269238 | 18:43 |
openstackgerrit | Lukasz Oles proposed openstack/solar: Remove ansible.cfg, we use .ssh/config now https://review.openstack.org/269035 | 18:44 |
openstackgerrit | Lukasz Oles proposed openstack/solar: Hardcode ansible version. We are not ready for 2.0 https://review.openstack.org/269239 | 18:44 |
salmon_ | pigmej: what is riak.yaml ? | 18:45 |
pigmej | https://review.openstack.org/#/c/269238/2/bootstrap/playbooks/tasks/riak.yaml | 18:45 |
salmon_ | pigmej: can you move runnig it from Vagrantfile to solar yaml? You will break devops tests | 18:47 |
pigmej | well, I'm not sure :) | 18:49 |
pigmej | I mean, what DB do you use in devops tests? | 18:49 |
pigmej | riak ? | 18:49 |
salmon_ | yup | 18:50 |
pigmej | 'm | 18:50 |
pigmej | ok | 18:50 |
salmon_ | it's the same env as in vagrant | 18:50 |
salmon_ | and I'm using solar.yaml to bootstrap it | 18:50 |
pigmej | then fine, In fact I don't wanted to break your stuff so I craeted it as separte thing | 18:50 |
salmon_ | good intentions ;) | 18:51 |
openstackgerrit | Jedrzej Nowak proposed openstack/solar: Fixing concurrency problems in history counter https://review.openstack.org/269238 | 18:52 |
pigmej | k | 18:52 |
pigmej | :) | 18:52 |
pigmej | dshulyak_: do you have any eta when your worker may be usable / testable ? | 18:55 |
pigmej | anyway, I'm off for today, | 18:55 |
dshulyak_ | it deploys some examples already, i think tomorrow it will be usable | 18:57 |
pigmej | dshulyak_: cool, I expect to have counter working tomorrow (it works already but I haven't tested setup), locks are working too ;) | 19:03 |
*** dshulyak_ has quit IRC | 19:12 | |
openstackgerrit | Lukasz Oles proposed openstack/solar: Hardcode ansible version. We are not ready for 2.0 https://review.openstack.org/269239 | 20:38 |
openstackgerrit | Jedrzej Nowak proposed openstack/solar: Fixing concurrency problems in history counter https://review.openstack.org/269238 | 20:57 |
*** salmon_ has quit IRC | 23:06 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!