14:01:31 #startmeeting Rally 14:01:32 Meeting started Mon Feb 15 14:01:31 2016 UTC and is due to finish in 60 minutes. The chair is rvasilets_. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:01:33 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:01:35 The meeting name has been set to 'rally' 14:01:48 Hi to all 14:03:07 hi! 14:03:09 o/ 14:03:11 looks like we have not a lot of topics for today 14:03:20 are you sure?) 14:03:26 Lets wait a bit 14:03:29 o/ 14:03:38 Maybe someone else join to us 14:04:18 andreykurilin: around? you seem to had a topic 14:05:00 yes, I'm here:) let's wait for another attenders a bit 14:05:23 i'd love to report 'bout my install_rally.sh refactoring (we want to install rally in venv on gates) but i'm at the middle of testing 14:05:30 :) 14:05:47 ..so I won't take your time today)) 14:06:02 let's start? 14:06:07 Okey 14:06:14 I have a one topic 14:06:32 raised by saurabh__ in our main chat today 14:06:43 andreykurilin, lets start from you 14:06:48 ok ok 14:06:53 what is the topic? 14:07:15 keystone can kill rally 14:07:16 lol 14:07:26 sounds like a good topic :D 14:07:37 #topic keystone can kill rally 14:07:51 rvasilets, can you set a topic? 14:08:03 #topic keystone can kill rally 14:08:26 nice:) 14:08:39 this is my privilege) 14:09:36 In case of "dead" keystone and big number of parallel iterations, keystoneclient will open a lot of sockets 14:10:32 saurabh__ faced with the issue, when rally was unable to open a db file to write the results of task 14:10:45 sqlite was used 14:11:03 Is that a problem of rally or sqlite for example? 14:11:43 This is limitation of sqlite not Rally 14:11:48 possibly 14:11:50 ? 14:11:57 it is limitation of the system in general 14:12:31 the problem on the rally side - we don't handle such cases 14:13:35 Maybe, we can check the limit before saving results and increase it if possible 14:13:38 Could be fix this somehow? 14:14:00 At least, we can catch the error and write user-friendly error 14:14:06 limit of what? 14:14:31 limit of "open files" 14:14:36 andreykurilin: sorry i kinda dont follow. how lots of open sockets prevent us from writing to sqlite? 14:14:47 i see 14:15:26 ikhudoshyn: it depends on system settings 14:15:35 maybe just post a warning? 14:15:46 when?) 14:15:54 during parsing of scenario? 14:15:57 the biggest thin that we could to do is raise user friendly msg here 14:16:24 ikhudoshyn: each time? it will bother 14:16:41 we have already 2 warnings(from boto and from requests) 14:16:48 and I want to remove them:) 14:16:49 like 'dear user you are to run lots of iterations, you might need many open sockets, pls make sure you can' 14:17:27 can you increase limits in runtime, not being a root? 14:17:45 ikhudoshyn: I suppose we can check the system limit before launching task and print a warning 14:17:56 ikhudoshyn: I don't have such experience:) 14:18:26 andreykurilin: that's what i suggested, that did not seem to satisfy u 14:18:35 Did we filled the bug? 14:18:51 ikhudoshyn: https://docs.python.org/2/library/resource.html#resource.setrlimit 14:18:56 I mean we parse scenario, check limits, if they are too low -- we warn 14:19:03 this is really bad thin 14:19:11 maybe, it is possible to change a limit 14:19:20 but we need to check 14:19:41 rvasilets_: no, we don't have filed bug yet 14:19:57 well, I'm not sure this could be a good idea -- changing system settings quietly 14:20:11 agree 14:20:35 we should just show error or warning 14:20:44 an steps how to fix it 14:21:15 andreykurilin: what d'you think? 14:22:20 ikhudoshyn: It would be nice to have the check proposed by you, in case of sqlite backend and user-friendly error in db-layer 14:23:08 why db layer? I believe it's a somewhat wider issue 14:23:51 like we e.g. could run in an issue when we're unable to open sockets as well as files? 14:24:20 currently, we faced we such issue at db-layer:) http://paste.openstack.org/show/486988/ 14:24:42 so it is not just 'we'll be possibly unable ti store results in db' but 'we'll be possibly unable to do any writes/reads' 14:25:17 yes 14:25:53 so db layer does not look like the very best place) 14:27:02 task layer already tries to catch all errors and wrap them with user-friendly exception 14:27:56 and only db-layer is not wrapped with any try...except 14:29:11 it's not good) but it is not necessarily connected to system limits 14:29:30 we need bigger count of reraising =) 14:30:49 ikhudoshyn: Are you talking about the paste posted above? 14:31:17 nope, i'm talking about the 'limits' issue in general 14:32:31 I know about only one limit - open files:) which relates to open new files and new sockets and new threads:) 14:32:47 if we're sure that the 'pasebin' issue related to 'limits' -- even then i dont think it is a good idea to catch exception and print warning like 'shit happens during operationg with sqlite -- check limits' 14:33:18 andreykurilin: yes -- we are talking about THAT limit ) 14:33:51 =) 14:35:43 so... 14:35:52 why you don't think that it is a good idea? imo, it would be nice to catch such errors, maybe, execute "time.sleep()" and try again 14:36:19 where we would sleep() ?) 14:36:45 rvasilets_: at home) 14:36:54 reraising error we could lost the trace 14:37:12 and not found exact occurance of error 14:37:23 I'm for warning 14:37:36 andreykurilin: from what I could see in the paste -- nothing gives any hint that the issue is related with the limit of open files 14:37:56 ikhudoshyn: It was not a full log 14:37:57 lol 14:38:03 http://paste.openstack.org/show/486959/ 14:38:05 look here 14:38:07 hm.. nice,) 14:38:10 L3 14:38:53 rvasilets_: we have log.exception to store an original trace 14:39:05 Failed to consume a task from the queue: Unable to establish connection to https://192.169.123.50:5000/v2.0 14:39:22 see? we got lot's of issues here, not just db related 14:40:04 ikhudoshyn: i start this topic from the phrase "keystone can kill rally" 14:40:06 :) 14:40:13 so I strongly suggest to print a warning during scenario parsing/validation 14:40:23 )) 14:40:28 yea) +1 for warning) 14:40:41 so, keystone is dead -> keystoneclient continue to open new sockets -> rally failed to write the results 14:41:05 andreykurilin: it can indeed. but catching exception at db layer won't save us anyway)) 14:41:56 we can print the results(json.dumps) in stdout in case of unability to save in db 14:41:57 lol 14:42:05 lol 14:42:14 we don't want to 'sleep()' until all ks client connections close and release file handlers, do we? 14:42:49 we can add one more tries in several seconds 14:42:54 in can help 14:43:02 and it will not produce a big delay 14:43:14 *it can help 14:43:20 and what if we just run out of disk space))? 14:43:56 we're still won't be able to store data, after 1 sec of after 100 14:44:18 yes, but it is another issue, which should be fixed separated 14:44:22 or don't fixed:) 14:45:37 disagree. the issues is 'we can't write db' 14:46:08 the error would be sqlite3.OperationalError 14:46:28 the same as for many things) 14:46:39 what we could do is to list all possible reasons to user (which i don't believe to be a good idea) or to describe the reasons in a runbook 14:46:43 ok, but this issue can appeared in different cases. so of them can be processed and handled, another - not 14:46:55 only difference is the trace) and possibly msg 14:47:05 +1 for a runbook 14:47:23 )) 14:47:43 and getting back to the separate warning -- do we need it? 14:48:34 I prefer a warning after the event :) 14:49:21 which event? sqlite3.OperationalError 14:49:23 ? 14:50:07 catching the error by checking the number of used resources and available and writing warning - possibly we can, I don't see here the evil 14:50:14 so to be consistent you should say 'db shit happens -- pls check limits, free space, access rights.. what else?' 14:51:21 we could catch to many files opens 14:51:24 we can check limits and free space in "catch" code 14:51:35 I write a proper message 14:51:35 and check keystone 14:51:37 here 14:52:11 bECAUSE FAILED KEYSTONE UNDER LOAD THIS IS COMMON PROBLEM 14:52:15 sorry 14:52:29 ok i give up... you are to creathe the whole recovery and diagnostic system for just one specific case 14:52:46 and all this stuff was caused by failed keystone 14:53:05 ) 14:53:12 rvasilets_: yesssss, but we're talking about file limits and not ks at all 14:53:19 we could use th simple rule 14:53:22 80/20 14:54:02 so what 80/20 gonna tell you in a case of sqlite3.OperationalError 14:54:04 ? 14:54:14 ks sux? 14:54:35 :) 14:54:49 ikhudoshyn: btw, we already tries to catch something - https://github.com/openstack/rally/blob/master/rally/cli/cliutils.py#L567-L572 14:54:57 if we see gazillion of nable to establish connection to https://192.169.123.50:5000/v2.0 14:55:26 we could say it is a ks issue, but if we cant write to db -- why ks is here at all? 14:56:09 ikhudoshyn: Can we reserve one "open file" for db-stuff? 14:56:24 andreykurilin: that is a sample of good warning. 'db issue -- pls check yr db' 14:56:57 but 'db issue -- pls check yr limits.. or check yr ks' -- it is a bad warning)) 14:57:20 ikhudoshyn: ok, but user will check the rights and disk space and will not find the reason of issue 14:57:33 andreykurilin: I dont know. If we could -- it would be great 14:58:07 ikhudoshyn: free resources and disk space can be checked by us ;) 14:58:09 i was thinking about keeping it always open -- but it could be fragile 14:58:12 and even rights 14:58:57 We have not much time here do we have any agreed? 14:58:59 ) 14:59:09 andreykurilin: ^^ hm.. do you want to check everything in a case of issue? 14:59:14 maybe 14:59:18 btw we're out of time 14:59:19 why not? 14:59:34 Okey 14:59:37 let's move to our general chat 14:59:42 *main 14:59:45 lets continue in slack 14:59:47 #agree almost agreed) 15:00:01 See you next meeting 15:00:05 see you 15:00:09 #endmeeting