15:00:05 #startmeeting Performance Team 15:00:06 Meeting started Tue Dec 1 15:00:05 2015 UTC and is due to finish in 60 minutes. The chair is DinaBelova. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:07 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:10 The meeting name has been set to 'performance_team' 15:00:19 hello everyone! :) 15:00:33 raise your hands who's around :) 15:00:41 o/ 15:00:53 hi 15:00:55 klindgren, harlowja? :) 15:01:20 hehe, people sleeping :) 15:01:22 hi, it's Rob from Flex Ciii 15:01:28 mlgrneff - hey! 15:01:40 nice to see u here 15:01:51 Glad to finally make it. 15:02:01 so let's wait for a few more moments :) to allow people appear 15:02:05 mlgrneff ;) 15:02:24 hi 15:03:02 ok, so let's probably start. As usual with today's agenda 15:03:07 #link https://wiki.openstack.org/wiki/Meetings/Performance#Agenda_for_next_meeting 15:03:14 #topic Action Items 15:03:39 andreykurilin are you around? 15:03:46 we have an action item on you 15:03:49 sure 15:03:50 from last meeting 15:03:53 I'm here:) 15:03:54 hi 15:04:03 about when load pattern tests runner will be available in Rally 15:04:24 I know there is already some change for it on review - any feeling when it'll be merged? 15:04:48 #link https://review.openstack.org/#/c/234195/ 15:05:09 kun_huang or I may asl you as well - as a reviewer :) 15:05:10 o/ 15:05:19 any bad/good feeling on this patch? 15:05:22 rvasilets___ o/ 15:05:35 rvasilets is an owner of this patch ) 15:05:51 he can give a better estimates 15:05:52 well, his progress depends much on the reviewers :) 15:05:58 rvasilets___? 15:06:17 I remember this change was hot enough last time - any estimates on merging? 15:06:48 Patch is ready. Just waiting for review. Its already work 15:07:08 ok, so I'm right and reviewing effort needs to be spend :) 15:07:11 General news are: patch by rvasilets adds a base ability to "stress runner", but it doesn't decrease load on SLA failure 15:07:22 pboldin and kun are reviewed it. That is all 15:07:26 DinaBelova: I will update my review on new patch set ;) 15:07:30 andreykurilin - well, we need to do first steps first :) 15:07:39 kun_huang, ok, thanks sir :) 15:07:55 DinaBelova: I don't know anyone who works on the second step 15:08:06 so waiting for it with great wish to see it merged as a first step - we can add work item for second one 15:08:10 andreykurilin ack 15:08:17 cool, so let's go further 15:08:33 so about devstack-gate n-api-meta job 15:08:41 there is an email to openstack-operators 15:08:45 lemme check it 15:09:03 mriedem proposed one job to run metadata as well 15:10:03 hm, cannot find the email in the arcives 15:10:07 I'll check it later 15:10:14 so anyway, action item is done 15:10:18 neutron large ops 15:10:27 mriedem - yep, thank you sir 15:10:43 and about my action items - done both 15:11:08 really encouraging you guys to review https://review.openstack.org/#/c/251343/ 15:11:12 and all chain related 15:11:27 andreykurilin, harlowja, kun_huang - please feel free to leave comments :) 15:11:58 ok, cool, so that's it for action items 15:12:01 got it 15:12:08 any questions regarding this topic? 15:12:16 my todo list is extended :) will review your patches a little bit later 15:12:21 andreykurilin ;) 15:12:26 cool, thanks andreykurilin kun_huang 15:12:31 let's move forward 15:12:32 #topic Nova-conductor performance issues 15:12:40 ok, so we have some news on that field :) 15:13:05 klindgren has sent me the results of the following patch 15:13:08 #link http://paste.openstack.org/show/480426/ 15:13:20 it's dumping the conductor workers info once a minute 15:13:37 visual results may be found here 15:13:46 #link https://drive.google.com/a/mirantis.com/file/d/0ByRtVrZu5ifzUkZTQVZzMERYQWc/view 15:14:04 and today dims and I had a chance to take a look 15:14:13 opening it... 15:14:18 frankly speaking it's not super obvious... 15:14:25 there are dumps of 2 workers there 15:14:40 i suspect that some of the reds are related to the bug we have observed in MOS as well - https://bugs.launchpad.net/mos/+bug/1380220 - this is related to the eventlet red svg boxes (that can be seen on first picture from conductor worker with 8401 pid) - we did not observe this bug before in community, so it looks like we need to file it on the upstream as well (related to the heartbits probably) 15:14:40 Launchpad bug 1380220 in Mirantis OpenStack "OpenStack services excessively poll socket events when oslo.messaging is used" [Medium,Triaged] - Assigned to MOS QA Team (mos-qa) 15:15:04 we have seen the same behavior as 8401 worker on some of the MOS installations 15:15:20 do we know what DB driver they are using for the nova-conductor? 15:15:55 johnthetubaguy - it's not listed in the https://etherpad.openstack.org/p/remote-conductor-performance 15:15:56 hm... 15:16:01 I hope klindgren will appear :) 15:16:08 mysql-python i think 15:16:09 I know it's a bit early for him now 15:16:12 would be good to check if its pymysql 15:16:17 b/c it's older oslo.db pre-pymysql 15:16:23 johnthetubaguy: i don't think it is 15:16:30 mriedem: ah, that is what I was wondinerg 15:16:33 mriedem: they could still be running it with the older one, right? 15:16:36 it's just the default changed.. 15:16:40 oslo.db==1.7.1, MySQL-python==1.2.3 (kilo reqs are: oslo.db<1.8.0,>=1.7.0) 15:16:41 ah, stop-stop 15:16:42 oslo.db==1.7.1, MySQL-python==1.2.3 (kilo reqs are: oslo.db<1.8.0,>=1.7.0) 15:16:45 johnthetubaguy ^^ 15:16:50 johnthetubaguy: has anyone deployed pymysql yet? 15:16:54 johnthetubaguy: either way, does that impact the messaging performance? 15:16:55 mriedem, yep, thanks 15:17:20 dansmith: it might impact eventlet, is probably why people are asking 15:17:29 since mysql-python doesn't support eventlet right? 15:17:34 dansmith: not sure what you mean by messaging performance 15:17:46 mriedem: yeah, each DB call locks up the whole thread 15:17:55 mriedem: it will affect eventlet, but the hotspots are banging hard on rabbit sockets, which doesn't seem like it would be related to the db driver 15:17:59 but still 8401 worker is having the https://bugs.launchpad.net/mos/+bug/1380220 - issue - but in fact it sould not influence %CPU used 15:17:59 Launchpad bug 1380220 in Mirantis OpenStack "OpenStack services excessively poll socket events when oslo.messaging is used" [Medium,Triaged] - Assigned to MOS QA Team (mos-qa) 15:18:04 and eventlet lets each one in turn do its DB call, before letting it process the response, I am told 15:18:10 interesting moment is with 8402 15:18:37 lots of RabbitMQ-related timeouts 15:18:43 dansmith: ah, so my head is mush today, can I can't even open the files somehow 15:18:58 johnthetubaguy: right so what DinaBelova is talking about right now ... doesn't seem db driver related to me 15:19:05 dansmith - yep.. 15:19:20 johnthetubaguy: and that's what I observed when looking at their profile traces a couple weeks ago.. something seems to be banging really hard on rabbit 15:19:27 and the only feeling I have now - to check that their CPU load is really related to the RabbitMQ 15:19:31 johnthetubaguy: I thought maybe it was heartbeats or something, but they say that's disabled 15:19:39 so we need more dumps + top screens 15:19:39 DinaBelova: ++ 15:19:52 if yes 15:19:56 one possible variant to fix it is rabbitmq upgrade 0_o 15:20:00 or some tcp / whatever tuning... 15:20:11 as we simply see RabbitMQ waiting for reading things on the wire - and that’s it 15:20:35 if all their CPU issues are about this - well, it'll be other (one more) RabbitMQ story 15:20:41 they have RabbitMQ 3.3.x 15:20:54 dansmith: ah, interesting, that does seem separate, unless eventlet is making it lie 15:21:10 johnthetubaguy - yep, sir 15:21:26 so I'll ask klindgren to make more dumps for more workers and for longer time + include tops for the same moments 15:21:36 dansmith: I know belliott hit some issues with DB locking sending the elapsed times crazy, but I don't remember the details now 15:21:41 johnthetubaguy: yeah, eventlet could be making it lie for sure, I'm just not sure that the db driver could be making the numbers look like they're elsewhere 15:21:46 dansmith: didn't they see a gain when turning off ssl too? 15:21:52 mriedem: a small gain 15:22:02 mriedem: the first big gain was because they mistyped the config :/ 15:22:07 #action DinaBelova klindgren more dumps for more workers and for longer time + include tops for the same moments to ensure we see the same RabbitMQ related reds at the same time conductors CPU is going crazy 15:22:30 mriedem - without ssl it was just a little drop 15:22:31 dansmith: oh, wait, this does sound like what brian found, you want to reduce the number of eventlet works, and it stops thrashing the hub 15:22:50 johnthetubaguy: s/works/workers/ ? 15:23:11 yeah, sorry 15:23:14 workers 15:23:18 johnthetubaguy: can you explain more? 15:23:22 well, greenlet threads I mean 15:23:28 \o 15:23:36 so I think eventlet got very busy scheduling between lots of active threads 15:23:47 harlowja_at_home o/ 15:23:50 so we turned down the number of workers (this was on the scheduler, rather than conductor) 15:24:08 and we found it was better at pushing through DB queries, when using mysql-python 15:24:18 hmm 15:24:31 scheduler workers? 15:24:33 johnthetubaguy: that means more queuing in rabbit instead of queuing in memory on a conductor itself? 15:24:39 mriedem: greenlet workers 15:24:44 oh 15:24:46 johnthetubaguy - interesting, may you please fill your proposal in the https://etherpad.openstack.org/p/remote-conductor-performance somewhere 15:24:59 well, more waiting to be restored, while the thread is busy doing DB stuff 15:25:13 johnthetubaguy: well, what I mean is, 15:25:16 as it lets all the DB stuff happen before letting the python code process the response, or something like that? 15:25:17 johnthetubaguy - in fact due to the cprofile data DB operations were not so busy 15:25:29 johnthetubaguy: we don't dequeue a thousand things from rabbit and then try to balance them even though we can only do one at a time 15:25:40 but who knows 15:25:41 DinaBelova: right, that's fine 15:25:48 DinaBelova: they wouldn't in this case johnthetubaguy is talking about 15:26:10 dansmith - yeah, i just understood it 15:26:16 dansmith thanks 15:26:21 its more a starvation issue, as I understood it 15:26:37 yeah, I guess I can see that 15:26:42 the thing is, 15:26:50 they don't have much if any real db traffic needing to be services 15:26:53 er, serviced 15:27:05 so I'm not sure why there would be a pile of requests needing a pile of threads 15:27:13 basically, just periodics and service checkins 15:27:19 their cloud is otherwise mostly idle 15:27:21 they have lots of nova metadata requests 15:27:31 isn't cells always syncing up too? 15:27:34 due to periodical puppet scripts running 15:27:44 mriedem: nova via conductor though 15:27:49 DinaBelova: that's true, forgot about those 15:28:00 johnthetubaguy: ok 15:28:00 and they sure have the cache as well, but metadata is periodically knocking the conductor 15:28:19 which is why we talked about turning on n-api-meta in one of the large ops jobs 15:28:19 so anyway, seems like worth a try 15:28:20 honestly, those service updates are all blocking DB calls, but they should be quick-ish though 15:28:31 mriedem precisely 15:29:00 johnthetubaguy - may you please add your idea to the https://etherpad.openstack.org/p/remote-conductor-performance just to have it written up? 15:29:02 ah... n-api-meta uses conductor... I never quite realised that 15:29:09 DinaBelova: will do 15:29:13 johnthetubaguy: yeah, so you can have a db-less compute node 15:29:15 johnthetubaguy - thank you sir 15:29:21 so yeah 15:29:25 we need more data! 15:29:35 will ping klindgren after the meeting :) 15:29:51 and I guess we may leave this topic for a while 15:29:56 let's move forward 15:29:57 dansmith: yeah, only just made that connection, oops 15:30:06 #topic Some hardware to reproduce the issues 15:30:14 kun_huang - your topic, sir 15:30:24 oh 15:30:43 since we are talking about issue on performance everyday 15:30:55 may you please explain what do you mean here? do you have the HW or dod you want to have some? 15:30:58 :) 15:31:14 I have some 15:31:23 and want to make good use of them 15:31:38 kun_huang wow, that will be simply perfect 15:31:55 and that will make easier possible issues debug, etc 15:32:14 so my first question, who need those first 15:32:28 I know intel&rackspace have public lab in U.S 15:32:59 Has everyone used their resources? 15:33:27 kun_huang - they have pretty big env, yes. but this lab is having competiting schedule between people who want to use it afaik 15:34:22 we don't use it for now 15:34:35 it or any other big labs 15:34:42 DinaBelova: I'll apply some resource from my company 15:34:53 at least, my boss support this idea 15:35:03 kun_huang - that is very promising, thanks! 15:35:10 and I need write some materials... 15:35:14 some paper work 15:35:20 in case of success - may you please write up some instructions 15:35:22 oh yeah :) 15:35:31 kun_huang - nobody loves it :) 15:35:37 kun_huang thanks in advance! 15:36:16 kun_huang - the resources we're using inside mirantis sadly are for mirantis usage only... but we can extend the test plans regarding peoples opinion 15:36:29 I hope to solve issue with these documents placing with TC 15:36:39 and then we'll start feedback collection 15:36:43 from you and others 15:37:01 really hope to make this stuff clearer this week 15:37:10 kun_huang - once more time - thanks for your effort 15:37:20 * regXboi wanders in late 15:37:26 regXboi o/ 15:37:45 regXboi I PROMISE to send an email with +1 hour to the meeting start time suggestion 15:37:55 I feel people are sufferng 15:37:55 :) 15:37:56 okay, I'll keep this channel posted with any update 15:38:02 kun_huang thanks! 15:38:03 or I need any help 15:38:12 sure, feel free to ping me 15:38:17 * harlowja_at_home is suffering from not enough coffee 15:38:25 harlowja_at_home :d 15:38:25 lol 15:38:27 #topic OSProfiler weekly update 15:38:35 good morning guys harlowja_at_home regXboi 15:38:37 k, so let's go to the profiler 15:38:46 * harlowja_at_home coffeeeeeee 15:38:57 DinaBelova: my problem is that I've got too many meetings stacking up on each other :( 15:39:05 * regXboi skims scrollback 15:39:07 regXboi, yes, sir :( 15:39:13 that's the issue 15:39:26 timeframes comfortable for both US and Europeans are overcrouded 15:39:32 crowded* 15:39:34 :( 15:39:44 ack 15:39:47 so going back to the osprofiler - harlowja_at_home - chain https://review.openstack.org/#/c/251343/ is pretty done 15:40:02 so I need reviews! 15:40:04 lol 15:40:05 cool beans, i'll check it out 15:40:15 and I need you to finish https://review.openstack.org/#/c/246116/ :) 15:40:18 it will be my today mission :) 15:40:27 so I can play with ELK for 100% here :) 15:40:28 yes ma'm 15:40:30 harlowja_at_home ack 15:40:33 :) 15:40:37 u are playing with elk? 15:40:44 is that a thing people do in europe? 15:40:46 #action harlowja_at_home review https://review.openstack.org/#/c/251343/ 15:40:50 harlowja_at_home :D 15:41:22 that's what tough Russian wifes are doing in the meanwhile 15:41:26 lol 15:41:37 ;) 15:41:49 so speaking seriously - I want to continue working on this direction 15:41:58 dancing with elk (the new movie, based off dancing with wolves) 15:42:00 of adding more logging and analysing opportunities 15:42:08 :) 15:42:12 def 15:42:29 so I'm kindly asking you to polish your change 15:42:39 sure thing 15:42:46 and I'll be able to go with ELK here and make some experiments 15:42:49 (only if i get to dance with elk to) 15:42:50 harlowja_at_home thank you sir 15:42:52 :D 15:43:03 about spec news 15:43:07 DinaBelova: is there a plan to extend osprofiler deeper into what it tracks?:) 15:43:12 https://review.openstack.org/#/c/103825/ also btw, but dims had some questions there... 15:43:28 maybe boris can followup on 103825 (or other person?) 15:43:32 er s/plan/patch/ 15:43:51 regXboi - not patch but plans :) 15:43:55 103825 is also somewhat ambiguous about if it wants oslo to adopt osprofiler 15:44:03 it'd be nice if that was like stated (is that a goal?) 15:44:04 harlowja_at_home - indeed 15:44:14 harlowja_at_home yes, it is 15:44:27 Boris is currently communicating with dims about his conserns 15:44:30 k 15:44:37 as right now developing approach is a bit different 15:44:43 from what dims is asking about 15:45:05 the issue is that 2 years ago Boris got -2 on his patch to oslo.messaging and oslo.db 15:45:08 ya 15:45:09 * regXboi wonders about decoration 15:45:12 with that profiling thing 15:45:22 DinaBelova, its been boris life goal to get that merged 15:45:29 before boris retires he might get it merged 15:45:29 regXboi - you can use decoration now everywhere already 15:45:37 harlowja_at_home :D 15:45:55 boris the old, lol 15:46:00 hopefully before then it will merge 15:46:01 DinaBelova: yes, but it's not mentioned that I could see in 103825 15:46:15 regXboi, hm, lemme check 15:46:19 it was there I believe 15:46:35 oh 15:46:37 * harlowja_at_home remembers boris trying decoration, people still complain about random crap (like oh decoration will add code... blah blah) 15:47:01 decoration can be an dependent add on patch 15:47:02 anyways, let's work through these weird issues that people have, and finally get it in (i hope) 15:47:03 regXboi - it has disappeared 15:47:13 but folks are proposing decoration profiling in other projects 15:47:18 so it's silly not to have it here 15:47:28 regXboi - agreed 15:47:40 but - in order to get this merged 15:47:45 let's save that for a follow on :) 15:47:56 #action boris-42 add information about ways of profiling to to 103825 15:47:58 harlowja_at_home agreed 15:48:06 :D 15:48:08 keep it short and simple ( the OpenStack KISS :) ) 15:48:16 :) 15:48:24 merging before boris retires would be superb to, lol 15:48:26 ok, so anything else here about osprofiler for now? 15:48:40 it needs a dancing with elk logo 15:48:43 :D 15:48:48 I like that 15:48:55 * harlowja_at_home can't draw though 15:48:57 and I'd want *that* patch :) 15:49:01 ok, so open discussion 15:49:02 #topic Open Discussion 15:49:11 and u can joke here :D 15:49:18 * harlowja_at_home i never joke 15:49:22 i'm always serious 15:49:27 harlowja_at_home I suspected that 15:49:28 :d 15:50:06 ok, so any topics to cover? 15:50:10 ideas to share? 15:50:12 https://en.wikipedia.org/wiki/Dances_with_Wolves (the dancing with wolves movie) btw 15:50:23 possible work items to add here? https://etherpad.openstack.org/p/perf-zoom-zoom 15:50:50 so about that rally upload results, public website thing 15:51:00 RobNeff - probably yuo can share something? 15:51:01 we discussed about figuring our ratio of controllers to compute nodes few weeks ago, is that a topic worth discussing? 15:51:12 harlowja_at_home, yep, sir? 15:51:14 do people think we should try to do that, or save it for later... 15:51:18 manand - yes, it is 15:51:40 manand - we just did not have much response yet collected 15:51:56 probably I'll need to refresh the discussion pinging some of the operators directly 15:52:02 * harlowja_at_home would really like a way for people to get involved, uploading there useful results with some metadata, and periodically do this, so that we as a community can gather data about each other, and use it for trending, analysis... 15:52:08 harlowja_at_home - I think it's useful for sure 15:52:17 the only thing is that it's not only about rally 15:52:25 anything in fact may land there 15:52:26 Do you have a how-to on the Rally Upload yet? 15:52:46 `so about that rally upload results, public website thing` I like this idea 15:52:47 DinaBelova, sure, although kitty-kat pictures hopefully aren't uploaded 15:52:55 harlowja_at_home :D 15:53:09 RobNeff - we do not have this web site yet 15:53:16 but we think it's a good idea 15:53:25 important moment here 15:53:35 rally results mean nothing without the cloud topology shared 15:53:38 :( 15:53:51 so people need to be open enough to share some of the details 15:54:04 harlowja_at_home - do you think it'll be possible? 15:54:07 agreed the critical part is open-enough 15:54:24 i think we have to start by letting people upload what they can, and we can improve on uploading what is better 15:54:32 *with more metadata about there topology... 15:54:34 DinaBelova: did I forget to point you at https://etherpad.openstack.org/p/hyper-scale ? 15:54:42 regXboi - yep :) 15:54:43 and if I did - I'm sorry 15:54:44 will go through it 15:54:51 that's open to all to go look at read 15:54:54 but initially i think we need to just get people to upload the basics, and as they get less 'scared' or whatever the can upload more info 15:55:00 #action DinaBelova go through the https://etherpad.openstack.org/p/hyper-scale 15:55:02 the back half is neutron specific 15:55:13 regXboi thanks! 15:55:20 but I'm wondering if the front half would make sense as a devref documentation *somewhere* 15:55:41 regXboi will take a look, if yes, it'll be really good to do that 15:55:56 harlowja_at_home - so about the web site - are you the volunteer here? :) 15:56:07 DinaBelova: if you can suggest a *where*, I'm all ears 15:56:13 DinaBelova, ummm, errr 15:56:23 let me get back to u on that, ha 15:56:28 regXboi - well, we can grad docs team and shake it a bit to find that out 15:56:39 grab* 15:56:41 ack 15:56:42 :) 15:56:51 harlowja_at_home - ok 15:57:15 harlowja_at_home - simply I'm a bit busy with profiler now and conductor investigations 15:57:26 (and elk dancing) 15:57:29 therefore right now personally I cannot work on that 15:57:31 yeah 15:57:31 np 15:57:42 so help is super appreciated 15:57:45 harlowja_at_home :) 15:57:49 understood 15:58:09 k, cool 15:58:12 anything else here? 15:58:33 thanks everyone for hot and productive discussion! 15:58:34 klindgren dansmith mriedem DinaBelova: I have attempted to write up my ideas around executor_thread_pool_size in the etherpad: https://etherpad.openstack.org/p/remote-conductor-performance let me know if any of that is unclear. 15:58:46 johnthetubaguy thank you sir! 15:58:50 johnthetubaguy: cool 15:58:50 DinaBelova: I'm pinging somebody I know in the docs project right now 15:58:59 ok, cool 15:59:07 buy! 15:59:08 #endmeeting