15:00:02 #startmeeting Performance Team 15:00:03 Meeting started Tue Nov 17 15:00:02 2015 UTC and is due to finish in 60 minutes. The chair is DinaBelova. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:05 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:07 The meeting name has been set to 'performance_team' 15:00:17 hello folks! 15:00:22 o/ 15:00:24 Hi! 15:00:26 o/ 15:00:33 good evening :) 15:00:39 o/ 15:00:42 kun_huang - good evening sir 15:00:49 so todays agenda 15:00:52 hi 15:00:56 #link https://wiki.openstack.org/wiki/Meetings/Performance#Agenda_for_next_meeting 15:01:16 there was a complain last time that there was not enough time to fill it 15:01:27 although this time it looks not so big as well :) 15:01:35 so let's start with action items 15:01:41 #topic Action Items 15:01:50 last time we had two action items 15:02:18 #1 was about filling the etherpad https://etherpad.openstack.org/p/rally_scenarios_list with information about Rally scenarios used 15:02:24 in your companies :) 15:02:40 well, it looks like nothing has changed since previous meeting 15:02:41 :( 15:02:57 I really hoped augiemena3, Kristian_, patrykw_ will fill it 15:03:07 although I do not see them here today 15:03:24 I know Kevin had shared a topic about rally and neutron's control plane benchmarking 15:03:36 kun_huang - oh, that's cool 15:03:43 do you have link to that info? 15:04:04 a topic in tokyo, wait a minute 15:04:07 Dina - my bad, should have filled in with some info 15:04:18 * mriedem joins late 15:04:27 * dims waves hi 15:04:32 #link https://www.youtube.com/watch?v=a0qlsH1hoKs 15:04:37 #action everyone (who use Rally for OpenStack testing inside your companies) fill etherpad https://etherpad.openstack.org/p/rally_scenarios_list with used scanarios 15:04:44 AugieMena - :) 15:04:55 please spend some time on this etherpad filling 15:05:11 if we want to create a standard it'll be useful to collect some info preliminary 15:05:15 mriedem, dims o/ 15:05:25 @kun_huang thank you sir 15:05:31 lemme take a quick look 15:05:38 ah, that's video 15:05:43 so after the meeting :) 15:05:58 #action DinaBelova go through the https://www.youtube.com/watch?v=a0qlsH1hoKs 15:05:58 no problem 15:06:05 ok, cool 15:06:18 so one more action item was on Kristian_ 15:06:40 he promised to collect the information about Rally blanks inside ATT 15:06:51 it looks like he was not able to join us today 15:07:13 #action DinaBelova ping Kristian_ about internal ATT Rally feedback gathering 15:07:31 so it looks like we went through the action items :) 15:07:46 just once more time - please fill https://etherpad.openstack.org/p/rally_scenarios_list 15:08:05 that will be super useful for future recommendations / methodologies creation 15:08:24 I guess we may go to the next topic 15:08:29 #topic Nova-conductor performance issues 15:08:39 #link https://etherpad.openstack.org/p/remote-conductor-performance 15:08:51 DinaBelova: can we retrun back to the previous topic? 15:09:01 boris-42 heh :) 15:09:20 I dunno how to make that easy using the bot controls 15:09:27 #undo 15:09:33 thanks! 15:09:35 #undo 15:09:35 Removing item from minutes: 15:09:43 boris-42 - feel free 15:09:46 dansmith: nice 15:10:06 DinaBelova: so we (Rally team) started recently workin on certification task 15:10:20 #link https://github.com/openstack/rally/tree/master/certification/openstack 15:10:29 boris-42 - sadly I do not have much info about this initiative 15:10:34 lemme take a quick look 15:10:36 DinaBelova: wich is much better way to share your expirience 15:10:37 * bauzas waves 15:11:00 DinaBelova: than just using etherpads 15:11:13 boris-42 - that may be cool 15:11:24 so it's some task creation on cloud validation 15:11:25 DinaBelova: so basically it's single task that accepts few arguments about cloud and should generate proper load and test everything that you specified 15:11:35 boris-42 - a-ha, cool 15:12:10 DinaBelova: so basically it's executable etherpad 15:12:17 ok, so that may be very useful for this purpose 15:12:19 DinaBelova: that you are trying to collect 15:12:19 thank you sir 15:12:31 we may definitely use it 15:12:33 so rally as defcore? 15:12:42 boris-42: Has mirantis team used this feature? 15:12:53 mriedem: so nope 15:12:57 mriedem: I'm trying to wrap my head around that :) 15:13:06 #info we may use https://github.com/openstack/rally/tree/master/certification/openstack to collect information about Rally scenarios used in verious companies 15:13:28 mriedem: it's pain in neck to use rally to validate OpenStack 15:13:31 kun_huang - I did not hear about this frankly speaking 15:13:41 mriedem: because you need to create such task and it takes usually 2-3 weeks 15:13:45 kun_huang - but as boris-42 said this initiative is fairly new 15:14:07 mriedem: so we decided to create it once and avoid duplication of effort 15:14:17 boris-42 - very useful, thank you sir 15:14:18 mriedem: our goal is not say is it openstack or not* 15:14:37 kun_huang: so we just recently made it 15:14:56 kun_huang: I know about only 1 usage and there were bunch of issues that I am going to address soon 15:15:02 ok, maybe the readme there needs more detail 15:15:16 ok, very cool. thanks boris-42! something else to mention here? 15:15:20 mriedem: what you would like to see there 15:15:25 mriedem: ? 15:15:28 what it is and what it's used for 15:15:31 boris-42 why is it called "certification" then? :) 15:15:33 note that i'm not a rally user 15:15:41 right, 'ceritification' makes me think defcore 15:15:44 :) 15:15:46 y 15:16:03 would someone provide a one-liner on what the purpose of it is? 15:16:32 I would like to say that is some kind of task template 15:16:49 tasks template 15:16:51 AugieMena - single task to check all OpenStack cloud. And you may fill it with all scenarios you like 15:16:58 kun_huang - is that accurate? 15:16:59 AugieMena: just that will put properl load and SLA on your cloud 15:17:07 I guess to run one big task again cloud and to see measures or different resources 15:17:14 dims: nope not scenarios 15:17:19 DinaBelova: nope not scenarios 15:17:21 DinaBelova: my understanding 15:17:21 dims: sorry 15:17:56 boris-42 :) 15:17:57 so how will it help make it easier to gather info about what Rally scenarios various companies are using? 15:18:12 It's the single task that contains bunch of subtasks that will test specified serviced with proper load (based on size & quality of cloud) and proper SLA 15:18:36 AugieMena - boris-42 just proposed to create these lists in form of these "certification" tasks to be able to run them 15:18:57 OK, I see 15:19:11 ack! 15:19:12 AugieMena: separated scenario doesn't mean anything 15:19:32 AugieMena: without it's arguments, context, runner.... 15:19:45 boris-42 - moving forward? :) 15:19:55 DinaBelova: there is still one question 15:20:07 boris-42 - go ahead :) 15:20:13 dims: so certification is picked because it's like "Rally certification of your cloud" 15:20:32 boris-42: DinaBelova pls make a note to describe rally's certification work, blogs or slides... I will help to understand 15:20:36 dims: it certifies the scalability & performance of evertyhing.. 15:20:51 * DinaBelova guesses boris-42 meant Dina 15:21:11 boris-42 - ok, understand the need to provide specifics about arguments used in the scenarios 15:21:17 #idea describe rally's certification work, blogs or slides - kun_huang can help with it 15:21:39 boris-42 : i understand, some link to the official certification activities would help evangelize this better. you will get this question asked again and again :) 15:22:10 dims: ) 15:22:23 dims - yep, documentation is everything here :) 15:22:31 dims: honestly we can rename this directory to anything 15:22:47 dims: but personally I don't like word validation because validation is what Tempest is doing 15:22:48 =) 15:23:04 ) 15:23:17 or not doing) 15:23:37 rvasilets___ :) 15:23:40 ok, anything else here? 15:24:04 ok, moving forward 15:24:18 #topic Nova-conductor performance issues 15:24:29 ok, so some historical info 15:25:01 during the Tokyo summit several operators including GoDaddy (ping klindgren) mentioned about issues observed around nova-conductor 15:25:08 #link https://etherpad.openstack.org/p/remote-conductor-performance 15:25:29 Rackspace mentioned it was well 15:25:46 * klindgren waves 15:25:56 so it was decided it'll be cool idea to investigate this issue 15:26:10 currently all known info is collected in the etherpad ^^ 15:26:29 SpamapS has started the investigation of the issue on the local lab 15:26:55 afaik he had to switch to something else yesterday, so not sure if anything new has hapenned 15:27:17 i'd be interested to know if moving to oslo.db >= 1.12 helps anything 15:27:30 why would it? 15:27:31 also, is everyone still using mysqldb-python in these tests? 15:27:38 dansmith: right 15:27:43 b/c oslo.db < 1.12 15:27:53 mriedem: is that a yes, or agreement with the question/ 15:27:54 rpodolyaka1: oslo.db 1.12 switched to pymysql 15:27:57 oslo.db >= 1.12 does not mean they use pymysql 15:27:59 dansmith: that's agreement 15:28:00 and yes 15:28:04 it's only used in oslo.db tests 15:28:14 it's up to operator to specify the connection string 15:28:17 right 15:28:20 ooo 15:28:21 you may use mysql-python as well 15:28:32 have we deprecated mysql-python? 15:28:51 and afaik Rackspace fixed this (or probably looking like this) issue by moving back to MySQL-Python 15:29:02 mriedem: I think we actually run the unit tests for it in oslo.db 15:29:07 DinaBelova: I think you're conflating two things there 15:29:09 rax has an out of tree change (that's also a DNM in nova) for direct sql for some db APIs 15:29:24 DinaBelova: rackspace went back to an out of tree db api 15:29:31 this is what rax has https://review.openstack.org/#/c/243822/ 15:29:35 dansmith - probably, I just remember conversation on Tokyo summit about issue like that 15:29:41 essentially dropping sqlalchemy for some calls 15:29:54 alaski - a-ha, thank you sir 15:30:08 thanks dansmith, mriedem 15:30:29 it'd also be good to know what the conductor/compute ratios are 15:30:46 klindgren ^^ 15:31:01 there is some info in the etherpad 15:31:16 mriedem: e.g. https://review.openstack.org/#/c/246198/ , there is a separate gate job for mysql-python 15:31:37 rpodolyaka1: so why isn't that deprecated? we want people to move to pymysql don't we? 15:31:39 that being said, we are using mysqldb 15:31:53 mriedem - yeah, conductor service with 20 workers per server (2 servers, 16 cores per server), 250 HV in the cell 15:32:00 Do you want to see if oslo.db >- 1.12 works better ? Or if pymysql works better 15:32:08 klindgren: pymysql 15:32:10 right now 20 computes * 3 servers 15:32:20 mriedem: we let them decide which one they want to use 15:32:20 2 servers are 16 core boxes, one is an 8 core box 15:32:22 but that requires at least oslo.db >= 1.12 if i'm understanding the change history correctly 15:32:40 klindgren: so 2.5 conductor boxes for 20 computes? 15:32:40 20 conductors* 15:32:43 rpodolyaka1: yeah but mysql-python is not python 3 compliant and has known issues with eventlet right? 15:33:00 for 250 computes 15:33:07 klindgren: that's waaaay low 15:33:10 mriedem: right, but as rax experience shows, pymysql does not shine on busy clouds :( 15:33:27 rpodolyaka1: I don't think that's what their experience shows 15:33:33 anyway, are we sure that's a bottleneck? 15:33:35 rpodolyaka1: i think those are unrelated 15:33:36 \o 15:33:59 rpodolyaka1 - not yet, sir. Investigation in progress, we're just collecting the ideas of where to look at 15:34:04 rpodolyaka1: rax hasn't tried pymysql yet. it's on our backlog to test but we don't have any data on it 15:34:05 harlowja_at_home - morning sir! 15:34:06 rpodolyaka1: rax uses mysqldb b/c of their direct to mysql change uses mysqld-python 15:34:07 https://review.openstack.org/#/c/243822/ 15:34:10 rpodolyaka1, I am getting Model server went away errors randomly from nova-computes 15:34:13 DinaBelova, hi! :) 15:35:07 alaski: mriedem: ah, I must have confused them with someone else then. I was pretty sure someone blamed pymysql for causing the load on nova-conductors. and that mysql-python was a solution 15:35:23 SpamapS wanted to check if switching to some other JSON lib will help, and I'm going to work on this issue as well (probably start tomorrow) 15:35:24 rpodolyaka1: I'm pretty sure not 15:35:27 dansmith, what would recommend as the number of servers dedicated to nova-conductor to nova-compute ratio? 15:35:28 ok 15:35:50 DinaBelova: unless you're on python 2.6, i don't know that the json change in oslo.serialization will make a difference 15:35:55 rpodolyaka1: we blame sqlalchemy right now :) but are hopeful that pymysql will be better 15:36:01 haha 15:36:05 klindgren: it all depends on your environment and your load.. but I just want to clarify.. above you seemed to confuse a few things 15:36:16 alaski lol 15:36:16 klindgren: 250 computes and how many physical conductor machines running how many workers? 15:36:26 mriedem - well, SpamapS is experimenting here, I'll probably start with some meaningful profiling 15:36:33 ++ 15:36:33 if will be able to reproduce it 15:36:44 3 physical boxes one server has 8 cores the others have 16 15:36:49 running 20 workers each 15:36:55 klindgren: so three total boxes for 250 computes, right? 15:37:00 yep 15:37:05 klindgren: right, so that's insanely low, IMHO 15:37:20 plus, you have $workers > npcu on those conductor boxes, 15:37:22 klindgren: and the answer is: keep increasing conductor boxes until the load is manageable :) 15:37:36 thats a pretty shit answer 15:37:37 mriedem: well, with mysqldb you have to have that 15:37:40 imho 15:37:41 klindgren :D 15:37:41 hah 15:38:01 klindgren: so run some conductors on every compute if you want 15:38:18 dansmith: although local conductor is now deprecated 15:38:22 klindgren: the load is all the same, conductor just concentrates it on a much smaller number of boxes if you choose it to be small 15:38:31 dansmith - heh, afair conductors were created to avoid local conductoring? 15:38:31 mriedem: sure, but they can still run conductor on compute if they don't want upgrades to work 15:38:40 fyi this environment has always been remote conductor 15:38:46 and load only started being an issue 15:38:48 klindgren: can you run nova-conductor under cProfile on one of the nodes? We haven't seen anything like that on our 200-compute nodes deployments 15:38:49 when we went to kilo 15:38:56 klindgren: are you still on kilo? 15:39:05 "still" 15:39:10 liberty *just* came out 15:39:21 klindgren: that's an important detail, maybe you're just experiencing the load of the flavor migrations 15:39:21 hmmm, flavor migrations in kilo maybe? 15:39:31 klindgren: that's a hugely important data point :) 15:39:54 we ran all the flavor migration commands after upgrade 15:40:03 klindgren: right, but there is still overhead 15:40:03 btw all of this in in the etherpad 15:40:16 klindgren: and it turned out to be higher than we expected even after the migrations were done 15:40:24 even after migrations we still saw overhead as well 15:40:24 dansmith, mriedem - yep, these details are in the etherpad as well :) 15:40:24 klindgren: but it's gone in liberty because the migration is complete 15:40:36 DinaBelova: I've read the etherpad and didn't get the impression this was just a kilo thing 15:40:49 dansmith, ok 15:41:20 DinaBelova: klindgren: i don't see anything about flavor migrations in the etherpad 15:41:36 mriedem - I meant kilo-based cloud 15:41:41 DinaBelova: I see that they say they started getting alarms after kilo, but the rest of the text makes it sound like this has always been a problem and just now tipped over the edge 15:42:24 yeah, i just added the notes on the flavor migrations 15:42:25 klindgren: so I think you should add some more capacity for conductors until you move to liberty, at which time you'll probably be able to drop it back down 15:42:31 mriedem thanks! 15:42:44 fyi on the flavor migrations for kilo upgrade https://wiki.openstack.org/wiki/ReleaseNotes/Kilo#Upgrade_Notes_2 15:42:55 klindgren: going forward, we have some better machinery to help us avoid the continued overhead once everything is upgraded 15:43:04 ok, so any other points for investigators to look at (except flavor migrations and JSON libs)? // not mentioning some profiling to find the real bottleneck // 15:43:13 klindgren: and also, the flavor migration was about the largest migration we could have done, so it almost can't be worse in the future 15:43:23 DinaBelova: I don't think there is a bottleneck to find, it sounds like 15:43:31 DinaBelova: i'm always curious about rogue periodic tasks in the compute nodes hitting the db too often and pulling too many instances 15:43:35 DinaBelova: I think this is likely due to flavor migrations we were doing in kilo and nothing more 15:43:49 DinaBelova: conductor-specific bottlenecks I mean 15:43:54 but roge periodic tasks pulling too much data could also mean you need to purge your db 15:43:58 *rogue 15:44:44 dansmith: not conductor specific bottlenecks, but there are db bottlenecks which conductor amplifies 15:44:49 dansmith - that may be very probable answer, I just want to reproduce the same situation klindgren is seeing, track that's about flavor migrations, and check everything is ok on liberty 15:44:52 alaski: yes, totes 15:45:02 that is also an answer 15:45:23 not mentioning something interesting may be found on what alaski has mentioned 15:45:50 ok, cool. 15:45:56 I shouldn't have said "no bottleneck to find" I meant that I think the kilo-centric bit that is the immediate problem is flavor migrations 15:46:13 dansmith, yep, gotcha 15:46:23 I'm also amazed that they _were_ fine with 2.5 conductor boxes for 250 computes 15:46:51 it it possible to turn off flavor mgirations under kilo to see if things get better? 15:47:02 klindgren: not really, no 15:47:11 not configurable, it happens in the code 15:47:16 klindgren, suffer :) 15:47:20 klindgren: we can have a back alley chat about some hacking you can do if you want 15:47:34 dansmith, can you provide what your mind is an acceptable conductor -> compute ratio? 15:47:53 klindgren: and if I may say, the next time you hit some spike when you roll to a release, please come to the nova channel and raise it :) 15:48:14 dansmith - I think if klindgren will be ok with trying some code hacking, I suppose this session will be very useful 15:48:30 klindgren: as I said, there is no magic number.. 1% is much lower than I would have expected would be reasonable for anyone, but you're proving it's doable, which also points to there being no magic number :) 15:49:18 #idea check if issue GoDaddy is facing is related to the flavor migrations or just to the too low conductor/compute ratio 15:49:42 klindgren - are you interesting in hacking session dansmith has proposed? 15:49:45 I think it's also worth pointing out, 15:50:01 since my answer was "shit" about having enough boxes to handle the load, 15:50:07 maybe running a GMR ? 15:50:11 that conductor separate from computes is mostly an upgrade story 15:50:30 just out of curiosity since there isn't a magic number, has any bunch of companies shared there conductor ratios with the world, then we can derive a 'suggested' number from those shared values...? 15:50:37 technically 2 -> 250 was working as well. Adding another physical box didn't actually fix anything, it just resulted in burning cpu on that server as well. 15:50:39 if you don't care about that, you can run a few conductor workers on every compute and distribute the load everywhere 15:51:08 dansmith thanks for the note 15:51:11 I mean if local conductor is depreacted - and remote conductor ian upgrade story - people are going to need to know a conductor to compute ratio that is "safe" 15:51:16 harlowja_at_home - did not hear about that :( 15:51:33 klindgren: 100% is safe 15:51:41 otherwise people are going to be blowing up their cloud 15:51:45 :-/ 15:51:54 klindgren probably we need to write an email to the operators email list 15:52:04 klindgren: let me ask you a question.. how many api nodes should everyone run? 15:52:08 and try to find what ratio do other folks have 15:52:12 then un-deprecate local-condcutor because obvious remote-conductor is planned out 15:52:22 is not well planned out* 15:52:41 DinaBelova, i'd like that 15:53:03 #action DinaBelova klindgren compose an email to the operators list and find out what conductors/computes ratio is used 15:53:23 can you do rolling upgrades with cells though? i thought not. 15:53:31 dansmith - well, I guess there is no right answer here :) 15:53:56 DinaBelova: right, that's what I'm trying to get at.. if I never create/destroy nodes, I can use one api worker for 250 computes :) 15:54:03 s/nodes/instances/ 15:54:06 dansmith :D 15:54:16 its almost always been possible in the past to run n-1 in cells 15:54:32 while we are on the subject of ratio, is this something we should look across other components such as network node to compute ration etc., 15:54:37 klindgren: just so you know, we think that's crazy :) 15:54:49 klindgren: that has been by chance though. there's no code to ensure it works 15:54:49 manand - yep, great note 15:54:51 whether or not it works :) 15:55:14 ok, folks, we've spent much time on this item 15:55:15 reminds me of the rpc compat bug in the cells code i saw last week... 15:55:22 yeah 15:55:26 it losos like we'll return to it back after the meeting 15:55:34 looks* 15:55:54 so let's move forward, as we're running out of time 15:55:59 #topic OSProfiler weekly update 15:56:34 ok, so last time we agreed that if we want to use osprofiler for tracing/profiling needs we need #1 fix it and #2 make it better 15:56:47 harlowja_at_home has created an etherpad 15:56:49 #link https://etherpad.openstack.org/p/perf-zoom-zoom 15:56:52 i put some code up for an idea of a different notifier that just uses files!! :-P 15:56:58 morezoom zoom 15:56:58 lol 15:57:05 harlowja_at_home - yep, saw it 15:57:30 and I left a comment - lemme create a change regarding https://github.com/openstack/osprofiler/blob/master/doc/specs/in-progress/multi_backend_support.rst first 15:57:43 not to have two drivers for backward compatibility 15:58:00 so in short - I was able to make osprofiler working ok with ceilometer events 15:58:06 cool 15:58:12 its limited now and some ceilometer work needs to be done now 15:58:21 one of Ceilo devs will work on it 15:58:31 and I've moved to https://github.com/openstack/osprofiler/blob/master/doc/specs/in-progress/multi_backend_support.rst task 15:58:47 harlowja_at_home - I'll ping you once I'll push the change to gerrit 15:58:51 kk 15:58:53 thx 15:58:56 so you'll be able to rebase your code 15:58:58 np 15:59:02 sounds good to me 15:59:23 boris-42 - did you have a chance to update the osprofiler -> oslo spec? 15:59:31 for mitaka? 15:59:56 a-ha, I see not yet 15:59:59 #action boris-42 update osprofiler spec to fit Mitaka cycle 16:00:07 ok, so we ran out of time 16:00:15 any last questions to mention? 16:00:29 thank you guys! 16:00:32 boris-42, where are u! 16:00:34 come in boris! 16:00:35 lol 16:00:39 :D 16:00:40 #endmeeting