16:00:51 <mlavalle> #startmeeting neutron_performance 16:00:52 <openstack> Meeting started Mon Jul 29 16:00:51 2019 UTC and is due to finish in 60 minutes. The chair is mlavalle. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:53 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:55 <openstack> The meeting name has been set to 'neutron_performance' 16:01:05 <rubasov> hi 16:01:30 <slaweq> hi 16:01:51 <haleyb> hi 16:01:54 <mlavalle> let's wait one minute and we get going 16:02:16 <mlavalle> well, I see haleyb is already here and I know njohnston won't attend 16:02:22 <mlavalle> so let's get going 16:02:28 <mlavalle> #topic Updates 16:03:04 <mlavalle> rubasov merged our port binding scenario rally-openstack: https://review.opendev.org/#/c/662781/ 16:03:16 <mlavalle> I've been playing with it the past few days 16:03:24 <mlavalle> Thanks and gret job! 16:03:34 <rubasov> I hope it's being useful 16:04:11 <mlavalle> it is being useful 16:05:21 <rubasov> then I'm happy 16:05:27 <mlavalle> and I have to point out that it is going to be even more useful for https://bugs.launchpad.net/neutron/+bug/1836834 16:05:28 <openstack> Launchpad bug 1836834 in neutron "[RFE] introduce distributed locks to ipam" [Wishlist,Confirmed] - Assigned to qinhaizhong (qinhaizhong) 16:06:05 <mlavalle> this is RFE that we approved this past Friday. We need a Rally test to check how much progres we make with that RFE 16:06:23 <mlavalle> and I think your rally test is the perfect match for that 16:06:30 <mlavalle> what do you think slaweq? 16:07:08 <slaweq> I agree, with this rally scenario we can exactly measure improvement (or not) which will give us this lock to ipam 16:07:29 <slaweq> as in other "port_create" scenarios I think it's not allocating IP addresses, right? 16:07:35 <slaweq> or is it? 16:07:42 <mlavalle> exactly 16:07:55 <mlavalle> this is the rally test that allows us to exercise the IPAM 16:08:16 <mlavalle> so really useful rubasov 16:08:20 <mlavalle> \o/ 16:08:50 <slaweq> thx rubasov for this :) 16:08:55 <rubasov> cool, when I as writing it I was thinking of adding some options to tune the IP contention 16:09:09 <rubasov> I mean how many IPs out of how large pool 16:09:17 <rubasov> let me know if that's needed 16:09:25 <rubasov> it could be added easily 16:09:56 <mlavalle> let me think about it, now that I am playing with it 16:10:09 <rubasov> ok 16:10:10 <mlavalle> any other udates from you rubasov? 16:10:17 <rubasov> a bit 16:10:37 <rubasov> I know slaweq has already seen these changes 16:10:58 <rubasov> I have made some progress with osprofiling a vif plug 16:11:04 <rubasov> together with gibi 16:11:18 <rubasov> these two changes: https://review.opendev.org/666610 https://review.opendev.org/665715 16:11:49 <rubasov> I have some feedback from ralonsoh to fix 16:12:03 <rubasov> and also some ugly interdependent unit test failures to fix too 16:12:28 <rubasov> but the main point is that those two patches now can get a trace through ovs-agent 16:12:43 <mlavalle> I took a quick look at these patches Friday (I think) 16:12:44 <rubasov> from nova through ovs-agent and back to neutron-server 16:13:03 <ralonsoh> I still have some concerns about storing so much information in the bridge register 16:13:20 <rubasov> ralonsoh: sorry I did not have the time to answer yet 16:13:26 <ralonsoh> rubasov, no problem 16:13:30 <ralonsoh> no rush 16:13:33 <rubasov> but I think it should not be that much info there 16:13:45 <rubasov> it depends on the number of cuncurrent traces 16:13:54 <rubasov> which can only come from trusted users 16:14:10 <rubasov> that number does not scale together with the number of ports on an agent 16:14:19 <ralonsoh> no, for sure 16:14:23 <rubasov> but with the number of concurrent traces 16:14:29 <slaweq> ralonsoh: yes, IIUC rubasov's patch it will remove any trace-id from br-int just after use it 16:14:43 <ralonsoh> but I don't want also to leave not needed info in the bidge register 16:14:56 <ralonsoh> if so, that's ok for me 16:15:15 <slaweq> but rubasov please correct me if I'm wrong :) 16:15:25 <ralonsoh> and eventually, once Sean merges the patch in nova, we'll put everyting in the port register 16:15:25 <rubasov> ralonsoh: by default ovs-agent removes the trace info as soon as it sees it 16:16:08 <rubasov> ralonsoh: on the other hand I fully agree that it's an ugly hack to put trace info there :-) 16:16:31 <rubasov> and I'd like to see Sean's patch merged 16:16:48 <rubasov> I'm just not sure I want to make it a dependency 16:17:01 <ralonsoh> not for now 16:17:43 <rubasov> ralonsoh: also I could add some extra cleanup 16:17:55 <ralonsoh> perfect 16:17:58 <rubasov> like deleting all trace info from the bridge on ovs-agent restart 16:18:24 <rubasov> would that be better? 16:18:30 <ralonsoh> for sure! 16:18:37 <rubasov> then I'll do that 16:19:11 <rubasov> let me know if you see other places in the code where we can do some meaningful cleanup 16:20:22 <rubasov> unless you have some questions about these patches that's all from me 16:20:50 <mlavalle> thanks for the update. and thanks to ralonsoh for helping with this :-) 16:21:01 <rubasov> yep thanks 16:21:13 <ralonsoh> 0/ 16:22:45 <mlavalle> On my side, as I indicated above, I've been running the create_bind_port scenario 16:23:09 <mlavalle> I creatde reports and posted them here: https://github.com/miguellavalle/neutron-perf/tree/july-29th/threadripper 16:24:01 <mlavalle> If you go to the reports folder, you will see the different runs, where I increase the number of iteration and the concurrency gradually 16:24:28 <mlavalle> I also created a quick overview of results here: http://paste.openstack.org/show/755065/ 16:26:21 <slaweq> so it looks for me that using osprofiler is slowing down everything a lot 16:26:26 <mlavalle> As you can see, we went from an average duartion of 63.282 secs with 10 iteration, concurrency 5 16:26:34 <mlavalle> yes, it does 16:27:17 <mlavalle> but at least for the time being it is a good window into what it is going on 16:27:35 <rubasov> but still the trends with osprofiler should be the same as without it, right? 16:27:35 <mlavalle> I don't think we should focus on the absolute durations 16:27:47 <mlavalle> we should focus on the trend 16:27:52 <rubasov> so we should still see how it scales 16:27:58 <mlavalle> exactly rubasov 16:28:05 <slaweq> I agree 16:28:18 <slaweq> I am just surprised that it has so much overhead 16:28:27 <mlavalle> it is a lot of overhead 16:28:51 <mlavalle> but if you think about it, it's not surprising 16:29:07 <mlavalle> we are logging each DB operation 16:29:23 <mlavalle> anyway... 16:29:25 <rubasov> a lots of I/O to send to the trace point db 16:30:55 <mlavalle> going back to my oririnl comment, with 80 iterations, 20 concurrency, we go to average duration of 147.887 16:33:28 <mlavalle> in that report folder you will find the rally report (for example iterations-10-concurrency-5.html) with one or several corresponding osprofiler reports (for example iterations-10-concurrency-5-osprofiler-max.html) 16:34:15 <mlavalle> if the osprofiler report name contains the 'max' suffix, it means that it referes to the iteration with the largest duration 16:35:27 <rubasov> what looks good is that there's not much of a spread, the 95 percentile values are close to the median values 16:35:49 <rubasov> so neutron-server seems to be performing at a consistent speed 16:36:03 <slaweq> rubasov: slow but stable :D 16:36:56 <rubasov> and based on the first two we may need to divide the numbers by 5 (the slowdown because of osprofiler) 16:37:00 <mlavalle> would that be an indication that we haven't reached a limit? i.e. the performance knee wehere response time grows exponentially? 16:37:56 <mlavalle> I mean the fact that the 95th percentile is still close to the median? 16:38:47 <jrbalderrama> BTW any plan to perform more test with more physical resources to compare the results with and w/o the profiler ? 16:39:08 <rubasov> I don't think the spreads in itself tells much about bottlenecks 16:39:15 <rubasov> or at least I don't see how 16:39:29 <mlavalle> fair point 16:39:45 <rubasov> other than we're likely hitting the same cause of slowness in each run 16:40:24 <rubasov> so if we can improve that we'll likely improve all runs 16:41:06 <mlavalle> I haven't identified a big / single obvious bottleneck in the case of port creation and update. so it going to require a very fine grained analysys of the reports, which I will attempt as my next step 16:41:43 <mlavalle> But in the case of subnet creation I found what seems to be a tentative surprise 16:41:55 <rubasov> this is already very cool, thank you 16:42:23 <slaweq> mlavalle: one more question: what are those sql update queries in Your summary? 16:42:38 <mlavalle> If you look at lines 75-78 and 99-100 in my summary.... 16:43:09 <mlavalle> you will see SQL update statements that are executed during subnet creation 16:43:53 <mlavalle> in the 60 / 15 and 80 / 20 scenarios I can see that those SQL statements take arounf 8 to 9 seconds 16:44:11 <mlavalle> while the total subnet creation time is arounf 25 30 seconds 16:44:22 <mlavalle> am I being clear? 16:45:00 <slaweq> so those are slowest queries in subnet creation requests, right? 16:45:15 <slaweq> in fact it's always the same query 16:45:22 <mlavalle> exactly 16:45:35 <mlavalle> let's take this as tentative yet 16:45:44 <mlavalle> I just found this last night 16:45:50 <mlavalle> but it really stands out 16:46:06 <slaweq> ok 16:46:19 <rubasov> do you have multiple query lines because we're hitting a db_retry? 16:46:52 <mlavalle> yeah, I need to investigate and relate this to the code and the log files 16:47:03 <mlavalle> but retries are clearly a possibility 16:47:15 <rubasov> ok 16:48:08 <mlavalle> and I also need to compare a port creation / update in the 10 / 5 scenario with their conterpart in the 80 / 20 scenario 16:48:31 <mlavalle> becaause in that case there is not a single operation that stands out 16:48:56 <mlavalle> but clearly a lot of small operations are adding up to a big increase in time 16:49:57 <mlavalle> makes sense? 16:50:06 <slaweq> yes 16:50:08 <mlavalle> any more questions or observations 16:50:10 <mlavalle> ? 16:50:14 <rubasov> yes it does 16:50:27 <slaweq> I don't have anything else to add 16:51:04 <mlavalle> jrbalderrama: yes, that's the plan lomger term 16:51:40 <mlavalle> but before getting there, I would like to have deeper knowledge on where bottlenecks are 16:52:02 <mlavalle> so we can perform a very enlightening experiment 16:52:27 <mlavalle> and btw, thanks for attending 16:52:58 <mlavalle> and Viva Colombia! I bet all your French buddies are disappointed 16:53:06 <ralonsoh> hahahah 16:53:39 <slaweq> LOL 16:54:27 <jrbalderrama> hahaha we still are the champions in other sports ;) and the tour is a French heritage ! 16:54:54 <mlavalle> that's true 16:55:07 <mlavalle> it is a big victory for the Colombians, though 16:55:25 <jrbalderrama> back to the the point. Sounds good for me. We are looking forward to launch some tests here 16:55:41 <mlavalle> ok, we'll keep you posted 16:55:59 <jrbalderrama> thank you all! 16:56:28 <mlavalle> no rush on the PR 16:56:47 <mlavalle> anything else we should discuss today? 16:57:36 <slaweq> not from me 16:58:04 <rubasov> neither from me 16:58:19 <mlavalle> Thanks for attending! 16:58:23 <mlavalle> #endmeeting