16:00:27 <mlavalle> #startmeeting neutron_performance 16:00:28 <openstack> Meeting started Mon Jul 1 16:00:27 2019 UTC and is due to finish in 60 minutes. The chair is mlavalle. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:29 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:32 <openstack> The meeting name has been set to 'neutron_performance' 16:00:56 <njohnston> o/ 16:01:14 <slaweq> hi 16:01:37 <haleyb> hi 16:04:11 <mlavalle> ok, let's get going 16:05:48 <mlavalle> #topic Rally / osprofiler / EnOS 16:06:01 <mlavalle> I have progress updates on this topic 16:06:57 <mlavalle> after the message that the EnOS sent us with guidelines, all the small hurdles that I was facing to start running tests we re solved 16:08:41 <mlavalle> I was able to deploy the test that rubasov is implementing in rally-openstack: https://review.opendev.org/#/c/662781/ 16:09:15 <mlavalle> I uploaded the rally report to dropbox: 16:09:19 <mlavalle> #link https://www.dropbox.com/s/ggk26tokkob4w8s/bind_port_rally_report.html?dl=0 16:09:26 <njohnston> coool 16:09:39 <mlavalle> so you can see the results of the task as it is defined by the patch 16:11:19 <mlavalle> I also uploaded for you a osprofiler report of an iteration reprsesentative of the average in the rally report: 16:12:44 <mlavalle> #link https://www.dropbox.com/s/ojigqb32swvdrgx/bind_port_avg.html?dl=0 16:13:25 <mlavalle> and a osprofiler report of the iteration with the highest elapsed time: 16:14:24 <mlavalle> #link https://www.dropbox.com/s/ojigqb32swvdrgx/bind_port_avg.html?dl=0 16:15:06 <slaweq> I wonder where on create_subnet it spends about 7 seconds in this example 16:15:18 <mlavalle> These are html files, so you need to download them and view them in your browser 16:15:23 <slaweq> creation of subnet is "just" db operation, right? 16:15:30 <mlavalle> it is 16:16:02 <slaweq> mlavalle: and Your environment wasn't overloaded by other tasks during time when You run this test, right? 16:16:14 <mlavalle> well, it interacts with the dhcp agent 16:16:29 <mlavalle> but I don't think that's the problem 16:16:41 <mlavalle> and no, my test environment is not overloades 16:17:15 <slaweq> mlavalle: to be sure if that is not a problem, can You run same test but with subnets with dhcp disabled? 16:17:50 <mlavalle> sure 16:17:56 <mlavalle> that's a good suggestion 16:19:05 <slaweq> thx 16:20:16 <mlavalle> one problem that I am finding with the osprofiler report is that we get a lot of lines of sqlalchemy tracing 16:20:48 <mlavalle> and they contain very detailed info on the DB operation 16:21:20 <mlavalle> problem is that the context info (i.e. where in the code that DB operation was issued) is lacking 16:21:29 <slaweq> yes 16:21:36 <mlavalle> so it is diffecult to relate the trace to the code 16:21:36 <slaweq> I also noticed that 16:21:53 <njohnsto_> yep, logging that is outside the scope of sql alchemy IMHO 16:23:02 <mlavalle> so I am going to ping Andrey Kurilin and see if he has any suggestions 16:23:17 <mlavalle> slaweq: do you remember his email address 16:23:57 <slaweq> mlavalle: no but I will find it and send it to You, ok? 16:24:15 <mlavalle> slaweq: not needed. thanks. found it in gerrit 16:24:18 <slaweq> mlavalle: You can also try to catch him on #openstack-rally channel, his nick is andreykurilin 16:24:48 <mlavalle> anyways. Next steps are: 16:25:21 <mlavalle> 1) Play with this scenario changing the load and concurrency parameters 16:25:49 <mlavalle> 2) Try to add conteext info to the osprofiler traces 16:26:07 <mlavalle> 3) Investigate the subnet creation anomaly as suggested by slaweq 16:27:03 <mlavalle> any other comments on this topic? 16:27:20 <slaweq> IMHO each operation there takes very long time 16:27:32 <mlavalle> it does 16:27:36 <slaweq> port create is almost as long as port update (binding) later 16:27:43 <slaweq> and port create is also only: 16:27:46 <slaweq> 1. db operations 16:27:55 <mlavalle> and it is a lot of DB operations 16:28:09 <slaweq> 2. ipam - but as for each port it is new subnet than ipam should be fast IMO because no retries should be there 16:28:11 <mlavalle> you can see that in the osprofiler reports 16:28:13 <slaweq> 3. dhcp agent update 16:29:04 <slaweq> so basically before I looked into it I was expecting that it will be something with most of the time spend on port update (port binding) 16:29:18 <slaweq> and it's not - everything else is also slow :/ 16:29:35 <slaweq> do You agree with me or am I missing something? 16:29:44 <mlavalle> agree with you 16:30:05 <mlavalle> and the fact is that it is almost all db operations 16:30:29 <mlavalle> if you look at the osprofiler reports 16:30:45 <mlavalle> that is why I want to be able to relate the reports with the code 16:31:05 <slaweq> yes 16:31:28 <mlavalle> going back to your question about how much load I have in my test environment 16:31:58 <mlavalle> I am monitoring continously the syste with top 16:32:20 <mlavalle> it has 32 threads 16:32:30 <mlavalle> all of wich are almost idle 16:32:42 <mlavalle> unless I trigger the rally workload 16:32:51 <mlavalle> and even then it is not very loaded 16:33:02 <mlavalle> so it's not the test environment 16:33:46 <mlavalle> any oher comments? 16:33:58 <slaweq> great work mlavalle, thx :) 16:34:16 <mlavalle> :-) 16:34:28 <mlavalle> ok, let's move on 16:34:39 <mlavalle> #topic Security groups and RBAC 16:35:01 <mlavalle> I have continued digging into https://bugs.launchpad.net/neutron/+bug/1830679 16:35:02 <openstack> Launchpad bug 1830679 in neutron "Security groups RBAC cause a major performance degradation" [High,In progress] - Assigned to Miguel Lavalle (minsel) 16:35:45 <mlavalle> Taking a suggestion from njohnsto_, I've been doing some profiling with cProfile 16:37:15 <mlavalle> I posted one cProfile report in the bug and explained where I am conducting the profiling 16:38:36 <mlavalle> you can see the details in the report 16:38:55 * njohnsto_ will check it out 16:38:59 <mlavalle> but the conclusion is that querying 6000 security groups from the DB doesn't take much 16:39:38 <mlavalle> whre we spend a lot of time is building the objects after retrieving the from the DB is here: https://github.com/openstack/neutron/blob/master/neutron/objects/base.py#L800 16:40:49 <slaweq> and it spends most of this time on calculating if SG is shared with tenant or not, am I understanding correct what You wrote in bug? 16:41:04 <njohnsto_> should we call on zzzeek’s expert opinion on this? 16:41:26 <mlavalle> slaweq: I have some second thoughts about that comment 16:41:46 <mlavalle> the point is that we spend a lot of time building the objects from the DB 16:42:01 <mlavalle> the DB query is almost nothing 16:42:30 <mlavalle> njohnsto_: that might be a good step 16:42:53 <mlavalle> but before that here's the strategy I intend to test later today: 16:43:25 <mlavalle> Instead of calling https://github.com/openstack/neutron/blob/master/neutron/objects/base.py#L617 blindly and let it retrieve all the objects from the DB 16:44:23 <mlavalle> I am going to query the DB from here: https://github.com/openstack/neutron/blob/master/neutron/objects/base.py#L617 16:44:52 <njohnsto_> that is the same link 16:45:04 <mlavalle> and build a specifc query instead of calling obj_db_api.get_objects 16:45:40 <mlavalle> in other words, instead of querying all the objects, I am going to query only the objects that the tenenat has access to 16:45:50 <njohnsto_> good idea 16:46:11 <mlavalle> either because it owns the object or because there is a rbac entry that grants the access 16:46:35 <mlavalle> that way, the number of OVOs to build will be much lower 16:47:03 <mlavalle> and in the case where there are no rbac entries, that user / deployer doesn't pay any extra cost 16:47:11 <mlavalle> does it make sense? 16:47:41 <njohnsto_> +1 16:47:57 <mlavalle> cool 16:48:44 <mlavalle> the only other observation I want to make is that last night I was wondering how many times we call get_objects from the base class blindly and let it retrieve all the objects from the DB 16:48:48 <slaweq> +1 16:49:03 <mlavalle> and then we filter later 16:49:36 <njohnsto_> Yes, that is definitely a bad pattern for operations at scale 16:49:37 <mlavalle> maybe after fixing this specific bug I am going to dig in the code looking for cases like this 16:50:17 <mlavalle> because if we retrieve a lot of objects from the DB, build the OVO's and then we filter 16:50:27 <mlavalle> we wasted a lot of resource building the OVOs 16:50:51 <njohnsto_> +1 16:51:04 <mlavalle> ok, that's all I had for today 16:51:10 <mlavalle> any other comments? 16:51:26 <slaweq> I wanted to ask about one thing 16:51:48 <slaweq> in launchpad we have tag "loadimpact" to bugs related to performance problems 16:51:55 <slaweq> here is the list of bugs with this tag: 16:51:56 <slaweq> https://bugs.launchpad.net/neutron/+bugs?field.searchtext=&search=Search&field.status%3Alist=NEW&field.status%3Alist=CONFIRMED&field.status%3Alist=TRIAGED&field.status%3Alist=INPROGRESS&field.status%3Alist=FIXCOMMITTED&field.status%3Alist=INCOMPLETE_WITH_RESPONSE&field.status%3Alist=INCOMPLETE_WITHOUT_RESPONSE&assignee_option=any&field.assignee=&field.bug_reporter=&field.bug_commenter=&field.subscr 16:51:58 <slaweq> iber=&field.structural_subscriber=&field.tag=loadimpact&field.tags_combinator=ANY&field.has_cve.used=&field.omit_dupes.used=&field.omit_dupes=on&field.affects_me.used=&field.has_patch.used=&field.has_branches.used=&field.has_branches=on&field.has_no_branches.used=&field.has_no_branches=on&field.has_blueprints.used=&field.has_blueprints=on&field.has_no_blueprints.used=&field.has_no_blueprints=on&ord 16:52:00 <slaweq> erby=-importance&start=0 16:52:02 <slaweq> should we maybe: 16:52:17 <slaweq> 1. clean this list and close old/already fixed bugs? 16:52:36 <slaweq> 2. check this list during those meetings? 16:52:41 <slaweq> what do You think? 16:52:46 <mlavalle> excellent idea 16:53:08 <mlavalle> I'll try to start with the clean up 16:53:18 <slaweq> thx a lot :) 16:53:40 <slaweq> that's all from me 16:53:46 <mlavalle> cool 16:53:50 <mlavalle> Thanks for attending 16:53:55 <mlavalle> have a great week! 16:53:59 <mlavalle> #endmeeting