15:00:59 #startmeeting openstack_ansible_meeting 15:00:59 Meeting started Tue Nov 16 15:00:59 2021 UTC and is due to finish in 60 minutes. The chair is noonedeadpunk. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:59 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:59 The meeting name has been set to 'openstack_ansible_meeting' 15:01:05 #topic office hours 15:01:23 ah, I eventually skipped rollcall :( 15:01:27 \o/ 15:01:43 hey! 15:04:38 Dmitriy Rabotyagov proposed openstack/openstack-ansible-rabbitmq_server master: Update rabbitmq version https://review.opendev.org/c/openstack/openstack-ansible-rabbitmq_server/+/817380 15:05:59 So, I pushed some patches to update rabbitmq and galera version before release 15:06:28 Frustrating thing is that for rabbitmq (erlang to be specific) bullseye repo is still missing 15:06:43 and galera 10.6 fails while running mariadb-upgrade 15:06:50 well, actually, it timeouts 15:06:57 I didn't have time to dig into this though 15:08:42 I guess once we figure this out we can do milestone release for testing 15:09:28 while still taking time before roles branching 15:09:39 also we have less then a month for final release jsut in case 15:10:46 like updating maria pooling vars across roles if we got them agreed in time 15:16:29 regarding to sqlachemy's connection pooling, i've checked how many active connections we have on our production environments and i think we should stick to oslodb's defaults(max_pool_size: 5, max_overflow: 50, pool_timeout: 30). I'll try to run some rally tests to make sure it won't degrade performance 15:20:00 hey. 15:20:17 \o/ 15:21:10 I wonder if we should set max_pool_size: 10 but might be we don't need even this amount 15:21:53 5 feels somehow extreme 15:22:27 if you want i can take a couple hours to dig issue this afternoon. 15:24:01 I won't refuse help - the more eyes we have on this the more balanced result we get I guess 15:24:14 hey 15:24:39 But I'm absolutely sure we should adjust this results and I'd wait for patches to land before release tbh 15:26:41 noonedeadpunk: regarding to max_pool_size, all of the environments i checked did not need more than 2 max_pool_size for the 12 hours(and with max_overflow set to 50, it makes it far more than enough). But i'll share some info after rally tests 15:27:32 well, in case of 1 controller failure I guess this number would increase? 15:27:35 spatel: there is a script i used to check how many active SQL connections I have for each openstack service. You may find it useful 15:27:35 https://paste.openstack.org/show/811029/ 15:27:52 I think you meant mgariepy hehe 15:28:09 :) 15:28:10 ouh my bad, sorry! 15:28:40 andrewbonney: you had your head in this topic as well. wdyt? 15:30:23 nice damiandabrowski[m] thanks i'll take a look 15:33:33 damiandabrowski[m] i am trying to run but throwing some error :) 15:33:51 may be my box doesn't have some tools 15:34:06 sort: cannot read: ./data/keystone: No such file or directory 15:35:09 well, i found an issue with this script yesterday. Looks like it has to be executed from your $PWD. maybe it's the case? 15:36:29 line 13 should create ./data and line 16 should create service directories inside it 15:37:02 ouh and another thing: You need to run this script with 'collect' argument to collect some data, then You will be able to use 'summary' argument ;) 15:39:08 It's a while since I've looked at this, but it wouldn't surprise me if existing defaults could be reduced. I just wouldn't have confidence doing so without checking larger deployments 15:39:53 would be interesting to see spatel's result actually 15:40:35 noonedeadpunk are you talking about related SQL script? 15:40:39 yeah 15:40:52 fyi, i am trying to test it on wallaby not latest branch 15:41:52 damiandabrowski[m] - https://paste.opendev.org/show/811030/ 15:42:06 may need some love to that script.. 15:42:28 tbh i ran it with bash so it may have some issues with sh :/ 15:44:03 trying with bash.. 15:45:19 https://paste.opendev.org/show/811031/ 15:45:32 all zero.. that is impossible 15:45:43 it's sandbox? 15:45:59 lab with 5 compute nodes 15:46:07 and no activity? 15:46:11 1 controller 15:46:17 let me run on busy cloud 15:46:27 because what this script does calculate I belive amount of active sql requests at a time 15:47:55 yep is threads are sleeping it won't count them 15:47:57 assuming collect continue collecting data in background right? ./mysql-active-conn.sh collect 15:48:06 i think it's very likely to be 0, but You can check the result of `mysql -sse "SELECT user,count(*) FROM information_schema.processlist WHERE command != 'Sleep' GROUP BY user;"` 15:48:19 yeah, i was running 'collect' in the background for 12 hours 15:48:29 zero.. result 15:48:52 hm, that's weird on busy cloud indeed 15:49:00 https://paste.opendev.org/show/811032/ 15:49:35 that is possible actually :) 15:50:22 but even sleeping connection are still a connection 15:51:05 but if i understand it correctly, our main point of implementing pooling is to reduce the number of sleeping connections as we don't need them ;) 15:51:23 i mean, with max_pool_size=5, we will have 5 sleeping/active connections per worker per service 15:52:05 yeah idea is that our current default setup is weird in terms of pooling 15:52:24 because of huge amount of sleeping connections 15:52:49 fyi, i rant this script on most busies cloud in my datacenter and result is all zero. i am running collect for just 1 min.. 15:53:06 ran* 15:53:34 James Denton proposed openstack/openstack-ansible-os_ironic master: Add [nova] section to ironic.conf https://review.opendev.org/c/openstack/openstack-ansible-os_ironic/+/818115 15:53:38 may be i am running on stein release thats why :) 15:53:54 nah I don't think it matters 15:54:00 and `cat ./data/nova` has only zeros? i was testing it on victoria, but i think it doesn't matter 15:54:03 it's pure mysql 15:54:16 k 15:55:00 damiandabrowski[m] yes all zero for nova and even other services also 15:55:25 only root and system user has 1 and 5 connection 15:55:56 so IMO, openstack queries are parsed super fast so script doesn't catch it 15:56:11 i mean, it just parses `mysql -sse "SELECT user,count(*) FROM information_schema.processlist WHERE command != 'Sleep' GROUP BY user;"` and saves the output to service files in ./data/* 15:57:01 during my case, the output was 0 in 99% of cases 15:59:33 noonedeadpunk this is my mytop command output - https://paste.opendev.org/show/811033/ 15:59:55 majority are sleeping connection 16:00:09 yeah and we aim to reduce their number) 16:00:32 does sleep consume resources ? 16:01:05 well, they do. it's not like _huge_ problem, but unpleasant 16:01:25 eventually you have tons of hanging tcp cnnections which also some load on tcp stack 16:02:29 how about setting up - interactive_timeout = 300 16:02:44 i believe default is 8hrs 16:03:00 or wait_timeout ? 16:03:35 interactive_timeout | 28800 16:03:43 wait_timeout | 28800 16:04:33 sleeping connections are the most problematic when galera nodes are going down. 16:04:33 Galera will keep them until timeout, that's how galera can easily reach max_connections 16:05:14 yeah, actually my main point was to implement connection pooling and lower wait_timeout - but I'm open for Your ideas ;) 16:05:33 i had same issue.. that is why i kept max_connection to 5000 and more... my upgrade failed last time because of high max_connection limit 16:08:20 damiandabrowski[m], which service does consume the most mysql connections ? 16:10:26 in my case: nova 16:10:48 yes nova is always on top 16:11:58 #endmeeting