#openstack-ansible log

15:00:59 <noonedeadpunk> #startmeeting openstack_ansible_meeting
15:00:59 <opendevmeet> Meeting started Tue Nov 16 15:00:59 2021 UTC and is due to finish in 60 minutes.  The chair is noonedeadpunk. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:59 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:59 <opendevmeet> The meeting name has been set to 'openstack_ansible_meeting'
15:01:05 <noonedeadpunk> #topic office hours
15:01:23 <noonedeadpunk> ah, I eventually skipped rollcall :(
15:01:27 <noonedeadpunk> \o/
15:01:43 <damiandabrowski[m]> hey!
15:04:38 <opendevreview> Dmitriy Rabotyagov proposed openstack/openstack-ansible-rabbitmq_server master: Update rabbitmq version  https://review.opendev.org/c/openstack/openstack-ansible-rabbitmq_server/+/817380
15:05:59 <noonedeadpunk> So, I pushed some patches to update rabbitmq and galera version before release
15:06:28 <noonedeadpunk> Frustrating thing is that for rabbitmq (erlang to be specific) bullseye repo is still missing
15:06:43 <noonedeadpunk> and galera 10.6 fails while running mariadb-upgrade
15:06:50 <noonedeadpunk> well, actually, it timeouts
15:06:57 <noonedeadpunk> I didn't have time to dig into this though
15:08:42 <noonedeadpunk> I guess once we figure this out we can do milestone release for testing
15:09:28 <noonedeadpunk> while still taking time before roles branching
15:09:39 <noonedeadpunk> also we have less then a month for final release jsut in case
15:10:46 <noonedeadpunk> like updating maria pooling vars across roles if we got them agreed in time
15:16:29 <damiandabrowski[m]> regarding to sqlachemy's connection pooling, i've checked how many active connections we have on our production environments and i think we should stick to oslodb's defaults(max_pool_size: 5, max_overflow: 50, pool_timeout: 30). I'll try to run some rally tests to make sure it won't degrade performance
15:20:00 <mgariepy> hey.
15:20:17 <noonedeadpunk> \o/
15:21:10 <noonedeadpunk> I wonder if we should set max_pool_size: 10 but might be we don't need even this amount
15:21:53 <noonedeadpunk> 5 feels somehow extreme
15:22:27 <mgariepy> if you want i can take a couple hours to dig issue this afternoon.
15:24:01 <noonedeadpunk> I won't refuse help - the more eyes we have on this the more balanced result we get I guess
15:24:14 <spatel> hey
15:24:39 <noonedeadpunk> But I'm absolutely sure we should adjust this results and I'd wait for patches to land before release tbh
15:26:41 <damiandabrowski[m]> noonedeadpunk: regarding to max_pool_size, all of the environments i checked did not need more than 2 max_pool_size for the 12 hours(and with max_overflow set to 50, it makes it far more than enough). But i'll share some info after rally tests
15:27:32 <noonedeadpunk> well, in case of 1  controller failure I guess this number would increase?
15:27:35 <damiandabrowski[m]> spatel: there is a script i used to check how many active SQL connections I have for each openstack service. You may find it useful
15:27:35 <damiandabrowski[m]> https://paste.openstack.org/show/811029/
15:27:52 <noonedeadpunk> I think you meant mgariepy hehe
15:28:09 <spatel> :)
15:28:10 <damiandabrowski[m]> ouh my bad, sorry!
15:28:40 <noonedeadpunk> andrewbonney: you had your head in this topic as well. wdyt?
15:30:23 <mgariepy> nice damiandabrowski[m] thanks i'll take a look
15:33:33 <spatel> damiandabrowski[m] i am trying to run but throwing some error :)
15:33:51 <spatel> may be my box doesn't have some tools
15:34:06 <spatel> sort: cannot read: ./data/keystone: No such file or directory
15:35:09 <damiandabrowski[m]> well, i found an issue with this script yesterday. Looks like it has to be executed from your $PWD. maybe it's the case?
15:36:29 <damiandabrowski[m]> line 13 should create ./data and line 16 should create service directories inside it
15:37:02 <damiandabrowski[m]> ouh and another thing: You need to run this script with 'collect' argument to collect some data, then You will be able to use 'summary' argument ;)
15:39:08 <andrewbonney> It's a while since I've looked at this, but it wouldn't surprise me if existing defaults could be reduced. I just wouldn't have confidence doing so without checking larger deployments
15:39:53 <noonedeadpunk> would be interesting to see spatel's result actually
15:40:35 <spatel> noonedeadpunk are you talking about related SQL script?
15:40:39 <noonedeadpunk> yeah
15:40:52 <spatel> fyi, i am trying to test it on wallaby not latest branch
15:41:52 <spatel> damiandabrowski[m] - https://paste.opendev.org/show/811030/
15:42:06 <spatel> may need some love to that script..
15:42:28 <damiandabrowski[m]> tbh i ran it with bash so it may have some issues with sh :/
15:44:03 <spatel> trying with bash..
15:45:19 <spatel> https://paste.opendev.org/show/811031/
15:45:32 <spatel> all zero.. that is impossible
15:45:43 <noonedeadpunk> it's sandbox?
15:45:59 <spatel> lab with 5 compute nodes
15:46:07 <noonedeadpunk> and no activity?
15:46:11 <spatel> 1 controller
15:46:17 <spatel> let me run on busy cloud
15:46:27 <noonedeadpunk> because what this script does calculate I belive amount of active sql requests at a time
15:47:55 <mgariepy> yep is threads are sleeping it won't count them
15:47:57 <spatel> assuming collect continue collecting data in background right?  ./mysql-active-conn.sh collect
15:48:06 <damiandabrowski[m]> i think it's very likely to be 0, but You can check the result of `mysql -sse "SELECT user,count(*) FROM information_schema.processlist WHERE command != 'Sleep' GROUP BY user;"`
15:48:19 <damiandabrowski[m]> yeah, i was running 'collect' in the background for 12 hours
15:48:29 <spatel> zero.. result
15:48:52 <noonedeadpunk> hm, that's weird on busy cloud indeed
15:49:00 <spatel> https://paste.opendev.org/show/811032/
15:49:35 <noonedeadpunk> that is possible actually :)
15:50:22 <mgariepy> but even sleeping connection are still a connection
15:51:05 <damiandabrowski[m]> but if i understand it correctly, our main point of implementing pooling is to reduce the number of sleeping connections as we don't need them ;)
15:51:23 <damiandabrowski[m]> i mean, with max_pool_size=5, we will have 5 sleeping/active connections per worker per service
15:52:05 <noonedeadpunk> yeah idea is that our current default setup is weird in terms of pooling
15:52:24 <noonedeadpunk> because of huge amount of sleeping connections
15:52:49 <spatel> fyi, i rant this script on most busies cloud in my datacenter and result is all zero. i am running collect for just 1 min..
15:53:06 <spatel> ran*
15:53:34 <opendevreview> James Denton proposed openstack/openstack-ansible-os_ironic master: Add [nova] section to ironic.conf  https://review.opendev.org/c/openstack/openstack-ansible-os_ironic/+/818115
15:53:38 <spatel> may be i am running on stein release thats why :)
15:53:54 <noonedeadpunk> nah I don't think it matters
15:54:00 <damiandabrowski[m]> and `cat ./data/nova` has only zeros? i was testing it on victoria, but i think it doesn't matter
15:54:03 <noonedeadpunk> it's pure mysql
15:54:16 <spatel> k
15:55:00 <spatel> damiandabrowski[m] yes all zero for nova and even other services also
15:55:25 <spatel> only root and system user has 1 and 5 connection
15:55:56 <damiandabrowski[m]> so IMO, openstack queries are parsed super fast so script doesn't catch it
15:56:11 <damiandabrowski[m]> i mean, it just parses `mysql -sse "SELECT user,count(*) FROM information_schema.processlist WHERE command != 'Sleep' GROUP BY user;"` and saves the output to service files in ./data/*
15:57:01 <damiandabrowski[m]> during my case, the output was 0 in 99% of cases
15:59:33 <spatel> noonedeadpunk this is my mytop command output - https://paste.opendev.org/show/811033/
15:59:55 <spatel> majority are sleeping connection
16:00:09 <noonedeadpunk> yeah and we aim to reduce their number)
16:00:32 <spatel> does sleep consume resources ?
16:01:05 <noonedeadpunk> well, they do. it's not like _huge_ problem, but unpleasant
16:01:25 <noonedeadpunk> eventually you have tons of hanging tcp cnnections which also some load on tcp stack
16:02:29 <spatel> how about setting up - interactive_timeout = 300
16:02:44 <spatel> i believe default is 8hrs
16:03:00 <spatel> or wait_timeout ?
16:03:35 <spatel> interactive_timeout                   | 28800
16:03:43 <spatel> wait_timeout                          | 28800
16:04:33 <damiandabrowski[m]> sleeping connections are the most problematic when galera nodes are going down.
16:04:33 <damiandabrowski[m]> Galera will keep them until timeout, that's how galera can easily reach max_connections
16:05:14 <damiandabrowski[m]> yeah, actually my main point was to implement connection pooling and lower wait_timeout - but I'm open for Your ideas ;)
16:05:33 <spatel> i had same issue.. that is why i kept max_connection to 5000 and more... my upgrade failed last time because of high max_connection limit
16:08:20 <mgariepy> damiandabrowski[m], which service does consume the most mysql connections ?
16:10:26 <damiandabrowski[m]> in my case: nova
16:10:48 <spatel> yes nova is always on top
16:11:58 <noonedeadpunk> #endmeeting