noonedeadpunk | #startmeeting openstack_ansible_meeting | 15:00 |
---|---|---|
opendevmeet | Meeting started Tue Nov 16 15:00:59 2021 UTC and is due to finish in 60 minutes. The chair is noonedeadpunk. Information about MeetBot at http://wiki.debian.org/MeetBot. | 15:00 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 15:00 |
opendevmeet | The meeting name has been set to 'openstack_ansible_meeting' | 15:00 |
noonedeadpunk | #topic office hours | 15:01 |
noonedeadpunk | ah, I eventually skipped rollcall :( | 15:01 |
noonedeadpunk | \o/ | 15:01 |
damiandabrowski[m] | hey! | 15:01 |
opendevreview | Dmitriy Rabotyagov proposed openstack/openstack-ansible-rabbitmq_server master: Update rabbitmq version https://review.opendev.org/c/openstack/openstack-ansible-rabbitmq_server/+/817380 | 15:04 |
noonedeadpunk | So, I pushed some patches to update rabbitmq and galera version before release | 15:05 |
noonedeadpunk | Frustrating thing is that for rabbitmq (erlang to be specific) bullseye repo is still missing | 15:06 |
noonedeadpunk | and galera 10.6 fails while running mariadb-upgrade | 15:06 |
noonedeadpunk | well, actually, it timeouts | 15:06 |
noonedeadpunk | I didn't have time to dig into this though | 15:06 |
noonedeadpunk | I guess once we figure this out we can do milestone release for testing | 15:08 |
noonedeadpunk | while still taking time before roles branching | 15:09 |
noonedeadpunk | also we have less then a month for final release jsut in case | 15:09 |
noonedeadpunk | like updating maria pooling vars across roles if we got them agreed in time | 15:10 |
damiandabrowski[m] | regarding to sqlachemy's connection pooling, i've checked how many active connections we have on our production environments and i think we should stick to oslodb's defaults(max_pool_size: 5, max_overflow: 50, pool_timeout: 30). I'll try to run some rally tests to make sure it won't degrade performance | 15:16 |
mgariepy | hey. | 15:20 |
noonedeadpunk | \o/ | 15:20 |
noonedeadpunk | I wonder if we should set max_pool_size: 10 but might be we don't need even this amount | 15:21 |
noonedeadpunk | 5 feels somehow extreme | 15:21 |
mgariepy | if you want i can take a couple hours to dig issue this afternoon. | 15:22 |
noonedeadpunk | I won't refuse help - the more eyes we have on this the more balanced result we get I guess | 15:24 |
spatel | hey | 15:24 |
noonedeadpunk | But I'm absolutely sure we should adjust this results and I'd wait for patches to land before release tbh | 15:24 |
damiandabrowski[m] | noonedeadpunk: regarding to max_pool_size, all of the environments i checked did not need more than 2 max_pool_size for the 12 hours(and with max_overflow set to 50, it makes it far more than enough). But i'll share some info after rally tests | 15:26 |
noonedeadpunk | well, in case of 1 controller failure I guess this number would increase? | 15:27 |
damiandabrowski[m] | spatel: there is a script i used to check how many active SQL connections I have for each openstack service. You may find it useful | 15:27 |
damiandabrowski[m] | https://paste.openstack.org/show/811029/ | 15:27 |
noonedeadpunk | I think you meant mgariepy hehe | 15:27 |
spatel | :) | 15:28 |
damiandabrowski[m] | ouh my bad, sorry! | 15:28 |
noonedeadpunk | andrewbonney: you had your head in this topic as well. wdyt? | 15:28 |
mgariepy | nice damiandabrowski[m] thanks i'll take a look | 15:30 |
spatel | damiandabrowski[m] i am trying to run but throwing some error :) | 15:33 |
spatel | may be my box doesn't have some tools | 15:33 |
spatel | sort: cannot read: ./data/keystone: No such file or directory | 15:34 |
damiandabrowski[m] | well, i found an issue with this script yesterday. Looks like it has to be executed from your $PWD. maybe it's the case? | 15:35 |
damiandabrowski[m] | line 13 should create ./data and line 16 should create service directories inside it | 15:36 |
damiandabrowski[m] | ouh and another thing: You need to run this script with 'collect' argument to collect some data, then You will be able to use 'summary' argument ;) | 15:37 |
andrewbonney | It's a while since I've looked at this, but it wouldn't surprise me if existing defaults could be reduced. I just wouldn't have confidence doing so without checking larger deployments | 15:39 |
noonedeadpunk | would be interesting to see spatel's result actually | 15:39 |
spatel | noonedeadpunk are you talking about related SQL script? | 15:40 |
noonedeadpunk | yeah | 15:40 |
spatel | fyi, i am trying to test it on wallaby not latest branch | 15:40 |
spatel | damiandabrowski[m] - https://paste.opendev.org/show/811030/ | 15:41 |
spatel | may need some love to that script.. | 15:42 |
damiandabrowski[m] | tbh i ran it with bash so it may have some issues with sh :/ | 15:42 |
spatel | trying with bash.. | 15:44 |
spatel | https://paste.opendev.org/show/811031/ | 15:45 |
spatel | all zero.. that is impossible | 15:45 |
noonedeadpunk | it's sandbox? | 15:45 |
spatel | lab with 5 compute nodes | 15:45 |
noonedeadpunk | and no activity? | 15:46 |
spatel | 1 controller | 15:46 |
spatel | let me run on busy cloud | 15:46 |
noonedeadpunk | because what this script does calculate I belive amount of active sql requests at a time | 15:46 |
mgariepy | yep is threads are sleeping it won't count them | 15:47 |
spatel | assuming collect continue collecting data in background right? ./mysql-active-conn.sh collect | 15:47 |
damiandabrowski[m] | i think it's very likely to be 0, but You can check the result of `mysql -sse "SELECT user,count(*) FROM information_schema.processlist WHERE command != 'Sleep' GROUP BY user;"` | 15:48 |
damiandabrowski[m] | yeah, i was running 'collect' in the background for 12 hours | 15:48 |
spatel | zero.. result | 15:48 |
noonedeadpunk | hm, that's weird on busy cloud indeed | 15:48 |
spatel | https://paste.opendev.org/show/811032/ | 15:49 |
noonedeadpunk | that is possible actually :) | 15:49 |
mgariepy | but even sleeping connection are still a connection | 15:50 |
damiandabrowski[m] | but if i understand it correctly, our main point of implementing pooling is to reduce the number of sleeping connections as we don't need them ;) | 15:51 |
damiandabrowski[m] | i mean, with max_pool_size=5, we will have 5 sleeping/active connections per worker per service | 15:51 |
noonedeadpunk | yeah idea is that our current default setup is weird in terms of pooling | 15:52 |
noonedeadpunk | because of huge amount of sleeping connections | 15:52 |
spatel | fyi, i rant this script on most busies cloud in my datacenter and result is all zero. i am running collect for just 1 min.. | 15:52 |
spatel | ran* | 15:53 |
opendevreview | James Denton proposed openstack/openstack-ansible-os_ironic master: Add [nova] section to ironic.conf https://review.opendev.org/c/openstack/openstack-ansible-os_ironic/+/818115 | 15:53 |
spatel | may be i am running on stein release thats why :) | 15:53 |
noonedeadpunk | nah I don't think it matters | 15:53 |
damiandabrowski[m] | and `cat ./data/nova` has only zeros? i was testing it on victoria, but i think it doesn't matter | 15:54 |
noonedeadpunk | it's pure mysql | 15:54 |
spatel | k | 15:54 |
spatel | damiandabrowski[m] yes all zero for nova and even other services also | 15:55 |
spatel | only root and system user has 1 and 5 connection | 15:55 |
damiandabrowski[m] | so IMO, openstack queries are parsed super fast so script doesn't catch it | 15:55 |
damiandabrowski[m] | i mean, it just parses `mysql -sse "SELECT user,count(*) FROM information_schema.processlist WHERE command != 'Sleep' GROUP BY user;"` and saves the output to service files in ./data/* | 15:56 |
damiandabrowski[m] | during my case, the output was 0 in 99% of cases | 15:57 |
spatel | noonedeadpunk this is my mytop command output - https://paste.opendev.org/show/811033/ | 15:59 |
spatel | majority are sleeping connection | 15:59 |
noonedeadpunk | yeah and we aim to reduce their number) | 16:00 |
spatel | does sleep consume resources ? | 16:00 |
noonedeadpunk | well, they do. it's not like _huge_ problem, but unpleasant | 16:01 |
noonedeadpunk | eventually you have tons of hanging tcp cnnections which also some load on tcp stack | 16:01 |
spatel | how about setting up - interactive_timeout = 300 | 16:02 |
spatel | i believe default is 8hrs | 16:02 |
spatel | or wait_timeout ? | 16:03 |
spatel | interactive_timeout | 28800 | 16:03 |
spatel | wait_timeout | 28800 | 16:03 |
damiandabrowski[m] | sleeping connections are the most problematic when galera nodes are going down. | 16:04 |
damiandabrowski[m] | Galera will keep them until timeout, that's how galera can easily reach max_connections | 16:04 |
damiandabrowski[m] | yeah, actually my main point was to implement connection pooling and lower wait_timeout - but I'm open for Your ideas ;) | 16:05 |
spatel | i had same issue.. that is why i kept max_connection to 5000 and more... my upgrade failed last time because of high max_connection limit | 16:05 |
mgariepy | damiandabrowski[m], which service does consume the most mysql connections ? | 16:08 |
damiandabrowski[m] | in my case: nova | 16:10 |
spatel | yes nova is always on top | 16:10 |
noonedeadpunk | #endmeeting | 16:11 |
opendevmeet | Meeting ended Tue Nov 16 16:11:58 2021 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 16:11 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/openstack_ansible_meeting/2021/openstack_ansible_meeting.2021-11-16-15.00.html | 16:11 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/openstack_ansible_meeting/2021/openstack_ansible_meeting.2021-11-16-15.00.txt | 16:11 |
opendevmeet | Log: https://meetings.opendev.org/meetings/openstack_ansible_meeting/2021/openstack_ansible_meeting.2021-11-16-15.00.log.html | 16:11 |
mgariepy | noonedeadpunk, which one of galera or rabbit do you want me to tackle? | 16:13 |
noonedeadpunk | would be great if you could take a look on galera | 16:14 |
noonedeadpunk | but I bet this is mariadb bug | 16:14 |
noonedeadpunk | since according to zuul logs mariadb-upgrade always fails on schema_redundant_indexes | 16:15 |
mgariepy | ok i'll take a look at it a bit later. | 16:15 |
noonedeadpunk | and this table has been implemented with 10.6 according to https://mariadb.com/docs/reference/mdb/sys/schema_redundant_indexes/ | 16:15 |
noonedeadpunk | also it's not easily reproducable... | 16:15 |
mgariepy | ok | 16:18 |
noonedeadpunk | I've also asked in #maria (on libera) but I guess they will jsut suggest to fill in bug report and provide stack trace collected | 16:19 |
mgariepy | hmm ok it fails sometimes during upgrade? do you have the pointers to the logs somewherE? | 16:21 |
noonedeadpunk | check out timeout tasks for https://review.opendev.org/c/openstack/openstack-ansible/+/817385 | 16:21 |
noonedeadpunk | that is always last task before timeout https://zuul.opendev.org/t/openstack/build/fed013733f9f415bb2d1118a26287bab/log/logs/host/mariadb.service.journal-17-59-27.log.txt#205 | 16:22 |
mgariepy | ok perfect. | 16:24 |
mgariepy | it's always a timeout on focal metal or it doesn't matter ? | 16:24 |
noonedeadpunk | oh well, I saw it indeed 2 times and only on focal right now | 16:25 |
mgariepy | ok | 16:25 |
mgariepy | perfect i'll dig into it after lunch. see if i can figure out what's going on. | 16:25 |
noonedeadpunk | thanks! | 16:27 |
noonedeadpunk | I think we should try at least to reproduce and collect debug information to post bug report | 16:27 |
noonedeadpunk | ie https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/#enabling-core-dumps | 16:29 |
noonedeadpunk | as I dont think it's us who does smth wrong | 16:30 |
mgariepy | yep ok | 16:30 |
mgariepy | i'll script the launch in for launch a couple of vms in my cloud to reproduce it and enable core-dumps to try to grab the trace. | 16:31 |
opendevreview | James Denton proposed openstack/openstack-ansible master: Deprecate OVN-related haproxy configuration https://review.opendev.org/c/openstack/openstack-ansible/+/813858 | 16:32 |
opendevreview | Merged openstack/ansible-config_template master: Fix repository URL in galaxy.yml https://review.opendev.org/c/openstack/ansible-config_template/+/817720 | 17:01 |
*** sshnaidm is now known as sshnaidm|afk | 17:33 | |
opendevreview | Merged openstack/openstack-ansible-os_nova master: Allow to provide mdev addresses as list https://review.opendev.org/c/openstack/openstack-ansible-os_nova/+/817738 | 17:51 |
*** tosky is now known as Guest6054 | 18:05 | |
*** tosky_ is now known as tosky | 18:05 | |
mgariepy | i'll try to reproduce on a AIO lxc install this way i can reset only the galera container if it pass the mysql_upgrade | 18:21 |
*** sshnaidm|afk is now known as sshnaidm | 18:39 | |
opendevreview | Merged openstack/openstack-ansible master: Minor update of openstack collection https://review.opendev.org/c/openstack/openstack-ansible/+/817851 | 18:45 |
opendevreview | Merged openstack/openstack-ansible master: Remove note about metal/horizon compatability https://review.opendev.org/c/openstack/openstack-ansible/+/771573 | 18:45 |
*** tosky_ is now known as tosky | 18:48 | |
mgariepy | noonedeadpunk, everysingle time i did the run on my server it has the issue. | 19:53 |
mgariepy | looks like a race condition to me. | 19:53 |
mgariepy | the line in journactl : sys.schema_redundant_indexes OK is printer on startup. | 20:03 |
opendevreview | Marc GariƩpy proposed openstack/openstack-ansible-galera_server master: Relaod deamon on overrides file creation https://review.opendev.org/c/openstack/openstack-ansible-galera_server/+/818138 | 20:47 |
mgariepy | maybe this would belong to systemd_services ? | 20:48 |
mgariepy | i'll be back tomorrow ! | 20:54 |
*** tosky is now known as Guest6070 | 22:42 | |
*** tosky_ is now known as tosky | 22:42 | |
*** tosky is now known as Guest6073 | 23:07 | |
*** tosky_ is now known as tosky | 23:07 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!