rohit02 | hi team,while deploying OSA Wallaby setup openstack failed at TASK [os_nova : Perform a cell_v2 discover] https://paste.opendev.org/show/810323/ | 07:19 |
---|---|---|
rohit02 | noonedeadpunk: any idea here while deploying OSA Wallaby setup openstack failed at TASK [os_nova : Perform a cell_v2 discover] https://paste.opendev.org/show/810323/ | 07:43 |
anskiy | rohit02: OperationalError: (1040, 'Too many connections'), you can check mysqladmin processlist and I think there are a bunch of processes in COMMIT state | 07:45 |
anskiy | which should lead to this: https://jira.mariadb.org/browse/MDEV-25368 | 07:46 |
rohit02 | anskiy:thanx..any resolution to overcome this error? | 07:47 |
anskiy | I've ended up pinning mysql version to 10.5.6, like this: https://paste.opendev.org/show/810326/ | 07:48 |
anskiy | or, you are just hitting that "too many connections" error normally, in that case you just need to increment galera_max_connections (which is set based on this formulae: (100 x vCPUs), but takes in account ALL nodes in cluster, so, having one node with 2 cores (like the deployment host, which I use for logging) sets this to 200, which could be a little low, dependind on amount of services you're trying to deploy. | 07:52 |
noonedeadpunk | rohit02: we here bumped max connections for galera | 08:26 |
noonedeadpunk | ie galera_max_connections: 1000 | 08:27 |
noonedeadpunk | This won't solve issue with connections in COMMIT | 08:27 |
noonedeadpunk | but if there're plenty connections with wait/sleep state it will | 08:27 |
opendevreview | Dmitriy Rabotyagov proposed openstack/ansible-role-python_venv_build stable/ussuri: Set centos-7 jobs to non voting https://review.opendev.org/c/openstack/ansible-role-python_venv_build/+/816316 | 10:42 |
opendevreview | Dmitriy Rabotyagov proposed openstack/ansible-role-python_venv_build stable/ussuri: Workaround distro provided pip having old CA certs on centos-7 https://review.opendev.org/c/openstack/ansible-role-python_venv_build/+/816317 | 10:43 |
opendevreview | Dmitriy Rabotyagov proposed openstack/openstack-ansible stable/ussuri: Fetch upper constraints file with curl rather than allow pip to download it https://review.opendev.org/c/openstack/openstack-ansible/+/815632 | 10:51 |
jrosser_ | anskiy: does the connection limit really get calculated based on the deploy host cpus? that sounds surprising? | 10:52 |
opendevreview | Dmitriy Rabotyagov proposed openstack/openstack-ansible stable/ussuri: Bump OpenStack-Ansible Ussuri https://review.opendev.org/c/openstack/openstack-ansible/+/815589 | 10:52 |
anskiy | jrosser_: https://opendev.org/openstack/openstack-ansible-galera_server/src/branch/master/templates/my.cnf.j2#L56-L64 and that default is calculated at the beginning of the same file | 11:32 |
jrosser_ | but only for the galera hosts? https://github.com/openstack/openstack-ansible-galera_server/blob/master/templates/my.cnf.j2#L2 | 11:33 |
jrosser_ | i'm not sure that i can see how the deploy host having two vcpu would affect it> | 11:33 |
anskiy | jrosser_: hm, it's true. Maybe I should've digged more into this. But what I was seeing is that I got max_connections capped at 200, and the only node I've had with that amount was the logging one. Or, it could've been a cause of that default value was triggering somehow, instead of proper value from fact. | 11:39 |
jrosser_ | if theres a bug somewhere then it would be great to understand why it's coming out that way for you | 11:40 |
jrosser_ | i guess that the default(2) here https://github.com/openstack/openstack-ansible-galera_server/blob/master/templates/my.cnf.j2#L4 might have something to say about this | 11:41 |
jrosser_ | if the number of vcpu is somehow not available, but i'd have expected that to fail already at the previous line when it tries to retrieve the value into `vcpus` | 11:42 |
anskiy | jrosser_: yeah. Thanks for pointing this out, I'll try to dig more into this and file a bug if appropriate | 11:45 |
mgariepy | depending on how many compute nodes you have even with a decent core counts on the galera nodes you might end up not having enough connections. | 11:48 |
mgariepy | Also if your cache is stalled for some nodes it can be set a bit too low. (ex: calculate the vcpus of only 1 node out of the 3.) | 11:49 |
jrosser_ | i'm surprused that the template doesnt blow up if the facts are not available | 11:49 |
mgariepy | there is always 1 there.. | 11:50 |
mgariepy | but yeah. indeed it's surprising that it didn't fail hard lol | 11:50 |
mgariepy | on a smallish 120 computes nodes i generally set it to : galera_max_connections: 4800 | 11:51 |
opendevreview | Merged openstack/openstack-ansible-os_tempest stable/stein: Fix tempest plugin versions https://review.opendev.org/c/openstack/openstack-ansible-os_tempest/+/814535 | 11:53 |
strattao | We've also seen the issue with the max connections being set to 200 even though I have more than 2 vcpus available. Haven't pinned down the issue, but it does seem that there is a bug with the calculation that is being made. We had to manually bump up the galera_max_connections, but I don't recall having to do that before our Wallaby tests | 12:26 |
opendevreview | Merged openstack/openstack-ansible stable/stein: Remove tempest plugins CI overrides https://review.opendev.org/c/openstack/openstack-ansible/+/814558 | 12:31 |
opendevreview | Merged openstack/ansible-role-pki master: Slurp all server certs not just first one https://review.opendev.org/c/openstack/ansible-role-pki/+/815849 | 12:55 |
opendevreview | Merged openstack/openstack-ansible master: Switch services to track stable/xena https://review.opendev.org/c/openstack/openstack-ansible/+/815597 | 14:00 |
*** sshnaidm_ is now known as sshnaidm | 15:49 | |
jrosser_ | behaviour of this galera max connections template is bizzare https://paste.opendev.org/show/810346/ | 16:39 |
mgariepy | it does appends strings? | 16:58 |
jrosser_ | aparrently | 17:01 |
jrosser_ | i thought i'd just grab the bit of the template out into a test playbook and mess with it | 17:01 |
mgariepy | what version of ansible ? | 17:01 |
jrosser_ | ah good question | 17:01 |
jrosser_ | hmmm 2.9.11 | 17:02 |
jrosser_ | whatever i turn the multiply to is puts the vcpu number in that many times | 17:03 |
mgariepy | '>' not supported between instances of 'str' and 'int'" | 17:07 |
mgariepy | if you swap the strings for int ? in the append statement? | 17:09 |
jrosser_ | thats why i've had to add the quotes around '2' for example | 17:09 |
jrosser_ | i have no idea how this works at all in the current code :( | 17:10 |
mgariepy | '2' and '9999' are actually strings. and with ansible 2.11.1 it does error out on the comparison. | 17:10 |
mgariepy | https://paste.opendev.org/show/810347/ | 17:10 |
jrosser_ | i was getting `'<' not supported between instances of 'int' and 'AnsibleUnsafeText'"}` | 17:14 |
jrosser_ | without the quotes around the '2' on 2.9.11 | 17:14 |
mgariepy | lol | 17:14 |
mgariepy | we got to love ansible :) | 17:14 |
mgariepy | very predictable. | 17:14 |
mgariepy | for stables feature that is. lol | 17:14 |
jrosser_ | ah right, and i'm just trying with ansible 4.8.0 and i get the same as you | 17:15 |
jrosser_ | but either way it still templates out N times the vcpu number | 17:16 |
mgariepy | https://paste.opendev.org/show/810348/ | 17:16 |
mgariepy | yep. | 17:16 |
jrosser_ | oh wait - how did you do that? | 17:17 |
mgariepy | https://paste.opendev.org/show/810349/ | 17:17 |
jrosser_ | oh right, so i guess this is python interpreter / jinja library version trouble then | 17:18 |
mgariepy | i wonder what was the rational on calculating the number of connection via the galera core count is about. | 17:19 |
mgariepy | most api (most of the time) with have X thread (based on core count) i guess and if they run all on the same host some number based on the core count of the host might make sense. but ignoring all the other nodes on the cluster is not a good idea imo. | 17:21 |
jrosser_ | as soon as you hit the connection limit the LB healthcheck starts to fail (as it can't connect either) and you end up with a catastrophic failure | 17:23 |
mgariepy | what i've seens is the first server stopping responding. then the second one taking over.. | 17:24 |
jrosser_ | anyway, adding the quotes on the numerical values and putting this in `| min | int * 20) %}` makes the template behave here | 17:25 |
mgariepy | wouldn't be an issue if all the nodes were really masters.. proxysql to the rescue ! | 17:25 |
jrosser_ | and it does fail hard if the vcpus fact is not present | 17:32 |
jrosser_ | i am confused about how this is ending up getting stuck at 200 for people | 17:32 |
mgariepy | i've seen that with Rocky in the past. when upgrading the controllers. | 17:36 |
mgariepy | when the fact were expired for some nodes it didn't matter. | 17:37 |
mgariepy | can a bug somewhere set the fact to 0 ? | 17:43 |
mgariepy | for some reason | 17:43 |
jrosser_ | I wonder if the galera_nodes can be empty somehow | 17:45 |
mgariepy | if the galera_cluster_members is overwritten by the user ? | 17:52 |
mgariepy | beside that i don't see how it can be empty. | 17:53 |
*** ianw_pto is now known as ianw | 19:00 | |
strattao | spatel, for ovn deployments in wallaby, is there supposed to be a default driver_interface defined in the ml2.ovn section /etc/ansible/roles/os_neutron/vars/main.yml? | 19:45 |
spatel | strattao what do you mean? | 19:46 |
strattao | I was testing an ovn install off of the stable/wallaby branch and got an error that driver_interface was not defined. I looked in /vars/main.yml of os_neutron and saw that many of the other defined ml2 plugins specify a driver_interface in that file, but ml2.ovn does not. | 19:50 |
strattao | I didn't know if that was the place it would get defined, if it even needs it, if there is another reason why it would be complaining about the driver_interface for ovn... etc. | 19:50 |
spatel | I have never seen that error | 20:07 |
spatel | strattao if you post config/error/output etc.. that would be good | 20:07 |
spatel | Here what i did - https://satishdotpatel.github.io/openstack-ansible-ovn-deployment-part1/ | 20:08 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!