mrf3 | hi, deploying wallaby ansible in a lab for test, do you use default HAProxy settings for galera? | 06:16 |
---|---|---|
mrf3 | with the default balance params during deployment you can get timeouts | 06:16 |
jrosser | the defaults should be "sensible" | 07:05 |
jrosser | mrf3: can you give a specific example? do you mean something times out during the deployment and it fails? | 07:06 |
mrf3 | haproxy by default for galera is configured for just node1 get all traffic during deployment | 07:12 |
mrf3 | due "massive" queries / connections for settup , during the lab of 3 nodes the galera got timeout | 07:13 |
mrf3 | no ones happens in your test? | 07:13 |
mrf3 | or just people test ussing AIO? | 07:14 |
jrosser | the galera setup is deliberately active/standby/standby | 07:27 |
jrosser | are you exceeding the maximum number of connections? | 07:28 |
*** rpittau|afk is now known as rpittau | 07:28 | |
mrf3 | yes just deploying, 3 controllers, 3 networks, 1 compute | 07:32 |
mrf3 | yes we got " backend galera-back has no server available!" | 07:32 |
mrf3 | galera has 3 containers ofc | 07:32 |
jrosser | well, there is a kind of corner case, that if you exceed the maximum number of database connections, then the haproxy healthcheck itself cannot connect to the database | 07:40 |
jrosser | so the loadbalancer thinks that the db is down | 07:40 |
jrosser | so i think it's worth trying to understand why in your case the loadbalancer thinks there is no backend | 07:41 |
noonedeadpunk | mornings | 07:50 |
jrosser | morning | 07:54 |
mrf3 | im surprised that no one got this coner case or just solve it self without report? | 08:12 |
mrf3 | get out of mysql connection is not a "corner" case is a future production problem | 08:12 |
mrf3 | because haproxy by default only enable a single node for the full infraestructure | 08:12 |
noonedeadpunk | yeah and we have plans to move away from balancing with haproxy to proxysql | 08:13 |
noonedeadpunk | eventually right now we pushed max connection limit but it's not really a long term solution for sure | 08:14 |
noonedeadpunk | but we also need to reduce sleeping amount of connections | 08:14 |
noonedeadpunk | we have etherpad with some ideas https://etherpad.opendev.org/p/db_pool_calculations but wasn't able to return back to it :( | 08:15 |
anskiy | I've seen this same issue actually in lab. Fixed it with this override: https://paste.opendev.org/show/809487/. Another problem of the way it works right now is that queries are not distributed across cluster (galera one) nodes, so all the load essentially hits the first node :( | 08:18 |
noonedeadpunk | well, eventually, galera can consume writes only on single node. But reads could be distributed across many | 08:22 |
noonedeadpunk | However haproxy can't really distinguish read/write because it's l3 balancers, while l7 is required for this kind of balancing | 08:23 |
mrf3 | anskiy +1 same happend to me | 08:23 |
mrf3 | i was having nightmare thinking i setup some wrong at network level and then i saw ha-proxy galera balancer down and cause playbooks got wrong | 08:24 |
jrosser | some specific tuning per deployment is probably needed | 08:32 |
jrosser | depending how many services you deploy / how many CPU you controllers have / blah blah this all affects the number of database connections | 08:32 |
noonedeadpunk | we set smth like `galera_max_connections: 5000` in user_vars | 08:35 |
noonedeadpunk | because 90% of all connections are sleeping ones atm | 08:35 |
mrf3 | sleeping connections are the problem | 08:40 |
noonedeadpunk | well, it depends. Because it speeds up reaching maria by using already established connections | 09:10 |
noonedeadpunk | Also you shouldn't get more connections to DB now except from the already spawned pool | 09:10 |
jrosser | the etherpad explains it quite will tbh | 09:11 |
jrosser | *well | 09:11 |
mrf3 | galera_wait_timeout is the timeout for idle connections? | 09:59 |
jrosser | mrf3: https://opendev.org/openstack/openstack-ansible-galera_server/src/branch/master/templates/my.cnf.j2#L66 | 10:34 |
mrf3 | ty jrosser i setup to default timeout of 120 secs | 10:54 |
mrf3 | 3600 idle a sql connection its crazy | 10:55 |
jrosser | this is not an OSA thing btw | 10:55 |
jrosser | it is how oslo.db middleware works, and by extension all of the services | 10:55 |
jrosser | you can tune the settings on the db connection pool based on your use case | 10:56 |
spatel | jamesdenton morning | 13:14 |
*** arxcruz is now known as arxcruz|ruck | 13:14 | |
jamesdenton | good morning | 13:14 |
spatel | I found solution :) | 13:14 |
jamesdenton | mice :) | 13:15 |
jamesdenton | nice, too! | 13:15 |
spatel | did you? | 13:15 |
jamesdenton | tell me, was it obvious? | 13:15 |
spatel | i saw your post related that error | 13:15 |
jamesdenton | no, i've been messing with it | 13:15 |
spatel | solution is Disable -> HP Shared Memory Features | 13:15 |
spatel | In BIOS | 13:15 |
jamesdenton | hmm, ok | 13:16 |
spatel | First you have to upgrade NIC firmware it will give you option in BIOS to disable/enable HP shared Memory | 13:16 |
jamesdenton | ok, let me reboot this thing and see | 13:16 |
spatel | I am very surprised there is no simple answer in internet, every post just clueless... | 13:17 |
spatel | https://ibb.co/YW1HTdG | 13:17 |
jamesdenton | it all varies so much | 13:17 |
spatel | In Device level configuration > disable Shared memory and DPDK will work like charm | 13:17 |
spatel | Make sure you have latest firmware for HP NIC otherwise you won't able to find that option in BIOS | 13:18 |
jamesdenton | i think i do, but i'll double check | 13:18 |
jamesdenton | i need to compare this config to another G9 i have tested DPDK with. It's possible i disabled it there, i dunno | 13:19 |
jamesdenton | that was 2 yrs ago | 13:19 |
spatel | :) HP just making this harder for everyone.. | 13:19 |
spatel | now i am testing on different server to just check this is correct process and i would like to blog it out so other folks don't need to struggle | 13:20 |
spatel | learnt one more thing in blog never reference HP kb article because they started restricting some of them like redhat doing, only paid people allow to use | 13:21 |
jamesdenton | maybe a screenshot would help | 13:21 |
spatel | yep.. i was reading your blog post and same thing happened most of the links are broken :( | 13:22 |
jamesdenton | which post was that? | 13:22 |
spatel | let me find it.. you have very nice blog related this issue | 13:22 |
spatel | jamesdenton https://www.jimmdenton.com/proliant-intel-dpdk/ | 13:24 |
spatel | may be that solve your issue :) | 13:24 |
noonedeadpunk | haha lol | 13:25 |
jamesdenton | Well, i think the IOMMU stuff was <= G8, i'm not seeing the same thing on this G9 | 13:25 |
jamesdenton | but the shared memory stuff sound plausible | 13:25 |
spatel | I have Gen9 servers and i am seeing same issue | 13:25 |
spatel | May be HP later add that feature to disable enable shared memory to handle RMRR (using NIC firmware) | 13:27 |
spatel | jamesdenton i am planning to patch ovn to support dpdk. its broken at present and only supported in ml2.ovs plugin | 13:31 |
jamesdenton | by all means | 13:31 |
spatel | jamesdenton did you try AF_XDP ? | 13:33 |
spatel | look like its next big thing in OVS, if it workout then it will overtake DPDK | 13:34 |
jamesdenton | if it's easier to implement it just might do it | 13:35 |
spatel | https://docs.openvswitch.org/en/latest/intro/install/afxdp/ | 13:35 |
spatel | did you play with it? i am going to try out | 13:35 |
jamesdenton | i have not | 13:40 |
spatel | jamesdenton i am reading this - https://docs.openstack.org/networking-ovn/queens/admin/dpdk.html | 13:44 |
spatel | help me to understand, why they are saying just set datapath_type=netdev | 13:44 |
spatel | for br-int ? | 13:45 |
spatel | what about br-provider, we don't need it there? | 13:45 |
jamesdenton | yes | 13:45 |
jamesdenton | "and all other bridges if connected to the integration bridge via patch ports" | 13:45 |
spatel | br-provider is external network so that should be connect to dpdk port right? | 13:46 |
spatel | so we don't need command like this ? - ovs-vsctl add-port br-int dpdk-1 -- set Interface dpdk-1 type=dpdk options:dpdk-devargs=0000:06:00.1 | 13:47 |
jamesdenton | i would expect to need that | 13:49 |
jamesdenton | i don't think those instructions are complete | 13:49 |
spatel | that is what i thought something is missing | 13:51 |
spatel | let me ask openvswitch channel | 13:51 |
strattao | we need to change our cinder_service _password. I see warnings all over the documentation that says changing the password and running the playbooks will break the region. So... how would I go about changing the cinder_service_password? (or any other passwords that might need to be changed for that matter...) | 14:01 |
strattao | (sorry to hijack this fascinating conversation spatel & jamesdenton) | 14:02 |
spatel | strattao no worry, i mostly go with whatever secret password generated by OSA, why do you want to change that ? | 14:03 |
spatel | you are not going to use them in daily basis so why bother | 14:04 |
strattao | our password complexity requirements changed and the ones that were originally autogenerated no longer match the new requirements. | 14:04 |
spatel | but why just cinder_service_password only? | 14:05 |
noonedeadpunk | strattao: Actually I think you can jsut update password and re-run the role. | 14:06 |
strattao | looks like that's the only one that doesn't meet the new requirements | 14:07 |
noonedeadpunk | Service will be really broken but until the time role execution will end | 14:07 |
noonedeadpunk | Because you need services to be restarted after password is changed, and it's changed somewhere in the middle of role execution | 14:07 |
noonedeadpunk | if you use manila - you should also re-run it | 14:08 |
strattao | hmmm, well I was testing this out in one of our dev regions and it failed re-running os-cinder. The error was with keystone, so I re-ran os-keystone, then os-cinder, to no avail. Now I have setup-everything running in the background now, but figured I'd ask here if anyone has ever had to update a password since all the documentation says that this is bad and I don't want to break in production. | 14:10 |
noonedeadpunk | And what's the issue with keystone was? | 14:14 |
noonedeadpunk | Have you tried to comment out no_log there? | 14:14 |
noonedeadpunk | because we should have force to be set for quite some time | 14:14 |
noonedeadpunk | Which means that role will attempt to change password | 14:14 |
noonedeadpunk | Unless you've also changed keystone admin password, then things are really bad | 14:15 |
strattao | Yeah, you're right - the keystone error I was seeing was that the cinder-api is not responsive. So, I'll try to restart cinder services in the api container now that the password has been reset and then re-run the os-cinder playbook and ..... ??? profit?!? | 14:23 |
strattao | Nope, that didn't work :( | 14:26 |
strattao | Okay, well I'll keep plugging along, thx | 14:26 |
spotz | strattao: Depending on the error it might also be a good idea to ask in Cinder assuming it's not an OSA issue | 14:37 |
noonedeadpunk | I guess it's osa actually.... | 15:06 |
noonedeadpunk | strattao: but some output would be useful | 15:06 |
opendevreview | Dmitriy Rabotyagov proposed openstack/openstack-ansible master: Add serial execution to all playbooks https://review.opendev.org/c/openstack/openstack-ansible/+/805188 | 15:17 |
*** rpittau is now known as rpittau|afk | 15:45 | |
spatel | jamesdenton, i have test my method on 3 servers and it just works after Disable HP shared memory feature in BIOS | 15:57 |
jamesdenton | nice! so far, that has not worked for me | 16:15 |
*** mgoddard- is now known as mgoddard | 17:13 | |
-opendevstatus- NOTICE: Zuul has been restarted in order to address a performance regression related to event processing; any changes pushed or approved between roughly 17:00 and 18:30 UTC should be rechecked if they're not already enqueued according to the Zuul status page | 18:35 | |
mrf3 | mm guys how much memory needed is need for a controller deployed with ansible? 16GB is not enough | 22:25 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!