*** slaweq has quit IRC | 00:05 | |
eandersson | sorrison, what version of rabbitmq are you running? | 00:08 |
---|---|---|
sorrison | we're on 3.6.5 erlang 19.3 | 00:09 |
sorrison | We tried 3.7.9 and erlang 20.3 but it crashed after about 5 mins or running and so we rolled back. (we tried this twice) | 00:10 |
eandersson | interesting | 00:10 |
eandersson | How many nodes is this with? does it happen with 1? or 100? | 00:10 |
eandersson | *compute nodes | 00:10 |
eandersson | basically does the load have an impact? | 00:11 |
sorrison | Yes I think load is a factor | 00:11 |
eandersson | So one obvious factor, especially with older versions of RabbitMQ could be that it just can't accept connections fast enough. | 00:12 |
sorrison | prod is around 1000 hosts and can't run latest version, test is 8 hosts and is fine (with new or old version of rabbit) | 00:12 |
eandersson | Would cause > {handshake_timeout,frame_header} | 00:12 |
sorrison | yes that is what we see! | 00:12 |
eandersson | > num_acceptors.ssl = 1 | 00:13 |
eandersson | So for SSL in these older versions they can only support one SSL acceptor | 00:13 |
eandersson | If you upgrade to something like 3.6.14, or maybe even your version | 00:13 |
eandersson | you can bump that to something like 10 | 00:13 |
*** mriedem has quit IRC | 00:13 | |
eandersson | https://www.rabbitmq.com/configure.html | 00:14 |
sorrison | the bulk of the load is neutron agents on this rabbit and so we offloaded ssl to our F5 but didn't help | 00:14 |
sorrison | Yeah ok, I'll have a look into that. We tried 3.6.14 but it crashed and burned when we put the load on it too | 00:14 |
sorrison | we think it's to do with the distributed mgmt interface | 00:14 |
eandersson | https://github.com/rabbitmq/rabbitmq-server/issues/1729 | 00:16 |
eandersson | Trying to find the exact bug | 00:16 |
eandersson | I think it might work on 3.6.5 as well, but haven't tested it. | 00:18 |
eandersson | Also consider raising the ssl_handshake_timeout | 00:25 |
*** simon-AS5591 has quit IRC | 00:26 | |
sorrison | yeah ok, I'll try {num_ssl_acceptors, 10} and maybe doubling ssl_handshake_timeout | 00:27 |
sorrison | So in my reading of this bug 10 acceptors is the default in 3.7.9? | 00:27 |
openstack | bug 10 in Launchpad itself "It says "displaying matching bugs 1 to 8 of 8", but there is 9" [Medium,Invalid] https://launchpad.net/bugs/10 | 00:27 |
eandersson | Yea - unfortunately only 3.7.9 | 00:28 |
sorrison | looks like it might be referenced as num_acceptors.ssl too | 00:28 |
eandersson | I can't remember which version it was actually fixed in thou | 00:28 |
eandersson | https://github.com/rabbitmq/rabbitmq-server/commit/d687bf0be3a23fdb63c1c0b36db967285a112c74 | 00:29 |
eandersson | Found it ^ | 00:29 |
sorrison | ah so looks like it's in 3.8.0-beta.1 | 00:30 |
eandersson | 3.7.13 | 00:30 |
eandersson | *3.6.13 | 00:30 |
eandersson | But depends on Ranch, not RabbitMQ | 00:31 |
eandersson | > due to a bug in Ranch 1.0 | 00:31 |
sorrison | we just use the debs provided my rabbitmq so I assume this is packaged in there somewhere? (no idea what Ranch is) | 00:32 |
eandersson | Yea library that RabbitMQ uses | 00:33 |
eandersson | Unfortunately no clue how to check | 00:33 |
eandersson | According to the bug, worst case it generates harmless error messages on shutdown if you have 1.0 | 00:33 |
sorrison | rabbitmqctl shows it | 00:34 |
eandersson | Ranch 1.0 is over 4 years old, so you are probably using a newer version. | 00:34 |
sorrison | we have {ranch,"Socket acceptor pool for TCP protocols.","1.2.1"}, on our 3.6.5 erlang 19.3 cluster | 00:34 |
sorrison | our dev cluster which is 3.7.9 erlang 20.3 has {ranch,"Socket acceptor pool for TCP protocols.","1.6.2"}, | 00:36 |
sorrison | So now I just need to figure out why anything newer than 3.6.5 falls over in our case once I get all the neutron agents connecting to it again | 00:36 |
eandersson | btw how many RabbitMQ nodes are you running? 2 or 3? | 00:37 |
sorrison | 3 | 00:37 |
sorrison | physical hosts with 24 cores and 96G ram | 00:37 |
sorrison | Connections: 11134 | 00:39 |
sorrison | Channels: 10884 | 00:39 |
sorrison | Exchanges: 1262 | 00:39 |
sorrison | Queues: 13055 | 00:39 |
sorrison | Consumers: 18965 | 00:39 |
eandersson | Are you spreading the connections out between the nodes? | 00:39 |
sorrison | yeah they evenly spread | 00:40 |
eandersson | Is handshake_timeout the only obvious error you see in the RabbitMQ logs? | 00:44 |
sorrison | {handshake_error,starting,1, | 00:51 |
sorrison | {handshake_timeout,frame_header} | 00:51 |
sorrison | {handshake_timeout,handshake} | 00:51 |
sorrison | {inet_error,{tls_alert,"bad record mac"}} | 00:51 |
sorrison | {inet_error,{tls_alert,"unexpected message"}} | 00:51 |
sorrison | those are the things I see in the logs | 00:51 |
sorrison | {handshake_timeout,frame_header} is by far the common one, the others only seen a couple times | 00:51 |
sorrison | Also get missed heartbeat from client timeouts too | 00:52 |
eandersson | Are you using a LB in front of RabbitMQ, or hitting them directly? | 01:03 |
eandersson | unexpected message / bad record mac is a bit concerning | 01:05 |
sorrison | about 90% of our clients go through an LB which terminates the ssl, the rest hit directly using ssl from rabbit | 01:05 |
eandersson | bad record usually means that there is a client side race condition | 01:06 |
eandersson | maybe a threading issue | 01:07 |
sorrison | the unexpected message / bad record mac only a couple times and that could've been when we were stopping and starting with new version etc. | 01:07 |
eandersson | I see | 01:07 |
eandersson | It's worth keeping an eye out for those type of errors, as they would indicate a potential issue with kombu/oslo.messaging | 01:09 |
eandersson | *could | 01:09 |
*** jakeyip has joined #openstack-operators | 01:14 | |
sorrison | Well downgrading from pike oslo.messaging to ocata oslo.messaging fixes the timeouts for us | 01:16 |
sorrison | See last comment on https://bugs.launchpad.net/oslo.messaging/+bug/1800957/ about what works/doesn't work for us | 01:16 |
openstack | Launchpad bug 1800957 in oslo.messaging "Upgrading to pike version causes rabbit timeouts with ssl" [Undecided,Incomplete] - Assigned to Ken Giusti (kgiusti) | 01:16 |
*** markvoelker has joined #openstack-operators | 01:23 | |
*** VW_ has quit IRC | 01:24 | |
*** markvoelker has quit IRC | 01:28 | |
*** VW_ has joined #openstack-operators | 01:28 | |
*** jakeyip has quit IRC | 01:31 | |
*** blake has quit IRC | 01:33 | |
*** jakeyip has joined #openstack-operators | 01:35 | |
*** blake has joined #openstack-operators | 01:46 | |
*** trident has quit IRC | 01:50 | |
*** blake has quit IRC | 01:51 | |
*** gyee has quit IRC | 01:54 | |
*** trident has joined #openstack-operators | 01:55 | |
*** blake has joined #openstack-operators | 01:57 | |
*** blake has quit IRC | 02:02 | |
*** markvoelker has joined #openstack-operators | 02:30 | |
*** rcernin has quit IRC | 02:58 | |
*** jamesmcarthur has joined #openstack-operators | 03:00 | |
*** jamesmcarthur has quit IRC | 03:04 | |
*** jamesmcarthur has joined #openstack-operators | 03:14 | |
*** jamesmcarthur has quit IRC | 03:30 | |
*** VW_ has quit IRC | 03:37 | |
*** VW_ has joined #openstack-operators | 03:37 | |
*** rcernin has joined #openstack-operators | 03:38 | |
*** jamesmcarthur has joined #openstack-operators | 04:20 | |
*** jamesmcarthur has quit IRC | 04:25 | |
*** jackivanov has joined #openstack-operators | 05:49 | |
*** blake has joined #openstack-operators | 05:58 | |
*** blake has quit IRC | 06:02 | |
*** simon-AS559 has joined #openstack-operators | 06:51 | |
*** simon-AS5591 has joined #openstack-operators | 06:53 | |
*** simon-AS559 has quit IRC | 06:55 | |
*** VW_ has quit IRC | 06:56 | |
*** ahosam has joined #openstack-operators | 07:02 | |
*** slaweq has joined #openstack-operators | 07:03 | |
*** ahosam has quit IRC | 07:26 | |
*** ahosam has joined #openstack-operators | 07:26 | |
*** aojea has joined #openstack-operators | 07:32 | |
*** rcernin has quit IRC | 07:35 | |
*** gkadam has joined #openstack-operators | 08:07 | |
*** takamatsu has quit IRC | 08:31 | |
*** dims has quit IRC | 08:32 | |
*** dims has joined #openstack-operators | 08:33 | |
*** pcaruana has joined #openstack-operators | 08:48 | |
*** markvoelker has quit IRC | 08:50 | |
*** ahosam has quit IRC | 09:10 | |
*** ahosam has joined #openstack-operators | 09:10 | |
*** takamatsu has joined #openstack-operators | 09:13 | |
*** derekh has joined #openstack-operators | 09:22 | |
*** markvoelker has joined #openstack-operators | 09:51 | |
*** simon-AS5591 has quit IRC | 09:58 | |
*** blake has joined #openstack-operators | 09:58 | |
*** blake has quit IRC | 10:03 | |
*** ahosam has quit IRC | 10:06 | |
*** simon-AS559 has joined #openstack-operators | 10:08 | |
*** markvoelker has quit IRC | 10:24 | |
*** ahosam has joined #openstack-operators | 10:31 | |
*** electrofelix has joined #openstack-operators | 11:04 | |
*** markvoelker has joined #openstack-operators | 11:21 | |
*** takamatsu has quit IRC | 11:25 | |
*** takamatsu has joined #openstack-operators | 11:31 | |
*** ahosam has quit IRC | 11:45 | |
*** markvoelker has quit IRC | 11:55 | |
*** jamesmcarthur has joined #openstack-operators | 12:00 | |
*** jamesmcarthur has quit IRC | 12:04 | |
*** takamatsu has quit IRC | 12:06 | |
*** gkadam_ has joined #openstack-operators | 12:23 | |
*** gkadam has quit IRC | 12:25 | |
*** gkadam_ has quit IRC | 12:28 | |
*** gkadam has joined #openstack-operators | 12:28 | |
*** markvoelker has joined #openstack-operators | 12:52 | |
*** markvoelker has quit IRC | 13:24 | |
*** gkadam_ has joined #openstack-operators | 13:25 | |
*** gkadam has quit IRC | 13:28 | |
*** blake has joined #openstack-operators | 13:35 | |
*** blake has quit IRC | 13:40 | |
*** jamesmcarthur has joined #openstack-operators | 13:48 | |
*** takamatsu has joined #openstack-operators | 13:56 | |
*** jamesmcarthur has quit IRC | 14:04 | |
*** mriedem has joined #openstack-operators | 14:29 | |
*** VW_ has joined #openstack-operators | 15:15 | |
*** mriedem is now known as mriedem_afk | 15:29 | |
*** ahosam has joined #openstack-operators | 15:39 | |
*** aojea has quit IRC | 15:59 | |
*** jamesmcarthur has joined #openstack-operators | 16:05 | |
*** jamesmcarthur has quit IRC | 16:10 | |
*** gkadam_ has quit IRC | 16:20 | |
*** mriedem_afk is now known as mriedem | 16:22 | |
*** blake has joined #openstack-operators | 16:30 | |
*** blake has quit IRC | 16:30 | |
*** ahosam has quit IRC | 16:46 | |
*** gyee has joined #openstack-operators | 17:09 | |
*** jamesmcarthur has joined #openstack-operators | 17:24 | |
*** derekh has quit IRC | 18:06 | |
*** VW_ has quit IRC | 18:07 | |
*** jamesmcarthur has quit IRC | 18:24 | |
*** electrofelix has quit IRC | 19:12 | |
*** markvoelker has joined #openstack-operators | 19:36 | |
*** jamesmcarthur has joined #openstack-operators | 19:37 | |
*** markvoelker has quit IRC | 19:40 | |
*** dtrainor has quit IRC | 19:41 | |
*** dtrainor has joined #openstack-operators | 19:48 | |
*** jamesmcarthur has quit IRC | 19:50 | |
*** jamesmcarthur has joined #openstack-operators | 19:52 | |
*** klindgren has joined #openstack-operators | 19:57 | |
*** markvoelker has joined #openstack-operators | 20:00 | |
*** jamesmcarthur has quit IRC | 20:00 | |
*** slaweq has quit IRC | 20:11 | |
*** jamesmcarthur has joined #openstack-operators | 20:25 | |
*** slaweq has joined #openstack-operators | 20:28 | |
*** jamesmcarthur has quit IRC | 20:29 | |
*** jamesmcarthur has joined #openstack-operators | 20:53 | |
*** markvoelker has quit IRC | 21:38 | |
*** markvoelker has joined #openstack-operators | 21:38 | |
*** markvoelker has quit IRC | 21:43 | |
*** jamesmcarthur has quit IRC | 21:48 | |
*** jamesmcarthur has joined #openstack-operators | 21:48 | |
*** rcernin has joined #openstack-operators | 21:57 | |
*** markvoelker has joined #openstack-operators | 22:01 | |
*** jamesmcarthur has quit IRC | 22:03 | |
*** mriedem is now known as mriedem_afk | 22:33 | |
*** slaweq has quit IRC | 22:41 | |
*** VW_ has joined #openstack-operators | 22:43 | |
*** slaweq has joined #openstack-operators | 22:52 | |
*** slaweq has quit IRC | 22:57 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!