*** sshnaidm is now known as sshnaidm|afk | 10:09 | |
*** sshnaidm|afk is now known as sshnaidm | 11:39 | |
evrardjp | Shameless plug: If you know someone who wants to work on Ironic, feel free to give that person this link: https://citynetwork.uhigher.com/en/job-details?job=61508774-cc6f-4269-87c2-3e07162160f7 ... Or to contact me on irc ;) | 13:31 |
---|---|---|
spatel | jamesdenton_alt altnative ID :) | 14:47 |
spatel | what happened here? | 14:47 |
*** jamesdenton_alt is now known as jamesdenton | 14:48 | |
jamesdenton | maybe an imposter! | 14:48 |
spatel | :) | 14:52 |
spatel | I have very strange issue going on related networking | 14:52 |
spatel | i thought may be you can help me guide me or advice me | 14:53 |
spatel | we have c7000 HP chassis with 16 gen9 blades | 14:53 |
spatel | all blade configured for Active-Standby LACP bundle for redundancy. | 14:53 |
spatel | yesterday i noticed one of blade has some crash and turn out related memory failure. but that created strange issue that blade switch went wrong and stop sending LACP PDU to upstream TOR switch and switch isolated :( | 14:55 |
spatel | I have wild theory that may be memory failure created loop on switch (not sure how) | 14:56 |
spatel | thinking to configure PASSIVE LACP config on HP blade switch side so if anything happened to server switch will shutdown port. | 14:57 |
jamesdenton | hmm | 14:59 |
jamesdenton | was it active-standby or lacp? I think lacp aggregates all links? | 15:00 |
spatel | https://paste.opendev.org/show/810315/ | 15:01 |
spatel | This is what i have on Ubuntu server | 15:02 |
spatel | I that LACP has mode called active-standby | 15:02 |
jamesdenton | i just blame netplan | 15:02 |
spatel | what do you mean? | 15:03 |
jamesdenton | active-standby corresponds to mode 1 (not lacp), i think, while 802.3ad would active-active (mode 4 lacp) | 15:04 |
spatel | you are saying in my case its not LACP bond right? | 15:04 |
jamesdenton | rifht | 15:04 |
jamesdenton | yes | 15:04 |
jamesdenton | so the link must actually go NO-CARRIER, i think, for the failover to occur | 15:05 |
spatel | hmm | 15:05 |
spatel | This bond config doesn't detect my upstream uplink failure :( | 15:06 |
jamesdenton | it would not detect that | 15:06 |
spatel | I am thinking to add arp_ip_target to get gateway arp to detect upstream failure of uplink | 15:06 |
jamesdenton | never used it myself, but give it a shot | 15:08 |
spatel | This issue killing me.. whenever server crashed or memory failed on these blade cause blade switch break LACP bond with TOR switch :( | 15:10 |
spatel | trying to understand what is the relation with server crash and TOR LACP bundle go down. | 15:10 |
spatel | I am seeing HP 6120XG blade switch stopped sending LACP PDU to tor switch which put LACP in suspended mode. | 15:11 |
jamesdenton | and then the downlinks to the servers don't recognize that and appear offline? | 15:14 |
spatel | jamesdenton look at this diagram - https://ibb.co/FntGz01 | 15:16 |
spatel | for server both HP 6120 switch is up but TOR switch not getting any LACP PDU packet so tor putting this switch LACP port in suspended | 15:17 |
spatel | This incident only happened to switch-A | 15:17 |
spatel | jamesdenton did you test bonding inside VM ? | 18:40 |
jamesdenton | eeeesh, if i did i don't recall | 19:43 |
jamesdenton | having issues? | 19:43 |
spatel | jamesdenton no worry let me dig and see | 21:09 |
spatel | what vm_memory_high_watermark setting you guys do for rabbitMQ . | 21:09 |
spatel | / | 21:09 |
spatel | ? | 21:09 |
bjoernt | 0.2 | 21:11 |
bjoernt | it really depends on how much ram you have and how large the vm should be come | 21:11 |
spatel | i have 128GB memory | 21:14 |
spatel | my rabbitMQ keep dying :( | 21:14 |
spatel | i got getting OOM killer when i had 64GB memory so i have added bunch of more dimm and now i have 128GB | 21:20 |
spatel | my current setting is 0.2 so thinking to change it to 0.4 | 21:20 |
bjoernt | depends how it dies. doubling the ram is effectively the same as 0.2 on the old one | 21:39 |
bjoernt | 0.4 i meant | 21:49 |
bjoernt | you dont want too large vms then the GC will take too long | 21:49 |
-opendevstatus- NOTICE: The Gerrit service on review.opendev.org is being restarted quickly for some security updates, but should return to service momentarily | 22:09 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!