| opendevreview | Gregory Thiemonge proposed openstack/octavia stable/2025.1: Fix issues related to pkg_resources module https://review.opendev.org/c/openstack/octavia/+/977293 | 08:55 |
|---|---|---|
| opendevreview | Gregory Thiemonge proposed openstack/octavia stable/2024.2: Fix issues related to pkg_resources module https://review.opendev.org/c/openstack/octavia/+/977294 | 08:56 |
| *** tkajinam_ is now known as tkajinam | 10:13 | |
| opendevreview | Gregory Thiemonge proposed openstack/octavia stable/2025.1: Fix issues related to pkg_resources module https://review.opendev.org/c/openstack/octavia/+/977293 | 10:57 |
| gthiemonge | ^ fixes additional issues in 2025.1 :/ | 11:01 |
| *** beagles__ is now known as beagles | 13:07 | |
| servagem | Hi all. We had some Octavia LBs failover due to a temporary Nova issue. The octavia-failover-amphora-flow failed with ComputeBuildException, leaving the LBs in ERROR and requiring manual intervention. Because these LBs support critical workloads, we need automatic retries for failover when Nova has transient failures. I checked [compute] max_retries option, but it doesn't | 13:29 |
| servagem | seem to apply to amphora build/create (only some compute ops like delete). Is that intentional? Any recommended way to enable retries for amphora create during failover? | 13:29 |
| gthiemonge | servagem: hey, yes it is intentional, but the code may have changed a lot since it was added. Basically, the octavia-worker sends only one create server request to nova, then waits for a certain amount of time until the VM is marked as active by nova and this amount of time used max_retries and interval. So this max_retries is not used to retry to create a VM, only to retry to get a positive | 13:49 |
| gthiemonge | status. | 13:49 |
| gthiemonge | they are some proposed patches that enable automatic failover of amphora in ERROR: https://review.opendev.org/c/openstack/octavia/+/934638 but we never agreed on a such feature | 13:51 |
| servagem | That would be an excellent feature. These LBs support critical workloads, and the time to detect the issue, engage the ops team, and perform a manual failover isn't acceptable for us. How can we help move this feature forward? | 13:57 |
| gthiemonge | reviews or tests would be appreciated, I haven't reviewed the code yet, I think the main question was: could it do more harm than good? in case of major outage, it may be stuck in a loop of VM recreation and push a lot of load on nova | 14:05 |
| servagem | It seems this patch doesn't use Tenacity for retries, unlike other parts of the Octavia code. | 14:06 |
| servagem | yes, we can help testing. I think that concern could be addressed by using options like the ones in [compute] retry settings (max_retries, retry_interval, retry_backoff, retry_max) | 14:10 |
| gthiemonge | servagem: I think it doesn't retry when the creation fails, but it allows the health-manager to trigger a new failover after the vm creation fails (which is currently blocked in Octavia) | 14:12 |
| servagem | yep, got it | 14:17 |
| *** croeland1 is now known as croelandt | 14:18 | |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!