pyjou | Hi I have a quick question. Have octavia project migrated to launchpad or do we still need to work with storyboard? | 13:30 |
---|---|---|
tweining | hi. the channel topic says it :) | 13:30 |
tweining | launchpad | 13:30 |
pyjou | True. Sorry I don´t see it. Thank you :) | 13:31 |
opendevreview | Pierre-Yves Jourel proposed openstack/octavia master: Add specs to resize a load balancer https://review.opendev.org/c/openstack/octavia/+/885490 | 13:39 |
opendevreview | Pierre-Yves Jourel proposed openstack/octavia master: Add specs to resize a load balancer https://review.opendev.org/c/openstack/octavia/+/885490 | 14:38 |
mnaser | gthiemonge: did you ever get to the bottom of https://storyboard.openstack.org/#!/story/2008226 ? | 15:26 |
* mnaser currently has a broken lb in my hands with this issue | 15:26 | |
gthiemonge | mnaser: nop it has happened randomly in the CI, I haevn't seen it for a long time | 15:41 |
mnaser | gthiemonge: ah gr, i'm trying to understanding why/how its happening right now, it looks like the gunicorn process keeps on respawning and then timing out, trying to strace my way out of this | 15:41 |
gthiemonge | I think that something might be really slow in the amphora, and the client timeouts, when the server replies the client is gone | 15:42 |
mnaser | gthiemonge: yeah except in this case its like respawning non stop the worker as it times out | 15:42 |
mnaser | write(7, "[2023-06-07 15:41:45 +0000] [808] [CRITICAL] WORKER TIMEOUT (pid:19986)\n", 72) = 72 | 15:43 |
mnaser | and then it boots a new worker and goes back to being stuck, but i just got my first trace of it rebooting so lets see | 15:43 |
gthiemonge | mnaser: what version do you have? what distrib for the amphora? | 15:44 |
gthiemonge | mnaser: do you see timeouts in the Octavia worker logs? | 15:45 |
gthiemonge | maybe if we can isolate the API call that takes too much time, we could understand what's happening | 15:46 |
mnaser | gthiemonge: yeah, octavia-worker looping kept trying to contact the worker and timing out, which lead me to start doing a simple 'curl' that was timing out (unless the reason my call times out is because amphora agent can do 1 req a time) | 15:46 |
mnaser | let me see what version.. it is ancient in here, pushing to get it updated :) | 15:47 |
mnaser | a nice ussuri... :( | 15:47 |
mnaser | also it looks like the image is 18.04 bionic | 15:47 |
mnaser | unfortunately strace is not helping too much bc tls, let me see | 15:48 |
johnsom | mnaser We do limit the gunicorn to 1 worker on purpose | 15:48 |
mnaser | ok so that means indeed its something taking too long and timing out so my `curl` test is useless | 15:48 |
johnsom | Back in that time frame there were some kernel issues with bringing up the interfaces. The ifup calls would take forever or fail occasionally. It was specific kernel versions in 18.04 | 15:50 |
mnaser | the thing is this will be a perfectly happy amphora until it gets stuck in PENDING_UPDATE | 15:50 |
mnaser | so something starts this chain | 15:50 |
mnaser | only the worker talks to the api eh? | 15:51 |
johnsom | Right | 15:51 |
mnaser | 2023-06-07 15:51:15.022 20 WARNING octavia.amphorae.drivers.haproxy.rest_api_driver [req-241a87d1-af88-4a1a-84b2-728e64845018 - 4d5940f2bb1040ee9d6468a7d1e8bf64 - - -] Could not connect to instance. Retrying.: requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='10.8.15.55', port=9443): Read timed out. (read timeout=10.0) | 15:51 |
gthiemonge | well, housekeeping and health-manager too (to the Amphora API) | 15:51 |
mnaser | its indeed having its way at it, except i think we have it set at *sigh* some 1500 retries so | 15:51 |
mnaser | let me see what triggers these constant attempts | 15:52 |
johnsom | Yeah, worker timing out the connection due to no response in time from the amp agent. | 15:52 |
mnaser | nothing in octavia worker logs, but i think there was a batch update that it tried to do and couldnt talk/reach | 15:53 |
mnaser | i wonder if i can use gmr to pull the stack | 15:53 |
johnsom | It probably failed over for some reason then landed in this situation. We do recommend in production dropping the retry attempt numbers, otherwise people think it is "stuck" in a pending state, where really it is just endlessly banging it's head retrying whatever is failing. | 15:54 |
johnsom | It is only one or two deployments that have slow setups that led to those crazy retry numbers. | 15:54 |
mnaser | right, that puts two issues here -- faster failovers.. but also figuring out why it is stuck here in the first place | 15:56 |
mnaser | im pulling gmr reports to try and see if i can get a stack of why/how its stuck | 15:57 |
gthiemonge | there's another similar story: https://storyboard.openstack.org/#!/story/2010010 | 15:58 |
mnaser | there, i got the stack trace | 15:59 |
gthiemonge | guys, we have the weekly meeting in 1min, can we resume the discussion after it? | 15:59 |
mnaser | ah yes sorry | 15:59 |
mnaser | ill toss this first :) https://www.irccloud.com/pastebin/jG1sdOEr/ | 16:00 |
gthiemonge | #startmeeting Octavia | 16:00 |
opendevmeet | Meeting started Wed Jun 7 16:00:41 2023 UTC and is due to finish in 60 minutes. The chair is gthiemonge. Information about MeetBot at http://wiki.debian.org/MeetBot. | 16:00 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 16:00 |
opendevmeet | The meeting name has been set to 'octavia' | 16:00 |
gthiemonge | o/ | 16:00 |
johnsom | o/ | 16:00 |
pyjou | o/ | 16:00 |
jdacunha2 | o/ | 16:00 |
tweining_ | o/ | 16:00 |
QG | o/ | 16:01 |
gthiemonge | #topic Announcements | 16:01 |
gthiemonge | * OpenInfra Summit Vancouver 2023 | 16:01 |
gthiemonge | reminder: the openinfra summit is next week | 16:02 |
gthiemonge | johnsom will host the OpenStack Octavia Forum Session (Wed, June 14, 9:00am - 9:30am) | 16:02 |
johnsom | The Octavia Forum session is at 9am on Wednesday | 16:02 |
gthiemonge | (thank you johnsom BTW!) | 16:02 |
gthiemonge | johnsom: I think it's the same time as our weekly meeting, right? | 16:03 |
johnsom | Yes | 16:03 |
gthiemonge | ok | 16:03 |
gthiemonge | so I'm proposing that we cancel next week's meeting, are you fine with it? | 16:03 |
tweining_ | I noticed there is no Octavia PTG session in https://ptg.opendev.org/ptg.html yet | 16:04 |
gthiemonge | nop we have'nt scheduled a PTG session | 16:04 |
tweining_ | I think it's fine to cancel | 16:04 |
gthiemonge | ack | 16:05 |
gthiemonge | any other announcements? | 16:05 |
gthiemonge | nop, ok | 16:07 |
gthiemonge | #topic CI Status | 16:07 |
gthiemonge | I didn't find the time to investigate the failures in the FIPS jobs | 16:07 |
gthiemonge | https://zuul.openstack.org/builds?job_name=octavia-v2-dsvm-tls-barbican-fips&skip=0 | 16:07 |
gthiemonge | and I still need to open a launchpad issue for this error | 16:08 |
gthiemonge | I think that's it for CI Status | 16:09 |
gthiemonge | #topic Brief progress reports / bugs needing review | 16:09 |
johnsom | Looks like it can't ssh into the cirros image | 16:09 |
gthiemonge | yeah | 16:09 |
johnsom | So test setup fails | 16:09 |
pyjou | I have an old change to review | 16:10 |
gthiemonge | we need add a testing patch that fetches the logs of the VMs | 16:10 |
pyjou | https://review.opendev.org/c/openstack/octavia-dashboard/+/866064 | 16:10 |
opendevreview | Julian DA CUNHA proposed openstack/octavia master: Add new spec Let's Encrypt support https://review.opendev.org/c/openstack/octavia/+/877281 | 16:11 |
gthiemonge | pyjou: nice feature, I'll review/test it | 16:12 |
pyjou | And I've added a spec to add load balancer resize as a new feature. | 16:12 |
pyjou | https://review.opendev.org/c/openstack/octavia/+/885490 | 16:12 |
gthiemonge | pyjou: nice too! | 16:12 |
tweining_ | yeah, that spec looks interesting | 16:14 |
johnsom | Cool, I will review that spec. At first glance it's aligned to our vision for that. | 16:14 |
pyjou | Great :) | 16:15 |
johnsom | You have to failover as nova resizing causes a reboot which drops the encrypted ram disk content | 16:15 |
gthiemonge | yeah this is what we discussed during the last PTG | 16:16 |
gthiemonge | I'm working on the Multi-Active BGP support | 16:18 |
pyjou | Yeah That's exactly what I'm proposing. It's not a resize in the sense of nova. It´s just a failover with a new octavia flavor in param of failover flow. | 16:18 |
gthiemonge | I need to propose a new spec, I hope I will upload it before the end of the week (maybe I'm too much optimistic) | 16:18 |
gthiemonge | #topic Open Discussion | 16:23 |
gthiemonge | anything else foks? | 16:24 |
jdacunha2 | could you please review ACMEv2 RFE again ? | 16:25 |
jdacunha2 | https://review.opendev.org/c/openstack/octavia/+/877281 | 16:26 |
gthiemonge | ack | 16:26 |
gthiemonge | ok i guess that's all for today! | 16:27 |
gthiemonge | thank you folks! | 16:27 |
gthiemonge | #endmeeting | 16:27 |
opendevmeet | Meeting ended Wed Jun 7 16:27:53 2023 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 16:27 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/octavia/2023/octavia.2023-06-07-16.00.html | 16:27 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/octavia/2023/octavia.2023-06-07-16.00.txt | 16:27 |
opendevmeet | Log: https://meetings.opendev.org/meetings/octavia/2023/octavia.2023-06-07-16.00.log.html | 16:27 |
gthiemonge | back to you mnaser | 16:27 |
jdacunha2 | quit | 16:27 |
mnaser | gthiemonge: i think one thing i've identified is that the amphora was overloaded at some point or something, i see some traces in the kernel and the rtc time was like 40 seconds behind the kernel time | 16:28 |
gthiemonge | mnaser: this is really weird, the API call initiated by get_api_version is the simplest API call we may have | 16:28 |
gthiemonge | oh | 16:28 |
mnaser | so that tells me there was a bunch of hangs that got us in a super weird state | 16:28 |
mnaser | so i think for now we can put octavia out of the equation till i figure this out | 16:29 |
gthiemonge | mnaser: ack, ping us if there's anything else we can do | 16:29 |
opendevreview | Merged openstack/octavia stable/2023.1: Send IP advertisements when plugging a new member subnet https://review.opendev.org/c/openstack/octavia/+/880723 | 17:16 |
opendevreview | Merged openstack/octavia stable/zed: Send IP advertisements when plugging a new member subnet https://review.opendev.org/c/openstack/octavia/+/880724 | 17:58 |
opendevreview | Merged openstack/octavia stable/wallaby: Send IP advertisements when plugging a new member subnet https://review.opendev.org/c/openstack/octavia/+/880727 | 17:58 |
opendevreview | Merged openstack/octavia stable/yoga: Send IP advertisements when plugging a new member subnet https://review.opendev.org/c/openstack/octavia/+/880725 | 17:58 |
opendevreview | Gregory Thiemonge proposed openstack/octavia-tempest-plugin master: DNM/WIP Testing server output for FIPS https://review.opendev.org/c/openstack/octavia-tempest-plugin/+/877667 | 19:13 |
opendevreview | Gregory Thiemonge proposed openstack/octavia master: DNM/WIP Testing FIPS job https://review.opendev.org/c/openstack/octavia/+/885540 | 19:16 |
gthiemonge | it's not only FIPS that is failing, c9s job fails too: https://zuul.openstack.org/builds?job_name=octavia-v2-dsvm-scenario-centos-9-stream&branch=master&skip=0 | 19:48 |
gthiemonge | https://paste.opendev.org/show/b0WFpGOQQ7854jHEqGic/ | 20:19 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!