zigo | Hi there! | 07:20 |
---|---|---|
zigo | Under trixie (aka: Debian 13) + epoxy, I get lots of stack dumps like this one in nova-conductor.log: | 07:20 |
zigo | https://paste.opendev.org/show/bECP9P2mCs2kelxQhbgK/ | 07:20 |
zigo | Has anyone seen something like this ? | 07:20 |
zigo | Looks like I'm also getting this when spawning a VM: | 07:22 |
zigo | https://paste.opendev.org/show/bOocYhL8mYOV0jf3m7ip/ | 07:22 |
zigo | Not sure if the 2 are related... | 07:22 |
sean-k-mooney | zigo: are you use galera or something like that? nova does not suppprot Active active galera nor those most of openstack for that matter | 07:28 |
zigo | Yes, but everything goes through *one* single node for writing. | 07:29 |
zigo | So it's more like active/passive... | 07:29 |
sean-k-mooney | ok that does work | 07:29 |
sean-k-mooney | its what we use downstream too | 07:29 |
sean-k-mooney | am the only other thing that comes to mind is maybe you have two compute service (condocutors in this case) with the same conf.HOST | 07:30 |
sean-k-mooney | i.e. you might have two conductor binaires trying to update teh same service record | 07:30 |
zigo | In my setup, conductor nodes are not compute nodes. | 07:31 |
zigo | My host = directive isn't set. | 07:32 |
zigo | So probably it's trying to use the FQDN ? | 07:32 |
sean-k-mooney | it will use socket.hostname() | 07:32 |
sean-k-mooney | by default becaus by defualt openstack requires unique hostnames not just unique FQDNs | 07:32 |
zigo | Right. | 07:33 |
sean-k-mooney | if you have two instance of the conductor running on the same host by mistake | 07:33 |
zigo | I don't. | 07:33 |
sean-k-mooney | then it could cuase the conductor issue but i have not seen it other then that | 07:33 |
zigo | I'm not running in containers, so that would be impossible. | 07:33 |
sean-k-mooney | your compute issue looks diffent | 07:33 |
sean-k-mooney | without digin in too deeply it looks like perhaps you have an incompatible verion of libs | 07:34 |
zigo | To me, it seems like I could ignore the nova-conductor reporting issue. It's ugly, but the service is still reported as alive, so that's ok. | 07:34 |
zigo | Of what lib?!? | 07:34 |
sean-k-mooney | perhaps an old neutron client? | 07:34 |
zigo | Unlikely, it does work under bookworm + epoxy, and that's the same version of the libs. | 07:35 |
zigo | ie: neutronclient 11.4.0-2 | 07:35 |
sean-k-mooney | well the error is coming form keystonauth within neutronclient | 07:36 |
zigo | Right, though it's saying "no attribute 'endpoint_override'" which is weird. | 07:36 |
zigo | FYI, I did set endpoint_override in nova.conf for the URL of Neutron. | 07:36 |
sean-k-mooney | what is you keystonauth version | 07:37 |
zigo | 5.10.0 | 07:37 |
zigo | So both keystoneauth1 and neutronclient as the released version for Epoxy. | 07:38 |
sean-k-mooney | ya that shoudl be new enough | 07:39 |
sean-k-mooney | so in neutronclinet its failling here https://github.com/openstack/python-neutronclient/blob/11.4.0/neutronclient/client.py#L348 and that is just delegating to keystonatuh so i guess ill check there next | 07:39 |
sean-k-mooney | so its calling https://github.com/openstack/keystoneauth/blob/5.10.0/keystoneauth1/adapter.py#L313-L334 | 07:40 |
zigo | I'll have a try when not setting endpoint_override in my config file. | 07:42 |
zigo | What's weird is that I didn't get this under Bookworm. | 07:42 |
zigo | So, this smells like a non-openstack-maintained lib is at play here. | 07:43 |
sean-k-mooney | that should not result in an atibute error | 07:44 |
sean-k-mooney | that what is really odd | 07:44 |
sean-k-mooney | nova is still properly registring the relevent config options, if that was broken https://docs.openstack.org/nova/latest/configuration/config.html#neutron.endpoint_override woudl not render | 07:46 |
sean-k-mooney | so it should be vaild to set that direcly | 07:46 |
zigo | Removing endpoint_override form nova.conf doesn't fix anything. :/ | 07:47 |
zigo | I still get the same stack trace. | 07:47 |
sean-k-mooney | zigo: ya the stack track looks like someting is wrong in our packigng somehow | 07:52 |
sean-k-mooney | have you chekced the content of /usr/lib/python3/dist-packages/keystoneauth1/adapter.py | 07:52 |
sean-k-mooney | and confirmed that the base adapter in the file on disk has endpoint overried | 07:53 |
sean-k-mooney | and tries to import that in a python terminal and acess it? | 07:53 |
sean-k-mooney | or even one step back | 07:53 |
sean-k-mooney | if you use the neutron client cli does it have the same issue | 07:53 |
sean-k-mooney | i woudl expect a neutron port list to be broken in the same way on that host | 07:54 |
zigo | I'm not sure what / how you're asking me to check. | 07:57 |
zigo | :( | 07:59 |
sean-k-mooney | grep endpoint_override /usr/lib/python3/dist-packages/keystoneauth1/adapter.py | 08:01 |
sean-k-mooney | and also check if `neutron port-list` works | 08:01 |
zigo | openstack port list works, indeed. | 08:03 |
sean-k-mooney | the neutron cli supprot clouds.yaml so you can just use it like osc | 08:03 |
sean-k-mooney | no not openstack port list | 08:03 |
sean-k-mooney | exiplcited `neutron port-list` | 08:03 |
zigo | I don't think neutronclient is providing an /usr/bin/neutron anymore. | 08:03 |
sean-k-mooney | openstack port list uses the openstack sdk not the neutron clinet i think | 08:03 |
sean-k-mooney | they may have droped it in epoxy | 08:03 |
zigo | https://paste.opendev.org/show/bWJcYW3nYUAkjMn71RN6/ | 08:05 |
sean-k-mooney | ya they did. you code on disk all looks correct toteh 5.10.0 tag | 08:07 |
jkulik | https://bugs.launchpad.net/ubuntu/+source/nova/+bug/2103413 could that be related? looks like having the same stacktrace | 08:07 |
sean-k-mooney | oh perhaps | 08:08 |
sean-k-mooney | if eventlet is nuking part of the obejct by garbage collecting too early | 08:08 |
zigo | Oh, thanks ! :) | 08:08 |
sean-k-mooney | that does look like the same o r a very similar trace | 08:08 |
zigo | I'm not surprised if eventlet is playing with me, as I'm on Python 3.13. | 08:09 |
sean-k-mooney | zigo: that is unfortually not very good news for yo since it means it because eventlet does not supprot 3.13 yet | 08:09 |
sean-k-mooney | which is why openstack does nto supprot 3.13 yet | 08:09 |
zigo | sean-k-mooney: I have no choice, 3.13 it is ... | 08:09 |
zigo | I was expecting things would go wrong. | 08:09 |
sean-k-mooney | so i belive there is an eventlet issue for this | 08:10 |
sean-k-mooney | we also know the thread id si broken | 08:10 |
zigo | Maybe I should try the latest eventlet release. | 08:10 |
sean-k-mooney | https://github.com/eventlet/eventlet/issues/1032 | 08:10 |
zigo | Yeah, which is probably what's breaking nova-conductor too. | 08:10 |
sean-k-mooney | zigo: its not fix yet | 08:10 |
gboutry | That was exactly the error I got zigo, this is python 3.13 and eventlet not playing nice together | 08:10 |
zigo | SHIT ! :( | 08:11 |
gboutry | the error would manifest with attributes in the object that was initialized correctly | 08:11 |
zigo | Die eventlet die ... | 08:11 |
sean-k-mooney | i have commeted on both the 3.13 bugs to say we need to supprot 3.13 in eventlet for master this cycle per the project runtimes and that its required to complete the eventlet removal | 08:16 |
sean-k-mooney | the real quetion is will anyoen have time to actully work on that | 08:16 |
sean-k-mooney | this should be highlighted in the eventlet removal channel too i guess | 08:16 |
zigo | Well, for me, this means there wont be a working OpenStack for the Trixie release. That's really bad ... :( | 08:17 |
sean-k-mooney | cannonical are in the same situration for ubuntu 25.04 | 08:17 |
zigo | Except it's not an LTS, so they don't really care. | 08:17 |
sean-k-mooney | its the first tiem in openstack's history that im aware of that the latest point release of ubuntu has not supproted the latest openstack release | 08:18 |
zigo | They mostly provide OpenStack on top of LTS. | 08:18 |
sean-k-mooney | they ship it in both | 08:18 |
zigo | Right. | 08:18 |
sean-k-mooney | even if most of there custoemr are not on the point releases | 08:18 |
sean-k-mooney | it where they get ther early qa | 08:18 |
sean-k-mooney | hopefully herve and co will have time to look at the reported bugs | 08:20 |
zigo | Epoxy is not a skipable release, so it *must* be fixed. | 08:20 |
zigo | Saying, it's ok, Flamigo will have the fix, is not an option. | 08:20 |
sean-k-mooney | its not an openstack bug currently | 08:22 |
sean-k-mooney | its an eventlet one so ther isnt anythign nova can do to enable this | 08:22 |
sean-k-mooney | as an assid e apprenetly there is a propsoal to add "virtual thread" ala greentthread to core python https://discuss.python.org/t/add-virtual-threads-to-python/91403 | 08:22 |
zigo | I'll still have a try with eventlet 0.40 and see how it goes. | 08:44 |
zigo | Same stuff with eventlet 0.40 ... :( | 08:48 |
sean-k-mooney | there is thsi work in progress hack https://github.com/eventlet/eventlet/pull/1031/files | 08:49 |
sean-k-mooney | but that not complete | 08:49 |
sean-k-mooney | and its techinially for the thread issue rather then the one you currently hittign with the GC | 08:50 |
sean-k-mooney | that may get you slighly closer howerever | 08:50 |
zigo | Will try the patch ! :) | 08:50 |
zigo | I was in fact looking into it. | 08:50 |
sean-k-mooney | its just a guess but https://github.com/eventlet/eventlet/pull/1042 might be related to ge gc issue | 08:52 |
sean-k-mooney | although it does not directly refence the exiting issue so im just going off the cover letter | 08:53 |
zigo | Looks like I had a version of that patch already in my package. | 08:54 |
zigo | Not sure if it was the latest one, so trying again. | 08:54 |
sean-k-mooney | i just looked at the top few open prs so that just a guess on my part | 08:55 |
zigo | Yeah... | 08:56 |
zigo | Still broken ... :( | 08:57 |
sean-k-mooney | i would reach out to herve and see if they have any ideas on a path forward | 08:58 |
zigo | sean-k-mooney: What's the progress on eventlet removal in nova-compute? | 08:58 |
zigo | Are there patches available already? | 08:58 |
sean-k-mooney | zigo: we dont expect to complete that until 2026.2 | 08:58 |
opendevreview | Stephen Finucane proposed openstack/nova master: tests: Replace keystoneclient with keystoneauth1 https://review.opendev.org/c/openstack/nova/+/951744 | 08:58 |
sean-k-mooney | we might get the intiall verison in 2026.1 | 08:58 |
sean-k-mooney | but we are workign on the schduler first this cycle | 08:58 |
sean-k-mooney | then maybe api and or oneof the other contoler services | 08:59 |
sean-k-mooney | nova-compute is the hardest to move and will be the last service we move | 08:59 |
sean-k-mooney | our hope is that in 2026.1 we might be able to run all the nova againet in threaded mode but we dont know if we will get that far | 09:00 |
sean-k-mooney | makeing it the defautl and or remvoing eventlet supprot is a 2026.2+ activity once we have at least 1 slurp releease that suprpot threaded mode | 09:00 |
sean-k-mooney | that why we are aiming to have 2026.1 be the first release that can run without eventlet (maybe) | 09:01 |
sean-k-mooney | zigo: gibi has been makign some promissing progress but its a lot of work | 09:02 |
zigo | At the end of https://bugs.launchpad.net/ubuntu/+source/nova/+bug/2103413/comments/1 Guillaume Boutry says: | 09:03 |
zigo | using `gc.disable()` makes the issue disappear (yay! disable gc!) or actually holding a hardref to `admin_client.baseclient.httpclient` makes the method pass most of the time... | 09:03 |
zigo | Not sure where/how he's doing the garbage collector disabling. | 09:04 |
zigo | Is this a gc in eventlet ? | 09:04 |
sean-k-mooney | no thats the main python one | 09:05 |
sean-k-mooney | we would have to hack in explict calls to the garbage collector somewhere and have hard refences to stop it beign nuked | 09:06 |
zigo | Maybe, disabling it would be in eventlet/patcher.py ? | 09:06 |
gboutry | gc.disable() to disable the GC, but that's really NOT a good idea | 09:07 |
zigo | That would mean memory leak right? | 09:07 |
sean-k-mooney | zigo: that will prevent object created by nova forom being deallcoate dy python automaticly when we exit scope | 09:07 |
gboutry | yes, nothing would be freed anymore. | 09:08 |
sean-k-mooney | zigo: yes it would mean nova would never deallocate any python object | 09:08 |
zigo | At this point, I need to validate the release, so anything is better than completely broken. | 09:08 |
zigo | I'll try and see what happens. :) | 09:08 |
sean-k-mooney | well that will completely break nova | 09:08 |
sean-k-mooney | it will cause OOM kill events | 09:09 |
opendevreview | Ivan Anfimov proposed openstack/nova master: docs: update for services to https https://review.opendev.org/c/openstack/nova/+/938680 | 09:11 |
sean-k-mooney | zigo: looking at how they were tracing it https://pastebin.ubuntu.com/p/cj7tb3kmGV/ | 09:12 |
opendevreview | Ivan Anfimov proposed openstack/nova master: docs: update for services to https https://review.opendev.org/c/openstack/nova/+/938680 | 09:13 |
sean-k-mooney | i wonder if on line 37 if we did somethign like base_client_ref = admin_client.base_client.httpclient | 09:13 |
sean-k-mooney | zigo: would that keep it form being gc'd | 09:13 |
sean-k-mooney | so that is https://github.com/openstack/nova/blob/master/nova/network/neutron.py#L1217 | 09:16 |
sean-k-mooney | we could maybve modify our get_client functionto return 3 refs one to the actully client one to the base client and one to the http clinet in the base. | 09:17 |
opendevreview | Ivan Anfimov proposed openstack/nova master: docs: update for services to https https://review.opendev.org/c/openstack/nova/+/938680 | 09:17 |
sean-k-mooney | that shoudl keep them in scope until that function body exits | 09:17 |
sean-k-mooney | although we already have a refence to the baseclinet in our wrapper | 09:19 |
sean-k-mooney | https://github.com/openstack/nova/blob/master/nova/network/neutron.py#L184-L185 | 09:19 |
gboutry | But that wouldn't prevent the code from breaking somewhere else? | 09:20 |
sean-k-mooney | gboutry: it woudl at best mask the issue for neturonclient | 09:20 |
sean-k-mooney | its not an actual fix. | 09:20 |
sean-k-mooney | im just tryign to thinkis there a way nova can keep the relevent object alive with a hardref fo some kind | 09:21 |
sean-k-mooney | im not seing anything obvious | 09:21 |
opendevreview | Ivan Anfimov proposed openstack/nova master: docs: update installation documentation https://review.opendev.org/c/openstack/nova/+/938680 | 09:21 |
sean-k-mooney | gboutry: i need to go look at somethign else but ill quickly add a direct refernce to the http client to our client wrapper and see fi that changes anything | 09:24 |
opendevreview | Ivan Anfimov proposed openstack/nova master: docs: update installation documentation https://review.opendev.org/c/openstack/nova/+/938680 | 09:26 |
opendevreview | sean mooney proposed openstack/nova master: [DNM] testing py3.13 eventlet bug workaround https://review.opendev.org/c/openstack/nova/+/951749 | 09:38 |
sean-k-mooney | zigo: gboutry no idea if ^ will work but maybe it will provide more data :shrug: | 09:39 |
sean-k-mooney | well that an excelnet start... | 09:40 |
*** mikal4 is now known as mikal | 09:42 | |
opendevreview | sean mooney proposed openstack/nova master: [DNM] testing py3.13 eventlet bug workaround https://review.opendev.org/c/openstack/nova/+/951749 | 09:42 |
opendevreview | sean mooney proposed openstack/nova master: [DNM] testing py3.13 eventlet bug workaround https://review.opendev.org/c/openstack/nova/+/951749 | 09:44 |
opendevreview | sean mooney proposed openstack/nova master: [DNM] testing py3.13 eventlet bug workaround https://review.opendev.org/c/openstack/nova/+/951749 | 09:44 |
opendevreview | sean mooney proposed openstack/nova master: [DNM] testing py3.13 eventlet bug workaround https://review.opendev.org/c/openstack/nova/+/951749 | 09:45 |
sean-k-mooney | it helps to actully save all your chagnes before commiting... | 09:45 |
sean-k-mooney | zigo: gboutry: that add a direct refence to the http clint but it also moves nova-next to try and ue 3.13 assuming it a think on ubuntu 24.04 and it also adds a 3.13 functional job to see just how broken nova really is | 09:47 |
sean-k-mooney | i expect the answer to be very but ci shoudl tell us soon assuming my hacks actully work. | 09:48 |
zigo | sean-k-mooney: I just tried, added gc.disable() at the end of eventlet/patcher.py's _green_existing_locks(), and I could spawn a VM ! | 10:38 |
sean-k-mooney | zigo: sure but you just added a out of memory issue. that not a solution you can include in your pakcaging of nova | 10:51 |
zigo | I know, I just wanted to try what gboutry wrote. | 10:51 |
zigo | I can even see the memory leak in real time. | 10:51 |
sean-k-mooney | ack | 10:51 |
zigo | nova-compute used to take 0.9% of my VM's RAM, now it's up to 1.0. | 10:51 |
zigo | I guess it's going to never stop leaking ... | 10:52 |
sean-k-mooney | if you have debug logging enabled it will leak faster | 10:52 |
zigo | I do ! :) | 10:52 |
sean-k-mooney | it looks like python3.13 is not aviable in ubuntu 24.04 is it aviabel in debian 12? | 10:55 |
sean-k-mooney | it proably in universe in noble | 10:55 |
gboutry | python 3.13 is only available through the deadsnakes ppa on noble | 10:56 |
sean-k-mooney | ack, im tryign to see if there is an easy way to hack in 3.13 in to one of our devstack jobs | 10:58 |
sean-k-mooney | i could enable that ppa in a pre playbook | 10:58 |
zigo | sean-k-mooney: No, only in Debian 13, though Trixie is in hard freeze, so it's a good moment to start using it. | 10:58 |
sean-k-mooney | zigo: well locally im using sid :) but ya its more work then i have time for today to get trixie in ci | 10:59 |
sean-k-mooney | zigo: have you spoken to the infra team about supportign it when it is released yet? | 11:01 |
stephenfin | Uggla: fyi, looks like there's a bug here https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L13370 The attribute on the class is called accessmode, not access_mode | 11:01 |
zigo | Nop, no time for that yet. | 11:01 |
zigo | sean-k-mooney: It doesn't work. | 11:08 |
zigo | I mean, your patch with self._http_client = base_client.httpclient in network/neutron.py | 11:09 |
zigo | Just tried ... | 11:09 |
sean-k-mooney | ack | 11:10 |
sean-k-mooney | it was a long shot | 11:10 |
zigo | Thanks for trying ! :) | 11:10 |
sean-k-mooney | im going to try one other thing quickly to provied 3.13 via pyenv the same way we do for functional tests | 11:11 |
zigo | I'm guessing there's going to be this kind of issue a bit everywhere anyways, so we need a better global eventlet fix. | 11:11 |
opendevreview | sean mooney proposed openstack/nova master: [DNM] testing py3.13 eventlet bug workaround https://review.opendev.org/c/openstack/nova/+/951749 | 11:15 |
sean-k-mooney | that shoudl complie 3.13 from souce on all the nodes although it might fail if we dont have gcc ectra avaibale but if it does it will fail fast | 11:17 |
sean-k-mooney | actully i shoudl add bindep first https://opendev.org/opendev/base-jobs/src/branch/master/playbooks/tox-docs/pre.yaml#L3 | 11:18 |
opendevreview | sean mooney proposed openstack/nova master: [DNM] testing py3.13 eventlet bug workaround https://review.opendev.org/c/openstack/nova/+/951749 | 11:20 |
gibi | zigo: sean-k-mooney: I'm torn on prioritizzng chasing down python3.13 eventlet bugs over focusing on removing eventlet from nova. Eventlet is on life support as we speak investing there is problematic. I know that I there is a mismatch between what python version OpenStack supports and what the distros want to use. Still I firmly belive we are better of removing eventlet than patching. | 11:54 |
sean-k-mooney | fun installing python 3.13. form sorce worked but https://paste.opendev.org/show/bcQL03mSHktFaQf6diQX/ mysql failed to install properly | 11:55 |
sean-k-mooney | gibi: well in epoxy 3.13 is exmperimetnal but its in the mandatory testign runtim for 2025.2 | 11:55 |
sean-k-mooney | so there kind fo need to be two efforts here | 11:56 |
sean-k-mooney | we either need to change the testing runtims or make it work on 3.13 this cycle regardless fo the eventlet removal | 11:56 |
sean-k-mooney | gibi: i think you should focus on the eventlet removal | 11:56 |
sean-k-mooney | but we need to work with the eventlet maintianers to make sure they have time to fix 3.13 | 11:57 |
sean-k-mooney | i think that were zigo and other could help with that effort | 11:57 |
sean-k-mooney | if we manage to get master workign with 3.13 we can access if we need to backprot stuff to epoxy | 11:57 |
sean-k-mooney | we do get clean unit and functional tests | 11:57 |
sean-k-mooney | so i dont think things are massively borken in nova | 11:58 |
sean-k-mooney | i.e. if the eventlet bugs were fix openstack might "just work" without code changes on expoxy | 11:58 |
gibi | we are the eventlet maintainers | 12:00 |
sean-k-mooney | kind of | 12:01 |
gibi | you can talk to hberaud about it | 12:01 |
sean-k-mooney | to me 3.13 supprot is not really optional | 12:02 |
gibi | if zigo could help maintaining eventlet in py313 that is a win for sure | 12:02 |
sean-k-mooney | by ya its partly a prioty probelm | 12:02 |
sean-k-mooney | gibi: there is currently a python discsion happenig about addign virtual tread to core pyton started by the gevent folks | 12:03 |
gibi | I think the reality is that we have limited eventlet internal knowledge to maintain it even if we find the time | 12:03 |
zigo | Hervé is in PTO this week, so I can't talk to him until Monday. | 12:04 |
gibi | zigo: yepp I know | 12:04 |
*** ralonsoh_ is now known as ralonsoh | 12:36 | |
Uggla | stephenfin, thanks finding this bug. I'll fix it asap. | 12:39 |
Uggla | gibi, fyi I have reviewed the first patch of your eventlet serie and left comments/questions. | 14:23 |
gibi | Uggla: thanks a lot. I will reply you probably tomorrow | 14:27 |
Uggla | gibi, sure no hurries. As I mentioned it took me a while to review this first patch, I just hope I have not asked too much silly questions ? | 14:29 |
gibi | all questions are usefull :) | 14:31 |
gibi | so no worries | 14:31 |
opendevreview | Dan Smith proposed openstack/nova master: Remove contrib/clean-on-delete.py https://review.opendev.org/c/openstack/nova/+/950592 | 17:47 |
*** mikal8 is now known as mikal | 21:10 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!