Thursday, 2025-12-11

*** haleyb is now known as haleyb|out00:43
*** bauzas1 is now known as bauzas00:52
*** mhen_ is now known as mhen02:09
*** Jeff is now known as Guest3339806:41
opendevreviewyangjianfeng proposed openstack/nova master: Fix missing allocations due to unsuccessful evacuations  https://review.opendev.org/c/openstack/nova/+/97033506:42
opendevreviewhuanhongda proposed openstack/nova master: live migration: Complete cleanup if post_live_migration fails  https://review.opendev.org/c/openstack/nova/+/96765207:21
*** ykarel__ is now known as ykarel08:01
opendevreviewyangjianfeng proposed openstack/nova master: [WIP] Fix: resources leak due to failed live-migration  https://review.opendev.org/c/openstack/nova/+/97059710:32
noonedeadpunkhey folks! I've started looking into multi-cell docs (https://docs.openstack.org/nova/latest/admin/cells.html) and got kind of confused by sepr conductor part. Or specifically - how to tell it that it's a superconductor?13:11
noonedeadpunkAs, I can imagine, conductor is able to fetch cells connection details from the api database13:11
noonedeadpunkbut then how to tell a cell conductor not to do that?13:11
noonedeadpunkand that it should operate only with own database, and never reach api database? Just don't define api_database section?13:12
noonedeadpunkAnd also - does it make sense to do superconductor setup at all?13:13
noonedeadpunkAs in RHOSP 16 docs I've seen that still it has limit of 500 hosts across all cells...13:13
noonedeadpunk(which I'd guess is due to network limit, but not sure)13:13
noonedeadpunkso my guess is that super conductor just has an [api_database] section, while regular conductors only [database]13:48
opendevreviewJulien LE JEUNE proposed openstack/nova stable/2024.2: Fix bug 2000069  https://review.opendev.org/c/openstack/nova/+/97063414:59
dansmithnoonedeadpunk: if you have multiple cells, you need a superconductor, or at least delegate one cell's conductors to be the superconductor function15:26
sean-k-mooneynoonedeadpunk: dont read teh RHOSP docs for anythign related to cells, if you do it will cuase you pain and confustion15:28
sean-k-mooneynoonedeadpunk: tripelo did not impelent cells well15:28
sean-k-mooneythe 500 limit in 16 was just that was what it was tested too15:28
sean-k-mooneythere is no actullly limit at 500 computes in nova15:29
sean-k-mooneytripleo however had scaling issuees with 500+ servers15:29
dansmithalso that ^ :)15:29
sean-k-mooneythat was related to how it used heat and the heat -> ansible -> puppet workflow tripleo used15:30
noonedeadpunkdansmith: but how you actually do that?15:32
noonedeadpunkas I have not found option or any config reference15:32
dansmithnoonedeadpunk: do what?15:32
noonedeadpunkdo say that this conductor is a superconductor15:33
dansmithnoonedeadpunk: having an api_database is critical, but the superconductor is the conductor that your API nodes talk to over MQ,15:33
dansmithso the [transport] of the API,scheduler, etc defines which conductors operate as the superconductor15:33
sean-k-mooneyeach conductor uses a sepeate rabbtit clsuter or at least the rabbit vhost. the supper conductor is pointing at the cell0 rabbit cluster and has api access15:33
noonedeadpunkI guess I'm kind of missing property to do so15:34
noonedeadpunkI'd be assuming that superconductor get's connection details from the api_database?15:34
sean-k-mooneythe api db access enabeld the super conductor to conenct to the other rabbit message buess dbs for the other cells if needed to cordiate cross cell migration ectra15:34
sean-k-mooneyyep the cell mappign table15:34
noonedeadpunkbut like - what if you add api_database connection to all cell conductors?15:35
noonedeadpunkbut then they will also be configured to use own cell db/mq15:35
sean-k-mooneythat wont break anything15:35
sean-k-mooneythe importnat thing is that each cell has its own rabbit or rabbit vhost15:35
noonedeadpunkso kinda this way all cell conductors are "superconductor"?15:35
sean-k-mooneyso the condutors and compute for a given cell use a specific rabbit15:36
dansmithI guess sean-k-mooney wants to answer this, so I guess I'll back off15:36
sean-k-mooneyno go ahaed if you like15:36
noonedeadpunkeventually osa also lacks proper multicell, despite some our users have it15:37
sean-k-mooneyi just wated to give a higelevel over view i tough i had a good link but i am not finding it in my history15:37
noonedeadpunkso I wanted to implement it, but I fail to get to the bottom of how everything is connected, and I did not found any super relevant options in nova.conf reference15:37
noonedeadpunkhttps://docs.openstack.org/nova/latest/admin/cells.html ?15:38
sean-k-mooneynoonedeadpunk: if you are not aware dansmith is nova cell expert. they did a good summit talk on how it works a few years ago15:38
sean-k-mooneyhttps://www.youtube.com/watch?v=o7jkqzORij8 is the link i was looking for15:39
noonedeadpunkheh, "few":)15:40
noonedeadpunkok, so eventually cells are super good for EDGE I assume15:41
sean-k-mooneyam it depends.15:42
sean-k-mooneyfor far edge you dont want to have 1 cell per store15:42
sean-k-mooneybut for near edge it can be a good toplogy15:43
noonedeadpunkok, so back to the confusion I have... what is the difference in config between conductor and super conductor? Just presence of api_database?15:44
sean-k-mooneythe rabbit transport url and api_databases section15:45
noonedeadpunkok, but where do I define cell transport then? or cell conductor has cell rabbitmq in transport_url and super conductor has cell0 transport_url?15:48
sean-k-mooneyyes that ^15:48
noonedeadpunkand then super conductor finds out about cells from api_database and connects there as well?15:48
sean-k-mooneywhen it needs to yes15:49
noonedeadpunkand then superconductor does not have [database] section at all I assume?15:49
sean-k-mooneyno it has the cell 0 db 15:50
noonedeadpunkbut isn't that api_database?15:50
sean-k-mooneyno15:50
noonedeadpunkok, I need to check on that then...15:51
sean-k-mooneya default "single cell" cellv2 deployment has 3 databases15:51
sean-k-mooneynova_api nova_cell0 and nova_cell115:51
sean-k-mooneythe cell dbs have the same schema15:51
sean-k-mooneythe api db is differnt15:51
noonedeadpunkhm, I think we have 2....15:51
noonedeadpunkoh, right15:52
noonedeadpunkjsut nova_cell0 is not really used in "single" setup15:52
sean-k-mooneyits special it will never have comptues15:52
sean-k-mooneyit does hold things like service record ectra15:53
sean-k-mooneybut it will thend to have littel data beyond isntnaces that failed to schdule15:53
noonedeadpunkbut in "single cell" setup, there's actually nothing which could connect to cell0?15:54
sean-k-mooneyi woudl suggest lookign at the video links in https://docs.openstack.org/nova/latest/admin/cells.html#references to undersand hwo this is intended to work15:55
sean-k-mooneyin your single cell case the api can and does15:55
noonedeadpunkyeah, I was trying to think if that's possible to do without superconductor, by allowing all cell conductors to connect to api database/mq15:56
sean-k-mooneythe conductor can and does via lookign it up in the db if it needs too15:56
noonedeadpunkand avoid need to have a "special" superconductor15:56
sean-k-mooneyyes but that is not somethign you shoudl do15:58
sean-k-mooneythe primary reason to have multipel cells is to scale nova by shardign it db and rabbit clusters15:58
noonedeadpunkI was thinking if that could solve trade-offs with upcalls15:58
sean-k-mooneyno it does not help15:59
noonedeadpunkyeah, so each cell has own db and mq, where main load is handled, but then all conductors would connect to this db, which might be an option if you have like couple of cells16:00
sean-k-mooneythis was one fo the things tripleo got very wrong when they first added the ablity to add an addtional cell16:00
noonedeadpunkokay, gotcha16:00
noonedeadpunkI guess next part is - how that is working with AZs....16:06
sean-k-mooneycompeltely unerlated concepts16:07
noonedeadpunkbut like... it would make sense to have cell per az?16:07
sean-k-mooneyhosts in a cell can be in diffent AZs and AZ can refence hosts in multiple cells16:07
sean-k-mooneynoonedeadpunk: that is one apprch but it depend on your usecase16:08
sean-k-mooneyits not forced in anyway16:08
noonedeadpunkas given AZ is geographically spread, having mariadb/rabbitmq cluster isolated in it - makes sense16:08
sean-k-mooneyAZ in openstack have many uses that is only one of them16:09
noonedeadpunkoh right, sure16:09
sean-k-mooneyremember that AWS aviablity zone are a better logical map to a keystone region then they are to a nova AZ16:09
noonedeadpunkright, though what we are trying to do, is to have stretched tenant networks between AZs16:10
sean-k-mooneyagain cells are for scaling. if you scaling unit happend to be a dataceneter and each datacenter is phsycial spreand then 1 cell per datacenter and one AZ per data ceneter might make sense16:10
noonedeadpunkand have workloads stretched as well, which you can not do easily with different regions16:10
noonedeadpunkyup16:11
sean-k-mooneywell tenant network are intended to be aviable across all AZ and cells so unless you using l3 routed networks that the default16:11
noonedeadpunkI just bet there are plenty scheduling complications on top....16:11
sean-k-mooneyalso there is no cell concep on the neutron side so its 1 neutron for all cells by default16:11
noonedeadpunkright, which is not the case with keystone regions16:11
tkajinamstephenfin, do you mind voting +A when you have time ? https://review.opendev.org/c/openstack/nova/+/943083 +A was postponed due to feature freeze period of 2025.2 but the freeze was lifted, apparently16:18
tkajinamsean-k-mooney, or if you have time16:19
opendevreviewTakashi Kajinami proposed openstack/nova master: Replace remaining reference to policy.json  https://review.opendev.org/c/openstack/nova/+/97064816:23
noonedeadpunkI also assume that for multi-cell environment, you pretty much never want to use template urls, as I'd guess it will make super conductor extremely confused?16:38
noonedeadpunkor force to use the same password for all cells?16:39
sean-k-mooneyyes either works.16:43
sean-k-mooneytemplate urls were built for tripleo ot remvoe password form the db but we intentionally do not use them in our new installer and revert to non templeted urls to simplicy mangement of them over time16:44
sean-k-mooneythey are useful in theory but less so in practice16:44
noonedeadpunkwe have adapted them quite some time ago as well, because of some bug in nova-manage - when a command with different password was just creating another cell. So it was breaking nova on password rotation process16:46
noonedeadpunk(I bet I reported it)16:46
noonedeadpunkas then you also need to properly time update of the cell with service restart....16:47
noonedeadpunkhttps://bugs.launchpad.net/nova/+bug/192389916:49
noonedeadpunkit's still in review :D16:49
sean-k-mooneyyou should not return the create cell command16:49
sean-k-mooneythat is not for updating the cell16:49
sean-k-mooneyit sfor creating a new one16:50
sean-k-mooneyif you want ot update use the update command https://docs.openstack.org/nova/latest/cli/nova-manage.html#cell-v2-update-cell16:50
sean-k-mooneythe name is for human uses only its not the uniuqe identifyer16:51
sean-k-mooneythis is somewhere between user error and a missign unique constratit16:52
sean-k-mooneyinterally nova only uses the cell uuid for joins16:52
noonedeadpunkI Guess it's kidna general expectation of name_or_uuid behavior as well...16:52
sean-k-mooneyright that not the documented behavior16:53
sean-k-mooneyand we dont support that for nova-mange normally16:53
noonedeadpunkas it's kinda hard in automated tooling to reference by uuid, but yeah, I get the point16:53
sean-k-mooneythat a openstack client thing16:53
sean-k-mooneynot somethng that the api or project -manage command typeicly support16:53
sean-k-mooneythe propsed path is not actully a fix16:54
sean-k-mooneyits a nova status commadn to detech cells with the same name16:54
sean-k-mooneyit does tno prevnet it from happeign or actully adress te bug16:55
sean-k-mooneywell there are two https://review.opendev.org/c/openstack/nova/+/90181016:55
sean-k-mooneyis the nova-status check16:55
noonedeadpunkit could be helpful as well, but yeah...16:56
sean-k-mooneyhttps://review.opendev.org/c/openstack/nova/+/876940 changed the unique constrating but that has upgrade impacts16:56
* noonedeadpunk already regrets to start looking into cells16:57
noonedeadpunkhow the heck notifications are even supposed to work in multi-cell envs....16:59
noonedeadpunkor well. you set yet another rabbitmq cluster for notifications ,and all services can report to it....16:59
noonedeadpunkI guess that's the way....16:59
sean-k-mooneywell you are recommened to use a dedicated rabbit for notrication regardless of cells17:00
sean-k-mooneyso it does not really change that you eithe ruse one rabbit for noticaion for all cells or you need to have 1 per cell and listnet to muptle notificions busses in yoru applcations17:00
noonedeadpunkand then they are not scaled with cells, simple as that I guerss17:01
noonedeadpunkoh, right... or that17:01
sean-k-mooneyit depend i belive celipmenter supprots a list of notificaon transport urls17:01
sean-k-mooneybut its more commont to just use 1 dedicated rabbit for them17:01
sean-k-mooneybut again that not really difent then a normal deployment where the recommadnation has alwasy beend seperate rpc and notficaiton traffic to diffent rabbit instances17:02
noonedeadpunkyes, true, I was just thinking of implications of cells, like firewalling, etc, where ceilometer should be also allowed to get connected to these cell-specific notification clusters or whatever...17:03
noonedeadpunkit somehow feels to have an enomorous amount of various implications for deployments17:04
sean-k-mooneyno more so then condotor groups or shard in ironic17:21
sean-k-mooneybut cells are an architecural choice17:21
sean-k-mooneycells are very diffent then AZs AZ are just names on an aggrete17:22
sean-k-mooneycells is about how you architect yoru clodu to scale and deserves the same tought like l3 routed netowrk for neutron 17:23
melwittsean-k-mooney: I don't think those are duplicates fwiw. https://bugs.launchpad.net/nova/+bug/2132020 is debatable as a bug bc instances in ERROR state do still consume resources (from a quota perspective) until they are deleted. I don't have a 100% opinion on that without thinking about it more17:30
noonedeadpunkI am jsut frankly concerned that by the time when you might get up to a limit of rabbitmq/mariadb for nova, you will have a more bottleneck at other place17:31
noonedeadpunk(like neutron db)17:31
JayFI mean, yeah. Ironic has been dealing with problems shaped like that for years, with Nova computes being the thing we outscale17:33
JayFand we figured it out (shards and conductor groups)17:33
noonedeadpunkand cell_mappings in nova_api  is pretty much used only by superconductor, right? So when cell mapping is not templated, but you need to update password, you pretty much caring about config change and restart of superconductor with cell mapping change?17:35
noonedeadpunkand it feels like with rabbitmq quorum queues and using streams for fanouts, rabbit become more horizontally scalable then it used to be... but yeah...17:39
*** haleyb is now known as haleyb|out17:43
sean-k-mooneymelwitt: they are not entirly but i think they are related17:43
sean-k-mooneymelwitt: https://bugs.launchpad.net/nova/+bug/2132020 depend o nthe way the error happened. 17:44
sean-k-mooneybasiclly if we have berried the isntance in cell 0 because of a no valid host17:44
sean-k-mooneythen we shoudl free any placment allcation on comptue nodes for it17:44
sean-k-mooneythe same way we do when you shelve offload17:45
sean-k-mooneyif the isntance is in error but insthace.host is pointing at a real host17:45
melwittI'm saying it's not 100% clear whether that's consistent bc currently any instance that has an instance record consumes quota17:45
sean-k-mooneythen we need to keep the allcation17:45
sean-k-mooneybecause in thiory you coudl rebuild ro hard reboot the vm17:45
sean-k-mooneywe do the instnace count check sepreaty form the palcmenet allcation 17:46
sean-k-mooneyi think the orgianll assution is the qutota violation is not for the instnace quota17:46
melwittyeah, I dunno. it is just not 100% clear to me what to do to make sure things are consistent17:48
sean-k-mooneyi dont know if you have had time to look at the regression test https://review.opendev.org/c/openstack/nova/+/96925117:49
sean-k-mooneyim still trying to figure out if it is valid base don that for waht its worht17:49
sean-k-mooneythe two just seam realted to me and i wanted to make sure we were not tryint to solve the same thing in two diffent ways17:50
melwittno I only skimmed17:50
sean-k-mooneythe repoducer has not conviced me its an actula bug yet for what its worth17:51
sean-k-mooneyit looks like they actully might be using the logic form https://opendev.org/openstack/nova/src/branch/master/nova/tests/functional/regressions/test_bug_1806064.py#L9217:53
sean-k-mooneyto simulate this so maybe it is the num isntance quota17:53
sean-k-mooneyto me if we exceed quota but are not shcduled to a host as a restl of it we shoudl not eb preventign other instances form using that space17:53
sean-k-mooneybut we obviously cant delete the allcaiton if there si any change of htis vm being recovered17:54
sean-k-mooneyi.e. with reset  state or rebuidl or whatever17:54
melwittyeah17:54
melwittI see the point about not preventing other instances from building. I was thinking about it focused on just the instance but thinking about it focused on the compute hosts I see the issue17:56
melwittsuperficially thinking about it I think each fix does have to be in a different place (conductor vs compute/api) because the allocation creation and volume reservation happen at different places17:58
melwittif you exceed quota in compute/api and a reserved volume was created, you have to clean it up there and you won't have any allocations yet17:59
melwittI'm actually not sure how quota is getting exceeded in conductor, if it is, then that would be the "quota recheck" step after the compute/api quota check passed18:00
melwittwhich is valid of course. just also highlighting how each bug is about a different point in the server create path18:01
opendevreviewJay Faulkner proposed openstack/nova master: [ironic] Use constants from Ironic, test w/ddt  https://review.opendev.org/c/openstack/nova/+/96932118:06
JayFhttps://review.opendev.org/c/openstack/nova/+/969321 is a follow-up to the Ironic bugfix, I did some changes in Ironic's side so we could just copy over the file that becomes ironic_states.py in our driver, to help make it easier to keep states synced. Also updated the tests (as requested on the previous patch) to use ddt. 18:19
JayFNo real urgency behind landing it, but I would like to get it in since I did the refactor on Ironic's side to make that constants file able to be copied around18:19
sean-k-mooneymelwitt: it can only execced in the conductor if 2 api request happen to diffent api isntance at almost the same time18:55
sean-k-mooneyso it rare but its possibel18:55
sean-k-mooneywell that or if an amdin change the quota while the request is in flight18:56
sean-k-mooneyi.e. after its pass the api18:56
melwittright18:59
opendevreviewMerged openstack/nova master: Drop direct dependency on iso8601  https://review.opendev.org/c/openstack/nova/+/94308320:51

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!