| *** haleyb is now known as haleyb|out | 00:43 | |
| *** bauzas1 is now known as bauzas | 00:52 | |
| *** mhen_ is now known as mhen | 02:09 | |
| *** Jeff is now known as Guest33398 | 06:41 | |
| opendevreview | yangjianfeng proposed openstack/nova master: Fix missing allocations due to unsuccessful evacuations https://review.opendev.org/c/openstack/nova/+/970335 | 06:42 |
|---|---|---|
| opendevreview | huanhongda proposed openstack/nova master: live migration: Complete cleanup if post_live_migration fails https://review.opendev.org/c/openstack/nova/+/967652 | 07:21 |
| *** ykarel__ is now known as ykarel | 08:01 | |
| opendevreview | yangjianfeng proposed openstack/nova master: [WIP] Fix: resources leak due to failed live-migration https://review.opendev.org/c/openstack/nova/+/970597 | 10:32 |
| noonedeadpunk | hey folks! I've started looking into multi-cell docs (https://docs.openstack.org/nova/latest/admin/cells.html) and got kind of confused by sepr conductor part. Or specifically - how to tell it that it's a superconductor? | 13:11 |
| noonedeadpunk | As, I can imagine, conductor is able to fetch cells connection details from the api database | 13:11 |
| noonedeadpunk | but then how to tell a cell conductor not to do that? | 13:11 |
| noonedeadpunk | and that it should operate only with own database, and never reach api database? Just don't define api_database section? | 13:12 |
| noonedeadpunk | And also - does it make sense to do superconductor setup at all? | 13:13 |
| noonedeadpunk | As in RHOSP 16 docs I've seen that still it has limit of 500 hosts across all cells... | 13:13 |
| noonedeadpunk | (which I'd guess is due to network limit, but not sure) | 13:13 |
| noonedeadpunk | so my guess is that super conductor just has an [api_database] section, while regular conductors only [database] | 13:48 |
| opendevreview | Julien LE JEUNE proposed openstack/nova stable/2024.2: Fix bug 2000069 https://review.opendev.org/c/openstack/nova/+/970634 | 14:59 |
| dansmith | noonedeadpunk: if you have multiple cells, you need a superconductor, or at least delegate one cell's conductors to be the superconductor function | 15:26 |
| sean-k-mooney | noonedeadpunk: dont read teh RHOSP docs for anythign related to cells, if you do it will cuase you pain and confustion | 15:28 |
| sean-k-mooney | noonedeadpunk: tripelo did not impelent cells well | 15:28 |
| sean-k-mooney | the 500 limit in 16 was just that was what it was tested too | 15:28 |
| sean-k-mooney | there is no actullly limit at 500 computes in nova | 15:29 |
| sean-k-mooney | tripleo however had scaling issuees with 500+ servers | 15:29 |
| dansmith | also that ^ :) | 15:29 |
| sean-k-mooney | that was related to how it used heat and the heat -> ansible -> puppet workflow tripleo used | 15:30 |
| noonedeadpunk | dansmith: but how you actually do that? | 15:32 |
| noonedeadpunk | as I have not found option or any config reference | 15:32 |
| dansmith | noonedeadpunk: do what? | 15:32 |
| noonedeadpunk | do say that this conductor is a superconductor | 15:33 |
| dansmith | noonedeadpunk: having an api_database is critical, but the superconductor is the conductor that your API nodes talk to over MQ, | 15:33 |
| dansmith | so the [transport] of the API,scheduler, etc defines which conductors operate as the superconductor | 15:33 |
| sean-k-mooney | each conductor uses a sepeate rabbtit clsuter or at least the rabbit vhost. the supper conductor is pointing at the cell0 rabbit cluster and has api access | 15:33 |
| noonedeadpunk | I guess I'm kind of missing property to do so | 15:34 |
| noonedeadpunk | I'd be assuming that superconductor get's connection details from the api_database? | 15:34 |
| sean-k-mooney | the api db access enabeld the super conductor to conenct to the other rabbit message buess dbs for the other cells if needed to cordiate cross cell migration ectra | 15:34 |
| sean-k-mooney | yep the cell mappign table | 15:34 |
| noonedeadpunk | but like - what if you add api_database connection to all cell conductors? | 15:35 |
| noonedeadpunk | but then they will also be configured to use own cell db/mq | 15:35 |
| sean-k-mooney | that wont break anything | 15:35 |
| sean-k-mooney | the importnat thing is that each cell has its own rabbit or rabbit vhost | 15:35 |
| noonedeadpunk | so kinda this way all cell conductors are "superconductor"? | 15:35 |
| sean-k-mooney | so the condutors and compute for a given cell use a specific rabbit | 15:36 |
| dansmith | I guess sean-k-mooney wants to answer this, so I guess I'll back off | 15:36 |
| sean-k-mooney | no go ahaed if you like | 15:36 |
| noonedeadpunk | eventually osa also lacks proper multicell, despite some our users have it | 15:37 |
| sean-k-mooney | i just wated to give a higelevel over view i tough i had a good link but i am not finding it in my history | 15:37 |
| noonedeadpunk | so I wanted to implement it, but I fail to get to the bottom of how everything is connected, and I did not found any super relevant options in nova.conf reference | 15:37 |
| noonedeadpunk | https://docs.openstack.org/nova/latest/admin/cells.html ? | 15:38 |
| sean-k-mooney | noonedeadpunk: if you are not aware dansmith is nova cell expert. they did a good summit talk on how it works a few years ago | 15:38 |
| sean-k-mooney | https://www.youtube.com/watch?v=o7jkqzORij8 is the link i was looking for | 15:39 |
| noonedeadpunk | heh, "few":) | 15:40 |
| noonedeadpunk | ok, so eventually cells are super good for EDGE I assume | 15:41 |
| sean-k-mooney | am it depends. | 15:42 |
| sean-k-mooney | for far edge you dont want to have 1 cell per store | 15:42 |
| sean-k-mooney | but for near edge it can be a good toplogy | 15:43 |
| noonedeadpunk | ok, so back to the confusion I have... what is the difference in config between conductor and super conductor? Just presence of api_database? | 15:44 |
| sean-k-mooney | the rabbit transport url and api_databases section | 15:45 |
| noonedeadpunk | ok, but where do I define cell transport then? or cell conductor has cell rabbitmq in transport_url and super conductor has cell0 transport_url? | 15:48 |
| sean-k-mooney | yes that ^ | 15:48 |
| noonedeadpunk | and then super conductor finds out about cells from api_database and connects there as well? | 15:48 |
| sean-k-mooney | when it needs to yes | 15:49 |
| noonedeadpunk | and then superconductor does not have [database] section at all I assume? | 15:49 |
| sean-k-mooney | no it has the cell 0 db | 15:50 |
| noonedeadpunk | but isn't that api_database? | 15:50 |
| sean-k-mooney | no | 15:50 |
| noonedeadpunk | ok, I need to check on that then... | 15:51 |
| sean-k-mooney | a default "single cell" cellv2 deployment has 3 databases | 15:51 |
| sean-k-mooney | nova_api nova_cell0 and nova_cell1 | 15:51 |
| sean-k-mooney | the cell dbs have the same schema | 15:51 |
| sean-k-mooney | the api db is differnt | 15:51 |
| noonedeadpunk | hm, I think we have 2.... | 15:51 |
| noonedeadpunk | oh, right | 15:52 |
| noonedeadpunk | jsut nova_cell0 is not really used in "single" setup | 15:52 |
| sean-k-mooney | its special it will never have comptues | 15:52 |
| sean-k-mooney | it does hold things like service record ectra | 15:53 |
| sean-k-mooney | but it will thend to have littel data beyond isntnaces that failed to schdule | 15:53 |
| noonedeadpunk | but in "single cell" setup, there's actually nothing which could connect to cell0? | 15:54 |
| sean-k-mooney | i woudl suggest lookign at the video links in https://docs.openstack.org/nova/latest/admin/cells.html#references to undersand hwo this is intended to work | 15:55 |
| sean-k-mooney | in your single cell case the api can and does | 15:55 |
| noonedeadpunk | yeah, I was trying to think if that's possible to do without superconductor, by allowing all cell conductors to connect to api database/mq | 15:56 |
| sean-k-mooney | the conductor can and does via lookign it up in the db if it needs too | 15:56 |
| noonedeadpunk | and avoid need to have a "special" superconductor | 15:56 |
| sean-k-mooney | yes but that is not somethign you shoudl do | 15:58 |
| sean-k-mooney | the primary reason to have multipel cells is to scale nova by shardign it db and rabbit clusters | 15:58 |
| noonedeadpunk | I was thinking if that could solve trade-offs with upcalls | 15:58 |
| sean-k-mooney | no it does not help | 15:59 |
| noonedeadpunk | yeah, so each cell has own db and mq, where main load is handled, but then all conductors would connect to this db, which might be an option if you have like couple of cells | 16:00 |
| sean-k-mooney | this was one fo the things tripleo got very wrong when they first added the ablity to add an addtional cell | 16:00 |
| noonedeadpunk | okay, gotcha | 16:00 |
| noonedeadpunk | I guess next part is - how that is working with AZs.... | 16:06 |
| sean-k-mooney | compeltely unerlated concepts | 16:07 |
| noonedeadpunk | but like... it would make sense to have cell per az? | 16:07 |
| sean-k-mooney | hosts in a cell can be in diffent AZs and AZ can refence hosts in multiple cells | 16:07 |
| sean-k-mooney | noonedeadpunk: that is one apprch but it depend on your usecase | 16:08 |
| sean-k-mooney | its not forced in anyway | 16:08 |
| noonedeadpunk | as given AZ is geographically spread, having mariadb/rabbitmq cluster isolated in it - makes sense | 16:08 |
| sean-k-mooney | AZ in openstack have many uses that is only one of them | 16:09 |
| noonedeadpunk | oh right, sure | 16:09 |
| sean-k-mooney | remember that AWS aviablity zone are a better logical map to a keystone region then they are to a nova AZ | 16:09 |
| noonedeadpunk | right, though what we are trying to do, is to have stretched tenant networks between AZs | 16:10 |
| sean-k-mooney | again cells are for scaling. if you scaling unit happend to be a dataceneter and each datacenter is phsycial spreand then 1 cell per datacenter and one AZ per data ceneter might make sense | 16:10 |
| noonedeadpunk | and have workloads stretched as well, which you can not do easily with different regions | 16:10 |
| noonedeadpunk | yup | 16:11 |
| sean-k-mooney | well tenant network are intended to be aviable across all AZ and cells so unless you using l3 routed networks that the default | 16:11 |
| noonedeadpunk | I just bet there are plenty scheduling complications on top.... | 16:11 |
| sean-k-mooney | also there is no cell concep on the neutron side so its 1 neutron for all cells by default | 16:11 |
| noonedeadpunk | right, which is not the case with keystone regions | 16:11 |
| tkajinam | stephenfin, do you mind voting +A when you have time ? https://review.opendev.org/c/openstack/nova/+/943083 +A was postponed due to feature freeze period of 2025.2 but the freeze was lifted, apparently | 16:18 |
| tkajinam | sean-k-mooney, or if you have time | 16:19 |
| opendevreview | Takashi Kajinami proposed openstack/nova master: Replace remaining reference to policy.json https://review.opendev.org/c/openstack/nova/+/970648 | 16:23 |
| noonedeadpunk | I also assume that for multi-cell environment, you pretty much never want to use template urls, as I'd guess it will make super conductor extremely confused? | 16:38 |
| noonedeadpunk | or force to use the same password for all cells? | 16:39 |
| sean-k-mooney | yes either works. | 16:43 |
| sean-k-mooney | template urls were built for tripleo ot remvoe password form the db but we intentionally do not use them in our new installer and revert to non templeted urls to simplicy mangement of them over time | 16:44 |
| sean-k-mooney | they are useful in theory but less so in practice | 16:44 |
| noonedeadpunk | we have adapted them quite some time ago as well, because of some bug in nova-manage - when a command with different password was just creating another cell. So it was breaking nova on password rotation process | 16:46 |
| noonedeadpunk | (I bet I reported it) | 16:46 |
| noonedeadpunk | as then you also need to properly time update of the cell with service restart.... | 16:47 |
| noonedeadpunk | https://bugs.launchpad.net/nova/+bug/1923899 | 16:49 |
| noonedeadpunk | it's still in review :D | 16:49 |
| sean-k-mooney | you should not return the create cell command | 16:49 |
| sean-k-mooney | that is not for updating the cell | 16:49 |
| sean-k-mooney | it sfor creating a new one | 16:50 |
| sean-k-mooney | if you want ot update use the update command https://docs.openstack.org/nova/latest/cli/nova-manage.html#cell-v2-update-cell | 16:50 |
| sean-k-mooney | the name is for human uses only its not the uniuqe identifyer | 16:51 |
| sean-k-mooney | this is somewhere between user error and a missign unique constratit | 16:52 |
| sean-k-mooney | interally nova only uses the cell uuid for joins | 16:52 |
| noonedeadpunk | I Guess it's kidna general expectation of name_or_uuid behavior as well... | 16:52 |
| sean-k-mooney | right that not the documented behavior | 16:53 |
| sean-k-mooney | and we dont support that for nova-mange normally | 16:53 |
| noonedeadpunk | as it's kinda hard in automated tooling to reference by uuid, but yeah, I get the point | 16:53 |
| sean-k-mooney | that a openstack client thing | 16:53 |
| sean-k-mooney | not somethng that the api or project -manage command typeicly support | 16:53 |
| sean-k-mooney | the propsed path is not actully a fix | 16:54 |
| sean-k-mooney | its a nova status commadn to detech cells with the same name | 16:54 |
| sean-k-mooney | it does tno prevnet it from happeign or actully adress te bug | 16:55 |
| sean-k-mooney | well there are two https://review.opendev.org/c/openstack/nova/+/901810 | 16:55 |
| sean-k-mooney | is the nova-status check | 16:55 |
| noonedeadpunk | it could be helpful as well, but yeah... | 16:56 |
| sean-k-mooney | https://review.opendev.org/c/openstack/nova/+/876940 changed the unique constrating but that has upgrade impacts | 16:56 |
| * noonedeadpunk already regrets to start looking into cells | 16:57 | |
| noonedeadpunk | how the heck notifications are even supposed to work in multi-cell envs.... | 16:59 |
| noonedeadpunk | or well. you set yet another rabbitmq cluster for notifications ,and all services can report to it.... | 16:59 |
| noonedeadpunk | I guess that's the way.... | 16:59 |
| sean-k-mooney | well you are recommened to use a dedicated rabbit for notrication regardless of cells | 17:00 |
| sean-k-mooney | so it does not really change that you eithe ruse one rabbit for noticaion for all cells or you need to have 1 per cell and listnet to muptle notificions busses in yoru applcations | 17:00 |
| noonedeadpunk | and then they are not scaled with cells, simple as that I guerss | 17:01 |
| noonedeadpunk | oh, right... or that | 17:01 |
| sean-k-mooney | it depend i belive celipmenter supprots a list of notificaon transport urls | 17:01 |
| sean-k-mooney | but its more commont to just use 1 dedicated rabbit for them | 17:01 |
| sean-k-mooney | but again that not really difent then a normal deployment where the recommadnation has alwasy beend seperate rpc and notficaiton traffic to diffent rabbit instances | 17:02 |
| noonedeadpunk | yes, true, I was just thinking of implications of cells, like firewalling, etc, where ceilometer should be also allowed to get connected to these cell-specific notification clusters or whatever... | 17:03 |
| noonedeadpunk | it somehow feels to have an enomorous amount of various implications for deployments | 17:04 |
| sean-k-mooney | no more so then condotor groups or shard in ironic | 17:21 |
| sean-k-mooney | but cells are an architecural choice | 17:21 |
| sean-k-mooney | cells are very diffent then AZs AZ are just names on an aggrete | 17:22 |
| sean-k-mooney | cells is about how you architect yoru clodu to scale and deserves the same tought like l3 routed netowrk for neutron | 17:23 |
| melwitt | sean-k-mooney: I don't think those are duplicates fwiw. https://bugs.launchpad.net/nova/+bug/2132020 is debatable as a bug bc instances in ERROR state do still consume resources (from a quota perspective) until they are deleted. I don't have a 100% opinion on that without thinking about it more | 17:30 |
| noonedeadpunk | I am jsut frankly concerned that by the time when you might get up to a limit of rabbitmq/mariadb for nova, you will have a more bottleneck at other place | 17:31 |
| noonedeadpunk | (like neutron db) | 17:31 |
| JayF | I mean, yeah. Ironic has been dealing with problems shaped like that for years, with Nova computes being the thing we outscale | 17:33 |
| JayF | and we figured it out (shards and conductor groups) | 17:33 |
| noonedeadpunk | and cell_mappings in nova_api is pretty much used only by superconductor, right? So when cell mapping is not templated, but you need to update password, you pretty much caring about config change and restart of superconductor with cell mapping change? | 17:35 |
| noonedeadpunk | and it feels like with rabbitmq quorum queues and using streams for fanouts, rabbit become more horizontally scalable then it used to be... but yeah... | 17:39 |
| *** haleyb is now known as haleyb|out | 17:43 | |
| sean-k-mooney | melwitt: they are not entirly but i think they are related | 17:43 |
| sean-k-mooney | melwitt: https://bugs.launchpad.net/nova/+bug/2132020 depend o nthe way the error happened. | 17:44 |
| sean-k-mooney | basiclly if we have berried the isntance in cell 0 because of a no valid host | 17:44 |
| sean-k-mooney | then we shoudl free any placment allcation on comptue nodes for it | 17:44 |
| sean-k-mooney | the same way we do when you shelve offload | 17:45 |
| sean-k-mooney | if the isntance is in error but insthace.host is pointing at a real host | 17:45 |
| melwitt | I'm saying it's not 100% clear whether that's consistent bc currently any instance that has an instance record consumes quota | 17:45 |
| sean-k-mooney | then we need to keep the allcation | 17:45 |
| sean-k-mooney | because in thiory you coudl rebuild ro hard reboot the vm | 17:45 |
| sean-k-mooney | we do the instnace count check sepreaty form the palcmenet allcation | 17:46 |
| sean-k-mooney | i think the orgianll assution is the qutota violation is not for the instnace quota | 17:46 |
| melwitt | yeah, I dunno. it is just not 100% clear to me what to do to make sure things are consistent | 17:48 |
| sean-k-mooney | i dont know if you have had time to look at the regression test https://review.opendev.org/c/openstack/nova/+/969251 | 17:49 |
| sean-k-mooney | im still trying to figure out if it is valid base don that for waht its worht | 17:49 |
| sean-k-mooney | the two just seam realted to me and i wanted to make sure we were not tryint to solve the same thing in two diffent ways | 17:50 |
| melwitt | no I only skimmed | 17:50 |
| sean-k-mooney | the repoducer has not conviced me its an actula bug yet for what its worth | 17:51 |
| sean-k-mooney | it looks like they actully might be using the logic form https://opendev.org/openstack/nova/src/branch/master/nova/tests/functional/regressions/test_bug_1806064.py#L92 | 17:53 |
| sean-k-mooney | to simulate this so maybe it is the num isntance quota | 17:53 |
| sean-k-mooney | to me if we exceed quota but are not shcduled to a host as a restl of it we shoudl not eb preventign other instances form using that space | 17:53 |
| sean-k-mooney | but we obviously cant delete the allcaiton if there si any change of htis vm being recovered | 17:54 |
| sean-k-mooney | i.e. with reset state or rebuidl or whatever | 17:54 |
| melwitt | yeah | 17:54 |
| melwitt | I see the point about not preventing other instances from building. I was thinking about it focused on just the instance but thinking about it focused on the compute hosts I see the issue | 17:56 |
| melwitt | superficially thinking about it I think each fix does have to be in a different place (conductor vs compute/api) because the allocation creation and volume reservation happen at different places | 17:58 |
| melwitt | if you exceed quota in compute/api and a reserved volume was created, you have to clean it up there and you won't have any allocations yet | 17:59 |
| melwitt | I'm actually not sure how quota is getting exceeded in conductor, if it is, then that would be the "quota recheck" step after the compute/api quota check passed | 18:00 |
| melwitt | which is valid of course. just also highlighting how each bug is about a different point in the server create path | 18:01 |
| opendevreview | Jay Faulkner proposed openstack/nova master: [ironic] Use constants from Ironic, test w/ddt https://review.opendev.org/c/openstack/nova/+/969321 | 18:06 |
| JayF | https://review.opendev.org/c/openstack/nova/+/969321 is a follow-up to the Ironic bugfix, I did some changes in Ironic's side so we could just copy over the file that becomes ironic_states.py in our driver, to help make it easier to keep states synced. Also updated the tests (as requested on the previous patch) to use ddt. | 18:19 |
| JayF | No real urgency behind landing it, but I would like to get it in since I did the refactor on Ironic's side to make that constants file able to be copied around | 18:19 |
| sean-k-mooney | melwitt: it can only execced in the conductor if 2 api request happen to diffent api isntance at almost the same time | 18:55 |
| sean-k-mooney | so it rare but its possibel | 18:55 |
| sean-k-mooney | well that or if an amdin change the quota while the request is in flight | 18:56 |
| sean-k-mooney | i.e. after its pass the api | 18:56 |
| melwitt | right | 18:59 |
| opendevreview | Merged openstack/nova master: Drop direct dependency on iso8601 https://review.opendev.org/c/openstack/nova/+/943083 | 20:51 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!