*** mlavalle has quit IRC | 00:04 | |
*** martinkennelly has quit IRC | 00:05 | |
*** CeeMac has quit IRC | 00:24 | |
*** hamalq has quit IRC | 01:00 | |
openstackgerrit | norman shen proposed openstack/nova master: Saving security group to info_cache https://review.opendev.org/c/openstack/nova/+/786348 | 01:26 |
---|---|---|
guilhermesp | melwitt: and I suppose the only workaround is rebuild the instance to get the vgpu reallocated, right? | 01:30 |
*** Anticimex has quit IRC | 01:56 | |
openstackgerrit | Qiu Fossen proposed openstack/nova-specs master: Support live migrate vtpm server https://review.opendev.org/c/openstack/nova-specs/+/785860 | 01:57 |
*** Anticimex has joined #openstack-nova | 01:59 | |
*** xinranwang has joined #openstack-nova | 02:02 | |
*** brinzhang_ is now known as brinzhang | 02:06 | |
openstackgerrit | Qiu Fossen proposed openstack/nova-specs master: Support fuzzy querying instance by tag https://review.opendev.org/c/openstack/nova-specs/+/768853 | 02:07 |
*** rcernin has quit IRC | 02:10 | |
*** rcernin has joined #openstack-nova | 02:36 | |
*** hemanth_n has joined #openstack-nova | 02:40 | |
*** sapd1 has joined #openstack-nova | 03:29 | |
*** rcernin has quit IRC | 03:34 | |
*** psachin has joined #openstack-nova | 03:39 | |
*** rcernin has joined #openstack-nova | 03:44 | |
openstackgerrit | Xinran WANG proposed openstack/nova-specs master: Repropose smartnic support spec https://review.opendev.org/c/openstack/nova-specs/+/783632 | 03:48 |
*** rcernin has quit IRC | 03:55 | |
*** rcernin has joined #openstack-nova | 03:55 | |
*** mkrai has joined #openstack-nova | 04:06 | |
*** rcernin has quit IRC | 04:13 | |
*** vishalmanchanda has joined #openstack-nova | 04:23 | |
*** ratailor has joined #openstack-nova | 04:30 | |
*** brinzhang_ has joined #openstack-nova | 04:45 | |
*** sapd1 has quit IRC | 04:48 | |
*** brinzhang has quit IRC | 04:48 | |
*** whoami-rajat has joined #openstack-nova | 04:50 | |
*** rcernin has joined #openstack-nova | 04:55 | |
*** ratailor_ has joined #openstack-nova | 05:10 | |
*** rcernin has quit IRC | 05:10 | |
*** ratailor has quit IRC | 05:13 | |
openstackgerrit | Jeffrey Zhang proposed openstack/nova master: Support inject-nmi action in watchdog https://review.opendev.org/c/openstack/nova/+/741072 | 05:20 |
*** rcernin has joined #openstack-nova | 05:49 | |
*** rcernin has quit IRC | 05:49 | |
*** rcernin has joined #openstack-nova | 05:49 | |
*** slaweq has quit IRC | 05:55 | |
*** slaweq_ has joined #openstack-nova | 05:55 | |
*** slaweq_ is now known as slaweq | 05:55 | |
*** xinranwang has quit IRC | 06:01 | |
*** waleedm__ has joined #openstack-nova | 06:07 | |
*** rcernin has quit IRC | 06:10 | |
*** ralonsoh has joined #openstack-nova | 06:10 | |
*** rcernin has joined #openstack-nova | 06:28 | |
*** rcernin has quit IRC | 06:29 | |
*** slaweq_ has joined #openstack-nova | 06:30 | |
*** rcernin has joined #openstack-nova | 06:30 | |
*** slaweq has quit IRC | 06:36 | |
*** slaweq_ is now known as slaweq | 06:36 | |
*** waleedm__ has quit IRC | 06:40 | |
*** gyee has quit IRC | 06:59 | |
melwitt | guilhermesp: shelve offload/unshelve might work? my thinking is shelve offload would deallocate placement resources and then unshelve would reallocate everything again | 07:00 |
*** dklyle has quit IRC | 07:11 | |
*** andrewbonney has joined #openstack-nova | 07:24 | |
*** sapd1 has joined #openstack-nova | 07:34 | |
*** luksky has joined #openstack-nova | 07:34 | |
openstackgerrit | Balazs Gibizer proposed openstack/nova master: Replace blind retry with libvirt event waiting in detach https://review.opendev.org/c/openstack/nova/+/770246 | 07:38 |
openstackgerrit | Balazs Gibizer proposed openstack/nova master: Move the guest.get_disk test to test_guest https://review.opendev.org/c/openstack/nova/+/777151 | 07:38 |
openstackgerrit | Balazs Gibizer proposed openstack/nova master: Enable mypy on libvirt/guest.py https://review.opendev.org/c/openstack/nova/+/777155 | 07:39 |
openstackgerrit | Balazs Gibizer proposed openstack/nova master: Follow up type hints for a634103 https://review.opendev.org/c/openstack/nova/+/777159 | 07:39 |
openstackgerrit | Balazs Gibizer proposed openstack/nova master: libvirt: Remove dead error handling code https://review.opendev.org/c/openstack/nova/+/779704 | 07:39 |
*** rpittau|afk is now known as rpittau | 07:45 | |
*** tosky has joined #openstack-nova | 07:46 | |
openstackgerrit | Balazs Gibizer proposed openstack/osc-placement master: Mark microversion 1.37 supported https://review.opendev.org/c/openstack/osc-placement/+/784023 | 07:49 |
lyarwood | gibi: I'll start looking at the detach stuff this morning | 07:49 |
lyarwood | gibi: https://review.opendev.org/c/openstack/nova/+/784129/2 would you mind hitting this to unblock the actual fix on top? | 07:49 |
gibi | lyarwood: thanks. and sure I will look at the funct test | 07:50 |
lyarwood | cool thanks | 07:50 |
*** ociuhandu has joined #openstack-nova | 07:57 | |
*** dtantsur|afk is now known as dtantsur | 07:59 | |
*** lucasagomes has joined #openstack-nova | 08:09 | |
*** ociuhandu has quit IRC | 08:14 | |
*** martinkennelly has joined #openstack-nova | 08:25 | |
*** mkrai has quit IRC | 08:25 | |
*** ociuhandu has joined #openstack-nova | 08:27 | |
*** luksky has quit IRC | 08:31 | |
*** luksky has joined #openstack-nova | 08:31 | |
*** hemanth_n has quit IRC | 08:32 | |
*** mkrai has joined #openstack-nova | 08:34 | |
*** rcernin has quit IRC | 08:37 | |
*** derekh has joined #openstack-nova | 08:42 | |
*** luksky has quit IRC | 08:42 | |
lyarwood | does anyone know where I can find the actual schedule for ptg? https://www.openstack.org/ptg/ just shows a static image for Monday AFAICT. | 08:44 |
lyarwood | http://ptg.openstack.org/ptg.html ah | 08:44 |
*** slaweq has quit IRC | 08:51 | |
*** slaweq has joined #openstack-nova | 08:51 | |
*** luksky has joined #openstack-nova | 08:55 | |
gibi | lyarwood: also above the static image there is a link to a pdf | 08:59 |
*** ociuhandu has quit IRC | 09:10 | |
*** ociuhandu has joined #openstack-nova | 09:10 | |
*** ociuhandu has quit IRC | 09:15 | |
*** vishalmanchanda has quit IRC | 09:27 | |
*** k_mouza has joined #openstack-nova | 09:29 | |
*** ociuhandu has joined #openstack-nova | 09:35 | |
openstackgerrit | Merged openstack/nova master: Add regression test for bug #1922053 https://review.opendev.org/c/openstack/nova/+/784129 | 09:44 |
openstack | bug 1922053 in OpenStack Compute (nova) "Operators can force up compute services with `done` evacuation migration records still active against the host" [Medium,In progress] https://launchpad.net/bugs/1922053 - Assigned to Lee Yarwood (lyarwood) | 09:44 |
openstackgerrit | Merged openstack/nova master: api: Reject requests to force up computes when `done` evacuation records exist https://review.opendev.org/c/openstack/nova/+/784130 | 09:44 |
*** dpawlik9 has quit IRC | 09:46 | |
*** vishalmanchanda has joined #openstack-nova | 09:48 | |
*** tesseract has joined #openstack-nova | 09:49 | |
*** swp20 has joined #openstack-nova | 09:51 | |
gibi | fyi, there is a frequent live migration failure in tempest in the new trunk port test | 09:53 |
gibi | https://bugs.launchpad.net/tempest/+bug/1924258 | 09:54 |
openstack | Launchpad bug 1924258 in tempest "test_live_migration_with_trunk fails intermittently" [Undecided,New] | 09:54 |
gibi | I notified the author of the test case and he promised to check it | 09:55 |
*** dpawlik4 has joined #openstack-nova | 09:59 | |
*** ratailor__ has joined #openstack-nova | 10:08 | |
*** ratailor_ has quit IRC | 10:11 | |
*** ociuhandu_ has joined #openstack-nova | 10:30 | |
*** ociuhandu has quit IRC | 10:32 | |
*** ociuhandu_ has quit IRC | 10:34 | |
openstackgerrit | Balazs Gibizer proposed openstack/placement master: Add support for RP re-parenting and orphaning https://review.opendev.org/c/openstack/placement/+/784020 | 10:56 |
*** k_mouza has quit IRC | 10:58 | |
*** psachin has quit IRC | 11:00 | |
*** mkrai has quit IRC | 11:01 | |
noonedeadpunk | Hi! Can you kindly help me a bit. I'm just trying to understand why we might be doing certain thing and it feels for me that it's not needed nowadays. But decided to ask before changing behaviour | 11:14 |
*** k_mouza has joined #openstack-nova | 11:15 | |
noonedeadpunk | So we're running `nova-manage cell_v2 map_instances` after creating a cell. I think that might be the valid case in old releases, when cellsv2 were just introduced? | 11:15 |
*** k_mouza has quit IRC | 11:15 | |
noonedeadpunk | or, when you create new cell and aim to move instances there? | 11:16 |
*** k_mouza has joined #openstack-nova | 11:16 | |
*** swp20 has quit IRC | 11:24 | |
*** psachin has joined #openstack-nova | 11:29 | |
*** mkrai has joined #openstack-nova | 11:30 | |
sean-k-mooney | you should not need to do map instance every upgrade | 11:30 |
sean-k-mooney | just if you are going form non cellsv2 to cellv2 i belive | 11:31 |
noonedeadpunk | yeah, that;s what I thought... And also there's batches of 50 instances anyway | 11:31 |
noonedeadpunk | (so would need to run in a while cycle or smth like that) | 11:31 |
noonedeadpunk | sean-k-mooney: thanks for confirming my concerns | 11:31 |
sean-k-mooney | the discover_hosts command also only need to be run if you add/remove hosts | 11:33 |
sean-k-mooney | or i guess move them | 11:33 |
noonedeadpunk | I think discover_hosts shouldn't actually hurt? As I'm not sure about how to distinguish if we add host atm or not... | 11:34 |
sean-k-mooney | although you nomally dont move host between cells. you can but its not common and it proably adviasble ot not have vms on it if you do | 11:34 |
sean-k-mooney | noonedeadpunk: ya its pretty cheap | 11:34 |
sean-k-mooney | noonedeadpunk: ooo i think just always runs it | 11:34 |
sean-k-mooney | noonedeadpunk: we even have the option to do it as a perodic task if you really want too | 11:35 |
noonedeadpunk | Yeah, I know. It's just smth like 20mins or so iirc | 11:35 |
*** sapd1 has quit IRC | 11:41 | |
*** k_mouza has quit IRC | 11:43 | |
*** k_mouza has joined #openstack-nova | 11:44 | |
*** CeeMac has joined #openstack-nova | 12:05 | |
lyarwood | gibi: re https://review.opendev.org/c/openstack/nova/+/770246 I forgot to note that I wanted to land https://review.opendev.org/c/openstack/nova/+/785682 first if possible, I think it's valid to also cover that corner case in the event based flow as well. | 12:07 |
*** mkrai has quit IRC | 12:12 | |
gibi | lyarwood: looking | 12:24 |
gibi | lyarwood: I will do the rebase and the adaptation to your fix either today or tomorrow | 12:26 |
lyarwood | gibi: ack thanks, I'll go over the rest of the series later in more detail but clicking through it LGTM at the moment | 12:27 |
gibi | cool, I will +A your fix soon | 12:27 |
*** ociuhandu has joined #openstack-nova | 12:47 | |
*** ratailor__ has quit IRC | 12:47 | |
mnaser | bauzas: https://bugs.launchpad.net/nova/+bug/1900800 have you thought about how this can be resolved? We are running into this often :( | 13:12 |
openstack | Launchpad bug 1900800 in OpenStack Compute (nova) "VGPUs is not recreated on host reboot" [Low,Confirmed] - Assigned to Sylvain Bauza (sylvain-bauza) | 13:12 |
*** hoonetorg has quit IRC | 13:20 | |
*** ociuhandu has quit IRC | 13:24 | |
*** ociuhandu has joined #openstack-nova | 13:24 | |
*** hoonetorg has joined #openstack-nova | 13:34 | |
*** sapd1 has joined #openstack-nova | 13:37 | |
bauzas | mnaser: sorry I was on a meeting | 13:41 |
bauzas | mnaser: well, maybe we would need to ask the operator to create the mdevs after rebooting | 13:41 |
*** rmart04 has joined #openstack-nova | 13:44 | |
*** ociuhandu has quit IRC | 13:47 | |
*** ociuhandu has joined #openstack-nova | 13:47 | |
openstackgerrit | Qiu Fossen proposed openstack/nova-specs master: Allow migrating PMEM's data https://review.opendev.org/c/openstack/nova-specs/+/785563 | 13:49 |
sean-k-mooney | bauzas: the mdevs should be create in init_host today | 13:51 |
sean-k-mooney | oh | 13:51 |
bauzas | sean-k-mooney: the problem is that we don't know which mdev type they usze | 13:52 |
sean-k-mooney | well we should no? | 13:52 |
sean-k-mooney | the mdevs should still be in the domins | 13:53 |
sean-k-mooney | this should be running before we start the vms | 13:53 |
sean-k-mooney | oh the type is not recorded | 13:53 |
sean-k-mooney | so if you have multiple devices that would be an issue | 13:53 |
sean-k-mooney | although we would be abel to look at the vgpu request in the flavor if you had a traits request | 13:54 |
sean-k-mooney | or better yet the allocation summeries | 13:54 |
sean-k-mooney | we can look up the RP form which the allocation came from and then identify the partent device and use that to look up the mdev type in the config | 13:55 |
sean-k-mooney | bauzas: that should work right ^ | 13:55 |
bauzas | sean-k-mooney: the problem is that the traits are optional | 13:55 |
sean-k-mooney | we dont need the traits | 13:55 |
sean-k-mooney | use allcotion to figure out partent device use parent device to look up mdev type in nova.conf | 13:56 |
sean-k-mooney | then recreate it | 13:56 |
bauzas | sean-k-mooney: https://github.com/openstack/nova/blob/450213f/nova/virt/libvirt/driver.py#L816 | 13:56 |
bauzas | here, we would then need to call placement | 13:57 |
sean-k-mooney | yes | 13:57 |
sean-k-mooney | unless we have the allocation summeries saved somewhere | 13:57 |
sean-k-mooney | i.e. in the nova db but we dont as far as i know | 13:58 |
bauzas | ok, but then we would see VGPU allocations | 13:58 |
bauzas | for RP | 13:58 |
bauzas | for a RP | 13:58 |
bauzas | which is a pGPU | 13:58 |
bauzas | so then we would need to look at the conf option to know which type it uses | 13:58 |
sean-k-mooney | yep | 13:58 |
bauzas | that *could* work | 13:59 |
bauzas | but that's a long change I think | 13:59 |
sean-k-mooney | you mean complex to write/test | 13:59 |
sean-k-mooney | i think its what is required though unless we start storing the infor in the nova db in the resouces table for example | 14:00 |
sean-k-mooney | those are our two options caulate it form plamcnet or record mdevs in the db like pmem or pcidevices | 14:00 |
sean-k-mooney | so that we can just look it up | 14:01 |
bauzas | I can try to help | 14:03 |
*** ociuhandu has quit IRC | 14:03 | |
*** ociuhandu has joined #openstack-nova | 14:04 | |
sean-k-mooney | mnaser: it would be a bit of a hack but you could proably fix this with a bash script executed by a systemd service file tempoarlly | 14:05 |
sean-k-mooney | basically implemented the same logic | 14:05 |
*** k_mouza has quit IRC | 14:05 | |
*** k_mouza_ has joined #openstack-nova | 14:06 | |
sean-k-mooney | loop over the domains and for each with an mdev look up the placment allocation and get the rp with the vgpu resouces | 14:06 |
sean-k-mooney | then get the mdev type and create it with the same mdev uuid as the xml currenly has | 14:07 |
sean-k-mooney | you could use systemd's "before" and "after" requirements to ensure it runs before nova-compute and after libvirt start | 14:08 |
sean-k-mooney | really nova should do that but that the bug your hitting i guess. | 14:08 |
*** ociuhandu has quit IRC | 14:09 | |
*** ociuhandu has joined #openstack-nova | 14:11 | |
mnaser | sean-k-mooney / bauzas: i guess if i'm understanding correctly, the instance <=> mdev mapping is not stored inside nova anywhere so we rely on the state in the libvirt domain | 14:16 |
bauzas | that's right | 14:17 |
bauzas | there is an existing tool tho | 14:17 |
bauzas | mnaser: https://github.com/mdevctl/mdevctl | 14:17 |
*** ociuhandu has quit IRC | 14:17 | |
bauzas | mnaser: you could use it for precreating the mdevs and persist them (using systemctl) | 14:17 |
mnaser | "When a known parent device add udev event occurs (or, for more recent kernels, change events with MDEV_STATE values), mdevctl is called by a udev rule to create defined devices with "start": "auto" configured." interesting | 14:18 |
bauzas | (whoops, systemd) | 14:18 |
sean-k-mooney | you could but long term we dont want peopel to do that | 14:18 |
sean-k-mooney | bauzas: if we were to go down that route we shoudl remvoe the code for nova to do it | 14:18 |
bauzas | sean-k-mooney: we said this before | 14:18 |
sean-k-mooney | bauzas: and track the mdevs in the pci_devices table or similar | 14:18 |
bauzas | sean-k-mooney: honestly, mdevs are like VFs | 14:19 |
sean-k-mooney | yep | 14:19 |
bauzas | and we don't persist the latter | 14:19 |
sean-k-mooney | we do | 14:19 |
bauzas | in nova ? | 14:19 |
sean-k-mooney | in the pci_devices table in nova | 14:19 |
bauzas | but you need to precreate them, right? | 14:19 |
sean-k-mooney | thats what the pci_tracker does | 14:19 |
sean-k-mooney | bauzas: oh yes you do | 14:19 |
bauzas | the pci trackers tracks the VFs | 14:19 |
bauzas | but it doesn't create them, right? | 14:20 |
sean-k-mooney | yep so for vf the operator has to precreate them | 14:20 |
bauzas | that's my point | 14:20 |
mnaser | would it make sense to have something like if len(mdev) == 0: <check with placement if vm has vgpu>; if <system-has-vgpu>: find_an_unused_mdev_or_create_a_new_one(); | 14:20 |
sean-k-mooney | we chose not to do that for mdevs for some reason | 14:20 |
bauzas | mnaser: what I *could* do is to work on what sean-k-mooney and I said | 14:20 |
sean-k-mooney | but since we chose to create them it meas we should alwasy do it | 14:20 |
bauzas | mnaser: ie. looking up the placement DB | 14:20 |
bauzas | and magically recreating them | 14:21 |
bauzas | mnaser: that's why I left the bug open | 14:21 |
mnaser | and that would pretty much get rid of the statefulness of libvirt domain xml again | 14:21 |
bauzas | mnaser: but the fact is, maybe eventually we would remove this whole recreate method | 14:21 |
sean-k-mooney | bauzas: i would be ok using mdevctl if we moved mdev to the pci track or resouces table | 14:22 |
bauzas | mnaser: if the libvirt domain information would persist the mdev type, that'd be awesome | 14:22 |
sean-k-mooney | and then just getting rid of this code and not needing the domain | 14:22 |
bauzas | mnaser: but it doesn't | 14:22 |
bauzas | sean-k-mooney: IIRC, aw (the mdevctl developer) was against using it for upper tooling | 14:23 |
sean-k-mooney | bauzas: if we really needed too we could store it in the metadta section of the xml | 14:23 |
bauzas | sean-k-mooney: that's actually a great point | 14:23 |
mnaser | is there anthing else we store in the xml as a state? | 14:23 |
sean-k-mooney | mnaser: no | 14:23 |
sean-k-mooney | long term we want to get rid of persitent domains | 14:24 |
mnaser | only thing with this is if something goes wrong with libvirt or anything, you would lose all your gpus | 14:24 |
sean-k-mooney | e.g. the domain xml on disk | 14:24 |
bauzas | sean-k-mooney: the only problem with metadata is that we won't recreate it on move operations | 14:24 |
mnaser | thats how this bit us, nova wouldn't start, so we tried to undefine the domain to let nova recreate it, and here we are with no vgpus | 14:24 |
bauzas | mnaser: nova just binds mdevs | 14:24 |
mnaser | but if libvirt domain is gone, it doesnt know which mdevs were assigned to that vm, even on a hard reboot | 14:25 |
sean-k-mooney | so fundimentally i think we need to revisit using the xml for state storage | 14:25 |
bauzas | mnaser: sure, but why would you undefine the domain ? | 14:25 |
sean-k-mooney | and just store the inf in the nova db eventualy | 14:25 |
bauzas | sean-k-mooney: eeeek | 14:25 |
mnaser | bauzas: we had other issues why the domain would not start, because a call to libvirt was failing because the mdev was missing | 14:26 |
sean-k-mooney | bauzas: i didnt like using the xml for this in the first place | 14:26 |
mnaser | so mdev was missing so nova couldnt start | 14:26 |
sean-k-mooney | bauzas: this probalem is just another rasons to not do it this way | 14:26 |
bauzas | mnaser: again, I can try to fix the logic by looking up placement | 14:26 |
bauzas | sean-k-mooney: you know what ? I'll start filling a spec for drafting mdev management in nova | 14:27 |
bauzas | and exposing them as raw resources | 14:27 |
mnaser | so pretty much regenerate state from placement | 14:27 |
sean-k-mooney | bauzas: i think if we want to do the stateless mdev work it would make sense to do that anyway | 14:27 |
bauzas | sean-k-mooney: we could discuss the oppportunity of persisting them in the spec | 14:27 |
sean-k-mooney | we could keep them seperate or combine them | 14:27 |
mnaser | i feel like that would be inline with VFs since you have to create them beforehand | 14:27 |
sean-k-mooney | ya i think there are two thing we should do | 14:28 |
bauzas | mnaser: yeah and honestly we regressed on this, so I feel responsible for closing the bug | 14:28 |
sean-k-mooney | 1 try an come up with a backporatable thing to adress the bug | 14:28 |
sean-k-mooney | and 2 figure out how to do it better longterm in the spec | 14:29 |
sean-k-mooney | while also discussing generic stateless mdevs for non gpu usecases | 14:29 |
*** links has joined #openstack-nova | 14:29 | |
sean-k-mooney | the placement way can work for the backportable solution | 14:29 |
*** ociuhandu has joined #openstack-nova | 14:29 | |
sean-k-mooney | im not sure we want to do that long term since i dont know what the performance of that will be like | 14:30 |
sean-k-mooney | i assume worse then a straight db lookup | 14:30 |
mnaser | yeah but only hitting on a hard_reboot() that involves regenerating xml | 14:30 |
sean-k-mooney | mnaser: actully only on init host | 14:31 |
sean-k-mooney | we dont need to hit placment on hard reboot nessialy | 14:32 |
mnaser | ah right yes, unless someone is undefinning domains while nova is running | 14:32 |
mnaser | in that case, that's on them =P | 14:32 |
bauzas | mnaser: that call to placement is made at service restart | 14:33 |
bauzas | not during hard reboots | 14:33 |
bauzas | for reboots, we just recreate the XML as we *already* have the allocations | 14:33 |
mnaser | bauzas: i do have another fun thing to add to it though | 14:34 |
bauzas | shoot (/me hides) | 14:34 |
mnaser | http://paste.openstack.org/show/804516/ | 14:34 |
mnaser | that was actually why we had to undefine the domain to let nova start | 14:34 |
sean-k-mooney | well | 14:35 |
mnaser | so mdev is gone but on init_host we try to look it up | 14:35 |
sean-k-mooney | that is just becasue the way we try to recrate the mdev today | 14:35 |
sean-k-mooney | mdev is still listed in the domain xml | 14:35 |
sean-k-mooney | but nothing has created in in sysfs yet | 14:35 |
sean-k-mooney | which is why nodeDeviceLookupByName fails | 14:36 |
sean-k-mooney | mnaser: the current code only account for nova-compute restarts | 14:36 |
*** ociuhandu has quit IRC | 14:36 | |
sean-k-mooney | it does not properly handel host reboots | 14:36 |
mnaser | yeah but unfortunately it fully blocks nova from going back up, so that was a little part of how we ended up with no libvirt domain xml | 14:36 |
sean-k-mooney | yep understandable | 14:37 |
sean-k-mooney | mnaser: do dyou have multiple mdev types per host | 14:37 |
sean-k-mooney | or just one | 14:37 |
mnaser | in my case no, just one | 14:37 |
mnaser | by 'handle host reboots' == 'handle mdev devices disappearing' | 14:37 |
sean-k-mooney | ok then the workaround for you is simple | 14:37 |
bauzas | mnaser: the paste you showed is just the bug you hit | 14:37 |
mnaser | create mdevs manually with the uuids from libvirt, i guess | 14:38 |
sean-k-mooney | mnaser: do you have multiple pGPUs per host | 14:38 |
mnaser | nope, single pGPU, single vGPU type | 14:38 |
sean-k-mooney | mnaser: yep that is the workaround | 14:38 |
bauzas | I stupidely wrote something like "if the mdev doesn't exist; get its name from the xml and then lookup the non-existing mdev" | 14:38 |
sean-k-mooney | perfect | 14:38 |
sean-k-mooney | so for your case then loop over the domains and grep all the mdevs | 14:38 |
bauzas | right | 14:38 |
sean-k-mooney | then just create them with the hardcoded mdev type and parent | 14:39 |
bauzas | if you only use one type, that's trivial | 14:39 |
sean-k-mooney | by ecoing into /sys | 14:39 |
bauzas | just make sure to recreate the mdev with the right uuid | 14:39 |
bauzas | echo "myuuid" > /sys/bus/mdev_bus/<mdev_type>/create | 14:39 |
sean-k-mooney | yep that | 14:39 |
bauzas | (can't remember the exact sysfs path) | 14:39 |
mnaser | got it, so that's our workaround until something placemet-y lands | 14:39 |
sean-k-mooney | yep | 14:40 |
bauzas | yup | 14:40 |
bauzas | YUUUUUUP even | 14:40 |
sean-k-mooney | :) | 14:40 |
* sean-k-mooney thinks of the land before time | 14:40 | |
bauzas | mnaser: sorry for the nasty bug, I should have wrote some docs describing the workaround | 14:41 |
sean-k-mooney | https://www.youtube.com/watch?v=cAEVzJnHv2c | 14:41 |
bauzas | I'll at least amend the bug report | 14:41 |
mnaser | bauzas: hey no worries at all :D | 14:42 |
mnaser | bauzas: i amended an eavesdrop link pointing to our conversation | 14:42 |
mnaser | bauzas: i can throw something in an etherpad about a potential workaround with using single pgpu with single vgpu type | 14:43 |
mnaser | i need to write up something for my team anyways :) | 14:43 |
*** dklyle has joined #openstack-nova | 14:47 | |
mnaser | working here https://etherpad.opendev.org/p/nova-vgpu-sys-reboot and ill post it as a comment after i run it by y'all :) | 14:47 |
bauzas | mnaser: https://bugs.launchpad.net/nova/+bug/1900800/comments/4 | 14:48 |
openstack | Launchpad bug 1900800 in OpenStack Compute (nova) "VGPUs is not recreated on host reboot" [Low,Confirmed] - Assigned to Sylvain Bauza (sylvain-bauza) | 14:48 |
mnaser | oh heck, even better, ill refernce that :) | 14:48 |
bauzas | you can even ask libvirt to give you all the domains that have mdevs | 14:49 |
bauzas | no need to lookup all your instances, just the ones that have mdevs | 14:49 |
*** ociuhandu has joined #openstack-nova | 14:50 | |
sean-k-mooney | bauzas: can you? i didnt know that | 14:52 |
bauzas | oh, nevermind, call me stupid | 14:52 |
sean-k-mooney | bauzas: can you do that with virsh? | 14:52 |
bauzas | that's the PCI devices you can get | 14:52 |
bauzas | the ones supporting mdev caps | 14:52 |
bauzas | in nova, I'm just blindly iterating over all instances | 14:53 |
bauzas | ... :( | 14:53 |
sean-k-mooney | ya i think that is what you have to do | 14:53 |
sean-k-mooney | im not aware of an api that allows you to fileter domains by the content of there xml | 14:53 |
*** ociuhandu has quit IRC | 14:57 | |
*** belmoreira has joined #openstack-nova | 14:57 | |
bauzas | sean-k-mooney: I guess looking up the instances by their flavors would be then better for mnaser ;) | 14:59 |
bauzas | using the nova api | 15:00 |
bauzas | and then getting the instance name | 15:00 |
sean-k-mooney | not really | 15:01 |
sean-k-mooney | that would be much more expensive | 15:01 |
sean-k-mooney | mnaser should now which host has gpus so its really not that hard to loop over the domains on those hosts in a script | 15:02 |
sean-k-mooney | doing that will be much faster then hitting keytone to get a tokken then listing the instance on host and filtering by flaovr then looking it up in libvirt to get the mdev | 15:03 |
mnaser | yeah i think we'll have a bash script because its single pgpu per system so if the libvirt domains werent destroyed yet | 15:15 |
mnaser | it should be easy to rebuild | 15:15 |
sean-k-mooney | mnaser: a hard reboot might fix your instances that dont have gpus | 15:16 |
sean-k-mooney | or a cold migrate it that is not enough | 15:16 |
mnaser | sean-k-mooney: yeah those are fine but init_host() failing means we have to fix them all first | 15:16 |
mnaser | cause nova wont go up | 15:16 |
sean-k-mooney | yep i ment for the ones you undifed | 15:16 |
openstackgerrit | Rodrigo Barbieri proposed openstack/nova master: Error anti-affinity violation on migrations https://review.opendev.org/c/openstack/nova/+/784166 | 15:21 |
*** mlavalle has joined #openstack-nova | 15:36 | |
openstackgerrit | Balazs Gibizer proposed openstack/nova master: Move instance power state check to _detach_with_retry https://review.opendev.org/c/openstack/nova/+/778918 | 15:42 |
openstackgerrit | Balazs Gibizer proposed openstack/nova master: Consolidate device detach error handling https://review.opendev.org/c/openstack/nova/+/778978 | 15:44 |
*** sapd1 has quit IRC | 15:45 | |
gibi | nova meeting starts in 10 minutes in #openstack-meeting-3 | 15:49 |
*** gyee has joined #openstack-nova | 15:57 | |
*** hamalq has joined #openstack-nova | 16:00 | |
*** ociuhandu has joined #openstack-nova | 16:02 | |
*** hamalq has quit IRC | 16:02 | |
*** lucasagomes has quit IRC | 16:02 | |
*** hamalq has joined #openstack-nova | 16:02 | |
*** rpittau is now known as rpittau|afk | 16:09 | |
*** ociuhandu has quit IRC | 16:17 | |
*** tesseract has quit IRC | 16:18 | |
*** mlavalle has quit IRC | 16:23 | |
*** _mlavalle_1 has joined #openstack-nova | 16:23 | |
*** ociuhandu has joined #openstack-nova | 16:32 | |
*** rmart04 has quit IRC | 16:34 | |
*** ociuhandu has quit IRC | 16:36 | |
*** dtantsur is now known as dtantsur|afk | 16:36 | |
*** links has quit IRC | 16:40 | |
*** _mlavalle_1 has quit IRC | 16:42 | |
*** k_mouza_ has quit IRC | 16:43 | |
*** k_mouza has joined #openstack-nova | 16:44 | |
*** k_mouza has quit IRC | 16:51 | |
*** derekh has quit IRC | 17:03 | |
*** mlavalle has joined #openstack-nova | 17:05 | |
*** ralonsoh has quit IRC | 17:12 | |
*** ociuhandu has joined #openstack-nova | 17:13 | |
*** ociuhandu has quit IRC | 17:18 | |
*** hemna has quit IRC | 17:19 | |
*** hemna has joined #openstack-nova | 17:25 | |
*** zul has joined #openstack-nova | 17:26 | |
*** tbachman has joined #openstack-nova | 17:29 | |
*** belmoreira has quit IRC | 17:42 | |
*** andrewbonney has quit IRC | 17:50 | |
*** bbowen_ has left #openstack-nova | 18:05 | |
openstackgerrit | Merged openstack/nova master: libvirt: Ignore device already in the process of unplug errors https://review.opendev.org/c/openstack/nova/+/785682 | 18:27 |
*** vishalmanchanda has quit IRC | 18:27 | |
*** ociuhandu has joined #openstack-nova | 18:31 | |
*** ociuhandu has quit IRC | 18:31 | |
*** ociuhandu has joined #openstack-nova | 18:31 | |
*** belmoreira has joined #openstack-nova | 18:34 | |
*** ociuhandu has quit IRC | 18:37 | |
*** ociuhandu has joined #openstack-nova | 18:51 | |
openstackgerrit | Lee Yarwood proposed openstack/nova stable/wallaby: libvirt: Ignore device already in the process of unplug errors https://review.opendev.org/c/openstack/nova/+/786483 | 18:56 |
*** ociuhandu has quit IRC | 18:58 | |
*** ociuhandu has joined #openstack-nova | 19:17 | |
*** hamalq has quit IRC | 19:17 | |
belmoreira | bauzas thanks for the ping. | 19:23 |
belmoreira | lassimus definitely I'm interested in the what you are proposing (emulate other architectures). I think it's an interesting topic to be discussed in the PTG. | 19:23 |
belmoreira | Related bugs: https://bugs.launchpad.net/nova/+bug/1902203 https://bugs.launchpad.net/nova/+bug/1902216 | 19:24 |
openstack | Launchpad bug 1902203 in OpenStack Compute (nova) "Instance architecture should be reflected in the instance domain" [Wishlist,Confirmed] | 19:24 |
openstack | Launchpad bug 1902216 in OpenStack Compute (nova) "Can't define a cpu_model from a different architecture" [Wishlist,Confirmed] - Assigned to Belmiro Moreira (moreira-belmiro-email-lists) | 19:24 |
*** ociuhandu has quit IRC | 19:27 | |
*** hamalq has joined #openstack-nova | 19:31 | |
*** dave-mccowan has quit IRC | 19:43 | |
*** dave-mccowan has joined #openstack-nova | 19:46 | |
*** whoami-rajat has quit IRC | 19:47 | |
*** macz_ has joined #openstack-nova | 19:55 | |
openstackgerrit | Merged openstack/nova master: Placeholders for DB migration backports to Wallaby https://review.opendev.org/c/openstack/nova/+/778923 | 20:22 |
*** k_mouza has joined #openstack-nova | 20:45 | |
*** k_mouza has quit IRC | 20:49 | |
*** ociuhandu has joined #openstack-nova | 20:57 | |
*** ociuhandu has quit IRC | 21:13 | |
*** ociuhandu has joined #openstack-nova | 21:13 | |
*** ociuhandu has quit IRC | 21:18 | |
*** belmoreira has quit IRC | 21:23 | |
*** ociuhandu has joined #openstack-nova | 21:44 | |
*** ociuhandu has quit IRC | 21:53 | |
*** tosky has quit IRC | 22:34 | |
*** rcernin has joined #openstack-nova | 22:51 | |
*** macz_ has quit IRC | 23:16 | |
*** macz_ has joined #openstack-nova | 23:20 | |
*** macz_ has quit IRC | 23:25 | |
*** luksky has quit IRC | 23:58 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!