opendevreview | yangzhipeng proposed openstack/nova master: add delete Signed-off-by: yangzhipeng <yangzhipeng@cmss.chinamobile.com> https://review.opendev.org/c/openstack/nova/+/865362 | 02:31 |
---|---|---|
opendevreview | yangzhipeng proposed openstack/nova master: Remove all tag if instance has beed hard deleted. Signed-off-by: yangzhipeng <yangzhipeng@cmss.chinamobile.com> https://review.opendev.org/c/openstack/nova/+/865362 | 02:33 |
opendevreview | yangzhipeng proposed openstack/nova master: Remove all tag if instance has beed hard deleted. https://review.opendev.org/c/openstack/nova/+/865362 | 02:42 |
*** ministry is now known as __ministry | 03:31 | |
gmann | dansmith: gibi: bauzas: need one more review on the placement RBAC spec, please check https://review.opendev.org/c/openstack/placement/+/864385 | 03:31 |
*** akekane is now known as abhishekk | 05:16 | |
gokhani | Good Morning Folks, When I try to reboot my instance, It destroys my instance and I am getting error "glanceclient.exc.HTTPNotFound: HTTP 404 Not Found: No image found with ID xxxx.." after reboot action I didn't find my instance with "virsh list --all" command. I didn't understand why nova tries to find glance image and why nova destroyed my instance ? Logs are in https://paste.openstack.org/show/by8oCDVLo29Wt612Z1tT/. what cause happen this | 06:59 |
gokhani | situation ? | 06:59 |
sean-k-mooney | gokhani: when you reboot an instance we undefine the domain and recreated it however we will not lookup the glance image as part of a normal hard or soft reboot | 09:02 |
sean-k-mooney | so that implies that the instance was in an inconsitent state before the reboot | 09:03 |
sean-k-mooney | ile "/openstack/venvs/nova-22.1.0/lib/python3.8/site-packages/nova/virt/libvirt/driver.py", line 9930, in _create_images_and_backing | 09:04 |
sean-k-mooney | so this should only be invokded if the vm is started via due to the resume_state_on_host_boot config option after a host reboot | 09:07 |
sean-k-mooney | https://github.com/openstack/nova/blob/stable/victoria/nova/virt/libvirt/driver.py#L3368-L3377 | 09:07 |
sean-k-mooney | gokhani: based on the trace back the instnace disk does not exist anymore | 09:08 |
sean-k-mooney | since it tokk this branch https://github.com/openstack/nova/blob/3224ceb3fffc57d2375e5163d8ffbbb77529bc38/nova/virt/libvirt/driver.py#L9949-L9951 | 09:08 |
sean-k-mooney | actullly no sorry https://github.com/openstack/nova/blob/3224ceb3fffc57d2375e5163d8ffbbb77529bc38/nova/virt/libvirt/driver.py#L9980-L9986 is the path its taking | 09:10 |
sean-k-mooney | so info['backing_file'] is equivalent to true | 09:11 |
sean-k-mooney | and to take the else branch that means that this si not the swap or ephmeral disk | 09:11 |
sean-k-mooney | anyway as the comment suggest https://github.com/openstack/nova/blob/stable/victoria/nova/virt/libvirt/driver.py#L3368-L3377 | 09:16 |
sean-k-mooney | that code path is there to endure that any backing files that are shared between teh vms are still present on the host | 09:17 |
sean-k-mooney | so this implies taht the backing file is not presnt on the host and it has been deleted from glance | 09:17 |
sean-k-mooney | in a normal hard reboot triggered by a human that code path shoudl not be taken and we will just redefine the domain | 09:18 |
kashyap | sean-k-mooney: Morning. Do we document any of this anywhere, at a high-level? | 09:36 |
sean-k-mooney | that hard reboot does not redownload the image | 09:40 |
sean-k-mooney | i dont think so but thats more of a impemation detail | 09:40 |
sean-k-mooney | it should not be required in a normal workflow sincew we are simulating rebooting a physical server | 09:41 |
sean-k-mooney | and you would not reinstall the os on a phsyical server | 09:41 |
sean-k-mooney | the fact we have code to do it after a host reboot is a little odd | 09:41 |
sean-k-mooney | but i would guess it is tehre because of a bug | 09:41 |
kashyap | Yeah, that's fair. (I'm not saying impl detail should be be documented. Just maybe somewhere in a debugging guide. I admit I can't find an appropriate place for it, though) | 09:42 |
sean-k-mooney | i.e. to work around a bug where perhaps at some point in the past the image casche got lost or something | 09:42 |
kashyap | "Image cache" ... /me runs for the hills | 09:42 |
kashyap | (Except no hills here in this part of the low lands :P) | 09:42 |
sean-k-mooney | ya ill admit im kind of surprised we have that code path to try and heal broken/missing backing files | 09:43 |
kashyap | TIL, too | 09:43 |
sean-k-mooney | the fallback to using it form the image cache if its not in glance is interesting... | 09:44 |
sean-k-mooney | it makes sense i guess | 09:44 |
sean-k-mooney | but this is and edgecase of an edgecase | 09:44 |
kashyap | Yeah, it definitely does to me. | 09:44 |
sean-k-mooney | i.e. after a host reboot the backing file need to have been delted and the image need to have been delted form glance but still exist in the host image cache for that to work | 09:45 |
sean-k-mooney | im sure thats the exact case someone hit and they added this to fix it however | 09:46 |
* sean-k-mooney s/however// | 09:47 | |
* kashyap nods | 09:48 | |
opendevreview | Amit Uniyal proposed openstack/nova stable/train: Refactor volume connection cleanup out of _post_live_migration https://review.opendev.org/c/openstack/nova/+/864670 | 09:57 |
opendevreview | Amit Uniyal proposed openstack/nova stable/train: Move pre-3.44 Cinder post live migration test to test_compute_mgr https://review.opendev.org/c/openstack/nova/+/864671 | 09:57 |
opendevreview | Amit Uniyal proposed openstack/nova stable/train: Adds a repoducer for post live migration fail https://review.opendev.org/c/openstack/nova/+/863806 | 09:57 |
opendevreview | Amit Uniyal proposed openstack/nova stable/train: [compute] always set instance.host in post_livemigration https://review.opendev.org/c/openstack/nova/+/864055 | 09:57 |
opendevreview | Amit Uniyal proposed openstack/nova stable/train: func: Add _live_migrate helper to InstanceHelperMixin https://review.opendev.org/c/openstack/nova/+/865381 | 09:57 |
opendevreview | Amit Uniyal proposed openstack/nova stable/train: func: Introduce a server_expected_state kwarg to InstanceHelperMixin._live_migrate https://review.opendev.org/c/openstack/nova/+/865382 | 09:57 |
auniyal | Hi sean-k-mooney | 10:20 |
auniyal | can you please review these patches - https://review.opendev.org/c/openstack/nova/+/864055 | 10:21 |
opendevreview | Anton Kurbatov proposed openstack/nova master: Fix VMs sorting fail in case of comparison with None https://review.opendev.org/c/openstack/nova/+/865037 | 10:23 |
sean-k-mooney | i can look quickly but i dont really have time for nova work this week outside of the vdpa rebases. | 10:23 |
sean-k-mooney | oh that the train backport | 10:23 |
auniyal | yeah, the last patch is failing before migration | 10:24 |
gokhani | sean-k-mooney, you are right, instance is in unconsistent state and I tried reboot it. Glance image is already deleted and so base image is also deleted. I didn't activate resume_state_on_host_boot option, it is false. | 10:42 |
sean-k-mooney | this code path is only taken if you dont have an authtoken | 10:43 |
sean-k-mooney | so you should not be able to get here manually | 10:43 |
sean-k-mooney | gokhani: how did you trigger the hard reboot | 10:43 |
opendevreview | Amit Uniyal proposed openstack/nova stable/train: Refactor volume connection cleanup out of _post_live_migration https://review.opendev.org/c/openstack/nova/+/864670 | 10:46 |
opendevreview | Amit Uniyal proposed openstack/nova stable/train: Move pre-3.44 Cinder post live migration test to test_compute_mgr https://review.opendev.org/c/openstack/nova/+/864671 | 10:46 |
opendevreview | Amit Uniyal proposed openstack/nova stable/train: Adds a repoducer for post live migration fail https://review.opendev.org/c/openstack/nova/+/863806 | 10:46 |
opendevreview | Amit Uniyal proposed openstack/nova stable/train: [compute] always set instance.host in post_livemigration https://review.opendev.org/c/openstack/nova/+/864055 | 10:46 |
gokhani | sean-k-mooney, firstly run "nova reset-state --active 3be526a1-664c-4166-8212-a0b979259ddf" and after that "openstack server reboot --hard 3be526a1-664c-4166-8212-a0b979259ddf" | 10:47 |
sean-k-mooney | the second command should not have taken this code path | 10:47 |
sean-k-mooney | its guarded by "if context.auth_token is not None:" | 10:48 |
sean-k-mooney | if hard_reboot was trigged form the api then there should always be an authtoken | 10:49 |
bauzas | sean-k-mooney: can you drop your -1 on https://review.opendev.org/c/openstack/nova/+/864418 now that I flipped the patches in the series ? | 10:49 |
sean-k-mooney | is that the gpu stuff | 10:50 |
sean-k-mooney | yep | 10:50 |
bauzas | yes | 10:50 |
* bauzas needs to take her daughter | 10:50 | |
sean-k-mooney | ill swap it to a review priorty +1 but i wont get to it today | 10:50 |
sean-k-mooney | ill see if i can loop back to it later in the week | 10:51 |
gokhani | sean-k-mooney, I will try to reproduce this problem with trying to reboot another instance | 10:52 |
sean-k-mooney | gokhani: ack if you can repoduced it with a knwon good test instance then that would help | 10:53 |
sean-k-mooney | gokhani: your using victoria correct | 10:53 |
sean-k-mooney | i noticed nova 22.somehting in the trace | 10:53 |
gokhani | sean-k-mooney, yes our env is victoria | 10:54 |
sean-k-mooney | i didnt check the same section of code in master but its possible that this was a bug that was fix in between | 10:54 |
sean-k-mooney | no that funciton is the same on master | 10:55 |
sean-k-mooney | https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L3931 | 10:55 |
sean-k-mooney | so if there is a bug it should happen there too. | 10:55 |
gokhani | sean-k-mooney, I tried to soft reboot an instance and it throws error again | 10:58 |
gokhani | https://paste.openstack.org/show/b7FSkdeQOxcuWu296tfq/ | 10:58 |
gokhani | rebooting is triggered with horizon | 10:59 |
sean-k-mooney | so the inital error Cannot access backing file '/var/lib/nova/instances/_base/9d36e2c635ce070d95805f64f4b34655f3eae96b' of storage file '/var/lib/nova/instances/4a34a2f4-98a4-4cea-8d80-96d28f12edd5/disk' | 10:59 |
sean-k-mooney | indicate that something deleted the disk image from behind nova's back | 11:00 |
sean-k-mooney | soft reboot escalate to hard reboot if it fails | 11:00 |
opendevreview | Amit Uniyal proposed openstack/nova stable/train: func: Introduce a server_expected_state kwarg to InstanceHelperMixin._live_migrate https://review.opendev.org/c/openstack/nova/+/865382 | 11:00 |
opendevreview | Amit Uniyal proposed openstack/nova stable/train: Refactor volume connection cleanup out of _post_live_migration https://review.opendev.org/c/openstack/nova/+/864670 | 11:00 |
opendevreview | Amit Uniyal proposed openstack/nova stable/train: Move pre-3.44 Cinder post live migration test to test_compute_mgr https://review.opendev.org/c/openstack/nova/+/864671 | 11:00 |
opendevreview | Amit Uniyal proposed openstack/nova stable/train: Adds a repoducer for post live migration fail https://review.opendev.org/c/openstack/nova/+/863806 | 11:00 |
opendevreview | Amit Uniyal proposed openstack/nova stable/train: [compute] always set instance.host in post_livemigration https://review.opendev.org/c/openstack/nova/+/864055 | 11:00 |
sean-k-mooney | gokhani: we might not be passign the auth token propelry in the context when we do that | 11:01 |
sean-k-mooney | gokhani: the code path that it is taking woudl repair the issue that it detected in soft reboot | 11:01 |
sean-k-mooney | if the image still existed in glance | 11:01 |
sean-k-mooney | if its deltete tehre is no way to fix the vm without copying the backing file form somewhere else in teh cloud that still has it | 11:02 |
sean-k-mooney | if /var/lib/nova/instances/_base/9d36e2c635ce070d95805f64f4b34655f3eae96b exist somewhere else on your cloud and you copy it to that location on this host | 11:02 |
sean-k-mooney | and ensure it has the corret user permissions as the other images there | 11:03 |
sean-k-mooney | then you might be able to recover the vm | 11:03 |
sean-k-mooney | with another hard/soft reboot | 11:03 |
gokhani | sean-k-mooney, I deleted image of this instance in glance. | 11:04 |
sean-k-mooney | sure that should be fine | 11:04 |
sean-k-mooney | but something also deleted the image backing file on the compute node | 11:04 |
sean-k-mooney | that is not ok and should not happen if there is an instance based on that image on the host | 11:04 |
sean-k-mooney | you have configured your deployment to use qcow images for the vm with a backign file | 11:06 |
gokhani | sean-k-mooney, what can be the reason of deleting image backing file ? Is there any config on nova? | 11:06 |
sean-k-mooney | this will only be deleted by nova if there are no vms on this host based on that glance image | 11:06 |
sean-k-mooney | you are not mounting this on a nfs share or soemthign liek that? | 11:07 |
gokhani | sean-k-mooney, yes I have netapp storage and Instance disks are on nfs share | 11:08 |
sean-k-mooney | nfsv3...? | 11:08 |
gokhani | nfs4 | 11:09 |
sean-k-mooney | nfs v3 has consitey/locking issues and we stongly discurage using it. ideally you woudl use v4.2+ with nova. | 11:09 |
gokhani | nfsv4 | 11:09 |
sean-k-mooney | ok | 11:09 |
sean-k-mooney | so the only thing that comes to mind is if you are not using a sepeate share or directory per host | 11:10 |
sean-k-mooney | then if the /var/lib/nova/instances/_base is shared one of the other nova compute could have deleted it if it did not detect its on a shared file system properly | 11:11 |
gokhani | sean-k-mooney, I am using same share for all compute nodes | 11:11 |
sean-k-mooney | in generall we discorgage puting the instance directory on nfs by the way. | 11:11 |
sean-k-mooney | you can do that and we know operators do but its not well tested and there are defintly more bugs in that config. | 11:12 |
sean-k-mooney | can you check if you ahve a file for me | 11:12 |
sean-k-mooney | one sec while i look for it | 11:13 |
gokhani | sean-k-mooney, I have also instance disk under /var/lib/nova/instances/xxxxxxxxx/disk | 11:13 |
sean-k-mooney | yes thats where they are stored by default | 11:14 |
sean-k-mooney | you have /var/lib/nova/instances/compute_nodes | 11:14 |
sean-k-mooney | *do you have | 11:14 |
gokhani | yes I have | 11:14 |
sean-k-mooney | does it have muliple entries | 11:15 |
gokhani | I will check it | 11:15 |
sean-k-mooney | mine looks like this | 11:15 |
sean-k-mooney | nova-compute)[nova@cloud instances]$ cat /var/lib/nova/instances/compute_nodes | 11:15 |
sean-k-mooney | {"cloud": 1669200576.0961165} | 11:15 |
sean-k-mooney | im not sure if that should have muliple entries for nfs | 11:16 |
sean-k-mooney | but that is created by the image cache code and i think its related ot deleteion on shared filesystmes | 11:16 |
sean-k-mooney | im just wondering if it exists and if it has multipel entires the content itself is not really that important | 11:16 |
gokhani | sean-k-mooney, https://paste.openstack.org/show/bldgeGSIq8BoLjp0u6sx/ | 11:18 |
sean-k-mooney | ack that is what i was expecting to see | 11:18 |
sean-k-mooney | that generated here https://github.com/openstack/nova/blob/50fdbc752a9ca9c31488140ef2997ed59d861a41/nova/virt/storage_users.py#L45-L72 | 11:19 |
sean-k-mooney | so each compute on shared storage adds themselves to that list | 11:19 |
sean-k-mooney | https://opendev.org/openstack/nova/src/branch/master/nova/compute/manager.py#L10901-L10916 | 11:21 |
sean-k-mooney | we use that file to get the instnace for all host on the shared storage | 11:21 |
sean-k-mooney | and tehn we clean the cache based on that | 11:21 |
sean-k-mooney | on a normal non shared deployment like mine that only has one host | 11:22 |
sean-k-mooney | in your case it has many | 11:22 |
sean-k-mooney | i see prod-compute3 but not compute03 in that list | 11:23 |
gokhani | sean-k-mooney, we need to find how image backing file is deleted . I have this problem on instances whose images are deleted on glance | 11:23 |
sean-k-mooney | the delete proably happen because of the perodic cleanup | 11:24 |
sean-k-mooney | but can you verify if the compute node is compute03 or prod-compute3 in the hypervior api or compute_node db table | 11:24 |
sean-k-mooney | we are using conf.host to populate that file | 11:25 |
sean-k-mooney | im wondering if you perhaps changed that at some point? | 11:25 |
sean-k-mooney | or change the hostname | 11:25 |
sean-k-mooney | to remove the prod- prefix | 11:26 |
sean-k-mooney | my guess is that currently eitehr the [DEFAULT]/host value does not match the value in the compute_nodes file or its unset and your hostname nolonger matches because it was change somehow | 11:27 |
sean-k-mooney | that file is updated before every time we cleanup the image cache | 11:28 |
gokhani | sean-k-mooney, https://paste.openstack.org/show/bVzhWTyWsA3f1Umb5YGM/ | 11:28 |
sean-k-mooney | so the compute agents think they shoudl be usingthe prod- prefix | 11:28 |
sean-k-mooney | ok | 11:28 |
sean-k-mooney | that atleast alines to whats in the file | 11:28 |
sean-k-mooney | althoguh | 11:28 |
sean-k-mooney | thats showing the hypervior hostname | 11:29 |
gokhani | sean-k-mooney, do ı need to also verify on db ? | 11:29 |
sean-k-mooney | what i actully need is the host value rather then hypervior hostname i think | 11:29 |
sean-k-mooney | that would be the host value in the compute service entry for example although its also in teh compute nodes table | 11:30 |
sean-k-mooney | im just trying to verify that they match what is in the file | 11:30 |
sean-k-mooney | on never mind | 11:31 |
sean-k-mooney | its fine | 11:31 |
sean-k-mooney | https://paste.openstack.org/show/b7FSkdeQOxcuWu296tfq/ | 11:31 |
sean-k-mooney | i had scolled that over and th start of the lines were cut off | 11:31 |
sean-k-mooney | so its prod-compute03 | 11:31 |
gokhani | yes | 11:31 |
sean-k-mooney | do you see this warning any of the logs https://github.com/openstack/nova/blob/50fdbc752a9ca9c31488140ef2997ed59d861a41/nova/virt/libvirt/imagecache.py#L331-L334 | 11:36 |
gokhani | sean-k-mooney, I dont see any warning like that | 11:41 |
gokhani | there are periodic checks like that https://paste.openstack.org/show/bGWC1FfgR5R7nXqiCOuQ/ | 11:43 |
sean-k-mooney | ya that what we expect to see | 11:48 |
sean-k-mooney | when its workign correctly | 11:48 |
gokhani | sean-k-mooney, for rescuing these vms it seems there is only one option create glance images from /var/lib/nova/instances/xxx/disk | 11:49 |
sean-k-mooney | glance does not allow you to create and image with a specifc uuid | 11:50 |
sean-k-mooney | breifly looking at the code i done se where the bug is | 11:50 |
sean-k-mooney | https://github.com/openstack/nova/blob/50fdbc752a9ca9c31488140ef2997ed59d861a41/nova/virt/imagecache.py#L43 | 11:50 |
sean-k-mooney | the list_running_isntnace code seams to take into account local and remote instnace | 11:51 |
sean-k-mooney | i have a meeting now so unfortuetly i cant continue to look at this now | 11:57 |
sean-k-mooney | the root cause so far is something deleted the backing file | 11:57 |
sean-k-mooney | since its deleted in glance and its deleted form the nfs share im not sure there is a way to correct this | 11:58 |
gokhani | thanks sean-k-mooney for your help :) | 11:59 |
*** dasm|off is now known as dasm | 13:59 | |
opendevreview | Jorge San Emeterio proposed openstack/nova-specs master: Review usage of oslo-privsep library on Nova https://review.opendev.org/c/openstack/nova-specs/+/865432 | 14:12 |
opendevreview | Jorge San Emeterio proposed openstack/nova-specs master: Review usage of oslo-privsep library on Nova https://review.opendev.org/c/openstack/nova-specs/+/865432 | 14:14 |
opendevreview | Jorge San Emeterio proposed openstack/nova-specs master: Review usage of oslo-privsep library on Nova https://review.opendev.org/c/openstack/nova-specs/+/865432 | 15:10 |
opendevreview | Amit Uniyal proposed openstack/nova stable/train: Adds a repoducer for post live migration fail https://review.opendev.org/c/openstack/nova/+/863806 | 16:02 |
opendevreview | Amit Uniyal proposed openstack/nova stable/train: [compute] always set instance.host in post_livemigration https://review.opendev.org/c/openstack/nova/+/864055 | 16:02 |
*** akekane is now known as abhishekk | 16:09 | |
bauzas | man, the gate is super flakey those days | 16:58 |
bauzas | lots of rechecks on networing and volume detachs :( | 16:58 |
gibi | probably the switch to jammy | 17:11 |
gibi | but this time I have not time to look at what changed with the detach logic again | 17:11 |
gibi | *I don't have time | 17:12 |
sean-k-mooney | gibi: im not sure that it really has changed | 18:28 |
sean-k-mooney | gibi: i think they still have not fixed it | 18:28 |
sean-k-mooney | gibi: bauzas by the way if eitehr of ye can approve https://review.opendev.org/c/openstack/nova/+/865031 it will simplfy ralonsoh life and help with the trunk port issue | 18:30 |
sean-k-mooney | https://review.opendev.org/c/openstack/neutron/+/837780 cuurntly need a cofnig option because of our min version | 18:31 |
sean-k-mooney | but that can be removed if we increase the min version of os-vif | 18:31 |
opendevreview | Merged openstack/nova master: Reproducer for bug 1951656 https://review.opendev.org/c/openstack/nova/+/850673 | 19:59 |
*** dasm is now known as dasm|off | 23:49 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!