Thursday, 2024-08-15

*** bauzas_ is now known as bauzas00:50
opendevreviewTakashi Kajinami proposed openstack/nova master: libvirt: Launch instances with SEV-ES memory encryption  https://review.opendev.org/c/openstack/nova/+/92610601:24
opendevreviewTakashi Kajinami proposed openstack/nova master: libvirt: Launch instances with SEV-ES memory encryption  https://review.opendev.org/c/openstack/nova/+/92610601:26
*** bauzas_ is now known as bauzas07:02
gibisean-k-mooney: artom: I'm still in the process of understanding the logic of the relationship between the instance.pci_devices and the pci_tracker state. At some point the acconting of the two got out of sync and I think that is the place we need to fix08:40
gibithe refresh is just a band aid08:40
zigobauzas: Hi there! Are you aware of this problem? https://bugs.launchpad.net/nova/+bug/207707009:03
zigoIs Sylvain in holidays btw? :)09:03
fricklerisn't all of france in holiday during august? but this sounds related to what sean-k-mooney and gibi were discussing earlier at least in as much as pci devices are affected09:13
gibisean-k-mooney: artom: https://review.opendev.org/c/openstack/nova/+/710848/13#message-354946be57b39eb88b67872230c226361d8a3559 this is where I am at. The resource trackker correctly removes the device from the instance.pci_devices list during the aborting of the claim, but that update is lost when we return to the compute manager09:44
sean-k-mooney[m]zigo is the compute node marked as deleted09:49
sean-k-mooney[m]we dont actully delete the row09:49
gibisean-k-mooney: artom: yeah Claim ctor clones the instance https://github.com/openstack/nova/blob/a7c82399b237aff44810ac76b0c4d10416c3bdf9/nova/compute/claims.py#L65-L6609:50
sean-k-mooney[m]gibi: im going to gab a coffee but ill be back in 5 mins and ill take a look09:51
sean-k-mooneyok actull10:01
gibiabort claim logic working on a copy of the instance so the cleanup of the instance is lost when nova returns to the compute manager10:03
sean-k-mooney do you want to to jump on a google meet and debug this in 10 mins. that sound plausible yes looking at the clone.10:04
gibisure I can 10:04
sean-k-mooneyi belive we do that to make sure we dont modify the instnace object during the claim10:04
sean-k-mooneybtu i guess we did in one path and didnt in another 10:04
sean-k-mooneyand the two are out of sync10:04
gibishould we wait for artom?10:05
gibimaybe he is interested too10:05
sean-k-mooneywe can 10:05
gibithen I will grab a quick lunch, and let you look at what I found in the meantime10:05
gibiand we can jump on a call when artom is arong10:06
gibiarond10:06
sean-k-mooneyi didnt sleep well last night (didnt get to sleep until 4am) so im just startign my day now so wanted to skim my email first while having my morning coffee10:06
sean-k-mooneyso we can chat when they are onlien10:06
gibiOK10:06
sean-k-mooneyfrickler: and yes that is a differnt topic. we have a very old bug, with an eaislly old bugfix that "works" but its fixing the sideefect rather then the cause and we are tryign to under stand what is the cuase and how to move forward10:15
zigosean-k-mooney: I'm not sure, I haven't checked for that, you're probably right that it just stays as deleted, but my bug still stand.10:16
sean-k-mooneyzigo: i have not looked at your bug yet :)10:16
sean-k-mooneyopenstack compute service delete <uuid> is deleting the comptue service not the compute node by the way10:17
sean-k-mooneyit may do one as a side effect of the other but those are two logically differnt things10:17
sean-k-mooneyzigo: are there any instance on the compute node at the time your deleting the compute service10:17
sean-k-mooneyespically any that are using pci devices10:18
zigosean-k-mooney: I had a case where that was the issue, but in another case, no, there was no instance at all, even less with a PCI device attached.10:18
sean-k-mooneyzigo so reinstalling a node because of a hardware issue would not normally invovle deleting the comppute service10:20
sean-k-mooneyis there a reason you did that10:20
sean-k-mooneyim not saying we should not look into the clean up10:20
zigoIt's just our procedure to clean-up nova, cinder and neutron services when we decomission a node...10:21
sean-k-mooneybut the procudure yoru takign is not what i would expect/recommend if you are replacing a node bug keeping the host name10:21
zigoAh.10:21
sean-k-mooneyzigo this is what i woudl expect if you were chanign the hostname i.e scale in and out10:21
sean-k-mooneybut not if you are preserving the host name10:22
sean-k-mooneyi think that might be why we dont see this downstream10:23
sean-k-mooneywe have 2 diffent but related procedures. if its just a hardware failure and you want to just swap out the failed hardware then we woudl expct you do phsyicall do that and then just reinstalll but not delete the compute service10:24
sean-k-mooneyfor hosts where the instances are on cpeh for example that allows replacement of the failed hardware  even if you have not evacuated the instances although that is prefered10:24
sean-k-mooneyin our old instally scalining in and out a new server ended up with a new hostname10:25
sean-k-mooneyso it would not hit this edgecases10:25
zigosean-k-mooney: I would expect that I can do what I want, the way I want, without experiencing such a bug ... :)10:25
sean-k-mooneywell you can expect that :)10:29
opendevreviewBalazs Gibizer proposed openstack/nova master: Fix PCI passthrough cleanup on reschedule  https://review.opendev.org/c/openstack/nova/+/92640711:21
*** bauzas_ is now known as bauzas12:30
sean-k-mooneygibi: i think what you are proposing is a better fix then mine as you have moved the cleanup to the correct function13:40
sean-k-mooneygibi: but i left a comment inline.13:40
sean-k-mooneyim not sure we will have time to sync with artom beofre our internal meeting start so we might just do that  after we finish wiht them or we can do it async13:42
sean-k-mooneyand check back tomorow13:43
gibilet's try tomorrow. I don't think I will have the mental capacity after our internal call. But I will clean up the unit and functional test failures at least in the meantime13:53
gibialso I agree about free_instance_claims but I will check for fallouts13:54
artomsean-k-mooney, gibi, what's up?13:54
sean-k-mooneyartom: gibi foudn the root cause of the bug13:54
artomSweet13:55
gibiartom: this is the distilled version https://review.opendev.org/c/openstack/nova/+/92640713:55
gibithe longer form is in the comments of https://review.opendev.org/c/openstack/nova/+/710848/13#message-354946be57b39eb88b67872230c226361d8a355913:56
artomOh wow13:57
gibiI have still 30 mins befor the next call so if you artom, sean-k-mooney want to sync on it then I can explain in gmeet13:57
sean-k-mooneyartom: i was suggeting this morning that we coudl hop on a call and try and find it together but we said we woudl wait for you to come online before doing that and in the mean time gibi made some progress13:57
artomThat is so insidious, well found!13:57
artomOf course then the fear is - what else are we affecting, and potentially breaking?13:58
sean-k-mooneyim ment ot have a 1:1 in 3 minute but not sure if its going to happen13:58
sean-k-mooneyartom: right13:58
artomI feel like our test coverage is enough at this point that's it's _probably_ fine13:58
sean-k-mooneymaybe 13:58
gibiartom: exactly. I also a bit affraid of the potential fallout13:59
gibithe functional test is clean (exept one test case that I can explain and fix)13:59
sean-k-mooneyso what i was hoping is you could integrate the changes form gibi and try an fix the test coverage as needed13:59
sean-k-mooneythen we could see what that looks liek tomorrow13:59
gibisean-k-mooney: artom: I can clean up my patch so artom can focus on the socket id thing14:00
sean-k-mooneyack that works too14:00
gibiand we can ask melwitt or dansmith for review as a second core14:00
sean-k-mooneyspeaking of mel i need to loop back to here unified limits changes14:01
sean-k-mooneymelwitt: did the test change to oslo.limits land yet14:01
* gibi goes to clean the tests14:02
sean-k-mooneymelwitt: oh ok you abandoned https://review.opendev.org/c/openstack/oslo.limit/+/92402414:02
sean-k-mooneyso i guess the intent is to proceed without that for now in https://review.opendev.org/c/openstack/nova/+/92402514:02
sean-k-mooneymelwitt: i have not looked at that in a while but i see you still have it marked as work in progress in gerrit14:04
sean-k-mooneyare you happy the current version is ready for review or should i loop back in a few days14:04
opendevreviewBalazs Gibizer proposed openstack/nova master: Fix PCI passthrough cleanup on reschedule  https://review.opendev.org/c/openstack/nova/+/92640714:50
melwittsean-k-mooney: yeah the intent was to go for a nova-only approach. and oops, not supposed to be marked as WIP anymore, I'll unmark them15:00
*** bauzas_ is now known as bauzas19:01
*** bauzas_ is now known as bauzas23:12

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!