Thursday, 2024-08-15

*** bauzas_ is now known as bauzas		00:50
opendevreview	Takashi Kajinami proposed openstack/nova master: libvirt: Launch instances with SEV-ES memory encryption https://review.opendev.org/c/openstack/nova/+/926106	01:24
opendevreview	Takashi Kajinami proposed openstack/nova master: libvirt: Launch instances with SEV-ES memory encryption https://review.opendev.org/c/openstack/nova/+/926106	01:26
*** bauzas_ is now known as bauzas		07:02
gibi	sean-k-mooney: artom: I'm still in the process of understanding the logic of the relationship between the instance.pci_devices and the pci_tracker state. At some point the acconting of the two got out of sync and I think that is the place we need to fix	08:40
gibi	the refresh is just a band aid	08:40
zigo	bauzas: Hi there! Are you aware of this problem? https://bugs.launchpad.net/nova/+bug/2077070	09:03
zigo	Is Sylvain in holidays btw? :)	09:03
frickler	isn't all of france in holiday during august? but this sounds related to what sean-k-mooney and gibi were discussing earlier at least in as much as pci devices are affected	09:13
gibi	sean-k-mooney: artom: https://review.opendev.org/c/openstack/nova/+/710848/13#message-354946be57b39eb88b67872230c226361d8a3559 this is where I am at. The resource trackker correctly removes the device from the instance.pci_devices list during the aborting of the claim, but that update is lost when we return to the compute manager	09:44
sean-k-mooney[m]	zigo is the compute node marked as deleted	09:49
sean-k-mooney[m]	we dont actully delete the row	09:49
gibi	sean-k-mooney: artom: yeah Claim ctor clones the instance https://github.com/openstack/nova/blob/a7c82399b237aff44810ac76b0c4d10416c3bdf9/nova/compute/claims.py#L65-L66	09:50
sean-k-mooney[m]	gibi: im going to gab a coffee but ill be back in 5 mins and ill take a look	09:51
sean-k-mooney	ok actull	10:01
gibi	abort claim logic working on a copy of the instance so the cleanup of the instance is lost when nova returns to the compute manager	10:03
sean-k-mooney	do you want to to jump on a google meet and debug this in 10 mins. that sound plausible yes looking at the clone.	10:04
gibi	sure I can	10:04
sean-k-mooney	i belive we do that to make sure we dont modify the instnace object during the claim	10:04
sean-k-mooney	btu i guess we did in one path and didnt in another	10:04
sean-k-mooney	and the two are out of sync	10:04
gibi	should we wait for artom?	10:05
gibi	maybe he is interested too	10:05
sean-k-mooney	we can	10:05
gibi	then I will grab a quick lunch, and let you look at what I found in the meantime	10:05
gibi	and we can jump on a call when artom is arong	10:06
gibi	arond	10:06
sean-k-mooney	i didnt sleep well last night (didnt get to sleep until 4am) so im just startign my day now so wanted to skim my email first while having my morning coffee	10:06
sean-k-mooney	so we can chat when they are onlien	10:06
gibi	OK	10:06
sean-k-mooney	frickler: and yes that is a differnt topic. we have a very old bug, with an eaislly old bugfix that "works" but its fixing the sideefect rather then the cause and we are tryign to under stand what is the cuase and how to move forward	10:15
zigo	sean-k-mooney: I'm not sure, I haven't checked for that, you're probably right that it just stays as deleted, but my bug still stand.	10:16
sean-k-mooney	zigo: i have not looked at your bug yet :)	10:16
sean-k-mooney	openstack compute service delete <uuid> is deleting the comptue service not the compute node by the way	10:17
sean-k-mooney	it may do one as a side effect of the other but those are two logically differnt things	10:17
sean-k-mooney	zigo: are there any instance on the compute node at the time your deleting the compute service	10:17
sean-k-mooney	espically any that are using pci devices	10:18
zigo	sean-k-mooney: I had a case where that was the issue, but in another case, no, there was no instance at all, even less with a PCI device attached.	10:18
sean-k-mooney	zigo so reinstalling a node because of a hardware issue would not normally invovle deleting the comppute service	10:20
sean-k-mooney	is there a reason you did that	10:20
sean-k-mooney	im not saying we should not look into the clean up	10:20
zigo	It's just our procedure to clean-up nova, cinder and neutron services when we decomission a node...	10:21
sean-k-mooney	but the procudure yoru takign is not what i would expect/recommend if you are replacing a node bug keeping the host name	10:21
zigo	Ah.	10:21
sean-k-mooney	zigo this is what i woudl expect if you were chanign the hostname i.e scale in and out	10:21
sean-k-mooney	but not if you are preserving the host name	10:22
sean-k-mooney	i think that might be why we dont see this downstream	10:23
sean-k-mooney	we have 2 diffent but related procedures. if its just a hardware failure and you want to just swap out the failed hardware then we woudl expct you do phsyicall do that and then just reinstalll but not delete the compute service	10:24
sean-k-mooney	for hosts where the instances are on cpeh for example that allows replacement of the failed hardware even if you have not evacuated the instances although that is prefered	10:24
sean-k-mooney	in our old instally scalining in and out a new server ended up with a new hostname	10:25
sean-k-mooney	so it would not hit this edgecases	10:25
zigo	sean-k-mooney: I would expect that I can do what I want, the way I want, without experiencing such a bug ... :)	10:25
sean-k-mooney	well you can expect that :)	10:29
opendevreview	Balazs Gibizer proposed openstack/nova master: Fix PCI passthrough cleanup on reschedule https://review.opendev.org/c/openstack/nova/+/926407	11:21
*** bauzas_ is now known as bauzas		12:30
sean-k-mooney	gibi: i think what you are proposing is a better fix then mine as you have moved the cleanup to the correct function	13:40
sean-k-mooney	gibi: but i left a comment inline.	13:40
sean-k-mooney	im not sure we will have time to sync with artom beofre our internal meeting start so we might just do that after we finish wiht them or we can do it async	13:42
sean-k-mooney	and check back tomorow	13:43
gibi	let's try tomorrow. I don't think I will have the mental capacity after our internal call. But I will clean up the unit and functional test failures at least in the meantime	13:53
gibi	also I agree about free_instance_claims but I will check for fallouts	13:54
artom	sean-k-mooney, gibi, what's up?	13:54
sean-k-mooney	artom: gibi foudn the root cause of the bug	13:54
artom	Sweet	13:55
gibi	artom: this is the distilled version https://review.opendev.org/c/openstack/nova/+/926407	13:55
gibi	the longer form is in the comments of https://review.opendev.org/c/openstack/nova/+/710848/13#message-354946be57b39eb88b67872230c226361d8a3559	13:56
artom	Oh wow	13:57
gibi	I have still 30 mins befor the next call so if you artom, sean-k-mooney want to sync on it then I can explain in gmeet	13:57
sean-k-mooney	artom: i was suggeting this morning that we coudl hop on a call and try and find it together but we said we woudl wait for you to come online before doing that and in the mean time gibi made some progress	13:57
artom	That is so insidious, well found!	13:57
artom	Of course then the fear is - what else are we affecting, and potentially breaking?	13:58
sean-k-mooney	im ment ot have a 1:1 in 3 minute but not sure if its going to happen	13:58
sean-k-mooney	artom: right	13:58
artom	I feel like our test coverage is enough at this point that's it's _probably_ fine	13:58
sean-k-mooney	maybe	13:58
gibi	artom: exactly. I also a bit affraid of the potential fallout	13:59
gibi	the functional test is clean (exept one test case that I can explain and fix)	13:59
sean-k-mooney	so what i was hoping is you could integrate the changes form gibi and try an fix the test coverage as needed	13:59
sean-k-mooney	then we could see what that looks liek tomorrow	13:59
gibi	sean-k-mooney: artom: I can clean up my patch so artom can focus on the socket id thing	14:00
sean-k-mooney	ack that works too	14:00
gibi	and we can ask melwitt or dansmith for review as a second core	14:00
sean-k-mooney	speaking of mel i need to loop back to here unified limits changes	14:01
sean-k-mooney	melwitt: did the test change to oslo.limits land yet	14:01
* gibi goes to clean the tests		14:02
sean-k-mooney	melwitt: oh ok you abandoned https://review.opendev.org/c/openstack/oslo.limit/+/924024	14:02
sean-k-mooney	so i guess the intent is to proceed without that for now in https://review.opendev.org/c/openstack/nova/+/924025	14:02
sean-k-mooney	melwitt: i have not looked at that in a while but i see you still have it marked as work in progress in gerrit	14:04
sean-k-mooney	are you happy the current version is ready for review or should i loop back in a few days	14:04
opendevreview	Balazs Gibizer proposed openstack/nova master: Fix PCI passthrough cleanup on reschedule https://review.opendev.org/c/openstack/nova/+/926407	14:50
melwitt	sean-k-mooney: yeah the intent was to go for a nova-only approach. and oops, not supposed to be marked as WIP anymore, I'll unmark them	15:00
*** bauzas_ is now known as bauzas		19:01
*** bauzas_ is now known as bauzas		23:12

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!