*** bauzas_ is now known as bauzas | 00:50 | |
opendevreview | Takashi Kajinami proposed openstack/nova master: libvirt: Launch instances with SEV-ES memory encryption https://review.opendev.org/c/openstack/nova/+/926106 | 01:24 |
---|---|---|
opendevreview | Takashi Kajinami proposed openstack/nova master: libvirt: Launch instances with SEV-ES memory encryption https://review.opendev.org/c/openstack/nova/+/926106 | 01:26 |
*** bauzas_ is now known as bauzas | 07:02 | |
gibi | sean-k-mooney: artom: I'm still in the process of understanding the logic of the relationship between the instance.pci_devices and the pci_tracker state. At some point the acconting of the two got out of sync and I think that is the place we need to fix | 08:40 |
gibi | the refresh is just a band aid | 08:40 |
zigo | bauzas: Hi there! Are you aware of this problem? https://bugs.launchpad.net/nova/+bug/2077070 | 09:03 |
zigo | Is Sylvain in holidays btw? :) | 09:03 |
frickler | isn't all of france in holiday during august? but this sounds related to what sean-k-mooney and gibi were discussing earlier at least in as much as pci devices are affected | 09:13 |
gibi | sean-k-mooney: artom: https://review.opendev.org/c/openstack/nova/+/710848/13#message-354946be57b39eb88b67872230c226361d8a3559 this is where I am at. The resource trackker correctly removes the device from the instance.pci_devices list during the aborting of the claim, but that update is lost when we return to the compute manager | 09:44 |
sean-k-mooney[m] | zigo is the compute node marked as deleted | 09:49 |
sean-k-mooney[m] | we dont actully delete the row | 09:49 |
gibi | sean-k-mooney: artom: yeah Claim ctor clones the instance https://github.com/openstack/nova/blob/a7c82399b237aff44810ac76b0c4d10416c3bdf9/nova/compute/claims.py#L65-L66 | 09:50 |
sean-k-mooney[m] | gibi: im going to gab a coffee but ill be back in 5 mins and ill take a look | 09:51 |
sean-k-mooney | ok actull | 10:01 |
gibi | abort claim logic working on a copy of the instance so the cleanup of the instance is lost when nova returns to the compute manager | 10:03 |
sean-k-mooney | do you want to to jump on a google meet and debug this in 10 mins. that sound plausible yes looking at the clone. | 10:04 |
gibi | sure I can | 10:04 |
sean-k-mooney | i belive we do that to make sure we dont modify the instnace object during the claim | 10:04 |
sean-k-mooney | btu i guess we did in one path and didnt in another | 10:04 |
sean-k-mooney | and the two are out of sync | 10:04 |
gibi | should we wait for artom? | 10:05 |
gibi | maybe he is interested too | 10:05 |
sean-k-mooney | we can | 10:05 |
gibi | then I will grab a quick lunch, and let you look at what I found in the meantime | 10:05 |
gibi | and we can jump on a call when artom is arong | 10:06 |
gibi | arond | 10:06 |
sean-k-mooney | i didnt sleep well last night (didnt get to sleep until 4am) so im just startign my day now so wanted to skim my email first while having my morning coffee | 10:06 |
sean-k-mooney | so we can chat when they are onlien | 10:06 |
gibi | OK | 10:06 |
sean-k-mooney | frickler: and yes that is a differnt topic. we have a very old bug, with an eaislly old bugfix that "works" but its fixing the sideefect rather then the cause and we are tryign to under stand what is the cuase and how to move forward | 10:15 |
zigo | sean-k-mooney: I'm not sure, I haven't checked for that, you're probably right that it just stays as deleted, but my bug still stand. | 10:16 |
sean-k-mooney | zigo: i have not looked at your bug yet :) | 10:16 |
sean-k-mooney | openstack compute service delete <uuid> is deleting the comptue service not the compute node by the way | 10:17 |
sean-k-mooney | it may do one as a side effect of the other but those are two logically differnt things | 10:17 |
sean-k-mooney | zigo: are there any instance on the compute node at the time your deleting the compute service | 10:17 |
sean-k-mooney | espically any that are using pci devices | 10:18 |
zigo | sean-k-mooney: I had a case where that was the issue, but in another case, no, there was no instance at all, even less with a PCI device attached. | 10:18 |
sean-k-mooney | zigo so reinstalling a node because of a hardware issue would not normally invovle deleting the comppute service | 10:20 |
sean-k-mooney | is there a reason you did that | 10:20 |
sean-k-mooney | im not saying we should not look into the clean up | 10:20 |
zigo | It's just our procedure to clean-up nova, cinder and neutron services when we decomission a node... | 10:21 |
sean-k-mooney | but the procudure yoru takign is not what i would expect/recommend if you are replacing a node bug keeping the host name | 10:21 |
zigo | Ah. | 10:21 |
sean-k-mooney | zigo this is what i woudl expect if you were chanign the hostname i.e scale in and out | 10:21 |
sean-k-mooney | but not if you are preserving the host name | 10:22 |
sean-k-mooney | i think that might be why we dont see this downstream | 10:23 |
sean-k-mooney | we have 2 diffent but related procedures. if its just a hardware failure and you want to just swap out the failed hardware then we woudl expct you do phsyicall do that and then just reinstalll but not delete the compute service | 10:24 |
sean-k-mooney | for hosts where the instances are on cpeh for example that allows replacement of the failed hardware even if you have not evacuated the instances although that is prefered | 10:24 |
sean-k-mooney | in our old instally scalining in and out a new server ended up with a new hostname | 10:25 |
sean-k-mooney | so it would not hit this edgecases | 10:25 |
zigo | sean-k-mooney: I would expect that I can do what I want, the way I want, without experiencing such a bug ... :) | 10:25 |
sean-k-mooney | well you can expect that :) | 10:29 |
opendevreview | Balazs Gibizer proposed openstack/nova master: Fix PCI passthrough cleanup on reschedule https://review.opendev.org/c/openstack/nova/+/926407 | 11:21 |
*** bauzas_ is now known as bauzas | 12:30 | |
sean-k-mooney | gibi: i think what you are proposing is a better fix then mine as you have moved the cleanup to the correct function | 13:40 |
sean-k-mooney | gibi: but i left a comment inline. | 13:40 |
sean-k-mooney | im not sure we will have time to sync with artom beofre our internal meeting start so we might just do that after we finish wiht them or we can do it async | 13:42 |
sean-k-mooney | and check back tomorow | 13:43 |
gibi | let's try tomorrow. I don't think I will have the mental capacity after our internal call. But I will clean up the unit and functional test failures at least in the meantime | 13:53 |
gibi | also I agree about free_instance_claims but I will check for fallouts | 13:54 |
artom | sean-k-mooney, gibi, what's up? | 13:54 |
sean-k-mooney | artom: gibi foudn the root cause of the bug | 13:54 |
artom | Sweet | 13:55 |
gibi | artom: this is the distilled version https://review.opendev.org/c/openstack/nova/+/926407 | 13:55 |
gibi | the longer form is in the comments of https://review.opendev.org/c/openstack/nova/+/710848/13#message-354946be57b39eb88b67872230c226361d8a3559 | 13:56 |
artom | Oh wow | 13:57 |
gibi | I have still 30 mins befor the next call so if you artom, sean-k-mooney want to sync on it then I can explain in gmeet | 13:57 |
sean-k-mooney | artom: i was suggeting this morning that we coudl hop on a call and try and find it together but we said we woudl wait for you to come online before doing that and in the mean time gibi made some progress | 13:57 |
artom | That is so insidious, well found! | 13:57 |
artom | Of course then the fear is - what else are we affecting, and potentially breaking? | 13:58 |
sean-k-mooney | im ment ot have a 1:1 in 3 minute but not sure if its going to happen | 13:58 |
sean-k-mooney | artom: right | 13:58 |
artom | I feel like our test coverage is enough at this point that's it's _probably_ fine | 13:58 |
sean-k-mooney | maybe | 13:58 |
gibi | artom: exactly. I also a bit affraid of the potential fallout | 13:59 |
gibi | the functional test is clean (exept one test case that I can explain and fix) | 13:59 |
sean-k-mooney | so what i was hoping is you could integrate the changes form gibi and try an fix the test coverage as needed | 13:59 |
sean-k-mooney | then we could see what that looks liek tomorrow | 13:59 |
gibi | sean-k-mooney: artom: I can clean up my patch so artom can focus on the socket id thing | 14:00 |
sean-k-mooney | ack that works too | 14:00 |
gibi | and we can ask melwitt or dansmith for review as a second core | 14:00 |
sean-k-mooney | speaking of mel i need to loop back to here unified limits changes | 14:01 |
sean-k-mooney | melwitt: did the test change to oslo.limits land yet | 14:01 |
* gibi goes to clean the tests | 14:02 | |
sean-k-mooney | melwitt: oh ok you abandoned https://review.opendev.org/c/openstack/oslo.limit/+/924024 | 14:02 |
sean-k-mooney | so i guess the intent is to proceed without that for now in https://review.opendev.org/c/openstack/nova/+/924025 | 14:02 |
sean-k-mooney | melwitt: i have not looked at that in a while but i see you still have it marked as work in progress in gerrit | 14:04 |
sean-k-mooney | are you happy the current version is ready for review or should i loop back in a few days | 14:04 |
opendevreview | Balazs Gibizer proposed openstack/nova master: Fix PCI passthrough cleanup on reschedule https://review.opendev.org/c/openstack/nova/+/926407 | 14:50 |
melwitt | sean-k-mooney: yeah the intent was to go for a nova-only approach. and oops, not supposed to be marked as WIP anymore, I'll unmark them | 15:00 |
*** bauzas_ is now known as bauzas | 19:01 | |
*** bauzas_ is now known as bauzas | 23:12 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!