Friday, 2021-08-27

opendevreview	Qiu Fossen proposed openstack/nova master: Allow migrating PMEM's data https://review.opendev.org/c/openstack/nova/+/802225	01:31
opendevreview	melanie witt proposed openstack/nova master: Revert "consumer gen: more tests for delete allocation cases" https://review.opendev.org/c/openstack/nova/+/806292	01:32
opendevreview	melanie witt proposed openstack/nova master: Revert "Consumer gen support for delete instance allocations" https://review.opendev.org/c/openstack/nova/+/806293	01:32
melwitt	gibi, sean-k-mooney, lyarwood: fyi I have uploaded reverts https://review.opendev.org/c/openstack/nova/+/806292 and https://review.opendev.org/c/openstack/nova/+/806293 as one possible option for addressing the gate bug. another option is the WIP patch https://review.opendev.org/c/openstack/nova/+/688802 to add a "force" kwarg to allocation delete	01:36
opendevreview	Ghanshyam proposed openstack/nova master: Convert features not supported error to HTTPBadRequest https://review.opendev.org/c/openstack/nova/+/806294	01:44
*** abhishekk is now known as akekane\|home		06:00
*** akekane\|home is now known as abhishekk		06:00
slaweq	gibi: hi	06:42
slaweq	gibi: I see a lot of job's failures with error on deleting instance, like e.g. https://a91ddba9ffca8a8b5d72-1d82fafbdd07c6a56856d7b1449f73a8.ssl.cf5.rackcdn.com/800059/3/check/neutron-ovn-tempest-ovs-release/fd05a18/testr_results.html	06:43
slaweq	and it's not only on neutron-tempest-plugin jobs	06:43
slaweq	I saw it also on some devstack jobs too	06:43
slaweq	gibi: is it known issue or should I open one?	06:43
gibi	slaweq: hi	07:04
frickler	slaweq: I would be hoping that these are solved by the above reverts https://review.opendev.org/c/openstack/nova/+/806293 etc.	07:04
gibi	just couple lines above melwitt proposed patches	07:04
slaweq	gibi: frickler thx a lot	07:05
gibi	I've just started my day but I will prioritize this issue	07:05
slaweq	gibi: thx, I just wanted to ask as I knew that You will know better if that's something already reported :)	07:06
gibi	sure	07:06
frickler	slaweq: if you check n-cpu.log for errors, you'll see tracebacks about conflicts in placement allocations	07:07
*** rpittau\|afk is now known as rpittau		07:07
slaweq	frickler: yes, now I see it :)	07:09
gibi	melwitt: thanks for proposing patches I read up on the bug discussion and the patches and I prefer the force kwarg that gives us a more fine grained approach to handle deletes in different situations	07:37
gibi	I will resolve the merge conflict in https://review.opendev.org/c/openstack/nova/+/688802 and then I will +2 it	07:37
gibi	sean-k-mooney, lyarwood, stephenfin: ^^	07:37
*** mgoddard- is now known as mgoddard		07:58
lyarwood	Morning	08:19
* lyarwood reads scrollback		08:19
gibi	lyarwood: morning	08:33
lyarwood	gibi: sorry got distracted with $child things	08:55
lyarwood	gibi: so how's the rebase going? Are we not going ahead with the reverts just yet?	08:55
gibi	lyarwood: almost done	08:55
lyarwood	ack	08:55
gibi	lyarwood: for me the reverts are more complex than the force kwargs fix	08:56
lyarwood	yup agreed	08:56
opendevreview	Balazs Gibizer proposed openstack/nova master: Add force kwarg to delete_allocation_for_instance https://review.opendev.org/c/openstack/nova/+/688802	09:01
gibi	lyarwood, sean-k-mooney, stephenfin: ^^	09:02
lyarwood	gibi: LGTM, had to do a little background reading to make sure but yeah still agree this is the better approach	09:15
gibi	lyarwood: yepp the bug report has a good summary of what we are dealing with	09:19
opendevreview	Fabian Wiesel proposed openstack/nova master: VmWare: Remove unused legacy_nodename regex https://review.opendev.org/c/openstack/nova/+/806336	09:28
gibi	lyarwood: thanks for the review on the pps series. I replied in https://review.opendev.org/c/openstack/nova/+/793619/15#message-a2916b6a95c6f4061429e9fb2898ed738dfc3427	09:46
opendevreview	FossenQiu proposed openstack/nova master: Allow migrating PMEM's data https://review.opendev.org/c/openstack/nova/+/802225	10:33
opendevreview	Fabian Wiesel proposed openstack/nova master: Vmware: Fix spelling in test https://review.opendev.org/c/openstack/nova/+/806348	10:36
opendevreview	Fabian Wiesel proposed openstack/nova master: VmWare: Use of id shadows built-in function https://review.opendev.org/c/openstack/nova/+/806390	10:43
lyarwood	gibi: https://review.opendev.org/c/openstack/nova/+/688802 has some unit and functional failures if you have time to look	10:45
opendevreview	Fabian Wiesel proposed openstack/nova master: Vmware: Fix indentation in conditionals https://review.opendev.org/c/openstack/nova/+/806391	10:47
gibi	lyarwood: whaat? I did run those tests before pushed	10:49
gibi	looking	10:49
lyarwood	yeah assumed you had, weird.	10:50
gibi	obviously I run it on something else as I see failures that I havent seen before :/	10:54
gibi	anyhow fixing it ...	10:54
gibi	the two failing functional test actually testing our bug that is now resulved with force=True	11:09
sean-k-mooney	the reverts to me are still the better long term solution as over all i think it more complext to have 2 ways to delete things	11:09
gibi	sean-k-mooney: but the automatic reclaim of soft deleted instance should not win over a user initiated restore of the same soft deleted instance	11:10
sean-k-mooney	although the force flag patch is slightly smaller	11:10
sean-k-mooney	hum... i can see that althoug i also dont think we should uspport soft deleted in general	11:12
sean-k-mooney	so you are saying the added complexityu to support soft cdelete requires force	11:12
sean-k-mooney	to ensure we can restore on the same host	11:13
sean-k-mooney	without other allocations talking up the space	11:13
sean-k-mooney	im still not sure we need to use a put in the soft delete case	11:13
gibi	what I'm saying that a conflicting delete should be forced most of the time, except if the delete is conflicting with a restore of a soft deleted instance	11:14
gibi	conflict means that restore receved first, we should not allow removing the allocation in this case blindly as it will lead to an active instance without allocaiton	11:14
sean-k-mooney	would we not be able to do that with a lock on the instecak uuid	11:15
sean-k-mooney	have restore aquire the lock and have the perodic try to aquire it too before it does the delete	11:15
gibi	yes a lock would also be a solution	11:16
sean-k-mooney	if we want to proceed with the force flag for now im not nessisarly against that	11:17
sean-k-mooney	but im still not trilled by doing deletes with put in general	11:17
opendevreview	Balazs Gibizer proposed openstack/nova master: Add force kwarg to delete_allocation_for_instance https://review.opendev.org/c/openstack/nova/+/688802	11:17
gibi	sean-k-mooney: I'd like to proced with the force solution, reverting old patches feels messy.	11:17
gibi	I need to go grab lunch, be back in 30	11:18
sean-k-mooney	ack enjoy	11:19
sean-k-mooney	stephenfin: ya removing the globals is certenly another valid option. if there is no global state we have nothign to reset. ill take a look at that quickly	11:20
opendevreview	sean mooney proposed openstack/nova master: db: Handle parameters in DB strings https://review.opendev.org/c/openstack/nova/+/805663	11:34
opendevreview	sean mooney proposed openstack/nova master: remove module level caching https://review.opendev.org/c/openstack/nova/+/806394	11:34
sean-k-mooney	stephenfin: ^ they were only ever used in that module so there was no test covergae for the gloabls so simple to remove them as you suggested	11:35
* gibi is back		11:38
gibi	sean-k-mooney: I agree less global is better	11:39
sean-k-mooney	we can re add them if this really is a problem but personaly i would prefer to use the lru cache decorator on the function if it came to that	11:42
sean-k-mooney	i tihnk stephen is correct that this likely is not providing much if any useful perfromance benifit today	11:42
gibi	sean-k-mooney: lru cache decorator propbably store info on the func def, so that is also a global, just a hidden one. Still needs the same reset logic as a pure module level global	11:44
gibi	I'm not sure what I like more, but I guess in this case a pure global is easier to notice	11:44
sean-k-mooney	gibi: yes it would but i just prefer using libary function for caching then inventing it our selves	11:44
sean-k-mooney	anyway im not suggesting we use caching here	11:45
gibi	yepp, the reinventing-the-wheel is on the other side of the compromise	11:45
gibi	agree, only add cache if we see slowness	11:45
sean-k-mooney	for what its worth where we do have our own caching i lie the pater we have in the compute rpc module https://github.com/openstack/nova/blob/50fdbc752a9ca9c31488140ef2997ed59d861a41/nova/compute/rpcapi.py#L43-L50 and a few others	11:46
sean-k-mooney	e.g. where we providee a reset_globals fucntion in the module to handel that	11:47
sean-k-mooney	as an external caller i done need to look at the details then	11:47
sean-k-mooney	by the way speaking fo global state there was a patch i wanted to flag to you	11:48
sean-k-mooney	gibi: https://review.opendev.org/c/openstack/nova/+/804985	11:49
gibi	looking..	11:50
gibi	sean-k-mooney: I agree with your comment on ^^. I do think eventlet and python threads are not play nice together and we have extensive separation between them on the compute side to prevent issues	11:52
sean-k-mooney	there change would not break anything but if they deployed the way we suggest with only 1 therad and multiple processes i dont think it would fix anything either right	11:52
sean-k-mooney	they would have to deploy in a "unsupproted" configuration where wer wer usign mutiple thread curretly right?	11:53
gibi	sean-k-mooney: if they use oslo lock without external=True then the lock is only help with threads not processes.	11:55
gibi	and I do think we only support process based scaling	11:56
sean-k-mooney	looking at there responce to my commnet https://review.opendev.org/c/openstack/nova/+/804985/3#message-e6eddd2104d423b51d8c4beb14e3eaa7998cf177	11:56
sean-k-mooney	im wondering if we are timing out on an io operation	11:57
sean-k-mooney	rather then a race	11:57
sean-k-mooney	i think im just going to ask them to file a bug. part of the porblem is we dont know what verion of nova this happens in	11:57
gibi	sean-k-mooney: yepp, we need to find a way to see if they really using threads and that casuing the problem or there is a problem even with eventlet	12:03
sean-k-mooney	ok i have asked them to file a bug report and linked them to the docs on our threading model in case that helps	12:05
sean-k-mooney	gibi: im going to try and fix the pep8 issue in bauzas mdev series and then split out the docs patch as stephenfin requested into its own patch at the end. are you ok to re review them if i do that. bauzas should be back on monday or tuseday but i would prefer to move this forward if we can before hitting FF	12:15
gibi	sean-k-mooney: sure, ping me and I will review it	12:17
sean-k-mooney	oh its a mypy failure we declar we return a sting but if we have an excption we return none. that explains why fast8 did not see it but pep8 does	12:22
gibi	yes mypy only runs in the pep8 target not in the fast8 one	12:24
stephenfin	sean-k-mooney: small nits on https://review.opendev.org/c/openstack/nova/+/806394	12:28
opendevreview	Ghanshyam proposed openstack/nova master: Convert features not supported error to HTTPBadRequest https://review.opendev.org/c/openstack/nova/+/806294	12:34
sean-k-mooney	stephenfin: yep just saw them come in. ill gix those up once i finish with bauzas mdev series	12:38
* sean-k-mooney current todo list is mdev serise-> db issue -> multi queue extra specs series -> day of learning project if i get time		12:40
opendevreview	Merged openstack/nova master: [func test] move port resource request tests https://review.opendev.org/c/openstack/nova/+/801815	12:56
opendevreview	Merged openstack/nova master: [func test] create pps resource on OVS agent RP https://review.opendev.org/c/openstack/nova/+/787205	13:00
gibi	lyarwood, stephenfin: I'm not sure I get https://review.opendev.org/c/openstack/nova/+/800086/16/nova/tests/functional/test_servers_resource_request.py#b1669	13:02
stephenfin	gibi: lyarwood is looking at your earlier statement that those tests are passing even though they wouldn't work in real life, and is saying the tests are therefore misleading	13:03
stephenfin	and therefore it would be better to comment them out until such a time as their real-world equivalent scenario would pass	13:03
lyarwood	stephenfin / gibi ; well I'm saying removing them isn't the way to document that	13:04
lyarwood	right	13:04
lyarwood	just seems odd to add things, remove them because they are passing when they shouldn't be and then later add coverage back in again	13:04
gibi	stephenfin: ahh so it is about the interface attach tests that are started passing earlier than the actual implementation?	13:04
lyarwood	gibi: right	13:04
stephenfin	yup, iiuc	13:04
lyarwood	it's a nit	13:04
lyarwood	I shouldn't really -1 for it, it's just confusing as a reviewer tbh as opposed to leaving them commented out with a note	13:05
lyarwood	IMHO	13:05
gibi	sorry , I got distracted with a call	13:21
gibi	OK, I got confused as you marked code lines that are about booting VMs not about interface attach	13:24
gibi	anyhow	13:24
gibi	those interface attach tests that are passing are not false positives	13:24
gibi	those sceanrios really works	13:24
gibi	but those scenarios are error scenarios	13:24
gibi	so they are pretty useless for the end user	13:25
gibi	I will respin the patches to move the service version checks to the compute.api	13:26
gibi	but I'm not sure what to do about the interface attach tests you commented on	13:26
gibi	I cannot comment them out as they are in the base class where they pass	13:26
lyarwood	ah crap right sorry, ignore that then	13:27
gibi	I can redefine them with empty body and a comment but that feels more confusin than actually doing nothing	13:27
sean-k-mooney	you can boot with one ovs and one seriov prot today	13:27
gibi	sean-k-mooney: ignore the context of the comment the comment is not about those test cases	13:27
gibi	sean-k-mooney: it is about interface attach test cases that started to pass earlier (and therefore not mentioned in the code at all) than expected	13:28
sean-k-mooney	right but im wondering why they were previouly mared as exped failure	13:28
sean-k-mooney	was that because of resouce requests?	13:28
gibi	sean-k-mooney: yepp,	13:28
sean-k-mooney	ah ok	13:28
gibi	sean-k-mooney: we had first patches to reject every operation with the new extended resource request format	13:28
gibi	sean-k-mooney: then this patch adds the impl for them and remove the rejection	13:29
sean-k-mooney	got it	13:29
sean-k-mooney	ok ill leave it for not sicne i would have to start at the begining of the serise to review properly	13:29
gibi	sean-k-mooney: yeah it is a long one :)	13:30
sean-k-mooney	well its more if i jump in mid seriese without context i wont really understand what your doing so it wont be helpful	13:30
sean-k-mooney	and i dont have time today unfortunetly to load all that context	13:31
sean-k-mooney	but if there is anything you want me to look at just let me know	13:31
gibi	sean-k-mooney: no worries, I got good reviews from stephenfin and lyarwood on the series	13:31
gibi	sean-k-mooney: it is better to spread our effort, like making the mdev series land as well	13:31
sean-k-mooney	ack yep thats why im working on it now	13:32
gibi	cool	13:32
sean-k-mooney	im hoping unified limits can also make it	13:32
gibi	does the dansmith's comments have been resolved there?	13:33
dansmith	not that I've seen,	13:33
sean-k-mooney	i think melwitt was working on that	13:33
gibi	ack	13:33
dansmith	and I think unified limits had a long way to go to be landable even before	13:33
dansmith	i.e. it's still reaching into oslo library internals and such, last I saw	13:33
sean-k-mooney	oh ok melwitt raised it as posibel at risk for this cycle but i think she was hopeful it could still be ready in time	13:34
dansmith	wow, okay I... would be surprised	13:34
sean-k-mooney	ill take your word for it since i have not been following the details of it	13:35
dansmith	I didn't see tempest tests for it,	13:35
dansmith	but I definitely shook out some stuff from the glance implementation when I wrote tempest tests for it	13:36
dansmith	I don't see docs, and I would think it needs docs	13:36
sean-k-mooney	there are some https://review.opendev.org/q/topic:%22bp%252Funified-limits-nova%22+(status:open%20OR%20status:merged)	13:36
sean-k-mooney	https://review.opendev.org/c/openstack/tempest/+/790186 and https://review.opendev.org/c/openstack/tempest/+/804311	13:36
dansmith	I see gibi has a bunch of +2s over the patch I'm -1 on, but I think some of those have code that reaches into oslo.limit	13:37
dansmith	ah okay cool, glad to see those tempest patches	13:38
dansmith	missed them being WIP I guess	13:38
gibi	dansmith: as far as I understand even if we land unified limits it will be a sort of experimental optional feature	13:38
gibi	there was no API impact so I considered it safe to experiment	13:38
dansmith	mmmkay :)	13:39
gibi	:)	13:39
sean-k-mooney	https://www.youtube.com/watch?v=6wJXBUfcIOE	13:40
*** rpittau is now known as rpittau\|afk		13:42
dansmith	gibi: did you see my comment on this patch you have +Wd? https://review.opendev.org/c/openstack/nova/+/712139	13:43
dansmith	I think those exceptions are wrong, no?	13:43
stephenfin	gibi: I'm +2 on basically the whole QoS series now, bar the things you've already talked about and the last two patches (small issues there)	13:45
gibi	dansmith: could be a bit more chatty yes, but ValueError tend to be used to point out which formal parameter got an unexpected value in the functin call	13:46
gibi	stephenfin: thanks, much appreciated	13:47
dansmith	gibi: sure, but is that caught and logged appropriately?	13:47
dansmith	gibi: usually that sort of thing ends up as not a trace in a log line where you just see "ValueError: entity_type" and never know what was going on when it happen	13:48
dansmith	but if so, cool	13:48
gibi	dansmith: thanks for pointing that out. I dropped my vote. I have to go back and see how that error is propagated	13:48
dansmith	gibi: I see there's a test for the specific string, so I guess it's intentional, but I still would like to know that it's turned into something useful :)	13:49
gibi	dansmith: I agree with your concerns	13:50
dansmith	btw, is the plan really to just outright revert the consumer types stuff because of that bug?	13:53
gibi	it felt for multiple cycles that this work is just sitting there waiting, I'm happy that now it got enough attention that we can make progress with it	13:53
dansmith	gibi: definitely been sidelined and I'm very happy to see it moving. having just implemented this for glance I've also got context on it, which is helpful since nova's situation is obviously more complex	13:54
sean-k-mooney	dansmith: i think we are going with the force flag instead of a full revert	13:54
sean-k-mooney	dansmith: although melwitt has also proposed patches for the revert if we decied that is what we want to do	13:54
dansmith	orly	13:55
sean-k-mooney	sorry melwitt revert patches are for the delete of the allocation using put	13:55
sean-k-mooney	is ther a reason to revert the consumer types if the force flag patch merges	13:56
dansmith	is the force flag safe?	13:56
sean-k-mooney	gibi: that already making its way through the gate correct	13:56
dansmith	that conflict is specifically to avoid deleting or changing data you don't realize has changed right?	13:56
sean-k-mooney	dansmith: it is but im not sure the orginial premis of this was valid	13:57
gibi	sean-k-mooney: I had to respin due to couple of missed tests	13:57
gibi	sean-k-mooney: so it is not on the gate but no passed check queue	13:57
dansmith	sean-k-mooney: premise of what, the force flag or the premise of the conflict?	13:57
sean-k-mooney	the conflict on allocation delete	13:57
sean-k-mooney	or rahter the precident in hte non soft delete case	13:58
sean-k-mooney	if we got a delete form the user i think that shoudl have taken precendce over allocation changes in general	13:59
sean-k-mooney	gibi raised a vaild concern with soft delete which the force flag patch adresses	13:59
sean-k-mooney	e.g. restore should take precidence over soft delete	13:59
dansmith	the bug discusses a retry, which is what I expect to do when I get a 409 from placement	14:02
dansmith	but the proposed solution is to force?	14:02
sean-k-mooney	perhaps its better for gibi to surmiarse or look at his patch so i dont butcher the answer :)	14:03
gibi	retry here means read the new allocatin with the new generation and null it out in the response	14:03
gibi	so for me that logically the same as ignore the current allocaiton and force the delete	14:04
dansmith	null it out in the request?	14:04
gibi	dansmith: yepp the current code deletes an allocation with an empty PUT	14:04
gibi	"empty" as there is the generation	14:04
dansmith	okay, the current proposed patch does a delete with no generation right off the bat if force=true, never tries "nicely" once AFAICT	14:05
gibi	dansmith: correct	14:06
sean-k-mooney	so i think doing a http delete is the corrct thing if your ar enot using soft delete and the user request the instnace to be deleted	14:06
dansmith	okay, confused about the "retry" terminology then :)	14:06
sean-k-mooney	which is one of the case we will pass force=true	14:06
gibi	dansmith: we can only do a meaningful retry if our original intention depends on the current state of the resource we are updated	14:07
gibi	in many delete case we don't care about the current state we just want to delete	14:07
dansmith	gibi: yeah, technically now we're just nulling out the allocations and PUTing the thing back as-is, and generation + retry would be the right thing to do in that case,	14:08
dansmith	but I understand, if we're deleting the thing then maybe delete is the right answer	14:08
dansmith	the thing I was saying above was that it seemed like you (all) were saying try the put, and if the put fails, delete to override	14:11
dansmith	but I see that's not what the patch does	14:11
sean-k-mooney	right no the patch adds a force flag and we are selectivly setting it depening on which behavior we want	14:12
sean-k-mooney	force=true to jsut delete regardless of state and force=false if we want an operturnity to reconider	14:12
gibi	dansmith: if we know we force on failure then what the point to try first nicely?	14:12
dansmith	gibi: I'm not arguing that we should, I was just trying to understand the proposed new behavior	14:13
gibi	ack	14:13
dansmith	I guess I'm wondering why we did the PUT with empty allocations in the first place instead of delete	14:14
sean-k-mooney	we used to do delete	14:15
sean-k-mooney	we started doing put when we added generations	14:15
dansmith	well, right :)	14:15
sean-k-mooney	but that is why we also have the straight revert patches as an option to consider	14:15
sean-k-mooney	melwitt: myslef and possibel other were wondering the same thing	14:15
sean-k-mooney	the force flag patch is kind of a middle ground	14:16
dansmith	it also seems super odd to have force=True as the default	14:16
dansmith	I'm guessing that's just for expediency, but..	14:17
sean-k-mooney	i think in most case we really do just want to delete not put	14:17
sean-k-mooney	althogh callign it force is perhaps questioable in that case	14:17
dansmith	right	14:17
dansmith	inverting the logic to "extra_checks=False" or something if you opt into it might be good,	14:18
dansmith	but again,	14:18
dansmith	we're opting out of the force for things like move and soft delete,	14:18
dansmith	which I kinda expect to exist more places than actual instance delete	14:18
dansmith	anyway, I didn't mean to come in here and throw a wrench in things, I just wanted to understand what as going on	14:18
dansmith	I saw 409s in failed runs yesterday, but I assumed those were just retries and didn't give it much thought	14:19
opendevreview	sean mooney proposed openstack/nova master: Provide the mdev class for every PCI device https://review.opendev.org/c/openstack/nova/+/802918	14:19
opendevreview	sean mooney proposed openstack/nova master: Provide and use other RCs for mdevs if needed https://review.opendev.org/c/openstack/nova/+/803233	14:19
opendevreview	sean mooney proposed openstack/nova master: Expose the mdev class https://review.opendev.org/c/openstack/nova/+/801743	14:19
opendevreview	sean mooney proposed openstack/nova master: [WIP] update vgpu docs to account for generic mdev support https://review.opendev.org/c/openstack/nova/+/806412	14:19
artom	Oooohhhh	14:21
* artom just fell in love with Gerrit v3		14:21
sean-k-mooney	this will be good. why so	14:21
artom	If something's changed between PSN and PSN+1 that's not actually part of the change itself, it'll show the diff in yellow, not green	14:21
sean-k-mooney	ah yes	14:22
sean-k-mooney	it does	14:22
dansmith	and sometimes blue right?	14:22
sean-k-mooney	the colours depend on light vs dark mode	14:22
dansmith	I haven't figured out the pattern	14:22
dansmith	ah	14:22
artom	I've only seen yellow so far	14:22
dansmith	yeah, probably the dark switch, I had one machine in dark the other day	14:23
artom	Reading the scrollback, what was the logic behind PUTing empty allocations instead of outright DELETE?	14:23
artom	We had consumer generations, so... just because we could?	14:23
artom	Like, we're deleting, why do races matter?	14:23
dansmith	artom: well, I'm kinda wondering if it's related to races during move and such	14:23
dansmith	artom: we delete allocations for all kinds of reasons other than deleting an instance	14:23
artom	So the idea is - before we delete, make sure we have the latest version so that we can then POST it to a new RP?	14:24
dansmith	which is why it seems to me like maybe we should only force=true on instance delete, and default to not everywhere else, instead of what is proposed which is always force	14:24
sean-k-mooney	so just invert the default for force	14:25
dansmith	we're not really checking the allocations though, so, probably not worth it	14:25
dansmith	anyway, it just feels like we're removing a fence because it's currently getting in the way :)	14:26
artom	Yeah - hence the question - why did we put the fence there in the first place?	14:26
dansmith	well, generations went in to avoid us clobbering allocations all the time	14:27
dansmith	during things like moves	14:27
sean-k-mooney	artom: this is the orginial commit that added the put behaivor https://review.opendev.org/c/openstack/nova/+/591597	14:28
artom	So, I'm sure there's a legitimate reason for it, but I don't grok it. Even during a move it, why would there be 2 bits of code trying to update allocations concurrently?	14:28
artom	*move operation	14:28
sean-k-mooney	it has some motivation for it in the commit but not a fully explanation	14:28
sean-k-mooney	the main motivation i got form that was to intneionally be able to put the instance into error on delete if it was in an inconsitent state	14:30
sean-k-mooney	to presumable give the operator or user a change to investigate	14:30
sean-k-mooney	*chance	14:30
dansmith	originally before we were using migration allocations I think both sides fought over who was going to claim the resources and delete the other	14:30
artom	sean-k-mooney, that's for instance deletion, dansmith is talking about moves needing generations	14:30
dansmith	and of course same-host migrations made that hard	14:31
artom	... so we did a thing in placement because Nova was dumb?	14:31
artom	Ah, resize to same host...	14:31
sean-k-mooney	isnt that why placment was created :)	14:31
dansmith	artom: no dude, placement was/is accessed by multiple services at once	14:31
artom	Hah	14:31
sean-k-mooney	but here this is a distibuted lock problem	14:32
sean-k-mooney	in that we dont have one	14:32
artom	dansmith, right, I know that. So was this something like Nova updating the VCPUs or whatever, while Neutron concurrently tries to update something about the port?	14:32
dansmith	well, generation is the lock	14:33
sean-k-mooney	so generations were added ^ yep	14:33
dansmith	I'm trying to remember how this works, but is the generation on the consumer actually?	14:33
dansmith	such that one instance moving between two RPs shares a generation?	14:33
sean-k-mooney	i think the generation is on the allocation	14:33
sean-k-mooney	i dont know if they share the same generation when we have both in a migration case	14:34
dansmith	we don't when migration is holding the resources on the other,	14:35
* sean-k-mooney closes tabs... i have/had 4 tabs with different version fo the same patch open and it was confusing the hell out of me		14:35
dansmith	but the generation is on the allocation for the consumer across both compute nodes (if you do it that way)	14:35
dansmith	so even with both compute nodes being "separate" they both had to update the allocation for the instance,	14:35
dansmith	to claim resources on incoming and drop the resources from the outgoing	14:35
dansmith	but this delete is actually _deleting_ the allocation	14:36
dansmith	so this is from after that concept of a single allocation against both providers	14:36
sean-k-mooney	ah so we dont have 2 indepented allcoations	14:37
dansmith	right	14:37
sean-k-mooney	i tought we used the migration uuid to create an indepent allcoation	14:37
dansmith	oh we do no,	14:37
dansmith	now	14:37
sean-k-mooney	ah ok	14:37
dansmith	I'm saying before we did that, we were creating allocations for the instance against both computes	14:38
sean-k-mooney	you were discibing your initall implemenation	14:38
sean-k-mooney	before you refactored it	14:38
dansmith	which required both to modify it during/after the migration	14:38
dansmith	not mine, but the, yes	14:38
dansmith	mine was the "use migration allocation" change	14:38
sean-k-mooney	right ok that aligns with what i recall	14:38
sean-k-mooney	so you think this was really only need before the move to seperate allocations?	14:39
sean-k-mooney	or do you think its still useful now. the put instead of delete	14:39
dansmith	I think the PUT came after we were already using two separate allocations, but not positive	14:40
sean-k-mooney	for migrations that is	14:40
sean-k-mooney	ok im not sure of the time line but that sound right	14:41
dansmith	gibi wrote that initial patch you linked above to move away from delete specifically to get generation stuff,	14:41
dansmith	and it was part of "nested allocation candidates"	14:41
sean-k-mooney	the migration allcotion change was a long time ago relitivly speaking	14:41
dansmith	right	14:41
sean-k-mooney	oh it was for nested resouce providers	14:41
sean-k-mooney	so maybe for port resouce requests?	14:42
dansmith	well, that's kinda what I was eluding to above,	14:42
dansmith	with multiple things accessing here	14:42
sean-k-mooney	ack	14:42
dansmith	like does PUT {} avoid nuking a sub-allocation that neutron hasn't yet cleaned up?	14:43
sean-k-mooney	i dont think we technially support netron modifying the allocation directly	14:43
sean-k-mooney	i think nova always handels that on its behalf	14:43
sean-k-mooney	but valid question	14:43
dansmith	okay I thought in some cases it would.. wasn't that the point of that blueprint?	14:44
dansmith	I wasn't really involved in it so I dunno	14:44
gibi	it was definitely before port resource request as nested a_c is a prerequisite for that	14:44
sean-k-mooney	so there was the open question about modifying QOS policies for bound ports	14:44
sean-k-mooney	which could change the allcoaitons	14:44
gibi	the only case neutorn modifies allocation ^^	14:44
sean-k-mooney	but i dont think we support that yet do we	14:44
gibi	we do	14:45
sean-k-mooney	oh ok	14:45
gibi	you can change QoS policy on a bound port, neutron will try to amend the instance allocation according to the new policy	14:45
sean-k-mooney	so ya change the bandwith or pps qos amount on a bound port would be the only time neutron could modify the allocation i think	14:45
gibi	yepp I think too	14:45
dansmith	okay, but this was for nested A-C, meaning another allocation against another provider coming back from placement, not any sort of actual nesting of allocations as stored right?	14:45
sean-k-mooney	yes i dont think there is really a concept of nested allocations	14:46
gibi	dansmith: right, we dont have sub allocations we have multiple RPs in an allocaiton	14:46
sean-k-mooney	jsut a singel allcoation with resouce form multple RPs today	14:46
dansmith	gibi: so you wrote this initial patch to move from delete to put specifically for "benefit" of generations.. are you thinking that was wrong now?	14:47
gibi	dansmith: I tried to figure out this mornign the reason why I had that patch	14:48
gibi	the only thing I know is that bumping the placemenet api to use nested a_cs meant that we got generations as well	14:48
gibi	so we had to do something with generations	14:48
gibi	i.e. the nested a_c support was in a bigger placement microversion than the generation support	14:49
gibi	1.28 and 1.29	14:50
gibi	if the request from the user is to delete the instance but that deletion fails due to a parallel operation then I think now that the delete operation should be the one that wins	14:51
dansmith	I don't think this patch was required just because we did generations and nested a_c, because you're updating the version you pass there,	14:52
gibi	in the other hand if the user requested restore of soft deleted instance and the timer on the soft delete -> hard delete is racing with that request then the restore should win	14:52
dansmith	and each call uses whatever version it wants right?	14:52
sean-k-mooney	gibi: microversion are per cal though	14:53
dansmith	right	14:53
gibi	right	14:53
gibi	I think the warning at the end of https://docs.openstack.org/placement/latest/placement-api-microversion-history.html#consumer-generation-support shows that at that time we was pretty serious about using generations strictly	14:54
sean-k-mooney	gibi: for the soft delete case we are expectign that this will help because? resotre will update allocation generation and the perodici will fail?	14:55
sean-k-mooney	for soft delete i assume we continue to claim the resouce with the allocation until the actul hard delete happens	14:56
sean-k-mooney	so would there be a need to update the allocation at all	14:56
sean-k-mooney	for either soft delete or restore	14:56
gibi	sean-k-mooney: if you are right then there could no be race between restore and soft -> hard delete	14:56
sean-k-mooney	e.g. will the generation actuly cahnge	14:56
gibi	btw the microversion history doc states	14:57
dansmith	gibi: I haven't mapped it out, but is there some reason we can't specifically opt-into the nuke-and-destroy behavior when we're deleting an instance and still fail for other cases?	14:57
gibi	Passing an empty allocations object along with a consumer_generation makes PUT /allocations/{consumer_uuid} a safe way to delete allocations for a consumer.	14:57
gibi	dansmith: you mean the user should be able to opt into the force?	14:57
dansmith	no	14:58
sean-k-mooney	dansmith: am i think the reason is that technially deletes are not allowd to have request bodies and we did not want to use a query arg	14:58
dansmith	you're force=True by default	14:58
sean-k-mooney	oh you mean the force default	14:58
dansmith	yes	14:58
gibi	dansmith: so just inverting the default	14:58
gibi	dansmith: that is OK to me	14:58
sean-k-mooney	we can set that to false and jsut pass true for a user delete	14:58
sean-k-mooney	ya im fine with inverting too	14:58
gibi	I think at some point melwitt stated that she used force=True as default as that was the more frequent call	14:59
dansmith	right, I think that's better signaling, and also better for whoever comes along later and uses that to delete an allocation	14:59
dansmith	I think making the dangerous activity explicit is worth the verbosity	14:59
sean-k-mooney	i tihnk its pretty close either way	14:59
sean-k-mooney	in terms of calls	15:00
dansmith	ack	15:00
dansmith	gibi: sean-k-mooney: do we know why the consumer types thing suddenly means we need to do this btw? I imagine the answer is just "more chance for the generation to change" but.. why specifically?	15:40
gibi	dansmith: exactly the consumer_types series merged the two transaction handling the allocation update into one, causing a longer transaction and hence a longer race window	15:42
dansmith	really? I wouldn't expect that to be meaningfully long	15:43
dansmith	as in, I would expect people running this in a larger cloud with lots of stuff going on would have hit this by now and not just in a gate run with few instances at any one point	15:44
dansmith	I saw that analysis from melwitt in the bug and wasn't so sure	15:44
dansmith	but I guess if it's really related to that change only then that has to be it	15:44
gibi	there is a correlation with the consumer_type series landed and the uptick in the conflict failures in the gate	15:45
gibi	I know correlation is not causation but still it feel relevant	15:46
gibi	and the consumer_type series had this transaction logic change	15:46
dansmith	hrm	15:46
gibi	i have no better explanation at the moment	15:46
dansmith	ack	15:47
sean-k-mooney	dansmith: sorry was on internal call	15:53
sean-k-mooney	ah am no i know there is a collation but not why	15:54
opendevreview	Balazs Gibizer proposed openstack/nova master: Support boot with extended resource request https://review.opendev.org/c/openstack/nova/+/800086	16:11
opendevreview	Ghanshyam proposed openstack/nova master: Convert features not supported error to HTTPBadRequest https://review.opendev.org/c/openstack/nova/+/806294	16:12
opendevreview	Balazs Gibizer proposed openstack/nova master: Support move ops with extended resource request https://review.opendev.org/c/openstack/nova/+/800087	16:13
opendevreview	Balazs Gibizer proposed openstack/nova master: [func test] refactor interface attach with qos https://review.opendev.org/c/openstack/nova/+/800088	16:14
opendevreview	Balazs Gibizer proposed openstack/nova master: Support interface attach / detach with new resource request format https://review.opendev.org/c/openstack/nova/+/800089	16:15
opendevreview	Balazs Gibizer proposed openstack/nova master: [func test] move unshelve test to the proper place https://review.opendev.org/c/openstack/nova/+/793621	16:17
opendevreview	Balazs Gibizer proposed openstack/nova master: [nova-manage]support extended resource request https://review.opendev.org/c/openstack/nova/+/802060	16:18
opendevreview	sean mooney proposed openstack/nova master: Remove module level caching https://review.opendev.org/c/openstack/nova/+/806394	16:19
opendevreview	sean mooney proposed openstack/nova master: db: Handle parameters in DB strings https://review.opendev.org/c/openstack/nova/+/805663	16:19
opendevreview	Balazs Gibizer proposed openstack/nova master: Reno for qos-minimum-guaranteed-packet-rate https://review.opendev.org/c/openstack/nova/+/805046	16:19
opendevreview	Balazs Gibizer proposed openstack/nova master: Add force kwarg to delete_allocation_for_instance https://review.opendev.org/c/openstack/nova/+/688802	16:30
gibi	dansmith, sean-k-mooney, melwitt, lyarwood: based on the above disucssion I inverted the force default value in ^^	16:30
gibi	lyarwood, stephenfin: I fixed up the nits in the pps series. thanks for the valuable feedback	16:31
gibi	and with that I end my week. o/	16:33
sean-k-mooney	gibi: https://review.opendev.org/c/openstack/project-config/+/787523 has now merged so we shoudl have the review priority lable avaiable just an fyi	16:42
gmann	finally :)	16:45
sean-k-mooney	i think https://review.opendev.org/q/project:openstack/nova+Review-Priority:1 is how to use it	16:46
sean-k-mooney	but we dont have any patches with it set	16:46
sean-k-mooney	https://review.opendev.org/q/project:openstack/nova+Review-Priority:0 get hits however	16:46
gmann	or event this, https://review.opendev.org/q/project:openstack/nova+Review-Priority:1+label:Verified%253D1++NOT+label:Workflow%253C%253D-1	16:56
gmann	sean-k-mooney: ^^	16:56
gmann	*even	16:56
sean-k-mooney	perhaps we will need at lest one patch marked as a prioroty to test it	16:57
gmann	it can be appended with NOT+owner:self	16:57
gmann	sean-k-mooney: you can test with 0 priority to test query	16:57
melwitt	gibi: ack	16:58
melwitt	dansmith: I didn't think changing to a single transaction would make that much of a timing difference either but the consumer types patch series hit that bug far more often than anything else did. I didn't see anything else in those patches that could possibly be related, so I guessed about the db transaction change. my guess might be wrong	17:22
dansmith	I'm not saying you're wrong, I'm just surprised	17:22
dansmith	and surprised lots of real deployments aren't having trouble with much busier systems before the consumer types patch made the transaction longer	17:23
melwitt	I was surprised too and thought maybe it was a coincidence but once it merged everything was hitting the bug	17:23
dansmith	and if that's really it, I sure hope we're not in for crazy pain if that made the transaction like waaaay longer or something	17:23
melwitt	yeah, same. I haven't seen something like this before	17:24
sean-k-mooney	i know that i have seen conflict in ci jobs form update avaiable resouces before the consome types change	17:37
sean-k-mooney	so there was already some conficts happening	17:38
sean-k-mooney	i dont think it caused test failures but i have seen it in the logs	17:38
dansmith	oh definitely	17:43
opendevreview	Ghanshyam proposed openstack/nova master: Convert features not supported error to HTTPBadRequest https://review.opendev.org/c/openstack/nova/+/806294	17:57
melwitt	yeah, I know it was occurring prior to consumer types, just saying that on the consumer types series it hit the bug often and after it merged it's happening more often	18:28
melwitt	and re: "yeah, same. I haven't seen something like this before" I haven't seen grouping writes into a single db transaction cause a significant timing difference before	18:34
dansmith	melwitt: it failed zuul, but I don't see any conflcit messages in there, although it is one of these "hung until timeout" sorts it seems	18:53
dansmith	maybe that multicell failure is known and some other pattern?	18:53
melwitt	dansmith: you're talking about the failure on the force kwarg patch right? I would expect fails on that to not be the conflict bug and at a glance it looks like it is indeed something different	18:56
melwitt	unrelated to that, this is a new error in the controller compute log on that run: Remote error: DBReferenceError (pymysql.err.IntegrityError) (1452, 'Cannot add or update a child row: a foreign key constraint fails (`nova_cell1`.`instance_info_caches`, CONSTRAINT `instance_info_caches_instance_uuid_fkey`	18:58
melwitt	https://zuul.opendev.org/t/openstack/build/575784eb62da4365ab8a1942da0353bc/log/controller/logs/screen-n-cpu.txt#30972	18:58
dansmith	I saw that too	18:59
dansmith	maybe that dropped an update and that's why the test waited until timeout?	18:59
melwitt	and on the instance from the failed tempest test: [instance: 99c8640e-cb21-4fa9-bb90-d3361201ce7c] Failed to allocate network(s): nova.exception.VirtualInterfaceCreateException: Virtual Interface creation failed	19:00
melwitt	Aug 27 17:45:54.463587 ubuntu-focal-rax-ord-0026150414 nova-compute[111496]: ERROR nova.compute.manager [instance: 99c8640e-cb21-4fa9-bb90-d3361201ce7c] Traceback (most recent call last):	19:00
melwitt	https://zuul.opendev.org/t/openstack/build/575784eb62da4365ab8a1942da0353bc/log/controller/logs/screen-n-cpu.txt#33831	19:00
* dansmith nods		19:00
melwitt	server failed to spawn. the info cache error was on a different instance	19:02
melwitt	[instance: 217a81e4-b9cf-4e9e-97a9-edcb0fab8349] Can not refresh info_cache because instance was not found	19:04
melwitt	trying to refresh info cache on an instance that's gone.. that's odd	19:05
melwitt	oh, it looks like a network event went to the compute/cell that the instance was not in, so when it tried to refresh the info cache, it was not found	19:25
opendevreview	Merged openstack/nova stable/wallaby: Avoid modifying the Mock class in test https://review.opendev.org/c/openstack/nova/+/805759	21:20
opendevreview	Merged openstack/nova master: [func test] move port creation to the NeutronFixture https://review.opendev.org/c/openstack/nova/+/787206	21:21

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!