opendevreview | Qiu Fossen proposed openstack/nova master: Allow migrating PMEM's data https://review.opendev.org/c/openstack/nova/+/802225 | 01:31 |
---|---|---|
opendevreview | melanie witt proposed openstack/nova master: Revert "consumer gen: more tests for delete allocation cases" https://review.opendev.org/c/openstack/nova/+/806292 | 01:32 |
opendevreview | melanie witt proposed openstack/nova master: Revert "Consumer gen support for delete instance allocations" https://review.opendev.org/c/openstack/nova/+/806293 | 01:32 |
melwitt | gibi, sean-k-mooney, lyarwood: fyi I have uploaded reverts https://review.opendev.org/c/openstack/nova/+/806292 and https://review.opendev.org/c/openstack/nova/+/806293 as one possible option for addressing the gate bug. another option is the WIP patch https://review.opendev.org/c/openstack/nova/+/688802 to add a "force" kwarg to allocation delete | 01:36 |
opendevreview | Ghanshyam proposed openstack/nova master: Convert features not supported error to HTTPBadRequest https://review.opendev.org/c/openstack/nova/+/806294 | 01:44 |
*** abhishekk is now known as akekane|home | 06:00 | |
*** akekane|home is now known as abhishekk | 06:00 | |
slaweq | gibi: hi | 06:42 |
slaweq | gibi: I see a lot of job's failures with error on deleting instance, like e.g. https://a91ddba9ffca8a8b5d72-1d82fafbdd07c6a56856d7b1449f73a8.ssl.cf5.rackcdn.com/800059/3/check/neutron-ovn-tempest-ovs-release/fd05a18/testr_results.html | 06:43 |
slaweq | and it's not only on neutron-tempest-plugin jobs | 06:43 |
slaweq | I saw it also on some devstack jobs too | 06:43 |
slaweq | gibi: is it known issue or should I open one? | 06:43 |
gibi | slaweq: hi | 07:04 |
frickler | slaweq: I would be hoping that these are solved by the above reverts https://review.opendev.org/c/openstack/nova/+/806293 etc. | 07:04 |
gibi | just couple lines above melwitt proposed patches | 07:04 |
slaweq | gibi: frickler thx a lot | 07:05 |
gibi | I've just started my day but I will prioritize this issue | 07:05 |
slaweq | gibi: thx, I just wanted to ask as I knew that You will know better if that's something already reported :) | 07:06 |
gibi | sure | 07:06 |
frickler | slaweq: if you check n-cpu.log for errors, you'll see tracebacks about conflicts in placement allocations | 07:07 |
*** rpittau|afk is now known as rpittau | 07:07 | |
slaweq | frickler: yes, now I see it :) | 07:09 |
gibi | melwitt: thanks for proposing patches I read up on the bug discussion and the patches and I prefer the force kwarg that gives us a more fine grained approach to handle deletes in different situations | 07:37 |
gibi | I will resolve the merge conflict in https://review.opendev.org/c/openstack/nova/+/688802 and then I will +2 it | 07:37 |
gibi | sean-k-mooney, lyarwood, stephenfin: ^^ | 07:37 |
*** mgoddard- is now known as mgoddard | 07:58 | |
lyarwood | Morning | 08:19 |
* lyarwood reads scrollback | 08:19 | |
gibi | lyarwood: morning | 08:33 |
lyarwood | gibi: sorry got distracted with $child things | 08:55 |
lyarwood | gibi: so how's the rebase going? Are we not going ahead with the reverts just yet? | 08:55 |
gibi | lyarwood: almost done | 08:55 |
lyarwood | ack | 08:55 |
gibi | lyarwood: for me the reverts are more complex than the force kwargs fix | 08:56 |
lyarwood | yup agreed | 08:56 |
opendevreview | Balazs Gibizer proposed openstack/nova master: Add force kwarg to delete_allocation_for_instance https://review.opendev.org/c/openstack/nova/+/688802 | 09:01 |
gibi | lyarwood, sean-k-mooney, stephenfin: ^^ | 09:02 |
lyarwood | gibi: LGTM, had to do a little background reading to make sure but yeah still agree this is the better approach | 09:15 |
gibi | lyarwood: yepp the bug report has a good summary of what we are dealing with | 09:19 |
opendevreview | Fabian Wiesel proposed openstack/nova master: VmWare: Remove unused legacy_nodename regex https://review.opendev.org/c/openstack/nova/+/806336 | 09:28 |
gibi | lyarwood: thanks for the review on the pps series. I replied in https://review.opendev.org/c/openstack/nova/+/793619/15#message-a2916b6a95c6f4061429e9fb2898ed738dfc3427 | 09:46 |
opendevreview | FossenQiu proposed openstack/nova master: Allow migrating PMEM's data https://review.opendev.org/c/openstack/nova/+/802225 | 10:33 |
opendevreview | Fabian Wiesel proposed openstack/nova master: Vmware: Fix spelling in test https://review.opendev.org/c/openstack/nova/+/806348 | 10:36 |
opendevreview | Fabian Wiesel proposed openstack/nova master: VmWare: Use of id shadows built-in function https://review.opendev.org/c/openstack/nova/+/806390 | 10:43 |
lyarwood | gibi: https://review.opendev.org/c/openstack/nova/+/688802 has some unit and functional failures if you have time to look | 10:45 |
opendevreview | Fabian Wiesel proposed openstack/nova master: Vmware: Fix indentation in conditionals https://review.opendev.org/c/openstack/nova/+/806391 | 10:47 |
gibi | lyarwood: whaat? I did run those tests before pushed | 10:49 |
gibi | looking | 10:49 |
lyarwood | yeah assumed you had, weird. | 10:50 |
gibi | obviously I run it on something else as I see failures that I havent seen before :/ | 10:54 |
gibi | anyhow fixing it ... | 10:54 |
gibi | the two failing functional test actually testing our bug that is now resulved with force=True | 11:09 |
sean-k-mooney | the reverts to me are still the better long term solution as over all i think it more complext to have 2 ways to delete things | 11:09 |
gibi | sean-k-mooney: but the automatic reclaim of soft deleted instance should not win over a user initiated restore of the same soft deleted instance | 11:10 |
sean-k-mooney | although the force flag patch is slightly smaller | 11:10 |
sean-k-mooney | hum... i can see that althoug i also dont think we should uspport soft deleted in general | 11:12 |
sean-k-mooney | so you are saying the added complexityu to support soft cdelete requires force | 11:12 |
sean-k-mooney | to ensure we can restore on the same host | 11:13 |
sean-k-mooney | without other allocations talking up the space | 11:13 |
sean-k-mooney | im still not sure we need to use a put in the soft delete case | 11:13 |
gibi | what I'm saying that a conflicting delete should be forced most of the time, except if the delete is conflicting with a restore of a soft deleted instance | 11:14 |
gibi | conflict means that restore receved first, we should not allow removing the allocation in this case blindly as it will lead to an active instance without allocaiton | 11:14 |
sean-k-mooney | would we not be able to do that with a lock on the instecak uuid | 11:15 |
sean-k-mooney | have restore aquire the lock and have the perodic try to aquire it too before it does the delete | 11:15 |
gibi | yes a lock would also be a solution | 11:16 |
sean-k-mooney | if we want to proceed with the force flag for now im not nessisarly against that | 11:17 |
sean-k-mooney | but im still not trilled by doing deletes with put in general | 11:17 |
opendevreview | Balazs Gibizer proposed openstack/nova master: Add force kwarg to delete_allocation_for_instance https://review.opendev.org/c/openstack/nova/+/688802 | 11:17 |
gibi | sean-k-mooney: I'd like to proced with the force solution, reverting old patches feels messy. | 11:17 |
gibi | I need to go grab lunch, be back in 30 | 11:18 |
sean-k-mooney | ack enjoy | 11:19 |
sean-k-mooney | stephenfin: ya removing the globals is certenly another valid option. if there is no global state we have nothign to reset. ill take a look at that quickly | 11:20 |
opendevreview | sean mooney proposed openstack/nova master: db: Handle parameters in DB strings https://review.opendev.org/c/openstack/nova/+/805663 | 11:34 |
opendevreview | sean mooney proposed openstack/nova master: remove module level caching https://review.opendev.org/c/openstack/nova/+/806394 | 11:34 |
sean-k-mooney | stephenfin: ^ they were only ever used in that module so there was no test covergae for the gloabls so simple to remove them as you suggested | 11:35 |
* gibi is back | 11:38 | |
gibi | sean-k-mooney: I agree less global is better | 11:39 |
sean-k-mooney | we can re add them if this really is a problem but personaly i would prefer to use the lru cache decorator on the function if it came to that | 11:42 |
sean-k-mooney | i tihnk stephen is correct that this likely is not providing much if any useful perfromance benifit today | 11:42 |
gibi | sean-k-mooney: lru cache decorator propbably store info on the func def, so that is also a global, just a hidden one. Still needs the same reset logic as a pure module level global | 11:44 |
gibi | I'm not sure what I like more, but I guess in this case a pure global is easier to notice | 11:44 |
sean-k-mooney | gibi: yes it would but i just prefer using libary function for caching then inventing it our selves | 11:44 |
sean-k-mooney | anyway im not suggesting we use caching here | 11:45 |
gibi | yepp, the reinventing-the-wheel is on the other side of the compromise | 11:45 |
gibi | agree, only add cache if we see slowness | 11:45 |
sean-k-mooney | for what its worth where we do have our own caching i lie the pater we have in the compute rpc module https://github.com/openstack/nova/blob/50fdbc752a9ca9c31488140ef2997ed59d861a41/nova/compute/rpcapi.py#L43-L50 and a few others | 11:46 |
sean-k-mooney | e.g. where we providee a reset_globals fucntion in the module to handel that | 11:47 |
sean-k-mooney | as an external caller i done need to look at the details then | 11:47 |
sean-k-mooney | by the way speaking fo global state there was a patch i wanted to flag to you | 11:48 |
sean-k-mooney | gibi: https://review.opendev.org/c/openstack/nova/+/804985 | 11:49 |
gibi | looking.. | 11:50 |
gibi | sean-k-mooney: I agree with your comment on ^^. I do think eventlet and python threads are not play nice together and we have extensive separation between them on the compute side to prevent issues | 11:52 |
sean-k-mooney | there change would not break anything but if they deployed the way we suggest with only 1 therad and multiple processes i dont think it would fix anything either right | 11:52 |
sean-k-mooney | they would have to deploy in a "unsupproted" configuration where wer wer usign mutiple thread curretly right? | 11:53 |
gibi | sean-k-mooney: if they use oslo lock without external=True then the lock is only help with threads not processes. | 11:55 |
gibi | and I do think we only support process based scaling | 11:56 |
sean-k-mooney | looking at there responce to my commnet https://review.opendev.org/c/openstack/nova/+/804985/3#message-e6eddd2104d423b51d8c4beb14e3eaa7998cf177 | 11:56 |
sean-k-mooney | im wondering if we are timing out on an io operation | 11:57 |
sean-k-mooney | rather then a race | 11:57 |
sean-k-mooney | i think im just going to ask them to file a bug. part of the porblem is we dont know what verion of nova this happens in | 11:57 |
gibi | sean-k-mooney: yepp, we need to find a way to see if they really using threads and that casuing the problem or there is a problem even with eventlet | 12:03 |
sean-k-mooney | ok i have asked them to file a bug report and linked them to the docs on our threading model in case that helps | 12:05 |
sean-k-mooney | gibi: im going to try and fix the pep8 issue in bauzas mdev series and then split out the docs patch as stephenfin requested into its own patch at the end. are you ok to re review them if i do that. bauzas should be back on monday or tuseday but i would prefer to move this forward if we can before hitting FF | 12:15 |
gibi | sean-k-mooney: sure, ping me and I will review it | 12:17 |
sean-k-mooney | oh its a mypy failure we declar we return a sting but if we have an excption we return none. that explains why fast8 did not see it but pep8 does | 12:22 |
gibi | yes mypy only runs in the pep8 target not in the fast8 one | 12:24 |
stephenfin | sean-k-mooney: small nits on https://review.opendev.org/c/openstack/nova/+/806394 | 12:28 |
opendevreview | Ghanshyam proposed openstack/nova master: Convert features not supported error to HTTPBadRequest https://review.opendev.org/c/openstack/nova/+/806294 | 12:34 |
sean-k-mooney | stephenfin: yep just saw them come in. ill gix those up once i finish with bauzas mdev series | 12:38 |
* sean-k-mooney current todo list is mdev serise-> db issue -> multi queue extra specs series -> day of learning project if i get time | 12:40 | |
opendevreview | Merged openstack/nova master: [func test] move port resource request tests https://review.opendev.org/c/openstack/nova/+/801815 | 12:56 |
opendevreview | Merged openstack/nova master: [func test] create pps resource on OVS agent RP https://review.opendev.org/c/openstack/nova/+/787205 | 13:00 |
gibi | lyarwood, stephenfin: I'm not sure I get https://review.opendev.org/c/openstack/nova/+/800086/16/nova/tests/functional/test_servers_resource_request.py#b1669 | 13:02 |
stephenfin | gibi: lyarwood is looking at your earlier statement that those tests are passing even though they wouldn't work in real life, and is saying the tests are therefore misleading | 13:03 |
stephenfin | and therefore it would be better to comment them out until such a time as their real-world equivalent scenario would pass | 13:03 |
lyarwood | stephenfin / gibi ; well I'm saying removing them isn't the way to document that | 13:04 |
lyarwood | right | 13:04 |
lyarwood | just seems odd to add things, remove them because they are passing when they shouldn't be and then later add coverage back in again | 13:04 |
gibi | stephenfin: ahh so it is about the interface attach tests that are started passing earlier than the actual implementation? | 13:04 |
lyarwood | gibi: right | 13:04 |
stephenfin | yup, iiuc | 13:04 |
lyarwood | it's a nit | 13:04 |
lyarwood | I shouldn't really -1 for it, it's just confusing as a reviewer tbh as opposed to leaving them commented out with a note | 13:05 |
lyarwood | IMHO | 13:05 |
gibi | sorry , I got distracted with a call | 13:21 |
gibi | OK, I got confused as you marked code lines that are about booting VMs not about interface attach | 13:24 |
gibi | anyhow | 13:24 |
gibi | those interface attach tests that are passing are not false positives | 13:24 |
gibi | those sceanrios really works | 13:24 |
gibi | but those scenarios are error scenarios | 13:24 |
gibi | so they are pretty useless for the end user | 13:25 |
gibi | I will respin the patches to move the service version checks to the compute.api | 13:26 |
gibi | but I'm not sure what to do about the interface attach tests you commented on | 13:26 |
gibi | I cannot comment them out as they are in the base class where they pass | 13:26 |
lyarwood | ah crap right sorry, ignore that then | 13:27 |
gibi | I can redefine them with empty body and a comment but that feels more confusin than actually doing nothing | 13:27 |
sean-k-mooney | you can boot with one ovs and one seriov prot today | 13:27 |
gibi | sean-k-mooney: ignore the context of the comment the comment is not about those test cases | 13:27 |
gibi | sean-k-mooney: it is about interface attach test cases that started to pass earlier (and therefore not mentioned in the code at all) than expected | 13:28 |
sean-k-mooney | right but im wondering why they were previouly mared as exped failure | 13:28 |
sean-k-mooney | was that because of resouce requests? | 13:28 |
gibi | sean-k-mooney: yepp, | 13:28 |
sean-k-mooney | ah ok | 13:28 |
gibi | sean-k-mooney: we had first patches to reject every operation with the new extended resource request format | 13:28 |
gibi | sean-k-mooney: then this patch adds the impl for them and remove the rejection | 13:29 |
sean-k-mooney | got it | 13:29 |
sean-k-mooney | ok ill leave it for not sicne i would have to start at the begining of the serise to review properly | 13:29 |
gibi | sean-k-mooney: yeah it is a long one :) | 13:30 |
sean-k-mooney | well its more if i jump in mid seriese without context i wont really understand what your doing so it wont be helpful | 13:30 |
sean-k-mooney | and i dont have time today unfortunetly to load all that context | 13:31 |
sean-k-mooney | but if there is anything you want me to look at just let me know | 13:31 |
gibi | sean-k-mooney: no worries, I got good reviews from stephenfin and lyarwood on the series | 13:31 |
gibi | sean-k-mooney: it is better to spread our effort, like making the mdev series land as well | 13:31 |
sean-k-mooney | ack yep thats why im working on it now | 13:32 |
gibi | cool | 13:32 |
sean-k-mooney | im hoping unified limits can also make it | 13:32 |
gibi | does the dansmith's comments have been resolved there? | 13:33 |
dansmith | not that I've seen, | 13:33 |
sean-k-mooney | i think melwitt was working on that | 13:33 |
gibi | ack | 13:33 |
dansmith | and I think unified limits had a long way to go to be landable even before | 13:33 |
dansmith | i.e. it's still reaching into oslo library internals and such, last I saw | 13:33 |
sean-k-mooney | oh ok melwitt raised it as posibel at risk for this cycle but i think she was hopeful it could still be ready in time | 13:34 |
dansmith | wow, okay I... would be surprised | 13:34 |
sean-k-mooney | ill take your word for it since i have not been following the details of it | 13:35 |
dansmith | I didn't see tempest tests for it, | 13:35 |
dansmith | but I definitely shook out some stuff from the glance implementation when I wrote tempest tests for it | 13:36 |
dansmith | I don't see docs, and I would think it needs docs | 13:36 |
sean-k-mooney | there are some https://review.opendev.org/q/topic:%22bp%252Funified-limits-nova%22+(status:open%20OR%20status:merged) | 13:36 |
sean-k-mooney | https://review.opendev.org/c/openstack/tempest/+/790186 and https://review.opendev.org/c/openstack/tempest/+/804311 | 13:36 |
dansmith | I see gibi has a bunch of +2s over the patch I'm -1 on, but I think some of those have code that reaches into oslo.limit | 13:37 |
dansmith | ah okay cool, glad to see those tempest patches | 13:38 |
dansmith | missed them being WIP I guess | 13:38 |
gibi | dansmith: as far as I understand even if we land unified limits it will be a sort of experimental optional feature | 13:38 |
gibi | there was no API impact so I considered it safe to experiment | 13:38 |
dansmith | mmmkay :) | 13:39 |
gibi | :) | 13:39 |
sean-k-mooney | https://www.youtube.com/watch?v=6wJXBUfcIOE | 13:40 |
*** rpittau is now known as rpittau|afk | 13:42 | |
dansmith | gibi: did you see my comment on this patch you have +Wd? https://review.opendev.org/c/openstack/nova/+/712139 | 13:43 |
dansmith | I think those exceptions are wrong, no? | 13:43 |
stephenfin | gibi: I'm +2 on basically the whole QoS series now, bar the things you've already talked about and the last two patches (small issues there) | 13:45 |
gibi | dansmith: could be a bit more chatty yes, but ValueError tend to be used to point out which formal parameter got an unexpected value in the functin call | 13:46 |
gibi | stephenfin: thanks, much appreciated | 13:47 |
dansmith | gibi: sure, but is that caught and logged appropriately? | 13:47 |
dansmith | gibi: usually that sort of thing ends up as not a trace in a log line where you just see "ValueError: entity_type" and never know what was going on when it happen | 13:48 |
dansmith | but if so, cool | 13:48 |
gibi | dansmith: thanks for pointing that out. I dropped my vote. I have to go back and see how that error is propagated | 13:48 |
dansmith | gibi: I see there's a test for the specific string, so I guess it's intentional, but I still would like to know that it's turned into something useful :) | 13:49 |
gibi | dansmith: I agree with your concerns | 13:50 |
dansmith | btw, is the plan really to just outright revert the consumer types stuff because of that bug? | 13:53 |
gibi | it felt for multiple cycles that this work is just sitting there waiting, I'm happy that now it got enough attention that we can make progress with it | 13:53 |
dansmith | gibi: definitely been sidelined and I'm very happy to see it moving. having just implemented this for glance I've also got context on it, which is helpful since nova's situation is obviously more complex | 13:54 |
sean-k-mooney | dansmith: i think we are going with the force flag instead of a full revert | 13:54 |
sean-k-mooney | dansmith: although melwitt has also proposed patches for the revert if we decied that is what we want to do | 13:54 |
dansmith | orly | 13:55 |
sean-k-mooney | sorry melwitt revert patches are for the delete of the allocation using put | 13:55 |
sean-k-mooney | is ther a reason to revert the consumer types if the force flag patch merges | 13:56 |
dansmith | is the force flag safe? | 13:56 |
sean-k-mooney | gibi: that already making its way through the gate correct | 13:56 |
dansmith | that conflict is specifically to avoid deleting or changing data you don't realize has changed right? | 13:56 |
sean-k-mooney | dansmith: it is but im not sure the orginial premis of this was valid | 13:57 |
gibi | sean-k-mooney: I had to respin due to couple of missed tests | 13:57 |
gibi | sean-k-mooney: so it is not on the gate but no passed check queue | 13:57 |
dansmith | sean-k-mooney: premise of what, the force flag or the premise of the conflict? | 13:57 |
sean-k-mooney | the conflict on allocation delete | 13:57 |
sean-k-mooney | or rahter the precident in hte non soft delete case | 13:58 |
sean-k-mooney | if we got a delete form the user i think that shoudl have taken precendce over allocation changes in general | 13:59 |
sean-k-mooney | gibi raised a vaild concern with soft delete which the force flag patch adresses | 13:59 |
sean-k-mooney | e.g. restore should take precidence over soft delete | 13:59 |
dansmith | the bug discusses a retry, which is what I expect to do when I get a 409 from placement | 14:02 |
dansmith | but the proposed solution is to force? | 14:02 |
sean-k-mooney | perhaps its better for gibi to surmiarse or look at his patch so i dont butcher the answer :) | 14:03 |
gibi | retry here means read the new allocatin with the new generation and null it out in the response | 14:03 |
gibi | so for me that logically the same as ignore the current allocaiton and force the delete | 14:04 |
dansmith | null it out in the request? | 14:04 |
gibi | dansmith: yepp the current code deletes an allocation with an empty PUT | 14:04 |
gibi | "empty" as there is the generation | 14:04 |
dansmith | okay, the current proposed patch does a delete with no generation right off the bat if force=true, never tries "nicely" once AFAICT | 14:05 |
gibi | dansmith: correct | 14:06 |
sean-k-mooney | so i think doing a http delete is the corrct thing if your ar enot using soft delete and the user request the instnace to be deleted | 14:06 |
dansmith | okay, confused about the "retry" terminology then :) | 14:06 |
sean-k-mooney | which is one of the case we will pass force=true | 14:06 |
gibi | dansmith: we can only do a meaningful retry if our original intention depends on the current state of the resource we are updated | 14:07 |
gibi | in many delete case we don't care about the current state we just want to delete | 14:07 |
dansmith | gibi: yeah, technically now we're just nulling out the allocations and PUTing the thing back as-is, and generation + retry would be the right thing to do in that case, | 14:08 |
dansmith | but I understand, if we're deleting the thing then maybe delete is the right answer | 14:08 |
dansmith | the thing I was saying above was that it seemed like you (all) were saying try the put, and if the put fails, delete to override | 14:11 |
dansmith | but I see that's not what the patch does | 14:11 |
sean-k-mooney | right no the patch adds a force flag and we are selectivly setting it depening on which behavior we want | 14:12 |
sean-k-mooney | force=true to jsut delete regardless of state and force=false if we want an operturnity to reconider | 14:12 |
gibi | dansmith: if we know we force on failure then what the point to try first nicely? | 14:12 |
dansmith | gibi: I'm not arguing that we should, I was just trying to understand the proposed new behavior | 14:13 |
gibi | ack | 14:13 |
dansmith | I guess I'm wondering why we did the PUT with empty allocations in the first place instead of delete | 14:14 |
sean-k-mooney | we used to do delete | 14:15 |
sean-k-mooney | we started doing put when we added generations | 14:15 |
dansmith | well, right :) | 14:15 |
sean-k-mooney | but that is why we also have the straight revert patches as an option to consider | 14:15 |
sean-k-mooney | melwitt: myslef and possibel other were wondering the same thing | 14:15 |
sean-k-mooney | the force flag patch is kind of a middle ground | 14:16 |
dansmith | it also seems super odd to have force=True as the default | 14:16 |
dansmith | I'm guessing that's just for expediency, but.. | 14:17 |
sean-k-mooney | i think in most case we really do just want to delete not put | 14:17 |
sean-k-mooney | althogh callign it force is perhaps questioable in that case | 14:17 |
dansmith | right | 14:17 |
dansmith | inverting the logic to "extra_checks=False" or something if you opt into it might be good, | 14:18 |
dansmith | but again, | 14:18 |
dansmith | we're opting out of the force for things like move and soft delete, | 14:18 |
dansmith | which I kinda expect to exist more places than actual instance delete | 14:18 |
dansmith | anyway, I didn't mean to come in here and throw a wrench in things, I just wanted to understand what as going on | 14:18 |
dansmith | I saw 409s in failed runs yesterday, but I assumed those were just retries and didn't give it much thought | 14:19 |
opendevreview | sean mooney proposed openstack/nova master: Provide the mdev class for every PCI device https://review.opendev.org/c/openstack/nova/+/802918 | 14:19 |
opendevreview | sean mooney proposed openstack/nova master: Provide and use other RCs for mdevs if needed https://review.opendev.org/c/openstack/nova/+/803233 | 14:19 |
opendevreview | sean mooney proposed openstack/nova master: Expose the mdev class https://review.opendev.org/c/openstack/nova/+/801743 | 14:19 |
opendevreview | sean mooney proposed openstack/nova master: [WIP] update vgpu docs to account for generic mdev support https://review.opendev.org/c/openstack/nova/+/806412 | 14:19 |
artom | Oooohhhh | 14:21 |
* artom just fell in love with Gerrit v3 | 14:21 | |
sean-k-mooney | this will be good. why so | 14:21 |
artom | If something's changed between PSN and PSN+1 that's not actually part of the change itself, it'll show the diff in yellow, not green | 14:21 |
sean-k-mooney | ah yes | 14:22 |
sean-k-mooney | it does | 14:22 |
dansmith | and sometimes blue right? | 14:22 |
sean-k-mooney | the colours depend on light vs dark mode | 14:22 |
dansmith | I haven't figured out the pattern | 14:22 |
dansmith | ah | 14:22 |
artom | I've only seen yellow so far | 14:22 |
dansmith | yeah, probably the dark switch, I had one machine in dark the other day | 14:23 |
artom | Reading the scrollback, what *was* the logic behind PUTing empty allocations instead of outright DELETE? | 14:23 |
artom | We had consumer generations, so... just because we could? | 14:23 |
artom | Like, we're deleting, why do races matter? | 14:23 |
dansmith | artom: well, I'm kinda wondering if it's related to races during move and such | 14:23 |
dansmith | artom: we delete allocations for all kinds of reasons other than deleting an instance | 14:23 |
artom | So the idea is - before we delete, make sure we have the latest version so that we can then POST it to a new RP? | 14:24 |
dansmith | which is why it seems to me like maybe we should only force=true on instance delete, and default to not everywhere else, instead of what is proposed which is always force | 14:24 |
sean-k-mooney | so just invert the default for force | 14:25 |
dansmith | we're not really checking the allocations though, so, probably not worth it | 14:25 |
dansmith | anyway, it just feels like we're removing a fence because it's currently getting in the way :) | 14:26 |
artom | Yeah - hence the question - why did we put the fence there in the first place? | 14:26 |
dansmith | well, generations went in to avoid us clobbering allocations all the time | 14:27 |
dansmith | during things like moves | 14:27 |
sean-k-mooney | artom: this is the orginial commit that added the put behaivor https://review.opendev.org/c/openstack/nova/+/591597 | 14:28 |
artom | So, I'm sure there's a legitimate reason for it, but I don't grok it. Even during a move it, why would there be 2 bits of code trying to update allocations concurrently? | 14:28 |
artom | *move operation | 14:28 |
sean-k-mooney | it has some motivation for it in the commit but not a fully explanation | 14:28 |
sean-k-mooney | the main motivation i got form that was to intneionally be able to put the instance into error on delete if it was in an inconsitent state | 14:30 |
sean-k-mooney | to presumable give the operator or user a change to investigate | 14:30 |
sean-k-mooney | *chance | 14:30 |
dansmith | originally before we were using migration allocations I think both sides fought over who was going to claim the resources and delete the other | 14:30 |
artom | sean-k-mooney, that's for instance deletion, dansmith is talking about moves needing generations | 14:30 |
dansmith | and of course same-host migrations made that hard | 14:31 |
artom | ... so we did a thing in placement because Nova was dumb? | 14:31 |
artom | Ah, resize to same host... | 14:31 |
sean-k-mooney | isnt that why placment was created :) | 14:31 |
dansmith | artom: no dude, placement was/is accessed by multiple services at once | 14:31 |
artom | Hah | 14:31 |
sean-k-mooney | but here this is a distibuted lock problem | 14:32 |
sean-k-mooney | in that we dont have one | 14:32 |
artom | dansmith, right, I know that. So was this something like Nova updating the VCPUs or whatever, while Neutron concurrently tries to update something about the port? | 14:32 |
dansmith | well, generation is the lock | 14:33 |
sean-k-mooney | so generations were added ^ yep | 14:33 |
dansmith | I'm trying to remember how this works, but is the generation on the consumer actually? | 14:33 |
dansmith | such that one instance moving between two RPs shares a generation? | 14:33 |
sean-k-mooney | i think the generation is on the allocation | 14:33 |
sean-k-mooney | i dont know if they share the same generation when we have both in a migration case | 14:34 |
dansmith | we don't when migration is holding the resources on the other, | 14:35 |
* sean-k-mooney closes tabs... i have/had 4 tabs with different version fo the same patch open and it was confusing the hell out of me | 14:35 | |
dansmith | but the generation is on the allocation for the consumer across both compute nodes (if you do it that way) | 14:35 |
dansmith | so even with both compute nodes being "separate" they both had to update the allocation for the instance, | 14:35 |
dansmith | to claim resources on incoming and drop the resources from the outgoing | 14:35 |
dansmith | but this delete is actually _deleting_ the allocation | 14:36 |
dansmith | so this is from after that concept of a single allocation against both providers | 14:36 |
sean-k-mooney | ah so we dont have 2 indepented allcoations | 14:37 |
dansmith | right | 14:37 |
sean-k-mooney | i tought we used the migration uuid to create an indepent allcoation | 14:37 |
dansmith | oh we do no, | 14:37 |
dansmith | *now* | 14:37 |
sean-k-mooney | ah ok | 14:37 |
dansmith | I'm saying before we did that, we were creating allocations for the instance against both computes | 14:38 |
sean-k-mooney | you were discibing your initall implemenation | 14:38 |
sean-k-mooney | before you refactored it | 14:38 |
dansmith | which required both to modify it during/after the migration | 14:38 |
dansmith | not mine, but the, yes | 14:38 |
dansmith | mine was the "use migration allocation" change | 14:38 |
sean-k-mooney | right ok that aligns with what i recall | 14:38 |
sean-k-mooney | so you think this was really only need before the move to seperate allocations? | 14:39 |
sean-k-mooney | or do you think its still useful now. the put instead of delete | 14:39 |
dansmith | I think the PUT came after we were already using two separate allocations, but not positive | 14:40 |
sean-k-mooney | for migrations that is | 14:40 |
sean-k-mooney | ok im not sure of the time line but that sound right | 14:41 |
dansmith | gibi wrote that initial patch you linked above to move *away* from delete specifically to get generation stuff, | 14:41 |
dansmith | and it was part of "nested allocation candidates" | 14:41 |
sean-k-mooney | the migration allcotion change was a long time ago relitivly speaking | 14:41 |
dansmith | right | 14:41 |
sean-k-mooney | oh it was for nested resouce providers | 14:41 |
sean-k-mooney | so maybe for port resouce requests? | 14:42 |
dansmith | well, that's kinda what I was eluding to above, | 14:42 |
dansmith | with multiple things accessing here | 14:42 |
sean-k-mooney | ack | 14:42 |
dansmith | like does PUT {} avoid nuking a sub-allocation that neutron hasn't yet cleaned up? | 14:43 |
sean-k-mooney | i dont think we technially support netron modifying the allocation directly | 14:43 |
sean-k-mooney | i think nova always handels that on its behalf | 14:43 |
sean-k-mooney | but valid question | 14:43 |
dansmith | okay I thought in some cases it would.. wasn't that the point of that blueprint? | 14:44 |
dansmith | I wasn't really involved in it so I dunno | 14:44 |
gibi | it was definitely before port resource request as nested a_c is a prerequisite for that | 14:44 |
sean-k-mooney | so there was the open question about modifying QOS policies for bound ports | 14:44 |
sean-k-mooney | which could change the allcoaitons | 14:44 |
gibi | the only case neutorn modifies allocation ^^ | 14:44 |
sean-k-mooney | but i dont think we support that yet do we | 14:44 |
gibi | we do | 14:45 |
sean-k-mooney | oh ok | 14:45 |
gibi | you can change QoS policy on a bound port, neutron will try to amend the instance allocation according to the new policy | 14:45 |
sean-k-mooney | so ya change the bandwith or pps qos amount on a bound port would be the only time neutron could modify the allocation i think | 14:45 |
gibi | yepp I think too | 14:45 |
dansmith | okay, but this was for nested A-C, meaning another allocation against another provider coming back from placement, not any sort of actual nesting of allocations as stored right? | 14:45 |
sean-k-mooney | yes i dont think there is really a concept of nested allocations | 14:46 |
gibi | dansmith: right, we dont have sub allocations we have multiple RPs in an allocaiton | 14:46 |
sean-k-mooney | jsut a singel allcoation with resouce form multple RPs today | 14:46 |
dansmith | gibi: so you wrote this initial patch to move from delete to put specifically for "benefit" of generations.. are you thinking that was wrong now? | 14:47 |
gibi | dansmith: I tried to figure out this mornign the reason why I had that patch | 14:48 |
gibi | the only thing I know is that bumping the placemenet api to use nested a_cs meant that we got generations as well | 14:48 |
gibi | so we had to do something with generations | 14:48 |
gibi | i.e. the nested a_c support was in a bigger placement microversion than the generation support | 14:49 |
gibi | 1.28 and 1.29 | 14:50 |
gibi | if the request from the user is to delete the instance but that deletion fails due to a parallel operation then I think now that the delete operation should be the one that wins | 14:51 |
dansmith | I don't think this patch was required just because we did generations and nested a_c, because you're updating the version you pass there, | 14:52 |
gibi | in the other hand if the user requested restore of soft deleted instance and the timer on the soft delete -> hard delete is racing with that request then the restore should win | 14:52 |
dansmith | and each call uses whatever version it wants right? | 14:52 |
sean-k-mooney | gibi: microversion are per cal though | 14:53 |
dansmith | right | 14:53 |
gibi | right | 14:53 |
gibi | I think the warning at the end of https://docs.openstack.org/placement/latest/placement-api-microversion-history.html#consumer-generation-support shows that at that time we was pretty serious about using generations strictly | 14:54 |
sean-k-mooney | gibi: for the soft delete case we are expectign that this will help because? resotre will update allocation generation and the perodici will fail? | 14:55 |
sean-k-mooney | for soft delete i assume we continue to claim the resouce with the allocation until the actul hard delete happens | 14:56 |
sean-k-mooney | so would there be a need to update the allocation at all | 14:56 |
sean-k-mooney | for either soft delete or restore | 14:56 |
gibi | sean-k-mooney: if you are right then there could no be race between restore and soft -> hard delete | 14:56 |
sean-k-mooney | e.g. will the generation actuly cahnge | 14:56 |
gibi | btw the microversion history doc states | 14:57 |
dansmith | gibi: I haven't mapped it out, but is there some reason we can't specifically opt-into the nuke-and-destroy behavior when we're deleting an instance and still fail for other cases? | 14:57 |
gibi | Passing an empty allocations object along with a consumer_generation makes PUT /allocations/{consumer_uuid} a safe way to delete allocations for a consumer. | 14:57 |
gibi | dansmith: you mean the user should be able to opt into the force? | 14:57 |
dansmith | no | 14:58 |
sean-k-mooney | dansmith: am i think the reason is that technially deletes are not allowd to have request bodies and we did not want to use a query arg | 14:58 |
dansmith | you're force=True by default | 14:58 |
sean-k-mooney | oh you mean the force default | 14:58 |
dansmith | yes | 14:58 |
gibi | dansmith: so just inverting the default | 14:58 |
gibi | dansmith: that is OK to me | 14:58 |
sean-k-mooney | we can set that to false and jsut pass true for a user delete | 14:58 |
sean-k-mooney | ya im fine with inverting too | 14:58 |
gibi | I think at some point melwitt stated that she used force=True as default as that was the more frequent call | 14:59 |
dansmith | right, I think that's better signaling, and also better for whoever comes along later and uses that to delete an allocation | 14:59 |
dansmith | I think making the dangerous activity explicit is worth the verbosity | 14:59 |
sean-k-mooney | i tihnk its pretty close either way | 14:59 |
sean-k-mooney | in terms of calls | 15:00 |
dansmith | ack | 15:00 |
dansmith | gibi: sean-k-mooney: do we know why the consumer types thing suddenly means we need to do this btw? I imagine the answer is just "more chance for the generation to change" but.. why specifically? | 15:40 |
gibi | dansmith: exactly the consumer_types series merged the two transaction handling the allocation update into one, causing a longer transaction and hence a longer race window | 15:42 |
dansmith | really? I wouldn't expect that to be meaningfully long | 15:43 |
dansmith | as in, I would expect people running this in a larger cloud with lots of stuff going on would have hit this by now and not just in a gate run with few instances at any one point | 15:44 |
dansmith | I saw that analysis from melwitt in the bug and wasn't so sure | 15:44 |
dansmith | but I guess if it's really related to that change only then that has to be it | 15:44 |
gibi | there is a correlation with the consumer_type series landed and the uptick in the conflict failures in the gate | 15:45 |
gibi | I know correlation is not causation but still it feel relevant | 15:46 |
gibi | and the consumer_type series had this transaction logic change | 15:46 |
dansmith | hrm | 15:46 |
gibi | i have no better explanation at the moment | 15:46 |
dansmith | ack | 15:47 |
sean-k-mooney | dansmith: sorry was on internal call | 15:53 |
sean-k-mooney | ah am no i know there is a collation but not why | 15:54 |
opendevreview | Balazs Gibizer proposed openstack/nova master: Support boot with extended resource request https://review.opendev.org/c/openstack/nova/+/800086 | 16:11 |
opendevreview | Ghanshyam proposed openstack/nova master: Convert features not supported error to HTTPBadRequest https://review.opendev.org/c/openstack/nova/+/806294 | 16:12 |
opendevreview | Balazs Gibizer proposed openstack/nova master: Support move ops with extended resource request https://review.opendev.org/c/openstack/nova/+/800087 | 16:13 |
opendevreview | Balazs Gibizer proposed openstack/nova master: [func test] refactor interface attach with qos https://review.opendev.org/c/openstack/nova/+/800088 | 16:14 |
opendevreview | Balazs Gibizer proposed openstack/nova master: Support interface attach / detach with new resource request format https://review.opendev.org/c/openstack/nova/+/800089 | 16:15 |
opendevreview | Balazs Gibizer proposed openstack/nova master: [func test] move unshelve test to the proper place https://review.opendev.org/c/openstack/nova/+/793621 | 16:17 |
opendevreview | Balazs Gibizer proposed openstack/nova master: [nova-manage]support extended resource request https://review.opendev.org/c/openstack/nova/+/802060 | 16:18 |
opendevreview | sean mooney proposed openstack/nova master: Remove module level caching https://review.opendev.org/c/openstack/nova/+/806394 | 16:19 |
opendevreview | sean mooney proposed openstack/nova master: db: Handle parameters in DB strings https://review.opendev.org/c/openstack/nova/+/805663 | 16:19 |
opendevreview | Balazs Gibizer proposed openstack/nova master: Reno for qos-minimum-guaranteed-packet-rate https://review.opendev.org/c/openstack/nova/+/805046 | 16:19 |
opendevreview | Balazs Gibizer proposed openstack/nova master: Add force kwarg to delete_allocation_for_instance https://review.opendev.org/c/openstack/nova/+/688802 | 16:30 |
gibi | dansmith, sean-k-mooney, melwitt, lyarwood: based on the above disucssion I inverted the force default value in ^^ | 16:30 |
gibi | lyarwood, stephenfin: I fixed up the nits in the pps series. thanks for the valuable feedback | 16:31 |
gibi | and with that I end my week. o/ | 16:33 |
sean-k-mooney | gibi: https://review.opendev.org/c/openstack/project-config/+/787523 has now merged so we shoudl have the review priority lable avaiable just an fyi | 16:42 |
gmann | finally :) | 16:45 |
sean-k-mooney | i think https://review.opendev.org/q/project:openstack/nova+Review-Priority:1 is how to use it | 16:46 |
sean-k-mooney | but we dont have any patches with it set | 16:46 |
sean-k-mooney | https://review.opendev.org/q/project:openstack/nova+Review-Priority:0 get hits however | 16:46 |
gmann | or event this, https://review.opendev.org/q/project:openstack/nova+Review-Priority:1+label:Verified%253D1++NOT+label:Workflow%253C%253D-1 | 16:56 |
gmann | sean-k-mooney: ^^ | 16:56 |
gmann | *even | 16:56 |
sean-k-mooney | perhaps we will need at lest one patch marked as a prioroty to test it | 16:57 |
gmann | it can be appended with NOT+owner:self | 16:57 |
gmann | sean-k-mooney: you can test with 0 priority to test query | 16:57 |
melwitt | gibi: ack | 16:58 |
melwitt | dansmith: I didn't think changing to a single transaction would make that much of a timing difference either but the consumer types patch series hit that bug far more often than anything else did. I didn't see anything else in those patches that could possibly be related, so I guessed about the db transaction change. my guess might be wrong | 17:22 |
dansmith | I'm not saying you're wrong, I'm just surprised | 17:22 |
dansmith | and surprised lots of real deployments aren't having trouble with much busier systems before the consumer types patch made the transaction longer | 17:23 |
melwitt | I was surprised too and thought maybe it was a coincidence but once it merged everything was hitting the bug | 17:23 |
dansmith | and if that's really it, I sure hope we're not in for crazy pain if that made the transaction like waaaay longer or something | 17:23 |
melwitt | yeah, same. I haven't seen something like this before | 17:24 |
sean-k-mooney | i know that i have seen conflict in ci jobs form update avaiable resouces before the consome types change | 17:37 |
sean-k-mooney | so there was already some conficts happening | 17:38 |
sean-k-mooney | i dont think it caused test failures but i have seen it in the logs | 17:38 |
dansmith | oh definitely | 17:43 |
opendevreview | Ghanshyam proposed openstack/nova master: Convert features not supported error to HTTPBadRequest https://review.opendev.org/c/openstack/nova/+/806294 | 17:57 |
melwitt | yeah, I know it was occurring prior to consumer types, just saying that on the consumer types series it hit the bug often and after it merged it's happening more often | 18:28 |
melwitt | and re: "yeah, same. I haven't seen something like this before" I haven't seen grouping writes into a single db transaction cause a significant timing difference before | 18:34 |
dansmith | melwitt: it failed zuul, but I don't see any conflcit messages in there, although it is one of these "hung until timeout" sorts it seems | 18:53 |
dansmith | maybe that multicell failure is known and some other pattern? | 18:53 |
melwitt | dansmith: you're talking about the failure on the force kwarg patch right? I would expect fails on that to not be the conflict bug and at a glance it looks like it is indeed something different | 18:56 |
melwitt | unrelated to that, this is a new error in the controller compute log on that run: Remote error: DBReferenceError (pymysql.err.IntegrityError) (1452, 'Cannot add or update a child row: a foreign key constraint fails (`nova_cell1`.`instance_info_caches`, CONSTRAINT `instance_info_caches_instance_uuid_fkey` | 18:58 |
melwitt | https://zuul.opendev.org/t/openstack/build/575784eb62da4365ab8a1942da0353bc/log/controller/logs/screen-n-cpu.txt#30972 | 18:58 |
dansmith | I saw that too | 18:59 |
dansmith | maybe that dropped an update and that's why the test waited until timeout? | 18:59 |
melwitt | and on the instance from the failed tempest test: [instance: 99c8640e-cb21-4fa9-bb90-d3361201ce7c] Failed to allocate network(s): nova.exception.VirtualInterfaceCreateException: Virtual Interface creation failed | 19:00 |
melwitt | Aug 27 17:45:54.463587 ubuntu-focal-rax-ord-0026150414 nova-compute[111496]: ERROR nova.compute.manager [instance: 99c8640e-cb21-4fa9-bb90-d3361201ce7c] Traceback (most recent call last): | 19:00 |
melwitt | https://zuul.opendev.org/t/openstack/build/575784eb62da4365ab8a1942da0353bc/log/controller/logs/screen-n-cpu.txt#33831 | 19:00 |
* dansmith nods | 19:00 | |
melwitt | server failed to spawn. the info cache error was on a different instance | 19:02 |
melwitt | [instance: 217a81e4-b9cf-4e9e-97a9-edcb0fab8349] Can not refresh info_cache because instance was not found | 19:04 |
melwitt | trying to refresh info cache on an instance that's gone.. that's odd | 19:05 |
melwitt | oh, it looks like a network event went to the compute/cell that the instance was not in, so when it tried to refresh the info cache, it was not found | 19:25 |
opendevreview | Merged openstack/nova stable/wallaby: Avoid modifying the Mock class in test https://review.opendev.org/c/openstack/nova/+/805759 | 21:20 |
opendevreview | Merged openstack/nova master: [func test] move port creation to the NeutronFixture https://review.opendev.org/c/openstack/nova/+/787206 | 21:21 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!