rm_work | so I'm still not hearing anything that is a problem for my deployment | 00:00 |
---|---|---|
dansmith | heh, okay | 00:00 |
rm_work | and if I can do it, others can do it | 00:00 |
dansmith | here's one problem: file injection has been deprecated :P | 00:00 |
rm_work | yes, thanks | 00:00 |
rm_work | I happened to notice that recently ;P | 00:01 |
*** tetsuro has joined #openstack-nova | 00:01 | |
penick | So the root problem is you need to put arbitrary files in an instance, or you need instances to have x509 certs chains? | 00:02 |
rm_work | i don't think we're going to get anywhere today on this, maybe we pick up at the PTG | 00:02 |
rm_work | right now it's a cert-chain, PK, and agent config | 00:02 |
rm_work | so I guess "arbitrary files" | 00:02 |
rm_work | and they all contain data that we consider "sensitive" | 00:03 |
rm_work | (obviously, in the case of the PK) | 00:03 |
penick | Are they "shared" secrets, like the keypair for a public website? Barbican might be the right place for those. | 00:03 |
*** david-lyle has joined #openstack-nova | 00:03 | |
rm_work | well, specifically the PK | 00:03 |
rm_work | we do use Barbican, but our instances have no way to auth against it | 00:03 |
rm_work | one is shared | 00:04 |
rm_work | the other is generated specifically for the VM in question | 00:04 |
rm_work | (the PK) | 00:04 |
*** shaohe_feng has quit IRC | 00:04 | |
penick | We generate secrets on our instances, then have another system the instances call to have their csr signed, it asserts their identity before it's signed by our root of trust | 00:05 |
rm_work | ok so that there is the important bit | 00:05 |
johnsom | Speaking of FF - had to go take care of that. Yeah I think at least of floppy disk's worth of storage is reasonable. lol Like the PXE boot image size. | 00:05 |
rm_work | we USE those certs/PK to assert identity | 00:05 |
*** harlowja has joined #openstack-nova | 00:05 | |
rm_work | how do you assert VM's identity without that? | 00:05 |
rm_work | i mean, that is exactly our workflow | 00:06 |
rm_work | well ... ALMOST our workflow | 00:06 |
rm_work | we reach out to the VM, not the other way around | 00:06 |
johnsom | Yeah, this was the whole discussion that led use to what was implemented years ago. | 00:06 |
*** shaohe_feng has joined #openstack-nova | 00:06 | |
dansmith | penick is headed down the right path here, which is not to pass everything to nova and expect it to keep it (most people) or disavow it (some people), and only give nova enough information to let you interact with some service that can do what you want | 00:06 |
*** dklyle has quit IRC | 00:06 | |
dansmith | information that is not sensitive forever | 00:06 |
johnsom | I'm just concerned that if we don't trust how we store and handle images we are in trouble before we even get to config data and establishing secure channels. | 00:07 |
*** linkmark has quit IRC | 00:07 | |
penick | We create a signed bearer document that's time limited and place it in the instance, on boot the instance creates a PK and CSR, then sends those along with the attestation document (created as part of vendor data) to the token server, which verifies the signature in the attestation document (and then invalidates the document) then calls openstack to verify the details in the CSR | 00:08 |
rm_work | which "we" is that? | 00:08 |
penick | eg, ensure the IP, UUID, etc in the CSR match the instance | 00:08 |
rm_work | which service | 00:08 |
rm_work | because that does sound like the workflow we're aiming for | 00:08 |
penick | the service is called Athenz, and the system we've built to integrate it into OpenStack is called copper argos | 00:08 |
penick | I have a talk on it, one sec.. | 00:09 |
rm_work | i was hoping to just glance at the repo | 00:09 |
rm_work | https://github.com/yahoo/athenz ? | 00:09 |
penick | https://www.openstack.org/videos/vancouver-2018/attestable-service-identity-with-copper-argos | 00:10 |
penick | yup | 00:10 |
rm_work | https://github.com/yahoo/athenz/blob/master/docs/copper_argos_dev.md | 00:10 |
penick | Ayup, that's it | 00:10 |
*** vladikr has quit IRC | 00:11 | |
rm_work | so basically, we're screwed once Stein hits, and we have to get something like this working before then? :P | 00:11 |
rm_work | sounds like another day at the office, lol | 00:11 |
penick | I feel like it benefits me to say Yes :) | 00:11 |
rm_work | we'll investigate | 00:11 |
dansmith | rm_work: you should really read the spec you're freaking out about | 00:11 |
rm_work | I did | 00:11 |
dansmith | "Since personality file injection will still be supported with older microversions, there will be nothing removed from the backend compute code related to file injection" | 00:11 |
penick | We're eager to have other people use this, so lmk if y'all (who..are..you?) are interested in using Athenz. It'd be good to get other organizations using/contributing to Athenz | 00:12 |
rm_work | yeah, but in Octavia we don't necessarily control the nova deployments | 00:12 |
rm_work | so we can't guarantee they have the thing enabled | 00:12 |
rm_work | but we still need our stuff to work | 00:12 |
*** dklyle has joined #openstack-nova | 00:12 | |
*** gbarros has joined #openstack-nova | 00:12 | |
dansmith | rm_work: oooh, I have good news for you | 00:12 |
rm_work | penick: we'd be writing something like that into Octavia | 00:12 |
dansmith | rm_work: user_data will always work? see how nice it is to have features that don't come and go with the deployment choices? :) | 00:12 |
rm_work | lol | 00:13 |
rm_work | except user-data already doesn't work :P | 00:13 |
johnsom | Well, nova is a stable api, so it shouldn't be going away any time soon or they are dropping their stable assertion.... | 00:13 |
penick | We'll be using octavia with this in the near future. It's one of the things we have to suss out this qtr | 00:13 |
dansmith | you mean jamming a bus into your wallet won't work | 00:13 |
penick | But, we already have Athenz in place | 00:13 |
penick | dansmith: Well not with that attitude | 00:13 |
dansmith | johnsom: that's what I'm trying to point out | 00:13 |
rm_work | but you're saying it's already disabled in most nova deploys? | 00:14 |
dansmith | johnsom: which is what you get if you read a paragraph down below "and now lose your mind" | 00:14 |
*** dklyle_ has joined #openstack-nova | 00:14 | |
dansmith | rm_work: no, we're saying that file injection is disabled, but as you pointed out we're putting those personality files into the config drive the first time we make it | 00:14 |
*** david-lyle has quit IRC | 00:14 | |
*** shaohe_feng has quit IRC | 00:15 | |
rm_work | [16:38:53] <dansmith>so this has been disabled by default for libvirt for a long time, | 00:15 |
rm_work | ^^ so what did that mean? | 00:15 |
dansmith | rm_work: file. injection. | 00:15 |
*** itlinux has joined #openstack-nova | 00:15 | |
rm_work | yes, which has always worked via personality files? | 00:15 |
dansmith | rm_work: you saw the part where I said "I'm not sure how this is going into config drive" and then ... found and quoted the code right? | 00:15 |
*** shaohe_feng has joined #openstack-nova | 00:15 | |
*** jamesde__ has quit IRC | 00:15 | |
rm_work | maybe? | 00:15 |
johnsom | dansmith I was shocked because we hadn't heard of this and it was the *way* to do this securely and reliably and user-data was .... less than ideal | 00:16 |
rm_work | https://github.com/openstack/nova/blob/master/nova/api/metadata/base.py#L191-L194 this link? | 00:16 |
rm_work | I thought that was via libvirt using the thing you said was disabled | 00:16 |
dansmith | johnsom: you know that config drive is disable-able and depending on it is also not reliable yeah? | 00:16 |
dansmith | rm_work: no | 00:16 |
dansmith | rm_work: I get that it says libvirt there, but... | 00:17 |
*** jamesden_ has joined #openstack-nova | 00:17 | |
*** dklyle has quit IRC | 00:17 | |
*** Sundar has quit IRC | 00:17 | |
johnsom | dansmith We force require it as the metadata service was swiss cheese and blew up if you booted more than a few instances at a time | 00:17 |
rm_work | if that's not "file injection" then I don't know | 00:17 |
dansmith | rm_work: the rest of the spec is talking about file injection specifically, which has nothing to do with config drive and is all about violating the very sanctity of the image by forcing large things into small holes | 00:17 |
rm_work | err | 00:18 |
penick | rm_work what's generating the secrets that you're putting into the instance? (amphora vms?) | 00:18 |
rm_work | so *are we using file injection or not*? | 00:18 |
dansmith | I'm serious, you should totes read the spec :) | 00:18 |
rm_work | I read the spec | 00:18 |
rm_work | several sections more than once | 00:18 |
rm_work | so obviously whatever you're hinting at, i'm not going to get | 00:18 |
johnsom | Yeah, the terminology in that spec is super confusing compared to the nova API and client API | 00:18 |
dansmith | that's the point of the first #1 bullet | 00:19 |
rm_work | this whole conversation started because I asked "is what we are doing the deprecated file injection" and multiple people said "yes" | 00:19 |
dansmith | users can't know whether they will get the files they send, because either the deployment may have actual injection disabled (the default), | 00:19 |
rm_work | which #1 bullet, there are several | 00:19 |
dansmith | or they may have disabled config drive (the other way to get these files) | 00:19 |
dansmith | rm_work: I said the first :) | 00:19 |
rm_work | (in fact, I DID notice something new by re-reading -- that SECTION has two, rofl) | 00:20 |
openstackgerrit | Merged openstack/nova master: conf: Add '[neutron] physnets' and related options https://review.openstack.org/564440 | 00:20 |
dansmith | let me try to restate this whole thing | 00:20 |
dansmith | and if that doesn't help, then I'll leave and you can keep your torches and pitchforks for whatever you want | 00:21 |
dansmith | in the olden times, | 00:21 |
dansmith | there was a feature called "file injection" | 00:21 |
dansmith | there are two halves of said feature: | 00:21 |
dansmith | 1. The API (personality files) by which people provide this data which may get ignored if config is unfriendly | 00:21 |
johnsom | Anyhow, any change we can bump that max size of user-data up to a floppy size? Is it just the API limitation and a DB column alter, or is cloud-init going to need to spin too? | 00:22 |
dansmith | 2. The actual injection part, where the virt driver (some not all) could inject files into images forcibly, literally by taking a hard-coded partition number, and writing over it with your data | 00:22 |
dansmith | are you with me? | 00:22 |
dansmith | config drive didn't exist at this point | 00:22 |
*** gyee has quit IRC | 00:23 | |
dansmith | aight, I guess nobody wants to hear my story | 00:24 |
rm_work | i'm trying to parse it | 00:24 |
*** medberry has joined #openstack-nova | 00:24 | |
*** vladikr has joined #openstack-nova | 00:24 | |
dansmith | which part? | 00:24 |
rm_work | so, file-injection IS what we're using, correct? so right now, we are using both halves of this? | 00:24 |
dansmith | no, | 00:24 |
dansmith | you're using the first part, | 00:24 |
rm_work | or this was just the past, and it's changed now, and you're getting to that | 00:24 |
dansmith | and another part I haven't gotten to yet | 00:24 |
rm_work | k | 00:25 |
*** shaohe_feng has quit IRC | 00:25 | |
dansmith | the #2 part is the really nasty bit, which has been disabled by default, and which we _actually_ want to be rid of | 00:25 |
*** itlinux has quit IRC | 00:25 | |
dansmith | however, the first part is problematic because we don't store it and it breaks several of our other features (agree to disagree on this) | 00:25 |
dansmith | so, in the middle ages, long before you showed up, | 00:25 |
dansmith | this config_drive thing was created | 00:25 |
*** shaohe_feng has joined #openstack-nova | 00:26 | |
dansmith | which was a way to avoid the metadata server's restrictions, complication, whatever | 00:26 |
*** gbarros has quit IRC | 00:26 | |
dansmith | apparently when we create that the first time, we also put those files in there (TIL) | 00:26 |
*** gbarros has joined #openstack-nova | 00:26 | |
dansmith | but we can't re-create it later, which is the #2 part of the spec problem section | 00:27 |
dansmith | so, | 00:27 |
dansmith | you're using the API part, and the config drive part, but not the actual injection thing which is the most smelly bit | 00:27 |
rm_work | ha, right, which is funny because the #2 "problem" is actually WHY we chose this method | 00:27 |
dansmith | fine, but whatever | 00:27 |
rm_work | ok so if #2 was the bad part, and that's just not done anymore... why is the first part being removed? | 00:27 |
dansmith | #2 is related to the API not the really bad part | 00:28 |
rm_work | err | 00:28 |
rm_work | sorry, PART 1 and 2 | 00:28 |
dansmith | the API part is bad because it takes arbitrary files and then kind keeps track of them, until a rebuild or something and then we lose them | 00:29 |
rm_work | per "1. The API (personality files) by which people provide this data" and "the #2 part is the really nasty bit, which has been disabled by default, and which we _actually_ want to be rid of" | 00:29 |
rm_work | hmmm | 00:29 |
dansmith | the #2 part is the libvirt injection partition thing | 00:29 |
dansmith | sorry | 00:29 |
dansmith | eff, | 00:29 |
rm_work | yeah | 00:30 |
dansmith | this straightening isn't going well | 00:30 |
rm_work | so right, #2 part (libvirt) isn't even done anymore | 00:30 |
rm_work | now it puts things into config-drvie | 00:30 |
rm_work | which is ... fine? | 00:30 |
rm_work | it's just that nova then loses track of that data, which you consider bad (but we don't) | 00:30 |
rm_work | (and it has worked that way for a while?) | 00:31 |
dansmith | okay, you know, it's after 5pm and I'm getting more frustrated here, so I'm just going to go | 00:31 |
rm_work | kk | 00:31 |
rm_work | prolly just discussing at the PTG is best | 00:31 |
*** Ileixe has joined #openstack-nova | 00:31 | |
Ileixe | Hello guys | 00:32 |
Ileixe | Recently I implement custom hooking code for server create api in nova-api by hook api. | 00:33 |
johnsom | My take away. There was some nasty bit taking files and making some strange partition at boot. We aren't using that and never have. Then there is the bit that takes files, stashes them in the config drive and cloud-init drops them in the guest filesystem. This what we use. However to remove the partition stuff the config drive part got removed too | 00:33 |
*** jangutter has joined #openstack-nova | 00:35 | |
Ileixe | Oh sorry there was converstation in now. Never mind. I ask later | 00:35 |
*** shaohe_feng has quit IRC | 00:35 | |
rm_work | Ileixe: we are ... wrapped up on that :P | 00:35 |
rm_work | it's fine | 00:35 |
rm_work | lol | 00:35 |
*** shaohe_feng has joined #openstack-nova | 00:36 | |
Ileixe | Thanks rm_work :) just simple qeustion. I found hook api was deprecated, and the api was the right thing for my logic, so i wonder what replace hook api | 00:37 |
*** namnh has joined #openstack-nova | 00:40 | |
*** namnh has quit IRC | 00:44 | |
*** shaohe_feng has quit IRC | 00:45 | |
*** Ileixe_ has joined #openstack-nova | 00:46 | |
*** Ileixe has quit IRC | 00:47 | |
*** shaohe_feng has joined #openstack-nova | 00:49 | |
*** Ileixe_ has quit IRC | 00:51 | |
*** ileixe has joined #openstack-nova | 00:53 | |
*** felipemonteiro has quit IRC | 00:55 | |
*** shaohe_feng has quit IRC | 00:56 | |
melwitt | argh, looks like we have a new gate failure as of today | 00:57 |
*** shaohe_feng has joined #openstack-nova | 00:57 | |
melwitt | http://logstash.openstack.org/#/dashboard/file/logstash.json?query=message:%5C%22Unsupported%20VIF%20type%20unbound%20convert%20'_nova_to_osvif_vif_unbound'%5C%22%20AND%20tags:screen-n-cpu.txt&from=7d | 00:57 |
melwitt | unless it's only the numa-aware-vswitches patches that are affected... looking closer | 00:58 |
melwitt | it's hitting several of the numa-aware-vswitches patches but is hitting other patches as well. started very recently | 01:04 |
*** slaweq has joined #openstack-nova | 01:05 | |
*** shaohe_feng has quit IRC | 01:06 | |
*** gbarros has quit IRC | 01:07 | |
openstackgerrit | Xiaohan Zhang proposed openstack/nova master: compute node local_gb_used include swap disks https://review.openstack.org/585928 | 01:07 |
*** shaohe_feng has joined #openstack-nova | 01:08 | |
*** artom has quit IRC | 01:09 | |
*** gbarros has joined #openstack-nova | 01:09 | |
*** slaweq has quit IRC | 01:10 | |
*** mrsoul has joined #openstack-nova | 01:12 | |
*** abhishekk has joined #openstack-nova | 01:12 | |
*** medberry has quit IRC | 01:13 | |
*** mrsoul` has quit IRC | 01:15 | |
*** shaohe_feng has quit IRC | 01:16 | |
*** harlowja has quit IRC | 01:17 | |
*** shaohe_feng has joined #openstack-nova | 01:17 | |
*** tiendc has joined #openstack-nova | 01:25 | |
*** shaohe_feng has quit IRC | 01:26 | |
*** gbarros has quit IRC | 01:27 | |
*** shaohe_feng has joined #openstack-nova | 01:27 | |
*** gbarros has joined #openstack-nova | 01:28 | |
*** ileixe has quit IRC | 01:33 | |
*** ileixe has joined #openstack-nova | 01:34 | |
*** shaohe_feng has quit IRC | 01:37 | |
*** shaohe_feng has joined #openstack-nova | 01:37 | |
*** sean-k-mooney has joined #openstack-nova | 01:39 | |
*** tbachman has quit IRC | 01:43 | |
*** namnh has joined #openstack-nova | 01:43 | |
mriedem | melwitt: i was noticing those randomly the last couple of weeks | 01:44 |
mriedem | unless it's major, just recheck | 01:44 |
melwitt | mriedem: oh, logstash was claiming it started today. and I was wondering if it might be related to https://review.openstack.org/522537 | 01:45 |
melwitt | I've rechecked the numa patches at least twice because of it so far. maybe it's a coincidence. I'll keep trying to recheck | 01:45 |
mriedem | hmm, yeah it might be, mostly hitting on the live migration and multinode jobs | 01:46 |
mriedem | which is where that is turned on | 01:46 |
mriedem | well that would be...awesome | 01:47 |
mriedem | can you report a neutron bug? | 01:47 |
*** shaohe_feng has quit IRC | 01:47 | |
melwitt | that patch landed at 13:00 (my time) which coincides with the logstash start of hits | 01:47 |
melwitt | mriedem: can do. was just writing it up for nova not realizing it's neutron. will copy it over and open for neutron | 01:48 |
*** yamahata has quit IRC | 01:48 | |
mriedem | it could be either | 01:48 |
mriedem | just add both | 01:49 |
*** dklyle has joined #openstack-nova | 01:49 | |
melwitt | oh, right. we can do that | 01:49 |
mriedem | Kevin_Zheng: fyi, might need to see if zhaobo can investigate this ^ | 01:49 |
mriedem | mlavalle is already gone for the day | 01:49 |
*** shaohe_feng has joined #openstack-nova | 01:49 | |
Kevin_Zheng | ACK, I will ask him | 01:49 |
mriedem | melwitt: there would be an easy way to disable it in nova if needed | 01:50 |
melwitt | k | 01:50 |
mriedem | and then could be tracked as an rc bug (it will need to be an rc bug anyway) | 01:50 |
mriedem | rather than revert | 01:50 |
*** dklyle_ has quit IRC | 01:50 | |
openstackgerrit | Matt Riedemann proposed openstack/nova-specs master: Fix problem description number in deprecate file injection spec https://review.openstack.org/586385 | 01:51 |
mriedem | i'm also going to fast approve ^ b/c of the confusion i saw in the backscroll | 01:51 |
*** namnh has quit IRC | 01:52 | |
Kevin_Zheng | mriedem, could you provide a error log? | 01:55 |
dansmith | mriedem: way ahead of you | 01:55 |
Kevin_Zheng | mriedem, never mind, Igot it | 01:55 |
melwitt | mriedem: https://bugs.launchpad.net/neutron/+bug/1783917 | 01:57 |
openstack | Launchpad bug 1783917 in OpenStack Compute (nova) "live migration fails with NovaException: Unsupported VIF type unbound convert '_nova_to_osvif_vif_unbound'" [Undecided,New] | 01:57 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: api-ref: document user_data length restriction https://review.openstack.org/586388 | 01:57 |
melwitt | Kevin_Zheng ^ | 01:57 |
*** shaohe_feng has quit IRC | 01:57 | |
Kevin_Zheng | Thanks | 01:57 |
*** medberry has joined #openstack-nova | 01:57 | |
mriedem | i'll push up an e-r and nova wip patch and then i have to run i think | 01:57 |
melwitt | oh, I'm not 100% sure it makes live migration "fail", I meant to change that to "raises" | 01:58 |
*** shaohe_feng has joined #openstack-nova | 01:58 | |
mriedem | e-r query https://review.openstack.org/#/c/586389/ | 01:59 |
mriedem | it fails | 01:59 |
mriedem | http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Live%20migration%20failed%5C%22%20AND%20message%3A%5C%22Unsupported%20VIF%20type%20unbound%20convert%20'_nova_to_osvif_vif_unbound'%5C%22%20AND%20tags%3A%5C%22screen-n-cpu.txt%5C%22&from=7d | 01:59 |
melwitt | although yeah, all the logstash hits containing the message are build failures | 02:00 |
melwitt | bah *changes it back* | 02:00 |
*** takashin has left #openstack-nova | 02:00 | |
melwitt | cool, thanks for adding the e-r query | 02:01 |
*** david-lyle has joined #openstack-nova | 02:01 | |
*** dklyle has quit IRC | 02:02 | |
*** david-lyle has quit IRC | 02:03 | |
*** dklyle_ has joined #openstack-nova | 02:04 | |
*** alexpilotti has quit IRC | 02:04 | |
sean-k-mooney | so im going to sleep now but http://logs.openstack.org/63/585163/1/check/nova-live-migration/1b2aebb/logs/screen-n-cpu.txt#_Jul_27_01_44_01_083831 looks like its happening because we are calling unplug on the source node after we have activated the binding on the dest | 02:07 |
*** shaohe_feng has quit IRC | 02:07 | |
melwitt | sean-k-mooney: thanks. so maybe something we need to adjust given the use of the new binding API? I dunno | 02:08 |
*** shaohe_feng has joined #openstack-nova | 02:08 | |
melwitt | I'll add your comment to the bug | 02:08 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Temporarily disable port binding flows for live migration https://review.openstack.org/586391 | 02:09 |
mriedem | ^ is an option for temporarily disabling this while debugging a fix | 02:09 |
mriedem | i hope it doesn't have to come to that, but would understand if it's causing a lot of failures | 02:10 |
* melwitt nods | 02:10 | |
sean-k-mooney | melwitt: i can try and reporduce this in the morning. we proably need to stor the original vif type and use that to constuct the vif object and use that or do the unplug on the host. | 02:10 |
melwitt | mriedem: okay, we'll decide what to do in the morning tomorrow when other people are around | 02:11 |
openstackgerrit | Merged openstack/nova-specs master: Fix problem description number in deprecate file injection spec https://review.openstack.org/586385 | 02:11 |
mriedem | yeah the error is from unplugging vifs in _post_live_migration which happens on the source, | 02:12 |
mriedem | https://github.com/openstack/nova/blob/2afc5fed1f60077e7ff0b9e81b64cff4e4dbabfc/nova/compute/manager.py#L6581 | 02:12 |
mriedem | right before that, | 02:12 |
mriedem | https://github.com/openstack/nova/blob/2afc5fed1f60077e7ff0b9e81b64cff4e4dbabfc/nova/compute/manager.py#L6572 | 02:12 |
mriedem | we activate the port bindings for the dest host | 02:13 |
melwitt | ah, I see | 02:13 |
sean-k-mooney | mriedem: yep that will deactivaate all other port bindings for that port meaning it will be in the unbound state on the sorce host | 02:13 |
melwitt | so just flip that? | 02:13 |
mriedem | https://github.com/openstack/nova/blob/2afc5fed1f60077e7ff0b9e81b64cff4e4dbabfc/nova/network/neutronv2/api.py#L2534 | 02:14 |
mriedem | i didn't know we couldn't unplug a deactivated port... | 02:14 |
melwitt | I wonder how it doesn't fail 100% of the time | 02:14 |
mriedem | melwitt: race | 02:14 |
mriedem | apparently | 02:14 |
melwitt | ah | 02:14 |
melwitt | yeah, what luck that the actual change *didn't* fail | 02:14 |
sean-k-mooney | mriedem: your raising with the notification neutron send for the port status change | 02:15 |
mriedem | hmm http://logs.openstack.org/63/585163/1/check/nova-live-migration/1b2aebb/logs/screen-n-cpu.txt#_Jul_27_01_44_00_974248 | 02:15 |
mriedem | melwitt: i had seen this once in the series and mlavalle debugged it and couldn't find anything wrong | 02:15 |
mriedem | Jul 27 01:44:00.974248 ubuntu-xenial-rax-dfw-0001002000 nova-compute[2629]: DEBUG nova.network.neutronv2.api [None req-33283139-ba55-4106-b76c-8751a025f153 service nova] [instance: 6b72a721-0995-446e-848f-f407b788c7f4] Port 21095ff0-6bcd-414b-9d6f-b63e03aacb23 binding to destination host ubuntu-xenial-rax-dfw-0001002004 is already ACTIVE. {{(pid=2629) migrate_instance_start /opt/stack/new/nova/nova/network/neutronv2/api.py:25 | 02:15 |
melwitt | ah, okay | 02:15 |
mriedem | oh i know why it's already active, | 02:16 |
mriedem | because we activate the dest host port binding during post-copy | 02:16 |
mriedem | which is the whole point of the blueprint - to shorten the window of time that you don't have networking on the dest host | 02:16 |
melwitt | right | 02:17 |
melwitt | shorten the window | 02:17 |
mriedem | this is the unplug event http://logs.openstack.org/63/585163/1/check/nova-live-migration/1b2aebb/logs/screen-n-cpu.txt#_Jul_27_01_44_00_069526 | 02:18 |
*** shaohe_feng has quit IRC | 02:18 | |
mriedem | this is where we activate the ports on the dest host during post-copy http://logs.openstack.org/63/585163/1/check/nova-live-migration/1b2aebb/logs/screen-n-cpu.txt#_Jul_27_01_43_58_561391 | 02:18 |
*** shaohe_feng has joined #openstack-nova | 02:18 | |
mriedem | we could have the live migration method wait for the unplug event before starting with post live migration, but (1) i'm not sure that helps anything and (2) it might not work that way for all virt drivers - only libvirt + post-copy has this | 02:19 |
melwitt | yeah, events are sketch depending on which networking backend too, right | 02:20 |
melwitt | like ovs vs other | 02:20 |
openstackgerrit | Yikun Jiang (Kero) proposed openstack/nova master: Change deprecated policies to policy https://review.openstack.org/583434 | 02:20 |
mriedem | melwitt: shouldn't be in this case, | 02:20 |
mriedem | odl should send the event on host binding changes | 02:20 |
openstackgerrit | Yikun Jiang (Kero) proposed openstack/nova master: Fix all invalid obj_make_compatible test case https://review.openstack.org/574240 | 02:20 |
openstackgerrit | Yikun Jiang (Kero) proposed openstack/nova master: Fix all invalid obj_make_compatible test case https://review.openstack.org/574240 | 02:20 |
mriedem | just not plug/unplug | 02:20 |
melwitt | oh, because neutron knows about it and not relying on anything else? ok | 02:20 |
melwitt | just remember getting burned by the whole plug event thing for reboot | 02:21 |
melwitt | but that was because we so os-vif plug only, not any call to neutron and the agent (or something) has to notice it | 02:21 |
sean-k-mooney | melwitt: the binding change is handeld in the common ml2 layer if i rember corrrectly yes. the port wire up/tear down event however has to come form the backend not the common layer hence the delta between odl/ovs in that case | 02:22 |
*** gongysh has joined #openstack-nova | 02:22 | |
melwitt | sean-k-mooney: yeah, I was having trouble remembering what the deal was. thanks | 02:22 |
*** psachin`` has joined #openstack-nova | 02:23 | |
sean-k-mooney | melwitt: the reason it did not work with linux bridge is its pools. the reason it did not work for odl was they were missing the handeler for the event in odl to send it to the websocket creeated by netowrking odl. i think they have fixed that. maybe | 02:24 |
sean-k-mooney | any way nova is reciving the port update event in this case from neutron and its updating the network info cacche so by the time we call nova_to_osvif_vif the vif_type is set to unbound and boom. if we still have the migration data object at this point we should have a copy of the original vif object that we could use instead of the info_cache versions to work around it. | 02:27 |
*** Dinesh_Bhor has joined #openstack-nova | 02:28 | |
mriedem | so migrate_instance_start() was always a noop before this series, | 02:28 |
*** shaohe_feng has quit IRC | 02:28 | |
mriedem | so its order in _post_live_migration would have never mattered except for nova-network | 02:28 |
*** shaohe_feng has joined #openstack-nova | 02:29 | |
mriedem | given we already call migrate_instance_start during post-copy, i don't think moving the order of those calls in _post_live_migration will matter, | 02:29 |
mriedem | because from these logs, i can see that when we call migrate_instance_start from _post_live_migration, it's a noop b/c the dest port binding is already active | 02:29 |
mriedem | http://logs.openstack.org/63/585163/1/check/nova-live-migration/1b2aebb/logs/screen-n-cpu.txt#_Jul_27_01_44_00_974248 | 02:30 |
mriedem | so i would think it means, we need to handle unbound vifs during unplug in the driver? | 02:30 |
mriedem | or just not call unplug_vifs in certain cases | 02:30 |
mriedem | not totally sure though | 02:30 |
mriedem | all the libvirt driver does in post_live_migration_at_source is unplug_vifs | 02:31 |
sean-k-mooney | if we dont call unplug_vif we could leak the linux bridges we create for ovs hybrid plug | 02:32 |
mriedem | umm... | 02:33 |
mriedem | oh i see what you were saying about storing off the vif_type then | 02:33 |
*** psachin`` has quit IRC | 02:33 | |
mriedem | b/c i was going to say, we could just not call unplug_vifs if the vif type (after refreshing the network info cache from neutron) was now 'unbound' | 02:33 |
mriedem | if it is, we can temporarily heal that using migrate_data.vifs | 02:33 |
mriedem | that has the vif type in it | 02:34 |
sean-k-mooney | mriedem: yep | 02:34 |
mriedem | ok i could try cooking something up real quick, | 02:34 |
mriedem | my wife is going to kill me though | 02:34 |
melwitt | you could do tomorrow morning? | 02:34 |
sean-k-mooney | i can try this in the morning too. i just need a 2 node vanila devstack install right | 02:35 |
melwitt | unless you were thinking to fast-approve this tonight | 02:35 |
mriedem | why would the vif type be unbound? | 02:36 |
mriedem | shouldn't it be bound to the dest host? | 02:36 |
mriedem | since we activated it there? | 02:36 |
sean-k-mooney | mriedem: it is. each host has its own binding now. only one will be in the bound state all the rest will be unbound | 02:37 |
mriedem | but i think the port in our info cache is not host-aware... | 02:38 |
*** shaohe_feng has quit IRC | 02:38 | |
mriedem | i need to check | 02:38 |
*** medberry has quit IRC | 02:39 | |
*** shaohe_feng has joined #openstack-nova | 02:39 | |
mriedem | http://logs.openstack.org/63/585163/1/check/nova-live-migration/1b2aebb/logs/screen-n-cpu.txt#_Jul_27_01_44_00_726935 | 02:39 |
mriedem | that's where we refresh the info cache in _post_live_migration | 02:40 |
mriedem | after activating the dest host port binding | 02:40 |
mriedem | [{"profile": {"migrating_to": "ubuntu-xenial-rax-dfw-0001002004"}, "ovs_interfaceid": null, "preserve_on_delete": false, "network": {"bridge": null, "subnets": [{"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.1.0.10"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "10.1.0.0/28", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.1.0.1"}}], "meta": {"in | 02:40 |
mriedem | ed": false, "tenant_id": "7dbeedd7076e472091193779ebbcf887", "mtu": 1400}, "id": "1d8de970-331e-46b5-8c7b-574821e891e5", "label": "tempest-LiveMigrationTest-411356071-network"}, "devname": "tap21095ff0-6b", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {}, "address": "fa:16:3e:34:c9:90", "active": false, "type": "unbound", "id": "21095ff0-6bcd-414b-9d6f-b63e03aacb23", "qbg_params": null}] | 02:40 |
mriedem | yeah...that's wrong | 02:40 |
mriedem | it should be bound to the dest host | 02:40 |
sean-k-mooney | well it was bound shortly before http://logs.openstack.org/63/585163/1/check/nova-live-migration/1b2aebb/logs/screen-n-cpu.txt#_Jul_27_01_43_59_311896 | 02:44 |
mriedem | yup we hit post-copy callback here and activate the dest host port binding http://logs.openstack.org/63/585163/1/check/nova-live-migration/1b2aebb/logs/screen-n-cpu.txt#_Jul_27_01_43_58_561391 | 02:46 |
mriedem | refresh nw info cache here http://logs.openstack.org/63/585163/1/check/nova-live-migration/1b2aebb/logs/screen-n-cpu.txt#_Jul_27_01_43_59_310738 | 02:47 |
mriedem | then we get an unplugged vif event from neutron | 02:48 |
mriedem | could be concurrently | 02:48 |
sean-k-mooney | whats happening is liekly that when the ovs neutron agent sees the tap device disapear it is sending an update to notify us the port state has changed on the souce node. | 02:48 |
mriedem | yeah we get the unplugged event and refresh the cache and it's unbound http://logs.openstack.org/63/585163/1/check/nova-live-migration/1b2aebb/logs/screen-n-cpu.txt#_Jul_27_01_44_00_726935 | 02:48 |
*** shaohe_feng has quit IRC | 02:48 | |
mriedem | post live migrate the dest host port binding is already active http://logs.openstack.org/63/585163/1/check/nova-live-migration/1b2aebb/logs/screen-n-cpu.txt#_Jul_27_01_44_00_974248 | 02:49 |
mriedem | then we unplug and kablammo | 02:50 |
mriedem | doesn't help that we route all of these plug/unplug neutron events to the source host only, that's a nova limitation during live migration right now | 02:50 |
mriedem | and there might be some kind of delay in the state updates or something in the neutron db? | 02:50 |
*** shaohe_feng has joined #openstack-nova | 02:51 | |
openstackgerrit | Tetsuro Nakamura proposed openstack/nova master: Fix create_all() to replace_all() in comments https://review.openstack.org/586396 | 02:51 |
mriedem | anyway, i can hack around this a bit i think but kind of sucks | 02:51 |
sean-k-mooney | mriedem: well there is a delay in the neutron agent sendign the update over the rabbit rpc bus to the neutron-server and then the rest call to nova. | 02:51 |
mriedem | i just worry the port isn't wired up on the dest or something, but that shouldn't be the case b/c we plug_vifs on the dest host during pre_live_migration now | 02:52 |
mriedem | it's just inactive until post-copy | 02:52 |
sean-k-mooney | we could prabably hack in a filter to ignore nay info cache updates where teh vif type is unbound and the port profile containts a migrating_to field | 02:52 |
mriedem | yeah... | 02:53 |
mriedem | that would coincide with this http://logs.openstack.org/63/585163/1/check/nova-live-migration/1b2aebb/logs/screen-n-cpu.txt#_Jul_27_01_43_59_310738 | 02:53 |
sean-k-mooney | mriedem: yes if the pulgin fails in pre_live_migration we bail out early and try another host so at this point the dest networking shoudl be fully set up | 02:54 |
mriedem | also, if we get the info cache based on what's setup for the dest host, we could have changed vif types, so unplugging on the source could be a different vif type...couldn't it? | 02:55 |
mriedem | this gets a bit wonky | 02:56 |
mriedem | we do have an exact copy of the source_vif in the migrate data vifs | 02:56 |
sean-k-mooney | yes it could have changed. | 02:56 |
sean-k-mooney | yep | 02:56 |
sean-k-mooney | the migrate data has everything you need. | 02:56 |
sean-k-mooney | just look up the vif by the port uuid and unplug or better yet just loop over all the vifs in migrate data instead of instance | 02:57 |
mriedem | that's kind of what i'm going to do, will hack something up quick and post it then flesh it out more in the morning | 02:57 |
mriedem | sean-k-mooney: and for the love of toast go to bed | 02:57 |
bzhao__ | Sorry for a nic break, I have a brief in the neutron log from the link shows. For the failure test instance, seem It works correct in Neutron side. | 02:58 |
sean-k-mooney | haha its only 4 am. but ya. il be back only in 6-8 hours and ill take a look at it then. nighto/ | 02:58 |
*** shaohe_feng has quit IRC | 02:59 | |
*** shaohe_feng has joined #openstack-nova | 03:00 | |
melwitt | bzhao__: thanks. feel free to add a comment to explain about the neutron side in https://bugs.launchpad.net/neutron/+bug/1783917 see comment #6 | 03:01 |
openstack | Launchpad bug 1783917 in OpenStack Compute (nova) "live migration fails with NovaException: Unsupported VIF type unbound convert '_nova_to_osvif_vif_unbound'" [High,Confirmed] | 03:01 |
bzhao__ | melwitt: Thanks, I will. ;-) | 03:02 |
mriedem | got a patch, pretty simple, no tests but can be easily added by someone else tonight or in the morning | 03:07 |
sapd | Hi everyone. I got this error when attach a SR-IOV port to instance http://paste.openstack.org/show/726723/ Please help me | 03:09 |
*** shaohe_feng has quit IRC | 03:09 | |
mriedem | sapd: read through https://docs.openstack.org/neutron/latest/admin/config-sriov.html and check everything in there | 03:10 |
melwitt | mriedem: coolness, sounds good | 03:10 |
*** shaohe_feng has joined #openstack-nova | 03:12 | |
sapd | mriedem: yep. I have read it. And follow the guide to config. Everything I setup is correct. Because I already launched an instance using SR-IOV successful. But It did not receive DHCP. So I launched another instance using Openvswitch then add SR-IOV port to the instance. But got above error. | 03:14 |
melwitt | sapd: looks like the bug has been around for awhile and still not resolved https://bugs.launchpad.net/nova/+bug/1708433 they say you can boot with the port if you pass it during server create, but that attaching port separately is broken | 03:16 |
openstack | Launchpad bug 1708433 in OpenStack Compute (nova) "Attaching sriov nic VM fail with keyError pci_slot" [Undecided,Expired] | 03:16 |
*** abhishekk has quit IRC | 03:17 | |
melwitt | sapd: what release of nova are you using? | 03:18 |
sapd | melwitt: I'm using queens version. 17.0.4 | 03:18 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: WIP: Use source vifs when unplugging on source during post live migrate https://review.openstack.org/586402 | 03:18 |
mriedem | melwitt: bzhao__: Kevin_Zheng: sean-k-mooney: ^ just needs unit tests | 03:18 |
melwitt | sapd: okay, I'm going to re-open that bug and mention what version you saw it in. it will need to be worked on | 03:19 |
Kevin_Zheng | mriedem, got it, just finish reading launchpad report | 03:19 |
mriedem | ask sahid to look at it | 03:19 |
mriedem | the sriov bug i mean | 03:19 |
*** shaohe_feng has quit IRC | 03:19 | |
melwitt | k | 03:19 |
sean-k-mooney[m] | Melwitt we used ti have an api check at one point to expresly forbid attach sriov port to existing instances. | 03:20 |
melwitt | hmm, interesting. I wonder what happened to that | 03:20 |
sapd | melwitt: I'm waiting. | 03:21 |
melwitt | hah | 03:21 |
*** shaohe_feng has joined #openstack-nova | 03:21 | |
sean-k-mooney[m] | Melwitt im guessing some of artoms changes | 03:21 |
melwitt | okay, I'll ask him about it | 03:23 |
*** dave-mccowan has quit IRC | 03:24 | |
openstackgerrit | Merged openstack/os-vif stable/rocky: Add vif_plug_noop to setup.cfg packages https://review.openstack.org/586340 | 03:26 |
melwitt | hot dog | 03:26 |
bzhao__ | mriedem: So so quick.... =。= | 03:29 |
*** shaohe_feng has quit IRC | 03:29 | |
*** annp has quit IRC | 03:31 | |
*** tiendc has quit IRC | 03:31 | |
*** trungnv has quit IRC | 03:31 | |
melwitt | I think I'm gonna give up on rechecking the r-3 patches, seems like a pretty high fail rate with the live migration thing | 03:31 |
*** shaohe_feng has joined #openstack-nova | 03:32 | |
*** tiendc has joined #openstack-nova | 03:32 | |
*** trungnv has joined #openstack-nova | 03:32 | |
melwitt | get the fix sorted in the morning and go from there | 03:32 |
*** annp has joined #openstack-nova | 03:32 | |
*** gbarros has quit IRC | 03:39 | |
*** shaohe_feng has quit IRC | 03:40 | |
*** shaohe_feng has joined #openstack-nova | 03:40 | |
*** vladikr has quit IRC | 03:45 | |
*** vladikr has joined #openstack-nova | 03:45 | |
mriedem | should have tests done pretty soon | 03:48 |
*** shaohe_feng has quit IRC | 03:50 | |
*** vladikr has quit IRC | 03:51 | |
*** vladikr has joined #openstack-nova | 03:51 | |
*** shaohe_feng has joined #openstack-nova | 03:52 | |
*** links has joined #openstack-nova | 03:52 | |
*** Dinesh_Bhor has quit IRC | 03:52 | |
*** gongysh has quit IRC | 03:52 | |
*** yamahata has joined #openstack-nova | 03:53 | |
*** Dinesh_Bhor has joined #openstack-nova | 03:54 | |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Use source vifs when unplugging on source during post live migrate https://review.openstack.org/586402 | 03:56 |
mriedem | alright gang there it is with a test ^ | 03:57 |
* melwitt clicks | 03:58 | |
*** Dinesh_Bhor has quit IRC | 04:00 | |
*** shaohe_feng has quit IRC | 04:00 | |
*** shaohe_feng has joined #openstack-nova | 04:01 | |
*** vladikr has quit IRC | 04:03 | |
*** vladikr has joined #openstack-nova | 04:03 | |
mriedem | and now i'm going to bed | 04:04 |
mriedem | o/ | 04:04 |
*** mriedem has quit IRC | 04:04 | |
melwitt | gnite | 04:04 |
*** mschuppert has joined #openstack-nova | 04:06 | |
*** tiendc has quit IRC | 04:10 | |
*** shaohe_feng has quit IRC | 04:10 | |
*** tiendc has joined #openstack-nova | 04:11 | |
*** slaweq has joined #openstack-nova | 04:11 | |
*** shaohe_feng has joined #openstack-nova | 04:11 | |
*** slaweq has quit IRC | 04:16 | |
*** shaohe_feng has quit IRC | 04:21 | |
*** mdnadeem has joined #openstack-nova | 04:21 | |
*** itlinux has joined #openstack-nova | 04:22 | |
*** shaohe_feng has joined #openstack-nova | 04:22 | |
*** pcaruana has joined #openstack-nova | 04:28 | |
*** pcaruana has quit IRC | 04:30 | |
*** shaohe_feng has quit IRC | 04:31 | |
*** shaohe_feng has joined #openstack-nova | 04:33 | |
*** shaohe_feng has quit IRC | 04:41 | |
*** shaohe_feng has joined #openstack-nova | 04:41 | |
openstackgerrit | Xiaohan Zhang proposed openstack/nova master: compute node local_gb_used include swap disks https://review.openstack.org/585928 | 04:47 |
*** gongysh has joined #openstack-nova | 04:50 | |
*** shaohe_feng has quit IRC | 04:51 | |
*** shaohe_feng has joined #openstack-nova | 04:53 | |
*** vladikr has quit IRC | 04:53 | |
*** vladikr has joined #openstack-nova | 04:54 | |
*** flwang1 has quit IRC | 04:59 | |
*** shaohe_feng has quit IRC | 05:02 | |
*** shaohe_feng has joined #openstack-nova | 05:02 | |
*** vladikr has quit IRC | 05:05 | |
*** itlinux has quit IRC | 05:05 | |
*** tbachman has joined #openstack-nova | 05:06 | |
*** vladikr has joined #openstack-nova | 05:08 | |
*** tbachman has quit IRC | 05:11 | |
vishakha | melwitt : Hi, waiting for your response https://review.openstack.org/#/c/580271/. Thanks | 05:11 |
*** shaohe_feng has quit IRC | 05:12 | |
*** slaweq has joined #openstack-nova | 05:13 | |
*** shaohe_feng has joined #openstack-nova | 05:14 | |
*** tbachman has joined #openstack-nova | 05:16 | |
*** Bhujay has joined #openstack-nova | 05:17 | |
*** slaweq has quit IRC | 05:17 | |
*** Bhujay has quit IRC | 05:21 | |
*** shaohe_feng has quit IRC | 05:22 | |
*** shaohe_feng has joined #openstack-nova | 05:23 | |
*** vladikr has quit IRC | 05:27 | |
*** vladikr has joined #openstack-nova | 05:29 | |
*** shaohe_feng has quit IRC | 05:32 | |
*** sridharg has joined #openstack-nova | 05:32 | |
*** shaohe_feng has joined #openstack-nova | 05:34 | |
*** shaohe_feng has quit IRC | 05:43 | |
*** shaohe_feng has joined #openstack-nova | 05:46 | |
*** tbachman has quit IRC | 05:46 | |
*** vladikr has quit IRC | 05:48 | |
*** josecastroleon has joined #openstack-nova | 05:48 | |
*** vladikr has joined #openstack-nova | 05:51 | |
*** trungnv has quit IRC | 05:51 | |
*** annp has quit IRC | 05:51 | |
*** tiendc has quit IRC | 05:51 | |
*** tiendc has joined #openstack-nova | 05:52 | |
*** trungnv has joined #openstack-nova | 05:52 | |
*** annp has joined #openstack-nova | 05:52 | |
*** zigo_ has joined #openstack-nova | 05:53 | |
*** zigo has quit IRC | 05:53 | |
*** shaohe_feng has quit IRC | 05:53 | |
*** shaohe_feng has joined #openstack-nova | 05:54 | |
*** Luzi has joined #openstack-nova | 05:54 | |
*** vladikr has quit IRC | 06:01 | |
*** vladikr has joined #openstack-nova | 06:02 | |
*** shaohe_feng has quit IRC | 06:03 | |
*** shaohe_feng has joined #openstack-nova | 06:05 | |
openstackgerrit | Vishakha Agarwal proposed openstack/nova master: No change in field 'updated' in server https://review.openstack.org/586446 | 06:08 |
*** shaohe_feng has quit IRC | 06:13 | |
*** shaohe_feng has joined #openstack-nova | 06:15 | |
*** alexchadin has joined #openstack-nova | 06:15 | |
*** sapd has quit IRC | 06:22 | |
*** sapd has joined #openstack-nova | 06:23 | |
*** shaohe_feng has quit IRC | 06:24 | |
openstackgerrit | Vishakha Agarwal proposed openstack/nova master: No change in field 'updated' in server https://review.openstack.org/586446 | 06:25 |
*** shaohe_feng has joined #openstack-nova | 06:26 | |
*** tiendc_ has joined #openstack-nova | 06:28 | |
*** tiendc has quit IRC | 06:30 | |
ileixe | Hello again | 06:32 |
ileixe | Does any body know how to expand APIExtensionBase for pre-processing not for post-processing..? | 06:33 |
*** shaohe_feng has quit IRC | 06:34 | |
*** shaohe_feng has joined #openstack-nova | 06:35 | |
*** abhishekk has joined #openstack-nova | 06:41 | |
*** mgoddard has joined #openstack-nova | 06:41 | |
*** shaohe_feng has quit IRC | 06:44 | |
*** shaohe_feng has joined #openstack-nova | 06:45 | |
*** vladikr has quit IRC | 06:45 | |
openstackgerrit | Xiaohan Zhang proposed openstack/nova master: compute node local_gb_used include swap disks https://review.openstack.org/585928 | 06:47 |
*** vladikr has joined #openstack-nova | 06:48 | |
*** mgoddard has quit IRC | 06:50 | |
*** brault has joined #openstack-nova | 06:51 | |
*** tesseract has joined #openstack-nova | 06:52 | |
*** shaohe_feng has quit IRC | 06:54 | |
*** shaohe_feng has joined #openstack-nova | 06:56 | |
*** rcernin has quit IRC | 07:00 | |
*** ispp has joined #openstack-nova | 07:00 | |
*** liuyulong__ has joined #openstack-nova | 07:02 | |
*** shaohe_feng has quit IRC | 07:05 | |
*** shaohe_feng has joined #openstack-nova | 07:05 | |
*** liuyulong_ has quit IRC | 07:06 | |
*** ileixe has quit IRC | 07:09 | |
*** ttsiouts has joined #openstack-nova | 07:14 | |
*** shaohe_feng has quit IRC | 07:15 | |
openstackgerrit | Chen proposed openstack/nova master: Make nova-manage capable of syncing all cell databases https://review.openstack.org/519275 | 07:15 |
*** tiendc has joined #openstack-nova | 07:15 | |
*** tiendc_ has quit IRC | 07:16 | |
*** shaohe_feng has joined #openstack-nova | 07:16 | |
*** ccamacho has joined #openstack-nova | 07:20 | |
*** dtantsur|afk is now known as dtantsur | 07:21 | |
*** ttsiouts has quit IRC | 07:24 | |
*** shaohe_feng has quit IRC | 07:25 | |
*** shaohe_feng has joined #openstack-nova | 07:26 | |
*** ileixe has joined #openstack-nova | 07:27 | |
*** ispp has quit IRC | 07:27 | |
*** AlexeyAbashkin has joined #openstack-nova | 07:29 | |
*** gibi is now known as giblet | 07:30 | |
openstackgerrit | Vishakha Agarwal proposed openstack/nova master: No change in field 'updated' in server https://review.openstack.org/586446 | 07:33 |
*** shaohe_feng has quit IRC | 07:35 | |
*** shaohe_feng has joined #openstack-nova | 07:37 | |
openstackgerrit | Tetsuro Nakamura proposed openstack/nova master: Fix create_all() to replace_all() in comments https://review.openstack.org/586396 | 07:43 |
*** shaohe_feng has quit IRC | 07:46 | |
*** shaohe_feng has joined #openstack-nova | 07:46 | |
*** tssurya has joined #openstack-nova | 07:48 | |
*** ispp has joined #openstack-nova | 07:48 | |
*** alexchadin has quit IRC | 07:52 | |
*** ttsiouts has joined #openstack-nova | 07:54 | |
*** shaohe_feng has quit IRC | 07:56 | |
*** shaohe_feng has joined #openstack-nova | 07:57 | |
*** rpittau has quit IRC | 07:57 | |
*** rpittau has joined #openstack-nova | 07:57 | |
*** dtantsur is now known as dtantsur|bbl | 08:00 | |
*** abhishekk has quit IRC | 08:04 | |
*** alexchadin has joined #openstack-nova | 08:05 | |
*** shaohe_feng has quit IRC | 08:06 | |
*** vladikr has quit IRC | 08:07 | |
*** vladikr has joined #openstack-nova | 08:08 | |
*** shaohe_feng has joined #openstack-nova | 08:08 | |
*** mgoddard has joined #openstack-nova | 08:12 | |
*** tetsuro has quit IRC | 08:14 | |
*** vladikr has quit IRC | 08:15 | |
*** vladikr has joined #openstack-nova | 08:15 | |
*** shaohe_feng has quit IRC | 08:16 | |
*** shaohe_feng has joined #openstack-nova | 08:19 | |
*** bauzas is now known as PapaOurs | 08:19 | |
kashyap | Hey folks, I'm hitting a "POST_FAILURE" state for the 'nova-live-migration' CI job; seems like a Zuul problem? | 08:20 |
kashyap | (For this change: https://review.openstack.org/#/c/567258/) | 08:20 |
PapaOurs | kashyap: nothing raised by infra AFAIK | 08:21 |
PapaOurs | kashyap: but maybe you should ask in #openstack-infra ? | 08:21 |
kashyap | Nod; in the past I've seen channel topic being changed when such errors occurreed. | 08:21 |
kashyap | PapaOurs: Yep, was just about to check there. | 08:21 |
kashyap | When I look into the log, it's the SSH failing | 08:22 |
*** derekh has joined #openstack-nova | 08:23 | |
*** shaohe_feng has quit IRC | 08:27 | |
*** shaohe_feng has joined #openstack-nova | 08:28 | |
*** avolkov has joined #openstack-nova | 08:28 | |
*** mgoddard has quit IRC | 08:34 | |
*** flwang1 has joined #openstack-nova | 08:34 | |
*** shaohe_feng has quit IRC | 08:37 | |
*** jaosorior has quit IRC | 08:38 | |
*** shaohe_feng has joined #openstack-nova | 08:38 | |
*** vivsoni has quit IRC | 08:41 | |
openstackgerrit | Vishakha Agarwal proposed openstack/nova master: No change in field 'updated' in server https://review.openstack.org/586446 | 08:43 |
*** mgoddard has joined #openstack-nova | 08:43 | |
*** flwang1 has quit IRC | 08:46 | |
*** shaohe_feng has quit IRC | 08:47 | |
*** shaohe_feng has joined #openstack-nova | 08:49 | |
*** lifeless has quit IRC | 08:54 | |
*** vladikr has quit IRC | 08:55 | |
*** vladikr has joined #openstack-nova | 08:55 | |
*** vishakha has quit IRC | 08:57 | |
*** shaohe_feng has quit IRC | 08:57 | |
*** jaosorior has joined #openstack-nova | 08:58 | |
*** shaohe_feng has joined #openstack-nova | 08:58 | |
*** vivsoni has joined #openstack-nova | 09:05 | |
*** shaohe_feng has quit IRC | 09:08 | |
*** shaohe_feng has joined #openstack-nova | 09:08 | |
*** flwang1 has joined #openstack-nova | 09:09 | |
*** josecastroleon has quit IRC | 09:09 | |
*** lifeless has joined #openstack-nova | 09:11 | |
*** vladikr has quit IRC | 09:11 | |
*** vladikr has joined #openstack-nova | 09:12 | |
*** akki has joined #openstack-nova | 09:12 | |
*** akki has quit IRC | 09:13 | |
*** akki has joined #openstack-nova | 09:13 | |
akki | can we take lxd container snapshots and use them to launch new containers? | 09:15 |
*** cdent has joined #openstack-nova | 09:18 | |
*** naichuans has quit IRC | 09:18 | |
*** shaohe_feng has quit IRC | 09:18 | |
*** josecastroleon has joined #openstack-nova | 09:18 | |
PapaOurs | do folks have any idea why we stupidly set the device owner of a port to be compute:<instance_az> ? | 09:18 |
openstackgerrit | huanhongda proposed openstack/nova master: hypervisor-stats shows wrong disk usages with shared storage https://review.openstack.org/149878 | 09:18 |
*** vladikr has quit IRC | 09:21 | |
*** shaohe_feng has joined #openstack-nova | 09:21 | |
*** shaohe_feng has quit IRC | 09:28 | |
*** shaohe_feng has joined #openstack-nova | 09:29 | |
*** MultipleCrashes has joined #openstack-nova | 09:29 | |
MultipleCrashes | Looking for further review from sometime , please have a look https://review.openstack.org/#/c/563418/ | 09:29 |
openstackgerrit | huanhongda proposed openstack/nova master: Change the metadata re to match the unicode https://review.openstack.org/536236 | 09:32 |
*** vladikr has joined #openstack-nova | 09:33 | |
*** MultipleCrashes has quit IRC | 09:37 | |
*** shaohe_feng has quit IRC | 09:38 | |
*** shaohe_feng has joined #openstack-nova | 09:41 | |
*** Dinesh_Bhor has joined #openstack-nova | 09:45 | |
*** andymccr- has joined #openstack-nova | 09:47 | |
*** shaohe_feng has quit IRC | 09:49 | |
*** jaosorior has quit IRC | 09:49 | |
*** shaohe_feng has joined #openstack-nova | 09:49 | |
*** andymccr_ has quit IRC | 09:50 | |
*** johnthetubaguy has quit IRC | 09:52 | |
*** flwang1 has quit IRC | 09:55 | |
*** flwang1 has joined #openstack-nova | 09:56 | |
*** shaohe_feng has quit IRC | 09:59 | |
*** shaohe_feng has joined #openstack-nova | 10:00 | |
*** flwang1 has quit IRC | 10:00 | |
*** vladikr has quit IRC | 10:03 | |
*** stakeda has quit IRC | 10:03 | |
*** vladikr has joined #openstack-nova | 10:04 | |
*** andymccr has quit IRC | 10:04 | |
*** andymccr- is now known as andymccr | 10:05 | |
*** liuzz_ has quit IRC | 10:09 | |
*** shaohe_feng has quit IRC | 10:09 | |
*** ispp has quit IRC | 10:09 | |
*** shaohe_feng has joined #openstack-nova | 10:10 | |
*** Dinesh_Bhor has quit IRC | 10:10 | |
*** flwang1 has joined #openstack-nova | 10:13 | |
openstackgerrit | Balazs Gibizer proposed openstack/nova master: Use placement 1.28 in scheduler report client https://review.openstack.org/583667 | 10:15 |
*** trungnv has quit IRC | 10:18 | |
*** shaohe_feng has quit IRC | 10:19 | |
*** shaohe_feng has joined #openstack-nova | 10:21 | |
*** alexchadin has quit IRC | 10:26 | |
*** cdent has quit IRC | 10:27 | |
*** shaohe_feng has quit IRC | 10:30 | |
*** shaohe_feng has joined #openstack-nova | 10:30 | |
*** ttsiouts has quit IRC | 10:31 | |
*** ispp has joined #openstack-nova | 10:33 | |
*** vladikr has quit IRC | 10:36 | |
*** vladikr has joined #openstack-nova | 10:36 | |
sean-k-mooney[m] | kashyap: post_failure means the job failed to upload the logs/result | 10:36 |
kashyap | sean-k-mooney[m]: Ah, I see | 10:36 |
kashyap | sean-k-mooney[m]: I hit a recheck, let's see if it goes through. | 10:37 |
kashyap | sean-k-mooney[m]: Would you happen to have time to have a gander at this: https://review.openstack.org/#/c/567258/ ("libvirt: Remove usage of migrateToURI{2} APIs") | 10:37 |
kashyap | Fairly mechanical, but some churn in there. | 10:37 |
kashyap | (The 'recheck' is still in progress, though.) | 10:38 |
kashyap | It's slow as molasses. | 10:38 |
sean-k-mooney[m] | Am sure. I'll take a look once i ger coffee | 10:38 |
sean-k-mooney[m] | Its feature freeze time the gate is under a lot of load. Rechek is all you could have done in this case | 10:39 |
*** alexchadin has joined #openstack-nova | 10:39 | |
*** shaohe_feng has quit IRC | 10:40 | |
*** shaohe_feng has joined #openstack-nova | 10:42 | |
kashyap | Ah, right | 10:47 |
*** gongysh has quit IRC | 10:47 | |
*** dtantsur|bbl is now known as dtantsur | 10:49 | |
*** sridharg has quit IRC | 10:50 | |
*** brault_ has joined #openstack-nova | 10:50 | |
*** shaohe_feng has quit IRC | 10:50 | |
*** shaohe_feng has joined #openstack-nova | 10:51 | |
openstackgerrit | Merged openstack/nova master: doc: add missing permission for the vCenter service account https://review.openstack.org/585683 | 10:52 |
*** brault has quit IRC | 10:53 | |
*** savvas has quit IRC | 10:53 | |
*** savvas has joined #openstack-nova | 10:53 | |
*** vladikr has quit IRC | 10:55 | |
*** vladikr has joined #openstack-nova | 10:55 | |
*** gilfoyle has joined #openstack-nova | 10:58 | |
gilfoyle | I'm trying to replicate some of nova's (the cli util) is doin. This is an old deployment of openstack. My goal is to understand how it is getting the zone-related information from the database when no zones are created | 10:59 |
gilfoyle | could someone help me by pointing out where in the repos should I be looking for this? | 11:00 |
gilfoyle | the relevant command is `nova availability-zone-list` | 11:00 |
*** shaohe_feng has quit IRC | 11:00 | |
*** shaohe_feng has joined #openstack-nova | 11:01 | |
sean-k-mooney | gilfoyle: what is the result you are getting and what were you expecting | 11:04 |
sean-k-mooney | ther are 2 default az that exist without you creating any | 11:05 |
sean-k-mooney | internal and nova | 11:05 |
sean-k-mooney | the contoler nodes will be in internal and all computes will be in nova | 11:05 |
*** dave-mccowan has joined #openstack-nova | 11:06 | |
*** pooja_jadhav has quit IRC | 11:08 | |
sean-k-mooney | kashyap: i was going to ask why ther is a migrateToURI() migrateToURI2() and migrateToURI3() then i rembered libvirt is written in c... | 11:08 |
gilfoyle | sean-k-mooney: my issue is that I'm running a query against a database that's not returning me any of the coputes in the `nova` and from the nova command above I do see it thee | 11:10 |
gilfoyle | there even, apologies | 11:10 |
*** shaohe_feng has quit IRC | 11:11 | |
sean-k-mooney | gilfoyle: yes i think the api layer injects the nova az before it gets to the client | 11:11 |
*** shaohe_feng has joined #openstack-nova | 11:11 | |
*** takedakn has joined #openstack-nova | 11:12 | |
gilfoyle | is it a case of if a compute node has been added without specifying an AZ, the reporting then returns it as being `nova`? that's how I've handled it in the past | 11:13 |
*** s10 has joined #openstack-nova | 11:15 | |
sean-k-mooney | gilfoyle: yes and that is still how its handeled today | 11:15 |
gilfoyle | or, let me restart, if the compute node has not been added to an AZ, it ends up in 'nova'? I've seen occasions where the aggregates.name came up as NULL, so I used the following shortcut in mysql `IFNULL(aggregates.name, 'nova') as zone` | 11:16 |
gilfoyle | s/restart/restate | 11:16 |
sean-k-mooney | gilfoyle: ah no if you have added a host to a host aggregate and you have set the availablity_zone metadata key on the aggregate it should not show up in nova anymore | 11:17 |
gilfoyle | ah, that explains my conundrum then, however, I now have a different question/ask | 11:18 |
gilfoyle | what's the case where aggregates.name is NULL? | 11:18 |
gilfoyle | if this isn't an obvious one, then I'll go back to the drawing board and try to analyse it further :) | 11:19 |
sean-k-mooney | gilfoyle: i belive we allow you to have host aggregate where you only set the uuid | 11:19 |
sean-k-mooney | i cant rember of the top of my head why however | 11:20 |
*** shaohe_feng has quit IRC | 11:21 | |
gilfoyle | ah, cool :) | 11:22 |
*** flwang1 has quit IRC | 11:22 | |
*** vivsoni has quit IRC | 11:23 | |
*** takedakn has quit IRC | 11:23 | |
*** shaohe_feng has joined #openstack-nova | 11:24 | |
sean-k-mooney | gilfoyle: the name filed on the aggregate is not the availability_zone name by the way. its the host aggregate name just incase you taught they were the same | 11:24 |
sean-k-mooney | i mean i personally always set them the same but they dont have to be | 11:24 |
gilfoyle | sean-k-mooney: Oh. interesting, I've been using a query with a relationship between aggregates, aggregate_hosts, compute_nodes and services tables to try and get all nodes for all AZs | 11:25 |
*** flwang1 has joined #openstack-nova | 11:26 | |
sean-k-mooney | gilfoyle: an avlailblity zone isnet really a thing in nova. its just a host_aggregate with metadata key called availability_zone in it | 11:27 |
*** cdent has joined #openstack-nova | 11:28 | |
*** Shilpa has quit IRC | 11:28 | |
*** ttsiouts has joined #openstack-nova | 11:29 | |
sean-k-mooney | so to get all host in an az you just find the host_aggregate with the correct metadata key then list its host. | 11:29 |
*** pooja_jadhav has joined #openstack-nova | 11:29 | |
sean-k-mooney | the nova and internal az are special however | 11:29 |
gilfoyle | could you possibly eyeball this and see if you can spot any obvious assumption(s) https://paste.ubuntu.com/p/sDFRDffzpy/ ? | 11:30 |
sean-k-mooney | i think the nova az is calulated by taking gennerating a list of host that are not part of another az | 11:31 |
*** jamesde__ has joined #openstack-nova | 11:31 | |
*** shaohe_feng has quit IRC | 11:31 | |
*** jamesden_ has quit IRC | 11:32 | |
*** flwang1 has quit IRC | 11:33 | |
*** jamesde__ has quit IRC | 11:34 | |
*** shaohe_feng has joined #openstack-nova | 11:34 | |
gilfoyle | that seems to make sense to me, so I assume it does that as a separate step/query in the `nova` cli? would you have any idea where this defined in the source? | 11:34 |
sean-k-mooney | gilfoyle: i think services.topic = 'compute' can be changed in the nova conf. so that might be more fragile then looking at the service.binary | 11:34 |
*** flwang1 has joined #openstack-nova | 11:34 | |
*** alexchadin has quit IRC | 11:34 | |
*** alexchadin has joined #openstack-nova | 11:35 | |
sean-k-mooney | gilfoyle: but that should list the capsity of all compute nodes ordered by the az they are in | 11:35 |
*** alexchadin has quit IRC | 11:35 | |
*** alexchadin has joined #openstack-nova | 11:36 | |
gilfoyle | yes, that's the goal, but for a cluster w/o any zones, I don't see the only compute node with it. Probably because it needs to be a separate query as you suggested above :) | 11:36 |
sean-k-mooney | actully no it wont | 11:36 |
*** alexchadin has quit IRC | 11:36 | |
*** alexchadin has joined #openstack-nova | 11:36 | |
sean-k-mooney | ya thats because you are matching on the aggregate name not the az name | 11:37 |
*** alexchadin has quit IRC | 11:37 | |
sean-k-mooney | actully thats not quite true either | 11:37 |
sean-k-mooney | by default you will not have any aggregates s the left join on aggregate_hosts.host = compute_nodes.hypervisor_hostname will filter out all the hosts | 11:38 |
gilfoyle | yup, that became apparent after your nugget above, too :) | 11:39 |
*** kholkina has joined #openstack-nova | 11:39 | |
sean-k-mooney | gilfoyle: so what you need to do is rather then set the aggregate.name to nova if null is also join this result with a suuquey on the computenodes table for every host that is not in the first result set | 11:40 |
gilfoyle | thank you sean-k-mooney! :) | 11:41 |
*** shaohe_feng has quit IRC | 11:41 | |
sean-k-mooney | gilfoyle: do you want to view this by host_aggregate or availablty zone by the way | 11:42 |
sean-k-mooney | the service has teh az embeded https://github.com/openstack/nova/blob/2afc5fed1f60077e7ff0b9e81b64cff4e4dbabfc/nova/objects/service.py#L190 | 11:42 |
*** shaohe_feng has joined #openstack-nova | 11:42 | |
gilfoyle | by availability zone :) | 11:42 |
*** abhishekk has joined #openstack-nova | 11:46 | |
openstackgerrit | Merged openstack/nova master: [placement] Use base test in placement functional tests https://review.openstack.org/585778 | 11:49 |
*** shaohe_feng has quit IRC | 11:52 | |
kashyap | sean-k-mooney: Was AFK for lunch | 11:52 |
kashyap | sean-k-mooney: Hehe, yeah. I linked to a libvirt commit that explains it | 11:52 |
*** tiendc has quit IRC | 11:53 | |
*** shaohe_feng has joined #openstack-nova | 11:53 | |
*** shaohe_feng has quit IRC | 12:02 | |
*** shaohe_feng has joined #openstack-nova | 12:03 | |
*** linkmark has joined #openstack-nova | 12:03 | |
*** savvas has quit IRC | 12:04 | |
*** medberry has joined #openstack-nova | 12:04 | |
*** ispp has quit IRC | 12:08 | |
*** savvas has joined #openstack-nova | 12:09 | |
*** savvas has quit IRC | 12:11 | |
*** savvas_ has joined #openstack-nova | 12:11 | |
*** shaohe_feng has quit IRC | 12:12 | |
*** shaohe_feng has joined #openstack-nova | 12:14 | |
*** alexchadin has joined #openstack-nova | 12:16 | |
*** edmondsw has joined #openstack-nova | 12:17 | |
*** johnthetubaguy has joined #openstack-nova | 12:17 | |
*** alexchadin has quit IRC | 12:20 | |
*** ispp has joined #openstack-nova | 12:20 | |
*** armaan has joined #openstack-nova | 12:22 | |
*** shaohe_feng has quit IRC | 12:22 | |
*** shaohe_feng has joined #openstack-nova | 12:23 | |
*** sridharg has joined #openstack-nova | 12:24 | |
*** wolverineav has joined #openstack-nova | 12:26 | |
*** annp has quit IRC | 12:27 | |
*** Shilpa has joined #openstack-nova | 12:31 | |
*** mdnadeem has quit IRC | 12:32 | |
*** alexchadin has joined #openstack-nova | 12:33 | |
*** shaohe_feng has quit IRC | 12:33 | |
*** shaohe_feng has joined #openstack-nova | 12:33 | |
*** lyan has joined #openstack-nova | 12:34 | |
*** lyan is now known as Guest87808 | 12:34 | |
*** vladikr has quit IRC | 12:35 | |
*** mriedem has joined #openstack-nova | 12:35 | |
*** ispp has quit IRC | 12:36 | |
*** savvas_ has quit IRC | 12:40 | |
*** armaan has quit IRC | 12:41 | |
*** shaohe_feng has quit IRC | 12:43 | |
*** flwang1 has quit IRC | 12:43 | |
*** shaohe_feng has joined #openstack-nova | 12:44 | |
*** armaan has joined #openstack-nova | 12:45 | |
*** savvas has joined #openstack-nova | 12:45 | |
*** flwang1 has joined #openstack-nova | 12:46 | |
mriedem | http://status.openstack.org/elastic-recheck/index.html#1783917 is clearly our top code-related gate failure so need eyes on the proposed fix https://review.openstack.org/#/c/586402/ | 12:47 |
giblet | mriedem: as sean-k-mooney is +1 on the change I'm going to approve it | 12:49 |
mriedem | giblet: ok. i'm looking at what other calls we make on the source, | 12:49 |
*** armaan has quit IRC | 12:49 | |
mriedem | rollback_live_migration looks OK - nothing directly using the info cache in there | 12:49 |
*** savvas has quit IRC | 12:50 | |
*** armaan has joined #openstack-nova | 12:50 | |
*** ttsiouts has quit IRC | 12:50 | |
PapaOurs | mriedem: there were some POST_FAILURE gate issues this morning too | 12:51 |
mriedem | PapaOurs: that's not code related | 12:51 |
PapaOurs | yup, I know, just FYI | 12:51 |
mriedem | and has been a known issue the last few weeks with one of the node providers | 12:51 |
PapaOurs | that I didn't know of | 12:51 |
PapaOurs | either way, giblet +Wd your change | 12:52 |
*** shaohe_feng has quit IRC | 12:53 | |
*** armaan has quit IRC | 12:54 | |
*** savvas has joined #openstack-nova | 12:54 | |
*** shaohe_feng has joined #openstack-nova | 12:55 | |
*** ttsiouts has joined #openstack-nova | 12:56 | |
mriedem | i do see one potential place i missed | 12:57 |
*** savvas has quit IRC | 12:59 | |
mriedem | giblet: comment inline, i'll do a follow up | 12:59 |
*** rmart04 has joined #openstack-nova | 12:59 | |
giblet | mriedem: OK, cool | 13:00 |
*** pchavva has joined #openstack-nova | 13:01 | |
*** vladikr has joined #openstack-nova | 13:01 | |
mriedem | hyperv ci failed but on unrelated tests | 13:01 |
mriedem | looks like those were failing due to ssh and timeouts | 13:01 |
mriedem | {7} tempest.api.volume.test_volumes_extend.VolumesExtendTest.test_volume_extend_when_volume_has_snapshot [365.093541s] ... FAILED | 13:01 |
mriedem | huh | 13:03 |
mriedem | 2018-07-27 05:15:36.661 5060 105049744 MainThread WARNING nova.scheduler.client.report [req-640b132e-9a1b-4f75-8f8d-7ae96964af72 c329c90c52a44fe2889e0284651a21f0 82e0a447215e49079fe42481922ccd81 - default default] Failed to save allocation for 390d33d0-36e2-469e-85be-8ec10658e953. Got HTTP 400: {"errors": [{"status": 400, "request_id": "req-fc67d1c6-b641-475a-afdf-27075995c0ff", "detail": "The server could not comply with the | 13:03 |
mriedem | uest since it is either malformed or otherwise incorrect.\n\n JSON does not validate: {} does not have enough properties Failed validating 'minProperties' in schema['properties']['allocations']['items']['properties']['resources']: {'additionalProperties': False, 'minProperties': 1, 'patternProperties': {'^[0-9A-Z_]+$': {'minimum': 1, 'type': 'integer'}}, 'type': | 13:03 |
mriedem | ect'} On instance['allocations'][0]['resources']: {} ", "title": "Bad Request"}]} | 13:03 |
*** shaohe_feng has quit IRC | 13:03 | |
*** ispp has joined #openstack-nova | 13:03 | |
mriedem | Sending updated allocation [{'resource_provider': {'uuid': u'b2979fd7-376b-4f9e-a1b9-b4c69d619cb9'}, 'resources': {}}] for instance 390d33d0-36e2-469e-85be-8ec10658e953 | 13:03 |
mriedem | 2018-07-27 05:15:36.513 5060 105049744 MainThread INFO nova.compute.manager [req-640b132e-9a1b-4f75-8f8d-7ae96964af72 c329c90c52a44fe2889e0284651a21f0 82e0a447215e49079fe42481922ccd81 - default default] [instance: 390d33d0-36e2-469e-85be-8ec10658e953] Doing legacy allocation math for migration 8221f52a-c72b-4b7b-81d9-67cb67fb37bc after instance move | 13:04 |
*** shaohe_feng has joined #openstack-nova | 13:04 | |
mriedem | i'm not sure why the hyperv ci would be hitting that in rocky | 13:05 |
mriedem | edmondsw: powervm in-tree ci took over 5 hours here and timed out https://review.openstack.org/#/c/586402/ | 13:06 |
mriedem | fyi | 13:06 |
*** mgariepy has quit IRC | 13:08 | |
*** mgariepy has joined #openstack-nova | 13:10 | |
*** edleafe is now known as figleaf | 13:11 | |
openstackgerrit | Balazs Gibizer proposed openstack/nova master: Use placement 1.28 in scheduler report client https://review.openstack.org/583667 | 13:11 |
cdent | Is this already a known thing: http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Unsupported%20VIF%20type%20unbound%20convert%5C%22 | 13:12 |
cdent | oh never mind, my search on launchpad just hit | 13:13 |
mriedem | http://status.openstack.org/elastic-recheck/index.html#1783917 | 13:13 |
cdent | it didn't when I was missing a closing t | 13:13 |
mriedem | fix is in the gate | 13:13 |
cdent | cool, thanks | 13:13 |
*** flwang1 has quit IRC | 13:14 | |
*** antosh has joined #openstack-nova | 13:14 | |
*** shaohe_feng has quit IRC | 13:14 | |
*** shaohe_feng has joined #openstack-nova | 13:14 | |
mriedem | based on the 50 mocks i have to do in _post_live_migration, clearly that method is too big | 13:15 |
cdent | ugh | 13:16 |
*** savvas has joined #openstack-nova | 13:16 | |
*** cdent has quit IRC | 13:16 | |
*** savvas has quit IRC | 13:21 | |
*** savvas has joined #openstack-nova | 13:21 | |
*** abhishekk has quit IRC | 13:21 | |
*** shaohe_feng has quit IRC | 13:24 | |
*** shaohe_feng has joined #openstack-nova | 13:25 | |
edmondsw | mriedem the powervm ci is borked right now. I'm trying to help get it fixed | 13:26 |
*** ttsiouts has quit IRC | 13:26 | |
*** mdrabe has joined #openstack-nova | 13:26 | |
*** gbarros has joined #openstack-nova | 13:27 | |
*** flwang1 has joined #openstack-nova | 13:31 | |
*** jistr is now known as jistr|mtg | 13:32 | |
*** Luzi has quit IRC | 13:32 | |
openstackgerrit | Balazs Gibizer proposed openstack/nova master: Use placement 1.28 in scheduler report client https://review.openstack.org/583667 | 13:33 |
*** shaohe_feng has quit IRC | 13:34 | |
dansmith | efried: what should happen if I have compute nodes with MISC_SHARES (and thus no DISK_GB inventory)? Should the scheduler receive split allocations from placement with disk on the sharing provider? | 13:35 |
*** shaohe_feng has joined #openstack-nova | 13:35 | |
dansmith | because I have yet to convince it to do that in a functional test | 13:36 |
*** alexchadin has quit IRC | 13:36 | |
*** medberry has quit IRC | 13:37 | |
*** gilfoyle has quit IRC | 13:37 | |
*** alexchadin has joined #openstack-nova | 13:42 | |
*** burt has joined #openstack-nova | 13:43 | |
*** tbachman has joined #openstack-nova | 13:44 | |
*** shaohe_feng has quit IRC | 13:44 | |
*** shaohe_feng has joined #openstack-nova | 13:45 | |
*** fanzhang has quit IRC | 13:45 | |
*** fanzhang has joined #openstack-nova | 13:45 | |
*** alexchadin has quit IRC | 13:49 | |
*** shaohe_feng has quit IRC | 13:55 | |
*** ttsiouts has joined #openstack-nova | 13:56 | |
*** shaohe_feng has joined #openstack-nova | 13:56 | |
*** mlavalle has joined #openstack-nova | 13:57 | |
mriedem | speaking of, i think this is going to be the money patch https://review.openstack.org/#/c/586363/ | 13:57 |
mriedem | creates a shared storage provider using the DISK_GB calculated from the compute node provider, then removes the compute node provider's DISK_GB inventory before the compute service host is discovered | 13:58 |
*** awaugama has joined #openstack-nova | 13:58 | |
*** med_ has quit IRC | 13:58 | |
s10 | Please check this bug: https://bugs.launchpad.net/nova/+bug/1784006 | 13:59 |
openstack | Launchpad bug 1784006 in OpenStack Compute (nova) "Instances misses neutron QoS on their ports after unrescue and soft reboot" [Undecided,New] | 13:59 |
*** ttsiouts has quit IRC | 14:00 | |
*** blkart has quit IRC | 14:01 | |
s10 | User can easily drop QoS limitations on ports with _soft_reboot() or unrescue() for libvirt driver. | 14:01 |
*** blkart has joined #openstack-nova | 14:01 | |
*** ttsiouts has joined #openstack-nova | 14:02 | |
mriedem | s10: i think we do plug_vifs on hard reboot now, but maybe not in pike... | 14:04 |
mriedem | or maybe only for certain types of vifs... | 14:04 |
mriedem | it's kind of a mess | 14:04 |
*** shaohe_feng has quit IRC | 14:05 | |
s10 | plug_vifs are executed on hard reboot and spawn(). Not for soft reboot, in master: https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L2706 | 14:06 |
mriedem | oh right, i missed the "without" here: "Execute nova reboot (without parameter --hard)" | 14:06 |
melwitt | hm, I thought for soft reboot they shouldn't have been unplugged in the first place, but the bug says a new domain is created, which I didn't think happened either. I wonder if something changed there | 14:07 |
*** shaohe_feng has joined #openstack-nova | 14:08 | |
*** gbarros has quit IRC | 14:08 | |
dansmith | soft reboot will turn into a hard reboot if the guest doesn't shut down voluntarily right? | 14:08 |
melwitt | shutdown and then a create | 14:08 |
mriedem | correct | 14:08 |
dansmith | it's trivial for me to make my guest not shut down when asked | 14:09 |
mriedem | but apparently in this case soft reboot works | 14:09 |
melwitt | looking at the code, indeed it does a guest.shutdown() followed by a create. so you'd think you'd have to plug the vifs in again, I wonder how this normally works? | 14:09 |
*** links has quit IRC | 14:10 | |
dansmith | hmm, it doesn't do an actual reboot? | 14:10 |
*** gilfoyle has joined #openstack-nova | 14:10 | |
melwitt | doesn't look like it? I guess I've never looked at soft reboot in detail before https://github.com/openstack/nova/blob/stable/pike/nova/virt/libvirt/driver.py#L2547 | 14:11 |
dansmith | hmm, yeah, I didn't think this was like this | 14:11 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Pass source vifs to driver.cleanup in _post_live_migration https://review.openstack.org/586568 | 14:12 |
mriedem | giblet: ^ | 14:12 |
giblet | mriedem: looking | 14:12 |
dansmith | I thought if we were running a virt that could do real reboot, we did that and only fell back to the shutdown/restart if not | 14:13 |
dansmith | but I don't see that | 14:13 |
melwitt | mriedem: why were you thinking not to use the source vifs throughout the entire method? just wondering | 14:13 |
mriedem | no particular reason, just wanted to minimize the amount of change, | 14:14 |
mriedem | but we could just do that at the top rather than get the refreshed nw info cache | 14:14 |
*** gilfoyle has quit IRC | 14:14 | |
mriedem | i.e. here https://review.openstack.org/#/c/586568/1/nova/compute/manager.py@6555 | 14:15 |
*** eharney has joined #openstack-nova | 14:15 | |
*** shaohe_feng has quit IRC | 14:15 | |
mriedem | i can definitely make that change if it makes more sense | 14:17 |
*** shaohe_feng has joined #openstack-nova | 14:17 | |
melwitt | yeah, I'm not 100% sure but it feels like it should be consistent throughout. but I guess that's never guaranteed anyway because neutron events in flight could change the network_info as it goes through the method anyway? | 14:17 |
mriedem | shouldn't | 14:18 |
*** jlvacation is now known as jlvillal | 14:18 | |
mriedem | an event would be processed separately and shouldn't be able to modify that network_info variable by reference | 14:18 |
melwitt | oh, yeah, okay | 14:18 |
mriedem | the instance.info_cache might be updated concurrently, sure | 14:18 |
mriedem | but we're using the local variable in most places | 14:18 |
*** r-daneel has joined #openstack-nova | 14:18 | |
melwitt | yeah | 14:18 |
mriedem | the versioned notifications will still use instance.info_cache | 14:18 |
*** felipemonteiro has joined #openstack-nova | 14:19 | |
mriedem | left that as a comment so giblet can also ponder it | 14:20 |
mriedem | i didn't do it in https://review.openstack.org/#/c/586402/ because (1) it was late and (2) i just wanted to get the immediate fire put out | 14:20 |
melwitt | I guess I could see the rationale in only using the source vifs for the relevant actions because like I think you mentioned, maybe the notifications should reflect the state of the network info cache at the time it was queried | 14:20 |
melwitt | that's the only other thing network info is used for in that method, I assume? | 14:21 |
pooja_jadhav | mriedem: hello | 14:21 |
*** gilfoyle has joined #openstack-nova | 14:22 | |
pooja_jadhav | sean-k-mooney : hello | 14:22 |
mriedem | and unfilter_instance in the firewall driver, | 14:22 |
mriedem | i looked at how it was used in the various drivers and it was just getting the mac address off the vifs in one case | 14:23 |
mriedem | which i don't think should change | 14:23 |
mriedem | but, | 14:23 |
mriedem | admittedly, only passing the source vifs from migrate_data to 2 spots indicates tight coupling into knowing exactly what those methods are doing with network_info | 14:23 |
giblet | mriedem, melwitt: I think having the current network infor send in the notification is the valid thing as we are notifying about current state | 14:23 |
pooja_jadhav | sean-k-mooney, mriedem: I am trying live migrate and using nfs storage, its failing for "Binding failed for port e973dde6-d68c-4aec-a70d-86dcd81fa11b and host Neha-VirtualBox." | 14:24 |
mriedem | pooja_jadhav: i can't really help you debug that right now | 14:24 |
melwitt | giblet, mriedem: I think that makes sense too, the more I think about it | 14:24 |
mriedem | pooja_jadhav: i'd suggest using something besides devstack if you want a more sophisticated deployment tool for multi-node with live migration, like openstack-ansible | 14:24 |
pooja_jadhav | mriedem: ok | 14:24 |
mriedem | melwitt: i'm totally fine with making the generic switch at the top of the method | 14:25 |
mriedem | i don't like the tight coupling that's in here really | 14:25 |
mriedem | i just wanted to reduce any exposure to regression | 14:25 |
mriedem | pooja_jadhav: or look at a nova-live-migration job config and see how it set things up | 14:25 |
mriedem | but those don't use nfs | 14:25 |
*** shaohe_feng has quit IRC | 14:25 | |
mriedem | http://logs.openstack.org/02/586402/2/check/nova-live-migration/2db7a54/ | 14:25 |
*** med_ has joined #openstack-nova | 14:25 | |
*** med_ has quit IRC | 14:25 | |
*** med_ has joined #openstack-nova | 14:25 | |
*** alexchadin has joined #openstack-nova | 14:26 | |
mriedem | pooja_jadhav: binding failed means something failed in neutron | 14:26 |
mriedem | so network is messed up | 14:26 |
*** gilfoyle has quit IRC | 14:26 | |
pooja_jadhav | mriedem: Hmm | 14:26 |
*** mdrabe has quit IRC | 14:26 | |
*** shaohe_feng has joined #openstack-nova | 14:27 | |
melwitt | mriedem: yeah, I'm thinking I agree with giblet though, that we should leave it the way you have it. let the notifications use the fresh network info and not artificially send source vif. I think the only reason to use source vifs there is if somehow a notifications listener might want to know which vif is actually being acted upon during the actions in the method. hmm. | 14:27 |
pooja_jadhav | mriedem: But I am not able to see any error logs at neutron side.. thats the problem | 14:28 |
mriedem | melwitt: giblet: well, only the versioned notifications will use the instance.info_cache, | 14:28 |
mriedem | the legacy ones would end up using the source vifs | 14:28 |
melwitt | oh | 14:29 |
*** links has joined #openstack-nova | 14:29 | |
mriedem | anyway, we could always change this later i guess if it causes some other unanticipated problem | 14:29 |
melwitt | yeah | 14:30 |
mriedem | let me check to make sure the mac address on the vif is the same between source and dest | 14:30 |
mriedem | since that's used in the firewall driver to unfilter | 14:30 |
*** alexchadin has quit IRC | 14:30 | |
giblet | mriedem: in the current code the legacy notification uses the local network_info and I guess that is the same as what the versioned gets from instance.info_cache | 14:31 |
mriedem | source vif "address": "fa:16:3e:cc:ff:66" | 14:31 |
mriedem | from the cache: "address": "fa:16:3e:cc:ff:66" | 14:31 |
mriedem | so yeah the mac doesn't change | 14:31 |
mriedem | giblet: yes | 14:31 |
giblet | mriedem: then I still think that the current code in your patch is good | 14:32 |
*** cdent has joined #openstack-nova | 14:32 | |
*** breton has quit IRC | 14:33 | |
*** gilfoyle has joined #openstack-nova | 14:34 | |
s10 | What could be done with unrescue/soft reboot QoS issue? Should we use _create_domain_and_network() in that functions instead of simple _create_domain()? Or call plug_vifs()? | 14:34 |
* giblet is logging off for the weekend | 14:35 | |
*** shaohe_feng has quit IRC | 14:36 | |
*** shaohe_feng has joined #openstack-nova | 14:37 | |
*** jistr|mtg is now known as jistr | 14:39 | |
*** tidwellr has joined #openstack-nova | 14:39 | |
*** gilfoyle has quit IRC | 14:39 | |
*** tidwellr has quit IRC | 14:39 | |
*** tidwellr has joined #openstack-nova | 14:39 | |
*** bhagyashris has quit IRC | 14:41 | |
*** flwang1 has quit IRC | 14:41 | |
mriedem | woot ceph shared storage change got through stack.sh and is now running tempest | 14:42 |
*** felipemonteiro_ has joined #openstack-nova | 14:42 | |
cdent | huzzah | 14:42 |
dansmith | cdent: did you see my question to efried earlier? | 14:42 |
cdent | dansmith: no sir, what's up? | 14:43 |
openstackgerrit | Chris Dent proposed openstack/nova master: [placement] Retry allocation writes server side https://review.openstack.org/586048 | 14:43 |
dansmith | [06:36:22] <dansmith>efried: what should happen if I have compute nodes with MISC_SHARES (and thus no DISK_GB inventory)? Should the scheduler receive split allocations from placement with disk on the sharing provider? | 14:43 |
dansmith | [06:36:46] <dansmith>because I have yet to convince it to do that in a functional test | 14:43 |
dansmith | cdent: ^ | 14:43 |
cdent | one sec, let me find something | 14:44 |
cdent | dansmith: this is current passing: https://github.com/cdent/placecat/blob/master/gabbits/fridge.yaml#L204-L213 | 14:45 |
cdent | which is an example of some allocations with sharing providers | 14:45 |
*** [fcandido] has joined #openstack-nova | 14:45 | |
cdent | so in theory it should work, but I'm not clear on what needs to happen on compute-node side to set things up | 14:45 |
*** felipemonteiro has quit IRC | 14:46 | |
dansmith | cdent: that is asserting what? that one of the providers only has a part of the whole? | 14:46 |
*** shaohe_feng has quit IRC | 14:46 | |
dansmith | or, two providers in the request | 14:46 |
mriedem | # but there are two resource providers in that one allocations block | 14:46 |
*** shaohe_feng has joined #openstack-nova | 14:46 | |
cdent | ^ | 14:46 |
dansmith | yeah | 14:46 |
*** gilfoyle has joined #openstack-nova | 14:47 | |
dansmith | so, that tells me that a single non-fancy request to placement should return a split allocation | 14:47 |
mriedem | dansmith: we should know shortly from this ceph patch i have | 14:47 |
cdent | If we need a specific functional test for something, I'm semi idle right now, so could make something if someone tells me what it needs to be | 14:47 |
dansmith | and the scheduler is doing a non-fancy request, so it should be getting back a split allocation I guess | 14:47 |
melwitt | mriedem: in https://review.openstack.org/586568 is that taking care of the live migration rollback scenario? or is that still an open question | 14:47 |
mriedem | melwitt: i looked at rollback and didn't see anything that needed this type of thing | 14:47 |
dansmith | cdent: well, I've tried writing a very hacky one and placement is returning no allocation requests | 14:47 |
melwitt | mriedem: ack | 14:47 |
cdent | dansmith: do you want to push it up and I'll tune it and you can go review something? | 14:48 |
mriedem | melwitt: i'd say if we ever go the generic route in _post_live_migration, we'd want to do the same in _rollback_live_migration | 14:48 |
mriedem | rollback is likely less of an issue b/c if we failed live migration, we won't activate the dest host port bindings and get into this mess | 14:48 |
*** lucasbagb has joined #openstack-nova | 14:49 | |
[fcandido] | http://eavesdrop.openstack.org/meetings | 14:49 |
*** efried is now known as fried_rice | 14:49 | |
*** [fcandido] has left #openstack-nova | 14:49 | |
melwitt | ack | 14:49 |
openstackgerrit | Dan Smith proposed openstack/nova master: WIP: funtional test with sharing providers https://review.openstack.org/586589 | 14:49 |
dansmith | cdent: ^ | 14:49 |
cdent | on it | 14:49 |
dansmith | cdent: warning, it's very, uh, forced | 14:50 |
cdent | ha, noted | 14:50 |
*** flwang1 has joined #openstack-nova | 14:50 | |
fried_rice | dansmith/superdan: I haven't caught up on the whole conversation, but you're asking about a compute node that's marked as a sharing provider? | 14:50 |
dansmith | cdent: attempts to create a provider with disk, associate with the compute host providers, nuke the disk inventory from one and then try to boot and see if we got the shared bit | 14:50 |
cdent | ✔ | 14:51 |
dansmith | fried_rice: no, not a compute node marked as sharing, just a compute with no disk because it's associated to a shared disk provider | 14:51 |
*** mlavalle has quit IRC | 14:52 | |
*** imacdonn has quit IRC | 14:52 | |
*** mlavalle has joined #openstack-nova | 14:52 | |
mriedem | dansmith: why not write a simple fake virt driver that doesn't report DISK_GB inventory? | 14:52 |
*** imacdonn has joined #openstack-nova | 14:52 | |
dansmith | mriedem: because this was quick | 14:52 |
dansmith | mriedem: obviously not mergeable | 14:52 |
*** fgonzales_ has joined #openstack-nova | 14:53 | |
mriedem | your max_unit is wrong | 14:54 |
melwitt | argh, gate bug fix just failed merge for POST_FAILURE | 14:54 |
mriedem | your sharing provider has 1gb | 14:54 |
mriedem | unless flavor1 doesn't have any root_gb | 14:55 |
*** bacape has joined #openstack-nova | 14:56 | |
*** breton has joined #openstack-nova | 14:56 | |
dansmith | it has 1024 GB | 14:56 |
*** Bellesse has joined #openstack-nova | 14:56 | |
*** jfinck has joined #openstack-nova | 14:56 | |
*** shaohe_feng has quit IRC | 14:56 | |
mriedem | but max you can request in a chunk is 1 right? | 14:56 |
dansmith | oh max_unit | 14:56 |
cdent | i'll mess with it | 14:57 |
mriedem | max_unit should equal total | 14:57 |
*** shaohe_feng has joined #openstack-nova | 14:57 | |
dansmith | still no dice | 14:57 |
dansmith | er, hmm it didn't update | 14:57 |
dansmith | ah, I'm setting inventory twice for some reason | 14:58 |
sean-k-mooney | melwitt: the live migrate one? | 14:58 |
mriedem | yeah | 14:58 |
mriedem | you might be using a 1 root_gb flavor anyway | 14:58 |
mriedem | so the max_unit being 1 might not make a difference | 14:59 |
dansmith | I was, and still no difference | 14:59 |
dansmith | yeah | 14:59 |
melwitt | sean-k-mooney: yeah | 14:59 |
fried_rice | dansmith/superdan: Okay, you're trying to make a setup that has its disk allocated from a sharing provider, not the compute node. And then what, migrate it? | 14:59 |
mriedem | boot and then migrate | 14:59 |
mriedem | but boot fails? | 14:59 |
dansmith | fried_rice: well, boot first would be nice | 14:59 |
fried_rice | bhagyashri got that working live and in a func test with the libvirt driver. | 15:00 |
dansmith | fried_rice: I believe migrate will mangle the allocations, but trying to prove it | 15:00 |
fried_rice | Have you located that func test yet? | 15:00 |
dansmith | nope | 15:00 |
fried_rice | dansmith: I suspect you may be right. | 15:00 |
fried_rice | okay, stand by... | 15:00 |
mriedem | fried_rice: that libvirt func test doesn't go through the scheduler though right? | 15:00 |
dansmith | fried_rice: yeah, so in that case, I want to remove the bit of the libvirt inventory thing that will not expose disk_gb, because people may turn that on, and then be mangling their allocations with migrations for a couple days before realizing it | 15:01 |
fried_rice | mriedem: I sure thought it did. | 15:01 |
fried_rice | https://review.openstack.org/#/c/560459/ | 15:01 |
mriedem | hmm yeah https://review.openstack.org/#/c/560459/17/nova/tests/functional/libvirt/test_shared_resource_provider.py | 15:01 |
fried_rice | yup | 15:02 |
*** links has quit IRC | 15:02 | |
dansmith | yeah, so I dunno why it's not working for me | 15:02 |
dansmith | but that's fine | 15:03 |
sean-k-mooney | dansmith: only the allocation for the compute resouces need to be migrated correct. the shard storage allocation should remain the same. | 15:03 |
mriedem | sean-k-mooney: well, that's the point of the test, | 15:03 |
dansmith | sean-k-mooney: right, but we don't do that properly | 15:03 |
fried_rice | dansmith: Building on that one and trying a migration would be informative. I would be surprised if it works properly, because we have no logic to do ^ | 15:03 |
mriedem | because we have FIXME notes all over the migration code | 15:03 |
sean-k-mooney | i guess unless we are migrating with a block migraion to a different storage provider | 15:03 |
dansmith | fried_rice: I have fixmes about it being broken and known | 15:03 |
fried_rice | yup | 15:04 |
*** alexchadin has joined #openstack-nova | 15:04 | |
dansmith | fried_rice: so, yeah, I'm not sure why we landed the patch to do that for inventory in that case, but.. alas | 15:04 |
fried_rice | dansmith: So that we wouldn't be double-reporting inventory allocations. | 15:04 |
*** gilfoyle_ has joined #openstack-nova | 15:05 | |
fried_rice | dansmith: Can't you only migrate an instance that's on volume storage anyway? | 15:05 |
dansmith | fried_rice: right, but that has been broken since forever, and this change means we *lose* data | 15:05 |
dansmith | no | 15:05 |
fried_rice | what happens to the disk? | 15:05 |
mriedem | ssh to the dest | 15:05 |
dansmith | hah | 15:05 |
fried_rice | eek, really? | 15:05 |
dansmith | it gets migrated | 15:05 |
*** mdrabe has joined #openstack-nova | 15:05 | |
dansmith | either block migration or shared (non-volume) storage in teh backend | 15:05 |
fried_rice | Okay, so what are we expecting to happen here? | 15:05 |
dansmith | for live, and yeah, scp to dest for the cold migration case | 15:06 |
fried_rice | I would have thought we would ssh the data to whatever disk got allocated on the dest. | 15:06 |
*** gilfoyle has quit IRC | 15:06 | |
dansmith | I think we need to remove that bit of the inventory logic that doesn't expose DISK_GB | 15:06 |
dansmith | so that we don't get split allocations that we trash during a migration | 15:06 |
dansmith | because we'll end up with instances with no DISK_GB allocation at all | 15:06 |
fried_rice | which may or may not be the same provider as we started on, but to a different spot on that disk - which would be something to fix later | 15:06 |
dansmith | and then start overcommitting | 15:06 |
*** shaohe_feng has quit IRC | 15:06 | |
fried_rice | I don't understand that thinking. And IMO it is premature to land a patch to yank that out until we've demonstrated that anything bad happens. | 15:07 |
*** shaohe_feng has joined #openstack-nova | 15:07 | |
dansmith | that's why I'm trying to write a test | 15:07 |
fried_rice | sounds good. | 15:07 |
fried_rice | need help? | 15:07 |
*** josecastroleon has quit IRC | 15:07 | |
dansmith | I asked for help and now am working on using that functional test to do my bidding | 15:08 |
* cdent is still poking at the test too | 15:08 | |
*** r-daneel_ has joined #openstack-nova | 15:08 | |
*** r-daneel has quit IRC | 15:09 | |
*** r-daneel_ is now known as r-daneel | 15:09 | |
mriedem | i believe this is the problem https://github.com/openstack/nova/blob/master/nova/conductor/tasks/migrate.py#L48 | 15:09 |
mriedem | b/c we're assuming only allocations on the source compute node provider | 15:09 |
fried_rice | I think the worst that happens is we fail to remove the original allocation for the DISK_GB on the sharing provider. What happens after that depends on whether we migrated to a compute node with or without sharing disk. But the doubled allocation leaves us in no worse shape than we were before this fix, I would have thought. | 15:09 |
mriedem | and copy those to the migration consumer | 15:09 |
mriedem | which won't include the DISK_GB allocation on the shared provider | 15:09 |
sean-k-mooney | fried_rice: dansmith do we handel flavors with root_gb=0 in placement by the way. preplacement we jsut did not track there disk usage properly. im assuming that is stil broken | 15:09 |
mriedem | sean-k-mooney: fixed like 1 week ago | 15:10 |
mriedem | sean-k-mooney: https://review.openstack.org/#/q/topic:bug/1469179+(status:open+OR+status:merged) | 15:10 |
sean-k-mooney | mriedem: fixed by reading disk size form image? | 15:10 |
dansmith | fried_rice: and I think we lose the disk allocation silently | 15:10 |
fried_rice | dansmith: mriedem: oic, yeah, that makes sense. | 15:10 |
mriedem | sean-k-mooney: no, we don't request DISK_GB allocations for bfv | 15:10 |
sean-k-mooney | mriedem: i was thinking about the non boot form volume case | 15:11 |
fried_rice | I didn't realize we don't go through GET /a_c to request the resources on the dest. | 15:11 |
mriedem | yes this is what removes the instances allocations https://github.com/openstack/nova/blob/master/nova/conductor/tasks/migrate.py#L60 | 15:11 |
mriedem | from all providers | 15:11 |
mriedem | fried_rice: we do to pick the dest host during scheduling | 15:12 |
sean-k-mooney | mriedem: for example the nano flavor with with the cirros image in devstack with no volume for the guest. | 15:12 |
fried_rice | mriedem: GET /a_c or just GET /rps?resources=... ? | 15:12 |
mriedem | GET /a_c, | 15:12 |
mriedem | we have to do that in the scheduler to figure out which providers to filter for a dest host | 15:12 |
fried_rice | mriedem: and then ignore that result and just copy the resources from the src to the dest? | 15:12 |
mriedem | i'm looking to confirm that | 15:13 |
fried_rice | mriedem: well, you could have used GET /rps?resources=... as well | 15:13 |
mriedem | sure but we don't in the scheduler | 15:13 |
fried_rice | The right thing would be to use GET /a_c to pick the host *and* create the allocations. Then we wouldn't be having this problem. | 15:13 |
mriedem | oh you know what, | 15:13 |
mriedem | yes that's what we o | 15:14 |
mriedem | *do | 15:14 |
mriedem | we move the existing allocs from the instance on the source node to the migration record, | 15:14 |
mriedem | and then call the scheduler and claim on the dest host | 15:14 |
mriedem | so the migration has allocs on source host and instance has allocs on dest host | 15:14 |
mriedem | then on successful migration we delete the migration allocs on the source host | 15:14 |
mriedem | on failure, we delete allocs for instance on dest and move allocs from migratoin on source host to instane | 15:15 |
mriedem | *instance | 15:15 |
fried_rice | oh, so what's actually happening is we're erroneously losing the DISK_GB allocation for a minute during the migration, but picking it up again on the dest. | 15:15 |
mriedem | so we don't hit _move_operation_alloc_request in the scheduler report client | 15:15 |
cdent | dansmith: the root problem in your test is that two compute nodes are not in the aggregate, when you do the put for that it is coming up 404, so the resource providers don't exist yet, not sure why that is | 15:15 |
dansmith | cdent: ah, okay | 15:15 |
dansmith | mriedem: hmm, so we end up double-claiming on the shared provider? | 15:16 |
dansmith | mriedem: I thought even with the new accounting we had to grab the allocation for the provider in question and regenerate it, which would mean the instance's allocation on the dest wouldn't include the shared one | 15:16 |
mriedem | i don't think so...as eric said, we'll remove the allocs for the instance on the shared provider, | 15:16 |
mriedem | then claim on the dest during scheduling | 15:16 |
*** shaohe_feng has quit IRC | 15:17 | |
dansmith | because we do a full regular schedule? | 15:17 |
mriedem | i think on a revert or failed migration we'd eff that up though | 15:17 |
mriedem | yes | 15:17 |
mriedem | *EXCEPT* in the case of forced live migrate | 15:17 |
dansmith | oh you're saying we drop the disk allocation but only because we don't copy it for the migration | 15:17 |
mriedem | we don't go through the scheduler there | 15:17 |
mriedem | dansmith: yeah | 15:17 |
dansmith | so, | 15:18 |
*** shaohe_feng has joined #openstack-nova | 15:18 | |
mriedem | on a revert or failed resize, we'll delete the allocs for the instance on the dest host (created by the scheduler) and move those back from the migration to the instance, but the migration allocs won't be on the sharing provider | 15:18 |
dansmith | what happens if placement picks a different sharing provider than we had before? our disk doesn't actually move | 15:18 |
mriedem | so we'd lost the DISK_GB alocs in that case | 15:18 |
dansmith | ah, yeah, anything where we use the migration's allocations would be wrong | 15:18 |
*** niraj_singh has quit IRC | 15:18 | |
mriedem | dansmith: yup that's this https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L4138 | 15:19 |
mriedem | well, we wouldn't hit that yet | 15:19 |
mriedem | the migration consumer will only have VCPU and MEMORY_MB allocations against the source node | 15:19 |
mriedem | so this https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L4155 | 15:19 |
mriedem | so that's definitely busted - we could easily test that with a resize revert test and verify the DISK_GB allocation for the instance is gone | 15:20 |
dansmith | and forced live | 15:20 |
mriedem | i haven't stepped through forced live yet (or evac for that matter) | 15:20 |
sean-k-mooney | dansmith: we should only be able to pick a different provider in a block migrate case correct? if we do not set that flag we should not allow the shareing provider to change | 15:21 |
dansmith | also, | 15:21 |
dansmith | migrate to an older node that doesn't have this will drop the shared disk allocation | 15:21 |
dansmith | because placement will allocate from its own disk inventory, even though it's the same pool, | 15:21 |
mriedem | sean-k-mooney: re: "for example the nano flavor with with the cirros image in devstack with no volume for the guest." i don't know what you're asking me | 15:21 |
dansmith | and then when we upgrade that node, we won't convert the allocations | 15:21 |
dansmith | in fact, | 15:21 |
dansmith | any upgrade where we boot up on rocky code and change our inventory will break all our allocations right? | 15:22 |
dansmith | fried_rice: cdent what happens if I have allocations against my disk_gb inventory and then I update my inventory with no disk_gb ? | 15:22 |
fried_rice | The inv update will bounce 409 InventoryInUse. | 15:22 |
mriedem | i don't think you can do that | 15:22 |
mriedem | yeah | 15:22 |
fried_rice | on every periodic | 15:23 |
dansmith | okay, so anyone with MISC_SHARES now will failboat on upgrade to rocky | 15:23 |
fried_rice | update_from_provider_tree will never succeed. | 15:23 |
dansmith | and anyone that sets that on non-empty computes will stop reporting | 15:23 |
mriedem | oh right b/c upt removes the DISK_GB from the node provider if it sees it's in a sharing provider relationship | 15:23 |
dansmith | yeah | 15:24 |
fried_rice | Note that we didn't document that you could do this. | 15:24 |
mriedem | and if that DISK_GB is being used it will blow up on the remove | 15:24 |
mriedem | fried_rice: heh i know | 15:24 |
dansmith | fried_rice: and yet, it's in documentation and people have tried it, hence the bug yeah? | 15:24 |
sean-k-mooney | mriedem: in that instance. the flavor has root_gb=0 the imange is like 20MB in glance and we boot it on the dest without claim space in placement. the vm can use as much space as disk topology in the image specifies | 15:24 |
*** alexchadin has quit IRC | 15:24 | |
*** tssurya has quit IRC | 15:24 | |
fried_rice | dansmith: The bug was opened because bhagyashri was working on it and I said it should have a bug report. | 15:24 |
mriedem | shared storage providers is definitely a feature/spec | 15:25 |
fried_rice | ...which we don't claim works yet. | 15:25 |
mriedem | given the upgrade/CI/etc | 15:25 |
mriedem | i know, but | 15:25 |
*** ttsiouts has quit IRC | 15:25 | |
fried_rice | we should document that we *don't* support it. | 15:25 |
dansmith | well, there's mention of that trait in our own docs, and given what it breaks it's not trivial, IMHO | 15:25 |
fried_rice | and then take some time to resolve these issues correctly. | 15:25 |
mriedem | that's what dansmith and melwitt were talking about last night | 15:25 |
sean-k-mooney | mriedem: anyway thats unrelated to dans question excpet for the fact we dont track the disk usage correctly in placemnet | 15:25 |
*** ttsiouts has joined #openstack-nova | 15:26 | |
melwitt | yes, described here L8 https://etherpad.openstack.org/p/nova-rocky-release-candidate-todo | 15:26 |
mriedem | sean-k-mooney: yes i think that's correct and likely a bug; i'm not entirely sure how the resource tracker reports disk usage for a flavor like that which *is* using local disk | 15:26 |
*** fgonzales_ has quit IRC | 15:26 | |
mriedem | where root_gb=0 | 15:26 |
mriedem | sean-k-mooney: we've also said you shouldn't use root_gb=0 except for volume-backed flavors | 15:27 |
mriedem | and added a policy rule in rocky to disable that | 15:27 |
*** shaohe_feng has quit IRC | 15:27 | |
sean-k-mooney | mriedem: oh cool | 15:27 |
sean-k-mooney | is is set by default | 15:27 |
mriedem | sean-k-mooney: https://github.com/openstack/nova/commit/763fd62464e9a0753e061171cc1fd826055bbc01 | 15:28 |
mriedem | the plan was to disable that by default starting in stein | 15:28 |
*** jfinck has quit IRC | 15:28 | |
mriedem | so you can't boot a server with a root_gb=0 flavor unless you're doing boot from volume | 15:28 |
cdent | dansmith: microversions :( | 15:28 |
mriedem | how are microversions related to this? | 15:28 |
dansmith | I assume because I messed up a version in my test | 15:29 |
*** shaohe_feng has joined #openstack-nova | 15:29 | |
cdent | (sorry, his test) | 15:29 |
mriedem | ah | 15:29 |
mriedem | whew | 15:29 |
sean-k-mooney | mriedem: right ok cool. if you have existing instces booted that way we sould have to update teh embeeded flavor or resouce dict to indicate or live migration will explode | 15:29 |
mriedem | sean-k-mooney: so figuring out how we track disk usage for those types of flavors in the resource tracker would be good to know | 15:29 |
mriedem | because if it was never tracked as usage before, then it's not really a huge regression to not be tracking it in placement | 15:30 |
sean-k-mooney | mriedem: im pretry sure we track it as 0 | 15:30 |
sean-k-mooney | e.g. we dont track it at all | 15:30 |
*** ttsiouts has quit IRC | 15:30 | |
mriedem | sean-k-mooney: i think that too b/c https://github.com/openstack/nova/blob/master/nova/compute/resource_tracker.py#L1461 | 15:30 |
mriedem | object_or_dict.flavor.root_gb | 15:30 |
sean-k-mooney | it was a way to bypass qouta in the past | 15:31 |
mriedem | the is_bfv in there was just recently added in the same series of fixes for the bfv thing | 15:31 |
mriedem | right, so to summarize, don't set flavor root_gb=0 *unless* those flavors are only used with bfv instances, | 15:31 |
mriedem | and we have the is_bfv root_gb reporting in the RT and placement fixed in rocky | 15:31 |
dansmith | melwitt: aight, well, anyway, my recommendation is that we just remove that inventory quirk for rocky since it can't work and it's one line. alternatively, at least a known-issue reno just to cover our butts in case someone hits it | 15:32 |
sean-k-mooney | mriedem: ya i think though we will have to fix up the allcoation for existing instance that are not bfv going forward | 15:32 |
mriedem | sean-k-mooney: we do | 15:32 |
dansmith | melwitt: it's like having a half-merged feature.. doesn't really serve any purpose and is externally tickle-able to failure | 15:32 |
dansmith | obviously your call on what to do | 15:33 |
mriedem | sean-k-mooney: https://review.openstack.org/#/c/583715/ | 15:33 |
mriedem | sean-k-mooney: we'll heal on move | 15:33 |
cdent | any swag on how hard to make it go, now-ish? | 15:33 |
mriedem | "make it go" == make it work? | 15:33 |
mriedem | we don't even have multi-node shared storage provider CI | 15:33 |
mriedem | so very high risk IMO | 15:33 |
sean-k-mooney | mriedem: in the non BFV case we need to read the size form the image if the flavor root_gb=0 | 15:33 |
mriedem | way too late | 15:34 |
mriedem | sean-k-mooney: yup | 15:34 |
dansmith | yeah, way too late to try to make any of the broken non-broken | 15:34 |
mriedem | but that's not reported to the RT as far as i know | 15:34 |
fried_rice | dansmith: I can propose that if you like. | 15:34 |
dansmith | fried_rice: which? | 15:34 |
mriedem | dansmith: so we should likely start with a bug saying this stuff will nuke your DISK_GB allocations on failure or revert at least | 15:34 |
*** andymccr has quit IRC | 15:34 | |
mriedem | fried_rice: melwitt: ^ | 15:34 |
fried_rice | dansmith: Taking that line out of the libvirt driver. | 15:34 |
dansmith | mriedem: for sure | 15:35 |
*** andymccr has joined #openstack-nova | 15:35 | |
dansmith | fried_rice: sure, I'm happy to do it as well, either way | 15:35 |
mriedem | and we can track the various bugs in a spec in stein if we're going to go full on and support this | 15:35 |
fried_rice | dansmith: You want to write up the bug, I'll do the patch? | 15:36 |
mriedem | b/c we need a spec for the upgrade impacts obviously, and how to deploy the thing, plus CI requirements (which i'm already half-way done with) | 15:36 |
melwitt | sounds like a plan | 15:36 |
dansmith | fried_rice: if mriedem isn't going to | 15:36 |
mriedem | go ahead | 15:36 |
dansmith | fried_rice: I think mriedem really wants to do it | 15:36 |
dansmith | I heard him say earlier | 15:36 |
dansmith | so I don't want to step on his toes | 15:36 |
mriedem | i'm cleaning up stephenfin's last 2 changes in his vswitc hseries | 15:36 |
dansmith | because I think he measures his weekly progress by bugs reported | 15:37 |
dansmith | mriedem: more? | 15:37 |
mriedem | plus, zuul just f'ed my ceph ci run that was almost done! | 15:37 |
melwitt | this is new, gate failure RETRY_LIMIT | 15:37 |
melwitt | great | 15:37 |
mriedem | melwitt: yes same | 15:37 |
openstackgerrit | Chris Dent proposed openstack/nova master: WIP: funtional test with sharing providers https://review.openstack.org/586589 | 15:37 |
mriedem | infra just posted a status | 15:37 |
mriedem | #status alert A zuul config error slipped through and caused a pile of job failures with retry_limit - a fix is being applied and should be back up in a few minutes | 15:37 |
*** shaohe_feng has quit IRC | 15:37 | |
mriedem | so don't recheck | 15:37 |
cdent | dansmith: ^ that gets the test actually making reasonable requests, but no more that that | 15:38 |
cdent | not sure if you care given the earlier discussion, but in case you do... | 15:38 |
*** shaohe_feng has joined #openstack-nova | 15:38 | |
dansmith | cdent: yeah, probably don't care now that I found this other one | 15:38 |
dansmith | but thanks for setting me straight | 15:38 |
-openstackstatus- NOTICE: A zuul config error slipped through and caused a pile of job failures with retry_limit - a fix is being applied and should be back up in a few minutes | 15:39 | |
*** ChanServ changes topic to "A zuul config error slipped through and caused a pile of job failures with retry_limit - a fix is being applied and should be back up in a few minutes" | 15:39 | |
*** hongbin_ has joined #openstack-nova | 15:39 | |
fried_rice | cdent: All I care about is that you misspelled funtional | 15:40 |
mriedem | i can write the bug if no one has started yet | 15:40 |
fried_rice | Do it. And let the English see you do it. | 15:41 |
mriedem | alright | 15:41 |
cdent | fried_rice: that was dansmith in this case | 15:42 |
dansmith | I was rushing | 15:42 |
cdent | but I can see how it being me would be unsurprising | 15:42 |
* cdent is always rushing | 15:42 | |
*** Shilpa has quit IRC | 15:42 | |
cdent | is my new excuse | 15:42 |
fried_rice | cdent: If it had been three weeks ago, and it had been fuctional, I would have totally known it was you. | 15:42 |
cdent | I fuctional tests all the time | 15:43 |
fried_rice | Is there a way to mark a normal funtional test as an xfail? | 15:44 |
fried_rice | oh, shit, bhagyashri's test still succeeds with that bit commented out :( | 15:45 |
*** mdrabe has quit IRC | 15:46 | |
fried_rice | ignore me, phew. | 15:46 |
*** mdrabe has joined #openstack-nova | 15:46 | |
*** AlexeyAbashkin has quit IRC | 15:46 | |
cdent | fried_rice: https://docs.python.org/3/library/unittest.html#unittest.expectedFailure | 15:47 |
cdent | https://docs.python.org/3/library/unittest.html#skipping-tests-and-expected-failures | 15:47 |
fried_rice | thanks cdent | 15:47 |
fried_rice | only py3? Are we running func tests on only py3 these days? | 15:47 |
*** shaohe_feng has quit IRC | 15:47 | |
sean-k-mooney | fried_rice: i think we have both. still | 15:48 |
fried_rice | yup sean-k-mooney | 15:48 |
sean-k-mooney | you can proably do a version check and jsut skip on 2 and expect failure on 3 | 15:49 |
*** shaohe_feng has joined #openstack-nova | 15:49 | |
mriedem | here you go https://bugs.launchpad.net/nova/+bug/1784020 | 15:50 |
openstack | Launchpad bug 1784020 in OpenStack Compute (nova) "Shared storage providers are not supported and will break things if used" [High,Triaged] | 15:50 |
mriedem | dansmith: fried_rice: melwitt: ^ | 15:50 |
dansmith | oh thanks | 15:50 |
* dansmith closes the empty bug report he had open | 15:50 | |
cdent | fried_rice: https://docs.python.org/2.7/library/unittest.html#skipping-tests-and-expected-failures | 15:50 |
melwitt | now that's a bug report | 15:50 |
dansmith | mine would have been 5% of that | 15:51 |
mriedem | fried_rice: testtools has an expectedFailure thing | 15:51 |
melwitt | hah, I know | 15:51 |
mriedem | you said i take pride in it... | 15:51 |
dansmith | "s'broken, kthx" | 15:51 |
mriedem | mostly because if i don't put those details in there, i'll totally forget wtf we talked about a year from now | 15:51 |
melwitt | yeah. the details are super helpful | 15:51 |
sean-k-mooney | mriedem: i think that bug also falls into the catagory of "we dont have ci for it so its broken by default" | 15:52 |
mriedem | well, | 15:52 |
mriedem | we don't have CI for a lot of things | 15:52 |
mriedem | and we still support them, <cough>evacuate</cough> | 15:52 |
*** gyee has joined #openstack-nova | 15:52 | |
sean-k-mooney | yes and i assume they are broken by default unless proven otherwise by it working when i use it and being happy | 15:53 |
dansmith | evacuate is hard to test for legit reasons, but this shared thing is not | 15:53 |
dansmith | and it's also often broken | 15:53 |
mriedem | yup | 15:54 |
mriedem | btw, yes, forced host live migrate/evacuate will drop the DISK_GB allocation on the shared provider | 15:54 |
*** flwang1 has quit IRC | 15:55 | |
dansmith | mriedem: from your test? | 15:56 |
mriedem | no just looking at teh code | 15:57 |
mriedem | https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/scheduler/utils.py#L500 | 15:57 |
mriedem | we only get the allocations for the instance against the source node | 15:57 |
dansmith | oh | 15:57 |
mriedem | and copy those to the dest node for the instance | 15:57 |
mriedem | double up | 15:57 |
mriedem | doesn't put anything on the migration record in the force cas | 15:58 |
mriedem | *case | 15:58 |
*** shaohe_feng has quit IRC | 15:58 | |
*** rpittau has quit IRC | 15:58 | |
*** r-daneel_ has joined #openstack-nova | 15:58 | |
mriedem | hmm, which makes me wonder if we ever cleanup the dest host allocations on a failed live migration | 15:58 |
mriedem | that is forced | 15:58 |
*** flwang1 has joined #openstack-nova | 16:00 | |
mriedem | looks like post_live_migration will give you a warning but remove the doubled allocation | 16:00 |
*** shaohe_feng has joined #openstack-nova | 16:00 | |
*** r-daneel has quit IRC | 16:00 | |
*** r-daneel_ is now known as r-daneel | 16:00 | |
mriedem | https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/compute/manager.py#L6638L6669 | 16:00 |
mriedem | oops | 16:00 |
mriedem | i'll write a functional test for the rollback forced live migration case | 16:01 |
*** openstackgerrit has quit IRC | 16:04 | |
mriedem | https://bugs.launchpad.net/nova/+bug/1784022 | 16:06 |
openstack | Launchpad bug 1784022 in OpenStack Compute (nova) "Failed forced live migration does not rollback doubled up allocations in placement" [High,Triaged] | 16:06 |
mriedem | looks like we regressed that in queens | 16:07 |
*** shaohe_feng has quit IRC | 16:08 | |
mriedem | blarg https://review.openstack.org/#/c/507638/25/nova/compute/manager.py@6252 | 16:08 |
*** shaohe_feng has joined #openstack-nova | 16:09 | |
*** jangutter has quit IRC | 16:10 | |
dansmith | mriedem: are you saying we don't have a migration record if we do a forced? | 16:11 |
mriedem | dansmith: we do, but we don't put the allocations on it | 16:11 |
*** lbragstad_ is now known as lbragstad | 16:11 | |
mriedem | b/c we don't go through the scheduler for forced | 16:11 |
*** ispp has quit IRC | 16:11 | |
dansmith | um | 16:11 |
mriedem | this is just one of the many reasons for the dreaded -5 in dublin | 16:11 |
*** flwang1 has quit IRC | 16:12 | |
mriedem | dansmith: forced live migration calls this method to double up the allocations from the source to the forced dest https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/scheduler/utils.py#L473 | 16:12 |
mriedem | that's from pike when doubling was all the rage | 16:12 |
dansmith | okay, so you're saying on forced we don't do the migration allocations, we just allocate against the newhost, then if we have to revert, we don't have the migration allocations to revert to the instance? | 16:12 |
melwitt | is it safe to recheck yet? I didn't see another status update | 16:12 |
mriedem | dansmith: correct | 16:13 |
mriedem | melwitt: yeah i just did | 16:13 |
melwitt | ok | 16:13 |
mriedem | dansmith: i'll write a functional test for it when i'm back from lunch | 16:13 |
dansmith | mriedem: okay but the doubling is not intentional, just incidental since we didn't replace the instance allocs with the migration one yeah? | 16:13 |
mriedem | it's intentional | 16:13 |
mriedem | it mimics the behavior of doubling in the scheduler from before quens | 16:13 |
mriedem | *queens | 16:14 |
dansmith | right, but we shouldn't be doing any doubling anymore | 16:14 |
mriedem | sure, | 16:14 |
mriedem | but we are :) | 16:14 |
mriedem | for forced | 16:14 |
mriedem | b/c forced is FUN | 16:14 |
mriedem | -20! | 16:14 |
*** links has joined #openstack-nova | 16:14 | |
dansmith | I'm saying we shouldn't intend to be doing that, | 16:14 |
mriedem | not anymore no | 16:14 |
dansmith | which means it's a case we missed in converting to non-doubling | 16:14 |
mriedem | but we missed it in queens with your bp | 16:14 |
mriedem | yes | 16:14 |
dansmith | right, that's what I mean | 16:14 |
dansmith | unintentional | 16:14 |
mriedem | yeah | 16:14 |
mriedem | ok lunch | 16:15 |
*** mriedem is now known as mriedem_away | 16:15 | |
*** links has quit IRC | 16:17 | |
*** links has joined #openstack-nova | 16:17 | |
*** shaohe_feng has quit IRC | 16:18 | |
*** shaohe_feng has joined #openstack-nova | 16:19 | |
*** artom_ has joined #openstack-nova | 16:22 | |
*** links has quit IRC | 16:23 | |
*** Sundar_ has joined #openstack-nova | 16:23 | |
sean-k-mooney | mriedem_away: im goint to choose to read -20! as -(20 factoral) to give it the weight it should have | 16:23 |
Sundar_ | efried: Please ping me when you have the time | 16:25 |
*** openstackgerrit has joined #openstack-nova | 16:26 | |
openstackgerrit | Eric Fried proposed openstack/nova master: libvirt: Revert non-reporting DISK_GB if sharing https://review.openstack.org/586614 | 16:26 |
fried_rice | mriedem_away, dansmith, cdent, melwitt: ^ | 16:26 |
fried_rice | Sundar_: Bad timing :( I have to run for a bit. Will you be around in a couple of hours? | 16:26 |
Sundar_ | NP, sure | 16:27 |
*** harlowja has joined #openstack-nova | 16:27 | |
*** flwang1 has joined #openstack-nova | 16:28 | |
*** shaohe_feng has quit IRC | 16:28 | |
*** shaohe_feng has joined #openstack-nova | 16:29 | |
*** derekh has quit IRC | 16:30 | |
*** tesseract has quit IRC | 16:32 | |
*** fried_rice is now known as fried_rolls | 16:33 | |
*** vladikr has quit IRC | 16:35 | |
*** vladikr has joined #openstack-nova | 16:35 | |
*** shaohe_feng has quit IRC | 16:39 | |
dansmith | mriedem_away: when you're back: I guess I don't really see the thing requiring the dynamic opts registration as being a bad thing | 16:40 |
dansmith | mriedem_away: it forces us to think about it when we write new code and the tests for it | 16:40 |
*** shaohe_feng has joined #openstack-nova | 16:41 | |
*** Bellesse has quit IRC | 16:44 | |
*** rmart04 has quit IRC | 16:46 | |
*** shaohe_feng has quit IRC | 16:49 | |
openstackgerrit | Dan Smith proposed openstack/nova master: Assorted cleanups from numa-aware-vswitches series https://review.openstack.org/582651 | 16:49 |
openstackgerrit | Dan Smith proposed openstack/nova master: Add additional functional tests for NUMA networks https://review.openstack.org/585385 | 16:49 |
*** shaohe_feng has joined #openstack-nova | 16:49 | |
*** felipemonteiro__ has joined #openstack-nova | 16:52 | |
*** felipemonteiro_ has quit IRC | 16:52 | |
cdent | melwitt, dansmith, mriedem_away : next week I'm pretty broadly available, so if stuff comes up and you want to wind me up and point me particular places, please ask. | 16:52 |
melwitt | will do, thanks | 16:54 |
*** shaohe_feng has quit IRC | 16:59 | |
*** shaohe_feng has joined #openstack-nova | 17:04 | |
*** felipemonteiro_ has joined #openstack-nova | 17:06 | |
*** mgoddard has quit IRC | 17:07 | |
*** yamahata has quit IRC | 17:07 | |
*** burt has quit IRC | 17:08 | |
*** shaohe_feng has quit IRC | 17:09 | |
*** felipemonteiro__ has quit IRC | 17:10 | |
*** dtantsur is now known as dtantsur|afk | 17:10 | |
*** gbarros has joined #openstack-nova | 17:11 | |
*** shaohe_feng has joined #openstack-nova | 17:12 | |
*** bacape_ has joined #openstack-nova | 17:16 | |
*** felipemonteiro__ has joined #openstack-nova | 17:18 | |
*** felipemonteiro_ has quit IRC | 17:18 | |
*** bacape_ has quit IRC | 17:18 | |
*** bacape has quit IRC | 17:20 | |
*** shaohe_feng has quit IRC | 17:20 | |
*** shaohe_feng has joined #openstack-nova | 17:20 | |
*** mriedem_away is now known as mriedem | 17:23 | |
*** gbarros has quit IRC | 17:23 | |
*** artom has joined #openstack-nova | 17:23 | |
*** jmlowe has joined #openstack-nova | 17:24 | |
*** artom_ has quit IRC | 17:26 | |
*** savvas has quit IRC | 17:29 | |
*** shaohe_feng has quit IRC | 17:30 | |
*** harlowja has quit IRC | 17:31 | |
*** shaohe_feng has joined #openstack-nova | 17:32 | |
*** felipemonteiro_ has joined #openstack-nova | 17:34 | |
*** felipemonteiro__ has quit IRC | 17:37 | |
*** cfriesen_ has quit IRC | 17:39 | |
*** shaohe_feng has quit IRC | 17:40 | |
*** shaohe_feng has joined #openstack-nova | 17:41 | |
*** gbarros has joined #openstack-nova | 17:42 | |
*** mgoddard has joined #openstack-nova | 17:43 | |
*** yamahata has joined #openstack-nova | 17:43 | |
*** colby_ has joined #openstack-nova | 17:46 | |
colby_ | Hey Everyone. Im trying to get metrics based filtering working in nova. I tried enabling compute_monitors but I always get an error in the logs: | 17:47 |
colby_ | compute_monitors=["nova.compute.monitors.cpu.virt_driver", "numa_mem_bw.virt_driver"] | 17:47 |
colby_ | 2018-07-27 17:43:14.001 2295696 WARNING nova.compute.monitors [req-51711d41-c626-4af2-92fd-dde09c576fb2 - - - - -] Excluding nova.compute.monitors.cpu monitor virt_driver. Not in the list of enabled monitors (CONF.compute_monitors). | 17:48 |
colby_ | Ive tried variations on the monitor: cpu.virt_driver & just virt_driver. It always gives the same error | 17:48 |
colby_ | Im on pike, Centos, kvm | 17:49 |
colby_ | I have gnocchi running and collecting resource | 17:49 |
colby_ | 2018-07-27 17:44:36.110 2800963 INFO nova.filters [req-0e8215e5-e029-4104-8578-a917bf9edddc e28435e0a66740968c523e6376c57f68 18882d9c32ba42aeaa33c4703ad84b2c - default default] Filter MetricsFilter returned 0 hosts | 17:50 |
colby_ | Not sure where the problem is | 17:50 |
*** shaohe_feng has quit IRC | 17:50 | |
colby_ | weight_setting=compute.node.cpu.percent=-1.0 | 17:51 |
dansmith | colby_: I really can't help you, but I can tell you that metrics have nothing to do with gnocchi/ceilo | 17:51 |
colby_ | ok I thought I read somewhere that it used the gnocchi metrics... | 17:51 |
dansmith | colby_: the computes have to be configured to report them in order to use the filter | 17:51 |
dansmith | nope | 17:51 |
*** shaohe_feng has joined #openstack-nova | 17:52 | |
colby_ | ok so then the compute_monitors is the issue then | 17:52 |
dansmith | the metrics come from libvirt, reported by the compute, used by the filter | 17:52 |
colby_ | ok then Im not sure why Im not getting the metrics | 17:53 |
colby_ | besides the filed driver load | 17:53 |
colby_ | or monitor load I mean | 17:53 |
dansmith | yeah, I can't really help beyond that | 17:53 |
sean-k-mooney | dansmith: colby_ if you enable the metric reporting on the compute node ceilometer is able to read them form the message bus and store them but that is a sideffect | 17:54 |
colby_ | Ok so does that mean my metrics reporting is working? | 17:55 |
sean-k-mooney | colby_: by the way memory bandwith monitoring is broken on skylake. both read and write metrics are are actully read... | 17:56 |
colby_ | Im actually just interested in the cpu.percent | 17:56 |
colby_ | I want to not put instances on nodes with high cpu usage. We have a large memory node and the scheduler always puts instances there even when its way overcommited on cpu | 17:57 |
sean-k-mooney | ah well you could just change the order of the weigher to prefer weighing on cpus before memory. am but i have not used the metric based weigher myself so i have not tried to configure it before | 17:58 |
*** penick is now known as OcataGuy | 17:58 | |
*** OcataGuy is now known as MostlyOcataGuy | 17:58 | |
colby_ | ah ok. I treid weight_setting=cpu.percent=-1.0 | 17:59 |
colby_ | but I got zero hosts returned with metrics filter enabled | 17:59 |
*** Sundar_ has quit IRC | 18:00 | |
colby_ | I was not aware that changing weigher order made any difference | 18:00 |
*** shaohe_feng has quit IRC | 18:01 | |
colby_ | I just used: nova.scheduler.weights.all_weighers | 18:01 |
colby_ | I thought it was all just based on multipliers | 18:02 |
*** savvas has joined #openstack-nova | 18:02 | |
*** med_ has quit IRC | 18:02 | |
sean-k-mooney | colby_: well stickly speaking it does not but what i ment was listing only the weighers you care about and then setting there multipliers | 18:02 |
*** jdillaman has quit IRC | 18:03 | |
*** shaohe_feng has joined #openstack-nova | 18:03 | |
sean-k-mooney | if you only care about cpus then you can simploy only enable the cpu Weigher | 18:03 |
colby_ | hmm ok | 18:04 |
colby_ | thanks | 18:04 |
melwitt | colby_: are you specifying compute_monitors= under the [DEFAULT] section of the nova.conf? | 18:04 |
colby_ | yes | 18:04 |
colby_ | but I get the error: Excluding nova.compute.monitors.cpu monitor virt_driver. Not in the list of enabled monitors (CONF.compute_monitors) | 18:04 |
melwitt | okay. the log message you posted earlier is saying it doesn't find the monitor in the list from the conf option. hm | 18:05 |
*** gbarros_ has joined #openstack-nova | 18:05 | |
colby_ | oh wait...there is a typo <smacks head> | 18:06 |
sean-k-mooney | colby_: you could proably get a similar effect by setting ram_weight_multiplier=0 or 0.1 so that ram is basically ignored when weighing if that does not work | 18:08 |
colby_ | ok thanks for your help! | 18:08 |
*** gbarros has quit IRC | 18:09 | |
*** gbarros_ has quit IRC | 18:09 | |
*** gbarros has joined #openstack-nova | 18:10 | |
*** shaohe_feng has quit IRC | 18:11 | |
*** gbarros_ has joined #openstack-nova | 18:12 | |
*** gbarros__ has joined #openstack-nova | 18:13 | |
*** mriedem1 has joined #openstack-nova | 18:14 | |
*** mriedem has quit IRC | 18:14 | |
*** gbarro___ has joined #openstack-nova | 18:14 | |
*** gbarros has quit IRC | 18:15 | |
*** shaohe_feng has joined #openstack-nova | 18:15 | |
*** gbarros has joined #openstack-nova | 18:15 | |
*** harlowja has joined #openstack-nova | 18:15 | |
*** gbarros_ has quit IRC | 18:16 | |
*** gbarros__ has quit IRC | 18:18 | |
*** sridharg has quit IRC | 18:18 | |
*** gbarro___ has quit IRC | 18:18 | |
sean-k-mooney | melwitt: mriedem1 https://review.openstack.org/#/c/586568/ hit the retry_limit issue after your last recheck. is that issue(retry_limit) still happening in the gate | 18:19 |
melwitt | I think it's been fixed | 18:19 |
sean-k-mooney | well there is no gate job for that patch at the moment. will i retry it? | 18:20 |
melwitt | yeah, go ahead. I didn't realize that one hadn't been rechecked | 18:21 |
*** shaohe_feng has quit IRC | 18:21 | |
sean-k-mooney | melwitt: it had. you did it at 5:14 but it hit the error again | 18:22 |
sean-k-mooney | you proably missed the fix by a few minutes | 18:22 |
melwitt | yeah, guh | 18:22 |
mriedem1 | dansmith: danicus, i have good pleasurable news | 18:22 |
*** mriedem1 is now known as mriedem | 18:22 | |
dansmith | um | 18:22 |
*** shaohe_feng has joined #openstack-nova | 18:22 | |
mriedem | bug 1784022 isn't a problem | 18:23 |
openstack | bug 1784022 in OpenStack Compute (nova) queens "Failed forced live migration does not rollback doubled up allocations in placement" [High,Triaged] https://launchpad.net/bugs/1784022 | 18:23 |
mriedem | it's handled | 18:23 |
dansmith | oh yeah? | 18:23 |
dansmith | that is indeed pleasurable | 18:23 |
melwitt | dansmith: wanna ack this? https://review.openstack.org/586614 | 18:24 |
dansmith | yup | 18:25 |
*** artom has quit IRC | 18:26 | |
melwitt | hooray | 18:26 |
melwitt | dangit, missed artom again. I had wanted to ask him about https://bugs.launchpad.net/nova/+bug/1708433 | 18:27 |
openstack | Launchpad bug 1708433 in OpenStack Compute (nova) "Attaching sriov nic VM fail with keyError pci_slot" [Undecided,New] | 18:27 |
mriedem | dansmith: i'll push up the functional test anyway since it didn't look like we had one, only for the non-forced rollback checks | 18:28 |
dansmith | okay | 18:28 |
dansmith | mriedem: did you see my comment above about stephen's set? | 18:29 |
dansmith | and I pushed up the other fixes to that, btw | 18:29 |
dansmith | since you hadn't and seemingly got distracted with this other thing | 18:29 |
*** Sundar_ has joined #openstack-nova | 18:29 | |
dansmith | oh I see you did | 18:29 |
dansmith | cool | 18:29 |
*** rmart04 has joined #openstack-nova | 18:31 | |
*** jmlowe has quit IRC | 18:31 | |
*** shaohe_feng has quit IRC | 18:31 | |
mriedem | sure did | 18:33 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Add functional test for forced live migration rollback allocs https://review.openstack.org/586636 | 18:33 |
*** shaohe_feng has joined #openstack-nova | 18:34 | |
mriedem | well, just in time for us to kill the shared storage provider support, i got it passing the ceph job http://logs.openstack.org/63/586363/3/check/legacy-tempest-dsvm-full-devstack-plugin-ceph/569c574/ | 18:35 |
dansmith | presumably because we're left with broken allocations after a revert or something, but don't check/assert them? | 18:36 |
*** artom has joined #openstack-nova | 18:37 | |
mriedem | right tempest won't assert any of that stuff, | 18:37 |
mriedem | we do have a post-test hook in the nova-next job for making sure there are no orphaned allocations but only on compute node providers | 18:37 |
dansmith | we had some sanity checking and logging in the RT when we removed the healing.. maybe there is some evidence in there? | 18:37 |
mriedem | oh nvm it's not just computes, it's all resource providers | 18:38 |
mriedem | but we don't run it on that job | 18:38 |
dansmith | http://logs.openstack.org/63/586363/3/check/legacy-tempest-dsvm-full-devstack-plugin-ceph/569c574/logs/screen-n-cpu.txt.gz#_Jul_27_17_31_35_337258 | 18:39 |
mriedem | yeah i don't see any obvious warnings related to allocatoins | 18:40 |
mriedem | i think if we ran our post-test leaked allocation hook on this job it would fail | 18:40 |
*** flwang1 has quit IRC | 18:41 | |
mriedem | well, maybe not for single node | 18:41 |
*** shaohe_feng has quit IRC | 18:42 | |
*** flwang1 has joined #openstack-nova | 18:42 | |
dansmith | yeah, so there are 133 logs of instance fd563ed2-d42c-4dc1-a614-8700c6e6c8fd | 18:42 |
dansmith | having non-cleaned-up allocations | 18:43 |
*** shaohe_feng has joined #openstack-nova | 18:43 | |
dansmith | although really the allocations that we'd destroy wouldn't be against the compute node, | 18:43 |
dansmith | and would be gone not stale | 18:43 |
dansmith | so even your check probably wouldn't catch it | 18:43 |
dansmith | because we'd be _losing_ not _leaking_ disk allocations | 18:43 |
dansmith | also, um | 18:45 |
dansmith | I just noticed that we're logging an entire console log out of privsep somewhere | 18:45 |
dansmith | http://logs.openstack.org/63/586363/3/check/legacy-tempest-dsvm-full-devstack-plugin-ceph/569c574/logs/screen-n-cpu.txt.gz#_Jul_27_18_07_23_550670 | 18:45 |
dansmith | you could argue that is a security issue if instances log sensitive info to their console | 18:46 |
mriedem | nice, 9 of those | 18:46 |
mriedem | you can open that bug | 18:46 |
*** r-daneel_ has joined #openstack-nova | 18:47 | |
*** r-daneel has quit IRC | 18:47 | |
*** r-daneel_ is now known as r-daneel | 18:47 | |
dansmith | okay | 18:47 |
dansmith | does privsep daemon log everything over the channel or something? | 18:48 |
*** s10 has quit IRC | 18:48 | |
sean-k-mooney | dansmith: that log is becasue seting a route in teh guest failed http://logs.openstack.org/63/586363/3/check/legacy-tempest-dsvm-full-devstack-plugin-ceph/569c574/logs/screen-n-cpu.txt.gz#_Jul_27_18_07_23_555618 | 18:49 |
dansmith | not sure about that | 18:50 |
sean-k-mooney | i think | 18:50 |
Sundar_ | efried: I need to take off for lunch. I'll look for your response in https://review.openstack.org/#/c/577438/. We need to get this discussion to a closure. | 18:50 |
dansmith | don't think so, I'm not sure why we'd log the console output in that case | 18:50 |
*** Sundar_ has quit IRC | 18:50 | |
dansmith | the route errors on the console are just there because we're logging it, if that's what you're looking at | 18:51 |
*** rmart04 has quit IRC | 18:51 | |
sean-k-mooney | yes it was but this looks like the ouput for dmesg when we are unning through cloud-init | 18:51 |
sean-k-mooney | well i gess its the main console log | 18:52 |
*** shaohe_feng has quit IRC | 18:52 | |
dansmith | sean-k-mooney: it's the instance console log | 18:53 |
*** shaohe_feng has joined #openstack-nova | 18:53 | |
dansmith | which is more than dmesg | 18:53 |
*** rmart04 has joined #openstack-nova | 18:53 | |
sean-k-mooney | well its a debug log. i wonder is it related to http://logs.openstack.org/63/586363/3/check/legacy-tempest-dsvm-full-devstack-plugin-ceph/569c574/logs/screen-n-cpu.txt.gz#_Jul_27_18_07_23_548723 | 18:54 |
dansmith | it looks to me like privsep daemon is logging anything sent over the channel, | 18:54 |
sean-k-mooney | by that i mean its a debug log so at least it does not do this normally | 18:54 |
*** rmart04 has quit IRC | 18:54 | |
dansmith | and since we're using it to do a readpty of the console, it gets logged | 18:55 |
dansmith | sean-k-mooney: lots of people run with debug on all the time | 18:55 |
dansmith | https://bugs.launchpad.net/nova/+bug/1784062 | 18:55 |
openstack | Launchpad bug 1784062 in OpenStack Compute (nova) "Instance console data is logged at DEBUG" [Undecided,New] | 18:55 |
dansmith | melwitt: ^ | 18:55 |
dansmith | I dunno what will be involved in squelching that, | 18:55 |
*** gbarros has quit IRC | 18:55 | |
dansmith | but might be good to fix that before GA, IMHO | 18:55 |
melwitt | gah, moar bugs | 18:55 |
melwitt | yeah, agreed. I'll put it on the RC1 list | 18:56 |
sean-k-mooney | dansmith: well i know privsep propagate any excpetions back over the unix socket and any loging within the privesep deamon is redirected to the parrent too as far as i know | 18:57 |
dansmith | I'd like to point to mriedem's statement that we should be finding and fixing critical bugs during this phase instead of rushing on a lot of FFEs | 18:57 |
dansmith | the last 24 hours has been pretty ... that. | 18:57 |
*** fried_rolls is now known as fried_rice | 18:59 | |
*** MostlyOcataGuy is now known as penick | 19:00 | |
mriedem | SWEET VALIDATION | 19:01 |
*** shaohe_feng has quit IRC | 19:02 | |
sean-k-mooney | dansmith: its coming from this line https://github.com/openstack/oslo.privsep/blob/master/oslo_privsep/daemon.py#L442 | 19:02 |
dansmith | sean-k-mooney: that wouldn't make much sense | 19:03 |
dansmith | I expect it's the one below, L455 | 19:03 |
dansmith | TestNetworkBasicOps-1426085565] privsep: reply[140593546325360]: (4, '') | 19:03 |
sean-k-mooney | sorry yes l455 | 19:03 |
*** shaohe_feng has joined #openstack-nova | 19:03 | |
dansmith | yup | 19:03 |
*** r-daneel has quit IRC | 19:04 | |
melwitt | that doesn't look very squelchable | 19:04 |
sean-k-mooney | so should we just delete those? | 19:04 |
dansmith | melwitt: agree, it's sticky, but .. imagine what else we might be logging when we're running commands as root... | 19:04 |
melwitt | no, I agree. just thinking, how can we stop it | 19:05 |
dansmith | melwitt: maybe we recommend squelching privsep DEBUG logs in the levels as a security measure? | 19:05 |
dansmith | but still, | 19:05 |
dansmith | something better likely needs doing | 19:05 |
sean-k-mooney | we could add a conf option for extra verbose loggin to privsep. | 19:05 |
dansmith | we control that to some degree in our default levels for libraries, | 19:05 |
dansmith | assuming the daemon starts with our config | 19:06 |
*** gbarros has joined #openstack-nova | 19:06 | |
sean-k-mooney | things like os-vif plugins create there own privsep deamons | 19:06 |
sean-k-mooney | it would be nice to turn that off by defaut globally | 19:07 |
*** gbarros has quit IRC | 19:09 | |
dansmith | decorating privsep methods as "may return sensitive stuff" would be one way, and let the daemon just not log the result | 19:11 |
dansmith | for the DoS case, limiting what we log to 256 chars max or something seems prudent | 19:11 |
melwitt | are you talking about changes to oslo_privsep or nova? | 19:12 |
*** shaohe_feng has quit IRC | 19:12 | |
dansmith | well, the decoration would be both | 19:12 |
dansmith | we'd decorate our things, and the daemon code would have to honor it | 19:13 |
melwitt | okay, I see | 19:13 |
dansmith | the log length limit would be purely privsep | 19:13 |
melwitt | gotcha | 19:13 |
dansmith | and our forcing of a log level for our own daemon could maybe be all on our end, but not sure | 19:13 |
*** shaohe_feng has joined #openstack-nova | 19:14 | |
melwitt | yeah, I was looking for where the default log levels come from and didn't find it yet | 19:14 |
dansmith | well, we control them for our libraries you know, | 19:15 |
dansmith | but I think the daemon itself is logging this | 19:15 |
sean-k-mooney | melwitt: well this is a devstack run so we proably hardcode the loglevel to debug in the nova conf | 19:15 |
melwitt | the decorator idea sounds like a good feature but I don't know how hard it would be to coordinate that with oslo in the next week or so | 19:15 |
dansmith | but, I assumed it was following our debug=true, so.. | 19:15 |
*** jaypipes has joined #openstack-nova | 19:16 | |
dansmith | I wonder if we've been doing this since this patch merged... | 19:16 |
melwitt | yeah, I mean how do we configure another library to log at a certain different level | 19:16 |
dansmith | surely thought we'd have heard of it | 19:16 |
dansmith | there's a default log levels thing | 19:18 |
sean-k-mooney | well privsep has its own log handeler that redirects everything over the unix socket https://github.com/openstack/oslo.privsep/blob/master/oslo_privsep/daemon.py#L144 | 19:18 |
dansmith | https://docs.openstack.org/kilo/config-reference/content/list-of-compute-config-options.html | 19:18 |
dansmith | default_log_levels = | 19:19 |
*** med_ has joined #openstack-nova | 19:19 | |
*** med_ has quit IRC | 19:19 | |
*** med_ has joined #openstack-nova | 19:19 | |
dansmith | default contains, for example: oslo.messaging=INFO | 19:19 |
dansmith | heh, that's kilo, but... :) | 19:19 |
melwitt | oh, never knew about that. cool | 19:19 |
dansmith | I don't see that we much control the execution of the daemon really, | 19:21 |
dansmith | so not sure if it even knows what our config is | 19:21 |
dansmith | or how it knows to have debug on | 19:22 |
dansmith | but yeah, if it's being fed into our logger (like sean-k-mooney is suggesting) then setting a level in that config might affect it | 19:22 |
*** shaohe_feng has quit IRC | 19:23 | |
melwitt | hm, yeah. the example shows all kinds of libraries that aren't openstack things as being affected | 19:23 |
sean-k-mooney | well this is what is handeling the log message on the nova side of the call https://github.com/openstack/oslo.privsep/blob/master/oslo_privsep/daemon.py#L206 | 19:23 |
dansmith | melwitt: it has nothing to do with openstack | 19:24 |
*** shaohe_feng has joined #openstack-nova | 19:24 | |
dansmith | melwitt: it's in our config of the root logger, | 19:24 |
dansmith | which any library will ultimately use | 19:24 |
*** mgoddard has quit IRC | 19:24 | |
dansmith | it just matters that it's in our process space | 19:24 |
*** rtjure has quit IRC | 19:25 | |
dansmith | so the daemon being outside, would be unaffected (unless it's looking at our config), but if it's redirecting all the log traffic over the channel and we have something our side reading that and logging _as_ privsep.daemon in our process, | 19:25 |
dansmith | then our root logger config would affect it | 19:25 |
melwitt | okay, I see. thanks for explaining that | 19:25 |
sean-k-mooney | dansmith: in this case it even going to work across process because both the root wrap and fork clients swap out the looger to redirect it over the socket | 19:26 |
dansmith | sean-k-mooney: yeah I just said that :) | 19:26 |
mriedem | so we just need to hard-code oslo.privsep=INFO or something in our default_log_levels yeah for that bug? didn't read all the backscroll | 19:27 |
sean-k-mooney | hah yep. i was typeing when you did :) | 19:27 |
dansmith | sean-k-mooney: heh okay | 19:27 |
dansmith | mriedem: yeah, sounds like it | 19:27 |
mriedem | easy peasy | 19:27 |
dansmith | yup | 19:27 |
mriedem | melwitt: don't forget to defer a bunch of these https://blueprints.launchpad.net/nova/rocky | 19:28 |
melwitt | mriedem: right, thanks | 19:28 |
mriedem | i only see 3 in there that wouldn't be deferred | 19:28 |
mriedem | mox-removal, versioned notifications and stephen's numa vswitch bp | 19:28 |
melwitt | thanks | 19:28 |
sean-k-mooney | dansmith: am could we use a decoreator/context manager to also chagne the config for spcific call? | 19:29 |
dansmith | sean-k-mooney: not sure I parsed that, but I think we'd not want to override log levels in a context manager | 19:29 |
sean-k-mooney | basically im thinking about your previous suggstion of a decorator for this is sensitive never logit cases | 19:30 |
dansmith | sean-k-mooney: yep, something intentional for this might be good | 19:30 |
*** eharney has quit IRC | 19:31 | |
melwitt | setting the default log level for oslo.privsep is a good mitigation we can do immediately. then we can look at the idea of adding something to oslo.privsep to control this in a better, non-overrideable way (though I guess one could argue if the user really wants to override, they should be able to) | 19:31 |
sean-k-mooney | the default log level change is also good but that read tty call proablly should never be logged | 19:31 |
sean-k-mooney | melwitt: if the user really want to log it that much they could add a print() | 19:32 |
sean-k-mooney | or remvoe the decorator | 19:32 |
sean-k-mooney | its likely that you would only want to do this if your debugging | 19:32 |
*** shaohe_feng has quit IRC | 19:33 | |
melwitt | yeah, I just meant to point out it's a consideration. not arguing either way | 19:33 |
*** shaohe_feng has joined #openstack-nova | 19:34 | |
*** rtjure has joined #openstack-nova | 19:35 | |
sean-k-mooney | ya thats true | 19:36 |
mriedem | the default_log_levels things is backportable, which i'm assuming this needs to be | 19:36 |
mriedem | we've had privsep in for awhile | 19:36 |
dansmith | I've been trying to git-review this mofo for a few minutes now | 19:37 |
sean-k-mooney | well the default_log_levels can be set in deployment tools so it can be done downstream also even if it was not upstream | 19:38 |
openstackgerrit | Dan Smith proposed openstack/nova master: Force oslo.privsep.daemon logging to INFO level https://review.openstack.org/586643 | 19:38 |
dansmith | thar ^ | 19:38 |
dansmith | we can check the logs after a run of that and make sure theres no privsep debug noise in there | 19:39 |
*** lbragstad_ has joined #openstack-nova | 19:40 | |
*** lbragstad has quit IRC | 19:41 | |
*** shaohe_feng has quit IRC | 19:43 | |
*** shaohe_feng has joined #openstack-nova | 19:45 | |
*** mchlumsky_ has quit IRC | 19:47 | |
* mriedem goes to get ma child | 19:48 | |
*** mchlumsky has joined #openstack-nova | 19:50 | |
sean-k-mooney | dansmith: the only down side to this change is i used to use some of those log messages to debug os-vif plugging stuff but in heighsight i should have proably questioned why they were there | 19:51 |
sean-k-mooney | dansmith: that said http://logs.openstack.org/63/586363/3/check/legacy-tempest-dsvm-full-devstack-plugin-ceph/569c574/logs/screen-n-cpu.txt.gz#_Jul_27_17_31_34_229614 | 19:51 |
dansmith | sean-k-mooney: this is just the default, you can still override it in config to turn it on | 19:51 |
sean-k-mooney | this is being loged form the privsep deamon but reported as oslo_concurrency | 19:52 |
sean-k-mooney | dansmith: oh i know, what will we do in the gate? | 19:52 |
dansmith | well, we can override this for the gate, it just needs to not be on by default | 19:53 |
*** shaohe_feng has quit IRC | 19:53 | |
dansmith | sean-k-mooney: are you sure? that doesn't look like the privsep format | 19:54 |
dansmith | and processutils would log something like that | 19:54 |
dansmith | maybe it's inside the daemon, but running processutils, which is emitting the actual log? | 19:54 |
sean-k-mooney | dansmith: that code is executed via privsep but that message is not from that log | 19:54 |
sean-k-mooney | dansmith: yes | 19:54 |
dansmith | okay I'm confused about what you're saying | 19:55 |
sean-k-mooney | sorry one sec | 19:55 |
*** shaohe_feng has joined #openstack-nova | 19:56 | |
sean-k-mooney | its basically this https://github.com/openstack/os-vif/blob/master/vif_plug_ovs/linux_net.py#L155 | 19:56 |
*** lbragstad_ is now known as lbragstad | 19:56 | |
*** mchlumsky has quit IRC | 19:56 | |
sean-k-mooney | which invokes processuitls here https://github.com/openstack/os-vif/blob/master/vif_plug_ovs/linux_net.py#L58 | 19:56 |
sean-k-mooney | the actull privsep request message is printed here http://logs.openstack.org/63/586363/3/check/legacy-tempest-dsvm-full-devstack-plugin-ceph/569c574/logs/screen-n-cpu.txt.gz#_Jul_27_17_31_34_229139 | 19:58 |
*** mchlumsky has joined #openstack-nova | 19:58 | |
sean-k-mooney | but any loging privledge fucntions do is also relyed to the parent over the socket. | 19:59 |
*** pchavva has quit IRC | 19:59 | |
*** ccamacho1 has joined #openstack-nova | 20:00 | |
*** mgoddard has joined #openstack-nova | 20:01 | |
*** ccamacho has quit IRC | 20:01 | |
sean-k-mooney | anyway lets see if that config option just effect the default log level of oslo.privspes own internal logging or also the suff call via a privsep context | 20:02 |
*** itlinux has joined #openstack-nova | 20:03 | |
dansmith | okay, I'm still not sure what your concern is | 20:03 |
dansmith | but it's probably my friday brain | 20:03 |
openstackgerrit | Artom Lifshitz proposed openstack/nova master: DNM: Extra logs for volume detach device tags cleanup https://review.openstack.org/584032 | 20:03 |
*** shaohe_feng has quit IRC | 20:04 | |
sean-k-mooney | well im hoping that oslo.preivsep.deamon=INFO just disables the debug loggin for privsep debug logs but not debug logs from things called via privsep | 20:04 |
dansmith | why? | 20:05 |
*** mgoddard has quit IRC | 20:05 | |
dansmith | it should affect anything that logs with oslo.privsep.daemon, not anything else | 20:05 |
*** mchlumsky has quit IRC | 20:05 | |
*** shaohe_feng has joined #openstack-nova | 20:05 | |
dansmith | if those concurrency logs are logged with a logger name of oslo.concurrency.processutils, then they should be unaffected | 20:06 |
dansmith | is that what you mean? | 20:06 |
sean-k-mooney | yes. | 20:06 |
dansmith | okay I think we'll be okay on that, assuming it works the way I think it does | 20:06 |
dansmith | I expect there is some code in privsep that does: | 20:06 |
*** weaksauce2 has joined #openstack-nova | 20:06 | |
weaksauce2 | Hey, I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/ | 20:06 |
weaksauce2 | or maybe this blog by freenode staff member Matthew 'mst' Trout https://MattSTrout.com/ | 20:06 |
*** weaksauce2 has quit IRC | 20:07 | |
dansmith | for message_logged_in_the_daemon: logger.getLogger(message.log_name).log.$level(message.msg) | 20:07 |
*** mchlumsky has joined #openstack-nova | 20:07 | |
dansmith | so my change should only affect actual messages logged on the daemon log name | 20:07 |
dansmith | not anything logged in the context of the daemon at all | 20:07 |
*** liuyulong__ has quit IRC | 20:08 | |
dansmith | hmm, that code was kindof nonsense, let me try again: | 20:08 |
sean-k-mooney | its this code that i was unsure about https://github.com/openstack/oslo.privsep/blob/master/oslo_privsep/daemon.py#L249-L254 | 20:08 |
itlinux | hello Nova guys, when spinning up a VM, and the hypervisor is asking to pull the image from glance does that go over the storage network? thanks | 20:08 |
*** liuyulong__ has joined #openstack-nova | 20:08 | |
dansmith | sean-k-mooney: that's the daemon-side code that intercepts the logs to redirect | 20:09 |
sean-k-mooney | yes | 20:09 |
dansmith | it's the non-daemon code that does the actual logging and would do what I surmised above | 20:09 |
*** s10 has joined #openstack-nova | 20:09 | |
sean-k-mooney | well part of it | 20:09 |
sean-k-mooney | anyway we will see soon. | 20:09 |
*** errantekarmico has joined #openstack-nova | 20:14 | |
*** shaohe_feng has quit IRC | 20:14 | |
*** shaohe_feng has joined #openstack-nova | 20:15 | |
*** slaweq has joined #openstack-nova | 20:15 | |
dansmith | yup | 20:15 |
*** errantekarmico has left #openstack-nova | 20:16 | |
mnaser | there technically should never be rows with cell_id=NULL in instance_mappings.. right? | 20:19 |
dansmith | mnaser: mappings have no cell until they're scheduled | 20:20 |
mnaser | dansmith: right, but yknow, not an instance from march lets say | 20:21 |
mnaser | :p | 20:21 |
dansmith | they should always end up scheduled, to cell0 at least, but they can be there transiently and/or if something fails | 20:21 |
mnaser | alright so i think i'll have to write something to look in our cell vs cell0 and update mappings to make the db consistent | 20:22 |
*** dtruong_ has quit IRC | 20:23 | |
*** shaohe_feng has quit IRC | 20:24 | |
*** shaohe_feng has joined #openstack-nova | 20:25 | |
*** dtruong_ has joined #openstack-nova | 20:26 | |
*** med_ has quit IRC | 20:27 | |
*** savvas has quit IRC | 20:28 | |
*** savvas has joined #openstack-nova | 20:28 | |
*** savvas has quit IRC | 20:30 | |
*** savvas has joined #openstack-nova | 20:30 | |
*** artom has quit IRC | 20:32 | |
*** shaohe_feng has quit IRC | 20:34 | |
*** mchlumsky has quit IRC | 20:37 | |
*** tidwellr has quit IRC | 20:38 | |
*** slaweq has quit IRC | 20:40 | |
*** shaohe_feng has joined #openstack-nova | 20:40 | |
*** felipemonteiro_ has quit IRC | 20:40 | |
*** felipemonteiro_ has joined #openstack-nova | 20:40 | |
*** shaohe_feng has quit IRC | 20:45 | |
*** shaohe_feng has joined #openstack-nova | 20:45 | |
mriedem | mnaser: same issue from last week right? | 20:46 |
mriedem | could have been rpc outage so a failed db update | 20:46 |
mriedem | er db? | 20:46 |
mriedem | failed write i mean | 20:46 |
*** cdent has quit IRC | 20:46 | |
mnaser | mriedem: no it looks like over the lifetime of our cloud any rpc or db related things might have accumulated a lot of things in nova_api with cell_id = NONE | 20:46 |
mnaser | like, 20000 worth. | 20:46 |
mriedem | i had also identified one spot in conductor where the build request will be gone and we don't set the instance mapping to cell0 | 20:46 |
mnaser | however for 99.9999% of those, they were actually assigned a cell and not buried in cell0 | 20:47 |
mnaser | dansmith, mriedem: http://paste.openstack.org/show/726767/ might be a useful little tool if someone ends up in the same situation | 20:47 |
mnaser | connect to api db, get all cells, go over them all and check where it can find the instance, and then print out an update statement for manual fix | 20:48 |
mriedem | we could nova-manage cell_v2 that baby | 20:48 |
mnaser | i can push up an initial patch but i dunno how much i can iterate/test/etc because i've been a bit overwhelmed | 20:49 |
mnaser | and it would have to be updated to use nova objects too i guess | 20:49 |
mriedem | np, or just report a bug and put this paste in it as a template | 20:49 |
mriedem | latter is fine ^ | 20:49 |
mnaser | good idea | 20:49 |
mriedem | is this finding instances in non-cell0 cells? | 20:50 |
mriedem | that aren't in error state? | 20:50 |
mnaser | mriedem: im not sure about the exact logic, but i grab a list of all cells, connect to them, and loop until i find an entry inside 'instances' table with the same id | 20:50 |
mnaser | if that is logically wrong, i can fix it | 20:51 |
mriedem | it makes sense | 20:51 |
mriedem | if the instance mapping doesn't tell what cell it's in, we have to iterate the cells looking for it | 20:51 |
mnaser | and there is no change it ever being in two cells | 20:52 |
mnaser | s/change/chance/ | 20:52 |
mriedem | is that a question? | 20:52 |
mnaser | yes | 20:52 |
mriedem | shouldn't be no | 20:52 |
mnaser | okay sounds good, because i break off once i find it and stop looping | 20:52 |
mriedem | but this shouldn't be happening in the first place | 20:52 |
mnaser | yeah :\ but i dunno how much to blame nova when it might have been an infra problem | 20:52 |
mriedem | i mean in a normal case we create the instance in the cell here https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/manager.py#L1257 | 20:53 |
mriedem | if the user goes over quota we should put the instance into error state and mark the instance mapping https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/manager.py#L1370 | 20:53 |
*** david-lyle has joined #openstack-nova | 20:54 | |
mriedem | in a normal case, we update the instance mapping here https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/manager.py#L1322 | 20:54 |
mnaser | in any case -- https://bugs.launchpad.net/nova/+bug/1784074 | 20:54 |
openstack | Launchpad bug 1784074 in OpenStack Compute (nova) "Instances end up with no cell assigned in instance_mappings" [Undecided,New] | 20:54 |
mriedem | before deleting the build request and casting to compute | 20:54 |
mnaser | hmm | 20:54 |
mnaser | i wonder if i wanna update that script | 20:54 |
mnaser | to check if a build_request exists | 20:55 |
mriedem | if anything fails in between there we could fail to update the mapping | 20:55 |
mriedem | mnaser: maybe - if the build request exists, the instance shouldn't be in a cell | 20:55 |
*** manjeets_ has joined #openstack-nova | 20:55 | |
mriedem | so L42 in your script is where i'd look for a build request | 20:55 |
mriedem | as a sanity check | 20:55 |
*** shaohe_feng_ has joined #openstack-nova | 20:56 | |
mnaser | mriedem: yeah i was planning to just run the mysql till a certain point and assume the rest was just unscheduled stuff but it could be confusing to hand off to others | 20:56 |
*** dklyle_ has quit IRC | 20:57 | |
mnaser | i'm feeling to check if a build request exists at L27 so a) i dont hit the cells and b) if a build requests exists, technically there shouldn't be an issue because api calls will interact with that build request | 20:57 |
*** manjeets has quit IRC | 20:57 | |
*** anupn_ has quit IRC | 20:57 | |
mnaser | i think the problem is there when a build request AND cell mapping is missing | 20:57 |
mnaser | but i believe if build request is there but cell mapping is missing, it'll work just fine and not do any weird 404s on instances | 20:57 |
mriedem | correct | 20:57 |
mriedem | this was the case i was worried about last week https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/manager.py#L1243 | 20:58 |
*** karimull has quit IRC | 20:58 | |
mriedem | in that case, the api has deleted the build request, and we haven't updated the instance mapping | 20:58 |
mriedem | but, we wouldn't put the instance in cell0 b/c the user deleted the instance before we created it (via build request) | 20:58 |
mriedem | mnaser: might be nice info to know if these unmapped instances are deleted | 20:59 |
melwitt | one thing that's interesting that I learned recently is that if, for some reason, there is a case where a build request exists but *no* instance mapping exists, the API does not handle it in that, the "instance" will show up in a 'nova list' but it can't be deleted because delete will raise NotFound | 20:59 |
*** shaohe_feng has quit IRC | 20:59 | |
*** shaohe_feng_ is now known as shaohe_feng | 20:59 | |
mriedem | i don't know how that could happen | 20:59 |
mriedem | we create the build request and the instance mapping in _provision_instances | 20:59 |
mriedem | *and request spec | 21:00 |
melwitt | and via code inspection, I don't know how that state could be gotten into other than nova-api restarting at precisely the moment after the build request is created but before the instance mapping was | 21:00 |
mriedem | https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/compute/api.py#L930 and then https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/compute/api.py#L942 | 21:00 |
*** anupn has joined #openstack-nova | 21:00 | |
mnaser | melwitt: yeah that's essentially the state that these vms are in | 21:00 |
mriedem | or the db failing the instance mapping insert | 21:00 |
*** karimull has joined #openstack-nova | 21:01 | |
melwitt | mnaser: I thought you had instance mappings though, right? | 21:01 |
melwitt | yeah, or that | 21:01 |
mnaser | melwitt: instance_mapping is there sure, but cell_id=NONE | 21:01 |
mnaser | so some of those are list-able, but not delete-able | 21:01 |
melwitt | yeah, that's different than what I said. your case will let a delete work | 21:01 |
mriedem | mnaser: are you listing as admin? | 21:01 |
melwitt | oh really? | 21:01 |
mriedem | to list out deleted instances? | 21:01 |
mnaser | nope, i had a user complain they could list an instance but could not delete it | 21:02 |
mriedem | i have to think you're hitting this https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/manager.py#L1243 | 21:02 |
mnaser | hell i cant even delete it | 21:02 |
melwitt | hm, okay, that is a new case I didn't know | 21:02 |
mnaser | let me dig th eticket | 21:02 |
*** brault_ has quit IRC | 21:02 | |
melwitt | I guess what it must do is, get the instance mapping, see cell_id=None and then think "I can't lookup the instance, therefore I can't delete it" | 21:03 |
mriedem | well, | 21:03 |
* melwitt looks at the code | 21:03 | |
mnaser | ok so confirmed here | 21:03 |
mriedem | it will fallback to trying to lookup the instance from the locally configured (in the api) [database]/connection | 21:03 |
mnaser | nova list --all-tenants | grep 1812c2eb-cfbc-4659-9817-4694ad3d2c37 < returns the instance with ERROR/NOSTATE | 21:03 |
mnaser | nova show 1812c2eb-cfbc-4659-9817-4694ad3d2c37 => ERROR (CommandError): No server with a name or ID of '1812c2eb-cfbc-4659-9817-4694ad3d2c37' exists. | 21:03 |
mriedem | mnaser: is that instance deleted? | 21:03 |
mriedem | instances.deleted != 0 | 21:04 |
mnaser | let me double check | 21:04 |
mnaser | fwiw though cell_id=NULL | 21:04 |
mnaser | checking instances | 21:04 |
*** edmondsw has quit IRC | 21:04 | |
mnaser | deleted=0 but this one is in cell0 | 21:05 |
mriedem | melwitt: this is what i'm thinking of https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/compute/api.py#L1768 | 21:05 |
*** edmondsw has joined #openstack-nova | 21:05 | |
mriedem | mnaser: hmm, ok so the instance was created in cell0 but the instance mapping update failed | 21:05 |
mnaser | in this case yes | 21:05 |
*** yamahata has quit IRC | 21:05 | |
melwitt | that's not what runs for a delete though | 21:05 |
*** shaohe_feng has quit IRC | 21:05 | |
mriedem | melwitt: it has to lookup the instance right? | 21:05 |
mriedem | _lookup_instance is called via API.get() | 21:05 |
mnaser | yeah i cant even look it up, it just 404s | 21:05 |
melwitt | yeah but it goes here https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/compute/api.py#L2333 | 21:05 |
*** r-daneel has joined #openstack-nova | 21:05 | |
*** shaohe_feng has joined #openstack-nova | 21:06 | |
mnaser | let me check | 21:06 |
mnaser | it probably doesnt have a build request | 21:06 |
mnaser | no build request indeed | 21:07 |
mnaser | https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/compute/api.py#L2353 | 21:07 |
mnaser | so ending up here afaik | 21:07 |
mriedem | how are we listing it then... | 21:08 |
mnaser | maybe list just hits the cells and ignores api stuff? | 21:09 |
mnaser | i can help if i knew where the list code is :p | 21:09 |
mriedem | https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/compute/instance_list.py#L98 | 21:09 |
*** edmondsw has quit IRC | 21:10 | |
melwitt | _lookup_instance is called via API().delete, _get_instance is called via API().get | 21:10 |
melwitt | and the API (nova/api/openstack/compute/servers.py) does a API().get first before doing anything with an instance | 21:10 |
mriedem | mnaser: you're right, we'll just iterate the cells | 21:10 |
mnaser | i guess in an ideal world you retrieve list of vms from nova_api, and then generate a subsequent list to each cell with a list of instance uuids to request | 21:11 |
mnaser | which might even eliminate extra calls if a user is located in one cell | 21:12 |
melwitt | so in the case of a build request with a instance mapping with cell_mapping = None, it will return build_request.instance, which I'm not sure what will happen if you try to delete that | 21:12 |
mriedem | mnaser: that's what this is for https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/compute/instance_list.py#L101 | 21:12 |
melwitt | presumably it fails | 21:12 |
mriedem | and that's what cern uses | 21:12 |
mnaser | wouldn't it be safer to only delete the build request once the cell has been set? | 21:13 |
melwitt | so that means build_request.instance gets passed to compute API().delete | 21:13 |
mriedem | melwitt: in that case we should go through here https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/compute/api.py#L1877 | 21:14 |
mriedem | mnaser: the idea is if the user deletes the build request before the instance has been scheduled to a cell, we never create the instance in the cell, | 21:15 |
mriedem | so there is nothing to do with the instance mapping b/c it's not in a cell | 21:15 |
*** shaohe_feng has quit IRC | 21:15 | |
mriedem | and shouldn't get listed either b/c it's (1) not a build request and (2) not in a cell | 21:16 |
*** r-daneel has quit IRC | 21:16 | |
mnaser | yeah so maybe the issue here really inside list? | 21:16 |
melwitt | right, so the delete of the build request would succeed, but then the lookup of the instance will fail because it was just a build_request.instance shell | 21:16 |
mriedem | which if that is really working, we get here in conductor after the build request was deleted in api https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/manager.py#L1243 | 21:16 |
melwitt | or well, maybe not. _lookup_instance would return None, None in the cell_mapping = None case | 21:17 |
mriedem | i wonder why we don't update the instance mapping right after this https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/manager.py#L1257 | 21:17 |
*** shaohe_feng has joined #openstack-nova | 21:18 | |
*** yamahata has joined #openstack-nova | 21:18 | |
mriedem | melwitt: right, if _delete_while_booting returns True, we exit https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/compute/api.py#L1877 | 21:18 |
melwitt | hm, so I'm not seeing how delete would fail in that case | 21:19 |
*** manjeets_ is now known as manjeets | 21:20 | |
melwitt | mnaser: is there any chance the service version in one of the records in the 'services' tables is < 15? | 21:22 |
mriedem | heh, i asked that last week too :) | 21:22 |
mnaser | melwitt: i checked that with mriedem last time we tried to look into this and no, none | 21:22 |
melwitt | I guess that wouldn't make sense. all of your instance GET would fail in that case | 21:22 |
mriedem | btw, i thin kwe should probably remove that service version check now | 21:22 |
mriedem | commented on the bug https://bugs.launchpad.net/nova/+bug/1784074/comments/1 | 21:23 |
openstack | Launchpad bug 1784074 in OpenStack Compute (nova) "Instances end up with no cell assigned in instance_mappings" [Undecided,New] | 21:23 |
mriedem | with what *might* be happening | 21:23 |
mriedem | but you'd have errors in the logs | 21:23 |
melwitt | this doesn't make any sense how delete returns 404 | 21:23 |
mriedem | melwitt: read ^ that comment in the bug because i think that could explain a window where it could happen | 21:23 |
*** liuyulong_ has joined #openstack-nova | 21:24 | |
mriedem | mnaser: i wonder if these are instances getting created as part of a multi-create request where they all get created in a cell, then when we go to update mappings, something fails and then the rest are left unmapped | 21:24 |
*** liuyulong__ has quit IRC | 21:24 | |
mriedem | the user attempts to delete the instance, they delete the build request, but then they can still list it, | 21:24 |
mriedem | but can't delete it b/c the build request is gone and the instance mapping isn't poining at a cell | 21:24 |
mriedem | hence your fix up script | 21:24 |
melwitt | ohhh | 21:25 |
mriedem | this goes back to something we've talked about before where the schedule_and_build_instances method was split into a few phases where it was originally one | 21:25 |
*** shaohe_feng has quit IRC | 21:26 | |
mriedem | so now we (1) get hosts from scheduler (2) create instances in cells (3) recheck quota (4) do some other stuff including updating instance mappings and casting to compute to build | 21:26 |
*** awaugama has quit IRC | 21:26 | |
mriedem | if anything fails in the loop in #4 we'd have this situation | 21:26 |
mnaser | these could be a multi create | 21:26 |
mnaser | let me double check | 21:26 |
*** shaohe_feng has joined #openstack-nova | 21:26 | |
mriedem | mnaser: you'd have to find the request spec and look that up | 21:27 |
melwitt | yeah, gosh | 21:27 |
mnaser | i know of a customer that uses this feature all the time | 21:27 |
mnaser | so it could just be them | 21:27 |
mriedem | there should be a num_instances field in the request spec for any of those instances | 21:27 |
mnaser | nope, at least one i randomly picked out is not a multi create | 21:27 |
mriedem | ok, well, | 21:27 |
mriedem | i think the theory still applies | 21:27 |
mriedem | if we fail *before* setting the instance mapping but after we've created the instance in the cell, we're toast | 21:28 |
mriedem | did we ever figure out if rabbit being down for notifications could screw us up too? because we send notifications before we update the instance mapping... | 21:29 |
melwitt | I don't know | 21:30 |
mriedem | i'll throw something up quick before i have to head out | 21:31 |
mnaser | so my audit script helped bring them from 20k down to 308 left which have no build_requests, no cell_id in the mapping | 21:32 |
mnaser | and not existing in any cells | 21:32 |
mriedem | mnaser: ok those are likely just instance mappings for deleted and purged instances | 21:32 |
mriedem | do you archive/purge the cell dbs often/ | 21:32 |
mriedem | ? | 21:32 |
mriedem | b/c it wasn't until i think rocky that we added instance mapping and reqspec hard delete to nova-manage db archive_deleted_rows when instances are archive | 21:33 |
mriedem | or maybe you run your own archive/purge script? | 21:33 |
mnaser | select created_at from instances order by id asc limit 1; => 2014-12-14 02:38:53 | 21:33 |
mnaser | ...ha. | 21:33 |
mnaser | but i think i'm mostly waiting for the rocky archive delete stuff | 21:34 |
*** shaohe_feng has quit IRC | 21:36 | |
*** shaohe_feng has joined #openstack-nova | 21:37 | |
mriedem | do you run your own archive script or nova-manage db archive_deleted_rows? | 21:41 |
mnaser | mriedem: none of the above, we just have a really really really big database | 21:42 |
mnaser | mysql indexing seems fast enough that it hasn't really affected us much other than just.. being a big db. | 21:42 |
sean-k-mooney | mriedem: fyi i left a comment on the review but is the call to self.driver.cleanup in https://review.openstack.org/#/c/586568/1 against the source or dest node? | 21:42 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: WIP: Update instance mapping as soon as instance is created in cell https://review.openstack.org/586713 | 21:44 |
mriedem | mnaser: melwitt: throwing things at the wall ^ | 21:44 |
mriedem | sean-k-mooney: source | 21:44 |
mriedem | _post_live_migration and _rollback_live_migration run on the source host | 21:44 |
*** liuyulong__ has joined #openstack-nova | 21:45 | |
mriedem | sean-k-mooney: replie | 21:45 |
mriedem | *replied | 21:45 |
sean-k-mooney | mriedem: oh ok then yes it proably should have the source vif then however i dont think it actully will need them unless we replug the vifs | 21:45 |
mriedem | that's not what you said last night | 21:46 |
*** rtjure has quit IRC | 21:46 | |
mriedem | something something ovs hybrid plug cleanup | 21:46 |
mriedem | but it was 4am and you were maybe loopy | 21:46 |
sean-k-mooney | mriedem: for the cleanup | 21:46 |
*** shaohe_feng has quit IRC | 21:46 | |
sean-k-mooney | mriedem: self.driver.post_live_migration_at_source shoudl use the old source vifs so it can unplug correctly | 21:47 |
*** liuyulong_ has quit IRC | 21:47 | |
*** shaohe_feng has joined #openstack-nova | 21:47 | |
mriedem | sean-k-mooney: yes, same thing | 21:48 |
sean-k-mooney | i dont know what self.driver.cleanup does. if its on the source however it should also proably be using the source vifs | 21:48 |
mriedem | sean-k-mooney: in the commit message, i pointed out that if post_live_migration_at_source is successful, destroy_vifs=False and the libvirt driver won't try to unplug in cleanup() | 21:48 |
mriedem | however, not all virt drivers adhere to that destroy_vifs flag | 21:49 |
mriedem | the hyperv driver doesn't for example | 21:49 |
sean-k-mooney | ah ok then yes that all looks good then | 21:49 |
mriedem | it looks...beautiful | 21:49 |
sean-k-mooney | normally i like shorter fuction names but the at_source and at_destination really helps keep context in this code | 21:50 |
mriedem | that's why i did https://review.openstack.org/#/c/551371/ | 21:51 |
mriedem | because knowing wtf is going on in the 20 methods involved in live migration is not sometihng you can keep in your head | 21:51 |
mriedem | also https://docs.openstack.org/nova/latest/reference/live-migration.html | 21:52 |
melwitt | yes. every time I figure out code like that, a few months later I end up wishing I had added a lot of code comments to it, if nothing else | 21:52 |
mriedem | yup also https://review.openstack.org/#/c/496861/ | 21:53 |
melwitt | two thumbs up | 21:53 |
mriedem | thanks ebert | 21:54 |
melwitt | looking at your change, trying to remember why the instance.create() was split up from the inst mapping update in the first place | 21:54 |
mriedem | RIP | 21:54 |
mriedem | melwitt: the quota stuff | 21:54 |
mriedem | i can find a review comment where we talked about the split | 21:54 |
melwitt | yeah, trying to re-remember | 21:54 |
sean-k-mooney | ya i have that bookmarked i just didnt have see we were still in _post_live_migration. that function does a lot | 21:54 |
mriedem | too much | 21:54 |
melwitt | I think it was something about, if we failed a quota recheck in the middle of a multi create, and to nix all the instances before creating any mappings | 21:55 |
melwitt | but we ended up not doing that and putting them in ERROR state | 21:55 |
sean-k-mooney | part of the issue is ist implementing a state machine and all of that context is mixed in with what its doing | 21:55 |
melwitt | so that ended up being the wrong thing to do, I think | 21:55 |
mriedem | melwitt: https://review.openstack.org/#/c/501408/2/nova/conductor/manager.py@1020 | 21:56 |
*** antosh has quit IRC | 21:56 | |
mriedem | too bad i didn't link that irc convo in | 21:56 |
*** shaohe_feng has quit IRC | 21:56 | |
melwitt | yeah, this is coming back to me. there were other things like, at the time I was thinking don't create the BDMs etc until after we know we're good after the quota recheck | 21:58 |
*** shaohe_feng has joined #openstack-nova | 21:58 | |
melwitt | but we discussed on IRC and determined that all had a failure path to clean up anything that was created, and so should have been okay to just do everything normally and check quota at the end | 21:58 |
melwitt | in one loop instead of two | 21:59 |
mriedem | http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2017-09-06.log.html#t2017-09-06T20:33:51 | 21:59 |
mriedem | it was also a refactor we didn't want to backport | 21:59 |
melwitt | right yeah | 21:59 |
mriedem | i had a todo to combine back to a single loop on my desk for a long time, b/c i had in mind how to do it, | 22:00 |
mriedem | but long forgot now | 22:00 |
sean-k-mooney | mriedem: haha i was just looking at the irc logs to see if i could find it for you. | 22:00 |
melwitt | I added it to my todo list too so hopefully one of us will do it this time. I had forgotten about it | 22:00 |
*** savvas has quit IRC | 22:01 | |
*** med_ has joined #openstack-nova | 22:02 | |
*** med_ has quit IRC | 22:02 | |
*** med_ has joined #openstack-nova | 22:02 | |
mriedem | "dansmithmriedem: we wouldn't know where to find the instance record to mark it as deleted when they deleted the buildreq, so we'd leave that undeleted but unfindable instance forever" | 22:02 |
mriedem | heh | 22:02 |
mriedem | sound familiar? | 22:03 |
mriedem | "mriedemi shit my pants everytime we touch nova these days" | 22:04 |
mriedem | ha | 22:04 |
melwitt | haha, relatable | 22:04 |
mriedem | mnaser: again, congratulations to you to continue running a business on top of stuff we're still talking about fixing almost 1 year later :) | 22:04 |
*** itlinux has quit IRC | 22:05 | |
openstackgerrit | karim proposed openstack/nova master: Updated AggregateImagePropertiesIsolation filter illustration https://review.openstack.org/586317 | 22:05 |
*** felipemonteiro_ has quit IRC | 22:06 | |
mriedem | i think the tl;dr from the irc convo is just combine the loops and move the quota check to the end | 22:06 |
mriedem | "locally" deleting the instance will automatically delete the tags and bdms along with the instance from the cell | 22:06 |
melwitt | I'm trying to think, why didn't we move the instance mapping update earlier last time? | 22:06 |
melwitt | yeah, that's what I'm getting from it too, merge the loops and check quota at the end | 22:07 |
*** shaohe_feng has quit IRC | 22:07 | |
*** jmlowe has joined #openstack-nova | 22:07 | |
*** shaohe_feng has joined #openstack-nova | 22:07 | |
mriedem | idk, my guess is tunnel vision on the fix at hand | 22:07 |
melwitt | wait, that change (last year) *did* move the inst mapping update earlier to right after the instance.create(). looking to see what happened to that | 22:10 |
*** itlinux has joined #openstack-nova | 22:10 | |
*** savvas has joined #openstack-nova | 22:11 | |
*** rtjure has joined #openstack-nova | 22:13 | |
mriedem | but only if the quota check failed | 22:14 |
mriedem | b/c we exit after that | 22:14 |
mriedem | we don't bury in cell0 if quota check fails because the instances are already created in cells at that point | 22:15 |
*** figleaf is now known as edleafe | 22:15 | |
melwitt | I mean this, this is showing an update of the instance mapping right after we create the instance record https://review.openstack.org/#/c/501408/2/nova/conductor/manager.py@1003 | 22:16 |
*** savvas has quit IRC | 22:16 | |
mriedem | oh right yewah | 22:17 |
mriedem | *yeah | 22:17 |
melwitt | but in the current version of the code, the instance mapping update isn't right after the instance create anymore | 22:17 |
melwitt | and I can't find how that changed, looking at git blame and failing | 22:17 |
*** shaohe_feng has quit IRC | 22:17 | |
mriedem | _populate_instance_mapping was only ever used in the cellsv1 path | 22:17 |
mriedem | the build_instances method | 22:17 |
mriedem | i'm pretty sure | 22:17 |
melwitt | but in that old patch, it's in schedule_and_build_instances | 22:17 |
*** shaohe_feng has joined #openstack-nova | 22:18 | |
mriedem | because mnaser was re-using it | 22:19 |
mriedem | you mean why did we talk him out of that? | 22:19 |
melwitt | no I mean, as of that patch, the instance mapping update was right after instance create, but the current code has the mapping update much later, and I was wondering why that was moved. I assume it was to fix some other bug or something | 22:20 |
*** savvas has joined #openstack-nova | 22:20 | |
mriedem | looks like it was changed as a result of the irc convo | 22:21 |
melwitt | oh gaaaahhh, I didn't realize I was looking at an earlier PS | 22:21 |
melwitt | okay so the final version only added a mapping update to the cleanup method, like you said earlier I think. so the normal path for updating the mapping was always later on | 22:24 |
*** sambetts_ has quit IRC | 22:24 | |
*** savvas has quit IRC | 22:25 | |
melwitt | ok | 22:25 |
*** sambetts_ has joined #openstack-nova | 22:26 | |
mriedem | yup. alright gotta run. o/ | 22:27 |
melwitt | o/ | 22:27 |
*** shaohe_feng has quit IRC | 22:27 | |
*** shaohe_feng has joined #openstack-nova | 22:28 | |
*** shaohe_feng has quit IRC | 22:37 | |
*** savvas has joined #openstack-nova | 22:38 | |
*** shaohe_feng has joined #openstack-nova | 22:39 | |
*** avolkov has quit IRC | 22:40 | |
*** hongbin_ has quit IRC | 22:42 | |
*** shaohe_feng has quit IRC | 22:48 | |
*** shaohe_feng has joined #openstack-nova | 22:48 | |
*** mhg has quit IRC | 22:53 | |
*** shaohe_feng has quit IRC | 22:58 | |
*** shaohe_feng has joined #openstack-nova | 23:01 | |
*** mschuppert has quit IRC | 23:03 | |
*** gilfoyle_ has quit IRC | 23:08 | |
*** shaohe_feng has quit IRC | 23:08 | |
*** harlowja has quit IRC | 23:09 | |
*** shaohe_feng has joined #openstack-nova | 23:10 | |
*** shaohe_feng has quit IRC | 23:18 | |
*** shaohe_feng has joined #openstack-nova | 23:20 | |
openstackgerrit | Merged openstack/nova master: Use source vifs when unplugging on source during post live migrate https://review.openstack.org/586402 | 23:27 |
openstackgerrit | Merged openstack/nova master: Pass source vifs to driver.cleanup in _post_live_migration https://review.openstack.org/586568 | 23:27 |
openstackgerrit | Merged openstack/nova master: Update queued-for-delete from the ComputeAPI during deletion/restoration https://review.openstack.org/566813 | 23:27 |
*** shaohe_feng has quit IRC | 23:29 | |
melwitt | finally \o/ | 23:30 |
*** shaohe_feng has joined #openstack-nova | 23:32 | |
*** itlinux has quit IRC | 23:37 | |
*** shaohe_feng has quit IRC | 23:39 | |
*** shaohe_feng has joined #openstack-nova | 23:40 | |
*** gongysh has joined #openstack-nova | 23:43 | |
mnaser | well i know its late | 23:49 |
mnaser | but now even another whole interesting failure | 23:49 |
mnaser | no record in nova_api but one in the cell | 23:49 |
mnaser | lol | 23:49 |
*** shaohe_feng has quit IRC | 23:49 | |
*** shaohe_feng has joined #openstack-nova | 23:50 | |
*** itlinux has joined #openstack-nova | 23:50 | |
*** itlinux has quit IRC | 23:50 | |
*** itlinux has joined #openstack-nova | 23:51 | |
*** itlinux has quit IRC | 23:51 | |
*** wolverineav has quit IRC | 23:53 | |
*** wolverineav has joined #openstack-nova | 23:54 | |
melwitt | mnaser: no build request or instance mapping? | 23:54 |
mnaser | melwitt: build request, no instance mapping | 23:55 |
mnaser | wait sorry | 23:55 |
mnaser | it doesnt exist in the cell, sorry | 23:55 |
melwitt | build request, instance in cell, no instance mapping | 23:56 |
melwitt | build request only? | 23:56 |
mnaser | yes | 23:56 |
mnaser | build request only | 23:56 |
melwitt | that's the exact same thing rdo cloud ran into | 23:56 |
mnaser | so shows up in list but not deletable etc | 23:56 |
melwitt | right | 23:56 |
mnaser | i guess i can just delete the build request and have it disappear? | 23:56 |
melwitt | do you have several or just a few? like does it happen a lot? | 23:56 |
melwitt | yes, that's what I told rdo cloud to do too | 23:57 |
mnaser | i mean after running my fixup script, i still had a few instances that were stuck BUILD/scheduling | 23:57 |
melwitt | I dug around in the code and didn't see a way it can happen other than nova-api going down at the precise moment between the build_request.create() and the instance_mapping.create() or the instance_mapping.create() somehow failing | 23:57 |
mnaser | so for context it is possible that rpc and/or db both had issues at the time | 23:57 |
mnaser | does the build request and instance_mapping get created at the same time or? | 23:58 |
melwitt | which seems it would be crazy rare ... so maybe we're missing some other way it could happen | 23:58 |
melwitt | pretty much yeah. let me grab a link | 23:58 |
melwitt | https://github.com/openstack/nova/blob/3e0b17b1e138615b66293976ca5b55c291957844/nova/compute/api.py#L930-L942 | 23:58 |
* mnaser is learning so much lol | 23:59 | |
melwitt | yeah, soon you can come fix all these bugs | 23:59 |
mnaser | ok that's interesting | 23:59 |
mnaser | haha | 23:59 |
*** wolverineav has quit IRC | 23:59 | |
*** shaohe_feng has quit IRC | 23:59 | |
mnaser | so build request was created, instance mapping was *not* created. unless there was an attempt to delete the instance while it was still in build request | 23:59 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!