*** brinzhang has joined #openstack-nova | 00:10 | |
*** ociuhandu has joined #openstack-nova | 00:10 | |
*** brinzhang_ has quit IRC | 00:12 | |
*** ociuhandu has quit IRC | 00:15 | |
*** tetsuro has joined #openstack-nova | 00:15 | |
*** spatel has joined #openstack-nova | 00:19 | |
*** mvkr has joined #openstack-nova | 00:22 | |
*** gyee has quit IRC | 00:26 | |
*** ivve has quit IRC | 00:48 | |
*** ab-a has quit IRC | 00:58 | |
*** jkulik has quit IRC | 00:58 | |
*** ab-a has joined #openstack-nova | 00:59 | |
*** jkulik has joined #openstack-nova | 00:59 | |
*** Liang__ has joined #openstack-nova | 01:00 | |
*** dave-mccowan has quit IRC | 01:04 | |
*** dave-mccowan has joined #openstack-nova | 01:09 | |
*** ociuhandu has joined #openstack-nova | 01:12 | |
*** ociuhandu has quit IRC | 01:17 | |
*** brinzhang has quit IRC | 01:19 | |
*** brinzhang has joined #openstack-nova | 01:20 | |
*** brinzhang_ has joined #openstack-nova | 01:22 | |
*** brinzhang has quit IRC | 01:24 | |
*** brinzhang has joined #openstack-nova | 01:25 | |
*** jdillaman has quit IRC | 01:26 | |
*** brinzhang_ has quit IRC | 01:26 | |
*** jdillaman has joined #openstack-nova | 01:27 | |
*** Xuchu has joined #openstack-nova | 01:33 | |
*** brinzhang_ has joined #openstack-nova | 01:44 | |
*** nanzha has joined #openstack-nova | 01:44 | |
*** brinzhang has quit IRC | 01:47 | |
*** abaindur has quit IRC | 01:48 | |
openstackgerrit | Merged openstack/nova stable/stein: Switch to opensuse-15 nodeset https://review.opendev.org/692033 | 01:51 |
---|---|---|
openstackgerrit | Merged openstack/nova master: Add functional test for two-cell scheduler behaviors https://review.opendev.org/452006 | 01:51 |
*** spatel has quit IRC | 01:53 | |
*** dave-mccowan has quit IRC | 01:56 | |
*** ociuhandu has joined #openstack-nova | 01:59 | |
*** ociuhandu has quit IRC | 02:07 | |
*** markvoelker has joined #openstack-nova | 02:26 | |
*** markvoelker has quit IRC | 02:30 | |
openstackgerrit | Brin Zhang proposed openstack/nova-specs master: Allow specify user to reset password https://review.opendev.org/682302 | 02:37 |
*** ociuhandu has joined #openstack-nova | 02:55 | |
*** ociuhandu has quit IRC | 03:02 | |
*** mkrai has joined #openstack-nova | 03:05 | |
*** brinzhang_ has quit IRC | 03:24 | |
*** brinzhang_ has joined #openstack-nova | 03:25 | |
*** mtreinish has quit IRC | 03:28 | |
*** psachin has joined #openstack-nova | 03:35 | |
*** factor has joined #openstack-nova | 03:47 | |
*** sapd1_ has joined #openstack-nova | 04:02 | |
*** sapd1 has quit IRC | 04:02 | |
*** ociuhandu has joined #openstack-nova | 04:08 | |
*** brinzhang has joined #openstack-nova | 04:09 | |
*** brinzhang_ has quit IRC | 04:11 | |
*** ociuhandu has quit IRC | 04:13 | |
*** ociuhandu has joined #openstack-nova | 04:14 | |
*** mkrai has quit IRC | 04:19 | |
*** ociuhandu has quit IRC | 04:26 | |
*** links has joined #openstack-nova | 04:31 | |
*** tetsuro has quit IRC | 04:34 | |
*** tetsuro has joined #openstack-nova | 04:36 | |
*** udesale has joined #openstack-nova | 04:38 | |
*** brinzhang_ has joined #openstack-nova | 04:58 | |
*** brinzhang has quit IRC | 05:02 | |
*** mkrai has joined #openstack-nova | 05:06 | |
*** mdbooth has quit IRC | 05:06 | |
*** brinzhang has joined #openstack-nova | 05:08 | |
*** brinzhang_ has quit IRC | 05:11 | |
*** mdbooth has joined #openstack-nova | 05:12 | |
*** trident has quit IRC | 05:18 | |
*** trident has joined #openstack-nova | 05:25 | |
*** ociuhandu has joined #openstack-nova | 05:31 | |
*** mkrai has quit IRC | 05:31 | |
*** psachin has quit IRC | 05:35 | |
*** ociuhandu has quit IRC | 05:35 | |
*** nanzha has quit IRC | 06:00 | |
*** nanzha has joined #openstack-nova | 06:01 | |
*** threestrands has quit IRC | 06:02 | |
*** mkrai has joined #openstack-nova | 06:20 | |
*** mkrai has quit IRC | 06:26 | |
*** mkrai has joined #openstack-nova | 06:26 | |
*** jawad_axd has joined #openstack-nova | 06:30 | |
openstackgerrit | Sundar Nadathur proposed openstack/nova-specs master: Updated Nova-Cyborg interaction spec. https://review.opendev.org/684151 | 06:31 |
*** mkrai has quit IRC | 06:37 | |
*** mkrai has joined #openstack-nova | 06:38 | |
*** psachin has joined #openstack-nova | 06:45 | |
*** dlbewley has quit IRC | 06:51 | |
*** lpetrut has quit IRC | 06:55 | |
*** Xuchu has quit IRC | 07:00 | |
*** ociuhandu has joined #openstack-nova | 07:02 | |
*** ociuhandu has quit IRC | 07:07 | |
*** Xuchu has joined #openstack-nova | 07:07 | |
*** jawad_axd has quit IRC | 07:13 | |
*** brinzhang_ has joined #openstack-nova | 07:13 | |
openstackgerrit | Merged openstack/nova master: Refactor rebuild_instance https://review.opendev.org/688419 | 07:16 |
*** brinzhang has quit IRC | 07:16 | |
*** brinzhang has joined #openstack-nova | 07:19 | |
*** brinzhang_ has quit IRC | 07:22 | |
*** pcaruana has joined #openstack-nova | 07:27 | |
*** jawad_axd has joined #openstack-nova | 07:27 | |
*** yaawang has quit IRC | 07:31 | |
*** jawad_ax_ has joined #openstack-nova | 07:36 | |
*** nanzha has quit IRC | 07:36 | |
*** nanzha has joined #openstack-nova | 07:38 | |
*** jawad_axd has quit IRC | 07:38 | |
*** ccamacho has quit IRC | 07:43 | |
*** panda|pto has quit IRC | 07:46 | |
*** panda has joined #openstack-nova | 07:46 | |
*** Xuchu has quit IRC | 07:47 | |
*** mkrai has quit IRC | 07:56 | |
*** jamesden_ has quit IRC | 08:09 | |
*** jamesdenton has joined #openstack-nova | 08:09 | |
*** mkrai has joined #openstack-nova | 08:18 | |
openstackgerrit | Merged openstack/nova stable/stein: libvirt: Ignore volume exceptions during post_live_migration https://review.opendev.org/691282 | 08:23 |
*** mkrai has quit IRC | 08:24 | |
*** mkrai has joined #openstack-nova | 08:24 | |
*** yaawang has joined #openstack-nova | 08:26 | |
*** markvoelker has joined #openstack-nova | 08:28 | |
openstackgerrit | Silvan Kaiser proposed openstack/nova master: Move Nova Quobyte driver to LibvirtMountedFileSystemVolumeDriver https://review.opendev.org/687066 | 08:31 |
*** ivve has joined #openstack-nova | 08:32 | |
openstackgerrit | Silvan Kaiser proposed openstack/nova master: [WIP]Move Nova Quobyte driver to LibvirtMountedFileSystemVolumeDriver https://review.opendev.org/687066 | 08:32 |
*** markvoelker has quit IRC | 08:33 | |
*** tkajinam has quit IRC | 08:53 | |
*** mkrai has quit IRC | 08:57 | |
*** mkrai has joined #openstack-nova | 09:01 | |
*** ralonsoh has joined #openstack-nova | 09:07 | |
*** Dinesh_Bhor has quit IRC | 09:07 | |
*** trident has quit IRC | 09:27 | |
*** links has quit IRC | 09:30 | |
*** links has joined #openstack-nova | 09:33 | |
*** trident has joined #openstack-nova | 09:34 | |
*** ociuhandu has joined #openstack-nova | 09:40 | |
*** Liang__ has quit IRC | 09:59 | |
*** mkrai has quit IRC | 10:01 | |
*** trident has quit IRC | 10:30 | |
*** sapd1 has joined #openstack-nova | 10:31 | |
*** ociuhandu has quit IRC | 10:34 | |
*** trident has joined #openstack-nova | 10:36 | |
*** mkrai has joined #openstack-nova | 10:39 | |
*** jraju__ has joined #openstack-nova | 10:40 | |
*** links has quit IRC | 10:41 | |
*** tbachman has quit IRC | 10:42 | |
*** brinzhang has quit IRC | 11:10 | |
*** sapd1 has quit IRC | 11:12 | |
*** liuyulong has joined #openstack-nova | 11:22 | |
*** pcaruana has quit IRC | 11:26 | |
*** dviroel has joined #openstack-nova | 11:37 | |
*** pcaruana has joined #openstack-nova | 11:41 | |
*** cdent has joined #openstack-nova | 11:44 | |
*** sapd1_ has quit IRC | 11:52 | |
*** sapd1 has joined #openstack-nova | 11:53 | |
*** mkrai has quit IRC | 12:00 | |
*** sapd1 has quit IRC | 12:18 | |
*** sapd1 has joined #openstack-nova | 12:18 | |
*** markvoelker has joined #openstack-nova | 12:24 | |
*** udesale has quit IRC | 12:38 | |
*** udesale has joined #openstack-nova | 12:38 | |
*** psachin has quit IRC | 12:38 | |
*** francoisp has quit IRC | 12:38 | |
*** spatel has joined #openstack-nova | 12:40 | |
*** Sundar has joined #openstack-nova | 12:47 | |
*** udesale has quit IRC | 12:59 | |
*** mmethot_ has joined #openstack-nova | 13:03 | |
openstackgerrit | Merged openstack/nova master: Default AZ for instance if cross_az_attach=False and checking from API https://review.opendev.org/469675 | 13:05 |
*** mmethot has quit IRC | 13:05 | |
*** mriedem has joined #openstack-nova | 13:08 | |
*** trident has quit IRC | 13:09 | |
openstackgerrit | Matt Riedemann proposed openstack/nova stable/rocky: libvirt: Ignore volume exceptions during post_live_migration https://review.opendev.org/691283 | 13:09 |
mriedem | need a couple of cores to review gibi's evacuate + qos ports changes, the top is trivial, the bottom is a little more work but not bad, lots of test coverage and i'm +2 on it https://review.opendev.org/#/q/topic:bp/support-move-ops-with-qos-ports-ussuri+status:open | 13:12 |
mriedem | s/a couple of cores/one other core/ | 13:14 |
*** trident has joined #openstack-nova | 13:15 | |
efried | I tried to hit those on Wed but didn't have it in me. I'll try again today. | 13:23 |
efried | mriedem, dansmith: do you think the cyborg tempest job should ultimately be voting or nonvoting in nova? The ironic job is nonvoting but the neutron jobs are voting, so I'm not sure there's a clear precedent for "integration test with $service" | 13:24 |
sean-k-mooney | efried: well if neturon is broken we cant boot vms | 13:25 |
sean-k-mooney | efried: if ironic is broken we just can manage bematal but the other virt dirvers shoudl still work | 13:25 |
sean-k-mooney | efried: i would assume it would be non voting at least at first | 13:25 |
efried | seems like a reasonable line of thought. Thanks. | 13:26 |
*** tbachman has joined #openstack-nova | 13:26 | |
*** READ10 has joined #openstack-nova | 13:27 | |
mriedem | the ironic job is voting in other projects, just not nova, likely because no one cared to make it voting in nova | 13:33 |
mriedem | and historically the ironic jobs were very inconsistent and failed / timed out frequently | 13:33 |
Sundar | sean-k-mooney, efried: I agree it should be non-voting now. With the fake driver, we are putting up a VM without devices. | 13:33 |
sean-k-mooney | right its just testing the api workflow | 13:34 |
mriedem | we also have a barbican job in the experimental queue, but i'm not sure how stable that is either | 13:34 |
sean-k-mooney | even so that is valuable to test | 13:34 |
sean-k-mooney | with zuulv3 and the fact we can manage this all in repo its not hard to change if we think its stable and worth it | 13:35 |
Sundar | mriedem, sean-k-mooney, efried: What is the criterion to place a job in experimental, instead of check alone? | 13:36 |
efried | "We don't want this to run with every patch set" | 13:36 |
efried | "only on demand" | 13:36 |
efried | Considering how integrated the cyborg callouts are in the general flow of nova, IMO this one needs to be in check | 13:37 |
*** damien_r has joined #openstack-nova | 13:37 | |
*** damien_r has quit IRC | 13:38 | |
efried | ...once the code is in place. Which is what's signified by the job-add patch being on top of the series. | 13:38 |
*** damien_r has joined #openstack-nova | 13:39 | |
sean-k-mooney | efried: ya i think it should be in check too as non voting | 13:39 |
*** damien_r has quit IRC | 13:39 | |
*** damien_r has joined #openstack-nova | 13:39 | |
Sundar | efried: Re. the criterion "We don't want this to run with every patch set" -- we may want to run the tests when anything changes in the compute manager or virt driver, at least. | 13:41 |
Sundar | Well, at least libvirt driver | 13:41 |
sean-k-mooney | Sundar: right erric was commenting about why it would be in experimental | 13:41 |
sean-k-mooney | if its in experimental one of the resaons for that is we dont want it to run on every patch | 13:41 |
sean-k-mooney | for cyborg we want it to run on most patches | 13:42 |
sean-k-mooney | so it better to keep it in check | 13:42 |
Sundar | sean-k-mooney: Sure, makes sense. How about gate? | 13:42 |
efried | Sundar: If you wanted to try to come up with a reasonable setting for 'irrelevant-files' you could do that. But for something coupled into the compute manager and virt driver and scheduler and conductor like this, I think that would be very difficult to nail down effectively. | 13:42 |
efried | Sundar: if it's nonvoting, it doesn't make sense for it to be in the gate queue. | 13:42 |
dansmith | definitely | 13:42 |
dansmith | maybe nova/doc :) | 13:43 |
Sundar | efried: So check only? | 13:43 |
sean-k-mooney | the default irrelevant-files we use for tempest jobs should be fine | 13:43 |
efried | Sundar: yes. Check, nonvoting. I commented as such in the patch. | 13:43 |
sean-k-mooney | Sundar: i think so we dont run most of the backend specific jobs in gate | 13:43 |
efried | which btw is here https://review.opendev.org/#/c/670999/ for those following along at home | 13:44 |
efried | Sundar: if we made it voting at some point, we would probably want to add it to gate for the same reasons that motivated us to make it voting (whatever those are). | 13:44 |
Sundar | efried: Good. Yumeng wasn't clear about that last night. | 13:45 |
*** damien_r has quit IRC | 13:45 | |
Sundar | efried, sean-k-mooney: On the PTG front,I'll schedule a cross-project Cyborg/Nova. | 13:46 |
*** artom has quit IRC | 13:46 | |
Sundar | for whatever topics are remaining -- specs or patch | 13:47 |
mriedem | some non-voting jobs in check are more restrictive on when they run, see the nova-lvm job | 13:47 |
mriedem | talking about a gating cyborg job in nova is jumping the gun by a mile | 13:47 |
Sundar | mriedem: Sure, was just asking | 13:47 |
mriedem | dansmith: the state of the gate sub-thread about slowness and timing out of the POST to resize a server reminded me of something you said in my cross-cell series at some point, the api currently calls conductor which calls the scheduler and it's all blocking until we pick a dest compute and cast to it, | 13:48 |
mriedem | you said something about adding long_rpc_timeout to that, | 13:49 |
sean-k-mooney | Sundar: i can try to follow some of the etherpad remotely but i wont be at the ptg | 13:49 |
dansmith | mriedem: oh for resize... | 13:49 |
mriedem | i don't think that would help the particular case found in the email (tempest would have timed out after 3 minutes i think) but in general it's maybe something we should do anyway since that could be a lengthy blocking resize call on a big deployment like cern | 13:49 |
dansmith | mriedem: yeah | 13:49 |
mriedem | i was surprised when i found that was a blocking call in the first place | 13:50 |
mriedem | since usually that's a no-no from the api | 13:50 |
dansmith | mriedem: when looking into the cells scheduler fail, I found one of those page_cleaner errors too, but it was only like 8s over the expected 1s, yours is 161s, which is a big(ger) deal | 13:50 |
mriedem | dansmith: so one question, and maybe the answer is "meh, doesn't matter", but would we want to only use the long_rpc_timeout in the call from the api but *not* the reschedule call from the compute where we won't hit the scheduler again and should be much faster? | 13:51 |
sean-k-mooney | mriedem: so a resize blocks all the way untill the scduler select a destitaiton ? y a i would have expected that to be asyc | 13:51 |
mriedem | or just keep it simple and use long_rpc_timeout for both | 13:51 |
mriedem | sean-k-mooney: more than that, until it selects a dest and conductor casts to it's prep_resize method | 13:51 |
dansmith | mriedem: well, long_rpc for the first call for sure, because that's the cascade.. on the second one, yeah I dunno.. could do both and if they're really running, then let 'em run | 13:52 |
mriedem | the reschedule still has to re-swap allocations i guess... | 13:52 |
mriedem | yeah seems simple to just do both | 13:52 |
dansmith | so, | 13:53 |
dansmith | crazy thought just entered my head | 13:53 |
*** kaisers_ has joined #openstack-nova | 13:53 | |
* sean-k-mooney braces for what is comming | 13:53 | |
dansmith | we *could* add configs for each service that lets you provide rpc method names that we do as long-call, | 13:53 |
*** nanzha has quit IRC | 13:53 | |
dansmith | which would make it easier to experiment with that on a running system when there's an issue like this | 13:54 |
*** nanzha has joined #openstack-nova | 13:54 | |
dansmith | we could hard-code the ones we know are important, but others could be converted by virtue of being mentioned | 13:54 |
dansmith | I know, it's a little crazy, but.. | 13:54 |
mriedem | seems a bit fickle, meaning typos would be easy to screw that up, or someone renaming a method, though i doubt that ever happens | 13:55 |
sean-k-mooney | how would we do that. add a decorator that inspected the config on invocation and convert them to a long rpc if found or something like that | 13:55 |
mriedem | you'd just check if the method name is in the configured list | 13:55 |
mriedem | if 'migrate_server' in CONF.conductor.long_rpc_timeout_calls | 13:56 |
mriedem | is one way | 13:56 |
dansmith | mriedem: yep all true | 13:56 |
sean-k-mooney | right i was just wonderign would htis be spread in the code or could we centralise it in one mentod that they all uses | 13:56 |
mriedem | i think that would be more useful before we had long_rpc_timeout where you otherwise just had to configure the global rpc_response_timeout for everything if you had one slow thing, like select_destinations | 13:57 |
mriedem | i'm sure vio deals with that | 13:57 |
mriedem | now having said that, | 13:57 |
mriedem | i think we know it's an issue for some object indirection api db queries on big lists | 13:57 |
mriedem | e.g. https://review.opendev.org/#/c/633042/ | 13:58 |
mriedem | but that would also be harder to configure properly i think | 13:58 |
dansmith | we should make the indirection methods use it always anyway probably | 13:59 |
mriedem | long_rpc_timeout? i thought about it when investigating that bug | 13:59 |
dansmith | yes | 13:59 |
sean-k-mooney | what would be the cost of makeing long_rpc the default out of interest. i assume there is a reason for being selectiv beyond the fact many rpc methods dont strictly need it | 14:00 |
dansmith | it increases rabbit load a little bit | 14:01 |
sean-k-mooney | ok im just wondering how much of a wack a mole problem this is | 14:02 |
mriedem | for object methods, could we distinguish between an object list query and just a single object query? | 14:02 |
dansmith | it's exactly a whack-a-mole problem, which is why making it config-driven so people can whack said moles without code changes might be reasonable :) | 14:02 |
dansmith | mriedem: yes | 14:02 |
dansmith | mriedem: if you're requesting so many objects that you can't get it done in under the default limit, there are probably other issues, and if it's a load thing, then small queries can be delayed behind big ones, so.. I dunno how meaningful that distinction is | 14:04 |
mriedem | yeah. at what point do we let someone configure themselves to wait too long rather than solve their bigger load problems by scaling out/up their control plane. | 14:05 |
mriedem | i guess that answer probably comes down to their SLAs | 14:05 |
mriedem | or whatever their performance testing thresholds are | 14:05 |
dansmith | yeah, I'm not really suggesting we let them configure long rpc calls externally to solve problems, | 14:06 |
dansmith | more to experiment and report (or even just for our own experimentation in devstack) | 14:06 |
dansmith | not that it's really easier for us I guess | 14:06 |
dansmith | anyway, it was just a thought | 14:06 |
*** eharney has joined #openstack-nova | 14:11 | |
*** jaosorior has joined #openstack-nova | 14:13 | |
Sundar | mriedem, sean-k-mooney, dansmith: What would you like to see, either in the PTG or in terms of spec/patches, to move Cyborg forward? | 14:15 |
dansmith | Sundar: I'm not going to be at the ptg | 14:17 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Use long_rpc_timeout in conductor migrate_server RPC API call https://review.opendev.org/692550 | 14:20 |
Sundar | dansmith: I see. Would you like to see any changes in https://review.opendev.org/#/q/status:open+project:openstack/nova+bp/nova-cyborg-interaction ? Just checking to see what it takes to merge this patch series. | 14:23 |
dansmith | Sundar: reviews. doesn't look like it's had much of any, and the little it has had seems pretty un-diverse | 14:23 |
*** jawad_ax_ has quit IRC | 14:23 | |
dansmith | which I know is why you're asking, just saying | 14:23 |
dansmith | I popped up one to see and it looks like it's in merge conflict and still has stuff like random whitespace damage in it (I'll comment in a sec), but that kind of stuff doesn't make it look like it's ready to go, fwiw | 14:24 |
*** jawad_axd has joined #openstack-nova | 14:24 | |
*** maciejjozefczyk has joined #openstack-nova | 14:25 | |
Sundar | The conflicts happened recently. The series has passed Zuul checks for the most par and has been ready to go since Train merged. | 14:26 |
Sundar | Please ignore the conflicts for now and focus on the content | 14:26 |
* dansmith wanders off to some patches not in merge conflict | 14:28 | |
mriedem | is there functional testing anywhere in the series? | 14:28 |
mriedem | i'm not even talking tempest, | 14:28 |
mriedem | but like creating a cyborg fixture in nova like we have for cinder and neutron, | 14:28 |
*** mtreinish has joined #openstack-nova | 14:28 | |
*** jawad_axd has quit IRC | 14:28 | |
mriedem | so that we can start writing functional in-tree tests that use that fixture and at least exercise the runtime code in nova, despite using a fixture for the cyborg apis | 14:28 |
mriedem | i didn't ask/expect anyone to review my cross-cell resize series until i had some solid functional testing of at least the happy paths, | 14:29 |
mriedem | i don't think i even started asking for reviews until i had a multi-cell gate job working | 14:29 |
mriedem | right now this all looks like unit tests | 14:29 |
dansmith | mriedem: please ignore the conflicts and lack of testing and focus on the content | 14:30 |
mriedem | the next closest thing to the cyborg stuff is probably gibi's qos ports series | 14:30 |
mriedem | and he did a lot of functional test for those b/c of the complicated nature of dealing with nested provider allocations | 14:30 |
mriedem | so maybe you could build on some of that while also working on a beginning of a cyborg fixture for functional testing - at least the basic happy path, | 14:31 |
mriedem | create a server with arqs and then delete the server and verify the arqs are deleted/unbound whatever happens to those | 14:31 |
mriedem | the fixture would obviously have to track the state of the arqs associated with a server | 14:31 |
mriedem | which we do in the NeutronFixture for ports and volumes in the CinderFixture | 14:31 |
*** ociuhandu has joined #openstack-nova | 14:31 | |
Sundar | mriedem: Looking at the discussion we had at the Cyborg/Nova cross-project at Denver PTG: https://etherpad.openstack.org/p/ptg-train-xproj-nova-cyborg | 14:32 |
Sundar | The criterion for integration was said to be upstream tempest CI with a fake driver. We got the fake driver done, and tempest can create/delete VMs wtth that fake driver now | 14:33 |
dansmith | Sundar: yeah, that's necessary too, | 14:34 |
dansmith | Sundar: but just about anything with tentacles running from API to the deepest depths of the compute service has serious functional testing | 14:34 |
Sundar | mriedem, dansmith: Just to be clear, you are talking about extending today's Nova functional testing, right? Not gabbi? | 14:35 |
mriedem | nova doesn't use gabbi | 14:35 |
mriedem | placement does | 14:35 |
*** ociuhandu has quit IRC | 14:36 | |
mriedem | Sundar: some solid examples are what gibi did with qos port functional testing, for which this is the parent class https://github.com/openstack/nova/blob/master/nova/tests/functional/test_servers.py#L5565 | 14:36 |
mriedem | so here is an example https://github.com/openstack/nova/blob/master/nova/tests/functional/test_servers.py#L5996 | 14:37 |
mriedem | "Tests creating a server with a pre-existing port that has a resource request for a QoS minimum bandwidth policy." | 14:37 |
mriedem | we have other tests that interact with cinder https://github.com/openstack/nova/blob/master/nova/tests/functional/test_boot_from_volume.py | 14:37 |
mriedem | well, usinga CinderFixture | 14:38 |
mriedem | https://github.com/openstack/nova/blob/master/nova/tests/functional/test_multiattach.py | 14:38 |
Sundar | mriedem: Thanks, will look at it and follow up. | 14:40 |
mriedem | i traced the instance create/delete through the conductor and compute logs for the test_server_basic_ops tempest test and it looked fine, i saw the creating/waiting/deleting arqs stuff, no weird errors in the logs, so that's all good to see | 14:44 |
mriedem | once the series is rebased, if tests are passing and it's ready for review, put it into a runway slot | 14:44 |
mriedem | Sundar: ^ | 14:45 |
Sundar | mriedem: Thanks, will do. | 14:47 |
mriedem | Sundar: this is going to require a microversion bump https://review.opendev.org/#/c/631245/36/nova/api/openstack/compute/schemas/server_external_events.py@36 | 14:48 |
mriedem | you can't change schema on a versioned api without a new microversion | 14:48 |
dansmith | this does not look "ready to go since train" to me | 14:49 |
dansmith | there are things used in early patches that aren't added until later patches | 14:49 |
dansmith | and presumably very large test (even unit) gaps to allow that to happen | 14:50 |
*** jangutter has quit IRC | 14:50 | |
Sundar | dansmith: "used in early patches that aren't added until later patches" -- What are you referring to? | 14:53 |
dansmith | Sundar: the event change. I left comments | 14:53 |
mriedem | Sundar: https://review.opendev.org/#/c/631244/43/nova/compute/manager.py@2609 | 14:54 |
mriedem | Sundar: note that nova still generally tries to subscribe to a philosophy of anything we merge today can be in production today, and people can CD nova | 14:54 |
mriedem | even if that's not the reality of how the vast majority of openstack consumers consume openstack, it's a development philosophy to try and avoid merging broken code | 14:55 |
*** bnemec is now known as beekneemech | 14:56 | |
*** artom has joined #openstack-nova | 14:57 | |
mriedem | having said that i can't easily find anything about that in https://docs.openstack.org/nova/latest/contributor/ but i know it comes up every so often, and not all projects in openstack subscribe to the same philosophy, e.g. if it's all working by RC1 whatever, let it ride | 14:57 |
*** jraju__ has quit IRC | 14:58 | |
dansmith | probably because it's core openstack philosophy than not everyone has adhered to | 14:59 |
mriedem | closest is, https://docs.openstack.org/nova/latest/contributor/process.html#smooth-upgrades "upgrade from any commit, to any future commit, within the same major release" | 14:59 |
dansmith | is, or was | 14:59 |
mriedem | dansmith: i'm just not sure it's clearly documented anywhere | 14:59 |
mriedem | and probably should be | 14:59 |
mriedem | like something simple in https://docs.openstack.org/nova/latest/contributor/policies.html | 15:00 |
dansmith | sounds like a good thing to do tome | 15:01 |
Sundar | I have generally thought of the patch series as a unified whole. Merging individual parts will not give you any useful functionality. The base patch has a procedural -2, so that the whole series will merge together. | 15:01 |
dansmith | it's fine to not have incremental functionality, | 15:01 |
Sundar | That said, I don't believe in merging 'broken patches' either | 15:01 |
dansmith | it's not fine to merge things that use something before the thing is available | 15:01 |
dansmith | and we can't arrange for the whole series to merge as a whole, | 15:02 |
dansmith | which means the -2 holds things until all the pieces are ready, but they still have to be mergeable in isolation, because they will | 15:02 |
mriedem | especially since it takes about 3 days of rechecks to merge anything in nova right now | 15:02 |
dansmith | three days? please | 15:03 |
mriedem | ok ok 5 days | 15:03 |
dansmith | my notifications thing has been going all week.. maybe since last week | 15:03 |
mriedem | for big series like this it's always best to work from the bottom up where everything is noop until "turned on" in the API at the end if possible | 15:03 |
dansmith | yeah, since monday | 15:03 |
mriedem | so there is less risk in merging the bottom noop stuff | 15:04 |
*** JamesBenson has joined #openstack-nova | 15:05 | |
Sundar | mriedem: The code flow starts from the API, such as getting device profile info based on the extra specs. If the API stuff came later, wouldn't the patch sequence be out of order? We can only do UT for the earlier patches then. | 15:07 |
cdent | mriedem: I assume you saw my response on the gate+placement email? Sorry I've not been able to dig more deeply. I've been engaged elsewhere...To such an extent that I may come off the mailing list pretty soon to maintain sanity. | 15:08 |
efried | Sundar: in this case the "master switch" could conceivably be the part that processes the device profile out of the flavor. | 15:09 |
mriedem | cdent: yup, and replied, and thanks | 15:09 |
mriedem | cdent: i think the likeliest culprit is noise neighbors on oversubscribed nodes as you said | 15:09 |
*** maciejjozefczyk has quit IRC | 15:13 | |
*** kaisers_ is now known as kaisers_away | 15:14 | |
*** maciejjozefczyk has joined #openstack-nova | 15:14 | |
cdent | mriedem: the gist of what I can find on that innodb error (now that I've managed to actually read your responses) is "you disk i/o is way constrained" | 15:17 |
*** cdent has left #openstack-nova | 15:18 | |
*** cdent has joined #openstack-nova | 15:18 | |
*** kaisers_away is now known as kaisers_ | 15:18 | |
*** kaisers_ is now known as kaisers_away | 15:19 | |
mriedem | cdent: heh, yeah looking at this dstat graph, i/o is just 0 from 16:52 to 16:55 | 15:19 |
cdent | oh dear | 15:19 |
cdent | there should be some kind of "how am I supposed to work in these conditions" meme we can interject here | 15:20 |
*** gyee has joined #openstack-nova | 15:20 | |
efried | https://makeameme.org/meme/i-cant-work-2sl5sn | 15:21 |
mriedem | https://toddroblog.files.wordpress.com/2017/02/drunk-hobo-flips-double-bird-1.jpg?w=809 ? | 15:21 |
mriedem | oh blurred out?! | 15:21 |
*** kaisers_away is now known as kaisers_ | 15:22 | |
mriedem | i wish i knew about https://lamada.eu/dstat-graph/# years ago | 15:23 |
cdent | yeah, why don't things like that get passed around more easily? | 15:26 |
*** dlbewley has joined #openstack-nova | 15:26 | |
mriedem | probably because i was too shy to say "i'm dumb, how can i see this dstat output in a graph?" | 15:27 |
mriedem | i assume all of my co-workers to write tools to generate graphs on the fly for everything they do | 15:27 |
cdent | https://www.youtube.com/watch?v=MKIaS0lh-uo | 15:29 |
*** macz has joined #openstack-nova | 15:29 | |
mriedem | heh, it's rare to get a repo man reference in here | 15:30 |
cdent | what provider is that 0 I/O on? | 15:33 |
mriedem | inap | 15:33 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: doc: link to nova code review guide from dev policies https://review.opendev.org/692569 | 15:35 |
*** igordc has joined #openstack-nova | 15:36 | |
*** TxGirlGeek has joined #openstack-nova | 15:38 | |
*** kaisers_ is now known as kaisers_away | 15:39 | |
*** igordc has quit IRC | 15:47 | |
*** artom has quit IRC | 15:47 | |
*** TxGirlGeek has quit IRC | 15:47 | |
* dansmith puts on "Round and Round" as he goes round with Zuul again | 15:49 | |
efried | I already rechecked a few dansmith | 15:54 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Document CD mentality policy for nova contributors https://review.opendev.org/692572 | 15:54 |
mriedem | dansmith: i took a crack at documenting that ^ | 15:54 |
dansmith | efried: in that case, https://www.youtube.com/watch?v=0u8teXR8VE4 | 15:54 |
mriedem | dansmith: thoughts on the no graceful shutdown thing i mentioned here? http://lists.openstack.org/pipermail/openstack-discuss/2019-October/010495.html - wondering if we should go back to like a 10 second shutdown timeout in case that is screwing up the guest during things that transfer the root disk or snapshot it (shelve/unshelve) leading to ssh failures later | 15:56 |
mriedem | i had made that change in devstack b/c of a real bug where we see the 60 second shutdown timeout in nova not doing anything, potentially leading to overall job timeouts if we're waiting a full minute to power off guests during a tempest run | 15:57 |
dansmith | mriedem: I doubt that cirros is responding to the graceful shutdown request at all | 15:57 |
dansmith | mriedem: so I'm not sure it matters | 15:58 |
dansmith | if you have a devstack up you should be able to test | 15:58 |
dansmith | even still, unless it's writing to the disk when the destroy happens, it wouldn't likely do any damage (and shouldn't really anyway) | 15:58 |
dansmith | don't we get the console of the guest we fail to ssh into if we do? unless it's sitting at a panic or some unhappy state, I'd not be concerned | 15:59 |
mriedem | yes we do | 15:59 |
*** mmethot_ has quit IRC | 15:59 | |
*** kaisers_away is now known as kaisers_ | 16:00 | |
*** kaisers_ is now known as kaisers_away | 16:00 | |
sean-k-mooney | mriedem: are you suspecting disk curruption? | 16:00 |
mriedem | there are several guest ssh failure bugs in the gate right now, many different reasons (dhcp lease fails, out of space on the guest disk [growroot fails], kernel panic) | 16:00 |
sean-k-mooney | due to an agressive shutdown? | 16:00 |
mriedem | i'm grasping at straws | 16:00 |
mriedem | in one case that i caught in the gate where the 60 second shutdown timed out, i got the guest console log http://paste.openstack.org/show/752002/ | 16:01 |
mriedem | cdent wondered (in the bug) if the metadata api retry loop in the guest was making it ignore the shutdown request | 16:01 |
mriedem | https://bugs.launchpad.net/nova/+bug/1829896 | 16:01 |
openstack | Launchpad bug 1829896 in OpenStack Compute (nova) "libvirt: "Instance failed to shutdown in 60 seconds." in the gate" [Undecided,New] | 16:01 |
sean-k-mooney | well in that case it got a dhcp lease but then could not hit the metadata service | 16:02 |
mriedem | i'm also wondering if there are ideas on figuring out what is eating up all of the disk in these guests when growroot fails | 16:02 |
sean-k-mooney | so that looks like a neutron issue with the metadata proxy | 16:02 |
sean-k-mooney | if glean/cloud init is waithing for the metadata then | 16:03 |
sean-k-mooney | the ssh service may not have started | 16:03 |
openstackgerrit | Lee Yarwood proposed openstack/nova stable/queens: Add 'path' query parameter to console access url https://review.opendev.org/692573 | 16:03 |
*** ivve has quit IRC | 16:03 | |
lyarwood | mriedem: yup, working on the other change now thanks | 16:06 |
*** jmlowe has quit IRC | 16:07 | |
*** maciejjozefczyk has quit IRC | 16:07 | |
sean-k-mooney | mriedem: i do know we have had issue with hot plug if the early init in the guest takes too long | 16:07 |
*** maciejjozefczyk has joined #openstack-nova | 16:07 | |
mriedem | lyarwood: i'm sort of inclined to say you guys (RH) should keep that one downstream | 16:08 |
sean-k-mooney | but we also see Starting acpid: OK | 16:08 |
mriedem | since i don't see people clamoring for using the bleeding edge novnc stuff back on queens | 16:08 |
mriedem | i'm not like -2, but those are my "feelings" | 16:09 |
mriedem | what you humans would call feelings anyway | 16:09 |
lyarwood | haha yeah no issues, thought I'd at least post it upstream first to see if anyone wanted it | 16:09 |
sean-k-mooney | ah ha proof mriedem is actully an irc bot !!! | 16:10 |
sean-k-mooney | mriedem: on other random straw is there seams to be very littly entropy in the guest | 16:10 |
sean-k-mooney | random: dd urandom read with 31 bits of entropy available | 16:10 |
sean-k-mooney | the topic of enabling the virtio random number generator by default has come up before. i wonder if it could be related to that. | 16:12 |
*** markvoelker has quit IRC | 16:12 | |
mriedem | i seem to remember a recentish patch of kashyap's related to this | 16:14 |
*** artom has joined #openstack-nova | 16:14 | |
mriedem | i'm just thinking of this https://review.opendev.org/#/c/577385/ | 16:15 |
*** artom has joined #openstack-nova | 16:15 | |
sean-k-mooney | ya there is that one but we also talked a little about adding the random number gereate to the libvirt xml by defualt rather then needing to set the image property | 16:16 |
mriedem | we don't use that in our devstack guests though since we don't have hw_rng_model=virtio in the image | 16:16 |
mriedem | oh | 16:16 |
mriedem | well, i have to go to what you humans would call lunch now | 16:16 |
*** mriedem is now known as mriedem_feeds | 16:16 | |
*** tbachman has quit IRC | 16:23 | |
*** markvoelker has joined #openstack-nova | 16:23 | |
*** jmlowe has joined #openstack-nova | 16:23 | |
*** tbachman has joined #openstack-nova | 16:24 | |
openstackgerrit | Lee Yarwood proposed openstack/nova stable/queens: Reduce scope of 'path' query parameter to noVNC consoles https://review.opendev.org/692581 | 16:28 |
*** jmlowe has quit IRC | 16:32 | |
*** antonym has quit IRC | 16:48 | |
*** nanzha has quit IRC | 16:48 | |
*** kaisers_away is now known as kaisers_ | 16:50 | |
*** kaisers_ is now known as kaisers_away | 16:50 | |
*** ociuhandu has joined #openstack-nova | 16:52 | |
*** cdent has quit IRC | 16:53 | |
*** Sundar has quit IRC | 16:57 | |
*** tbachman has quit IRC | 16:58 | |
*** kaisers_away is now known as kaisers_ | 17:00 | |
*** kaisers_ is now known as kaisers_away | 17:00 | |
*** tbachman has joined #openstack-nova | 17:00 | |
*** jaosorior has quit IRC | 17:01 | |
*** tbachman has quit IRC | 17:04 | |
*** ivve has joined #openstack-nova | 17:05 | |
*** jmlowe has joined #openstack-nova | 17:07 | |
*** tbachman has joined #openstack-nova | 17:10 | |
*** igordc has joined #openstack-nova | 17:13 | |
dansmith | cripes | 17:19 |
dansmith | cache notifications patch is gonna fail again | 17:19 |
openstackgerrit | John Garbutt proposed openstack/nova-specs master: Add Unified Limits Spec https://review.opendev.org/602201 | 17:21 |
*** antonym has joined #openstack-nova | 17:22 | |
*** ociuhandu has quit IRC | 17:27 | |
*** ociuhandu has joined #openstack-nova | 17:29 | |
*** ociuhandu has quit IRC | 17:33 | |
*** maciejjozefczyk has quit IRC | 17:33 | |
*** jawad_axd has joined #openstack-nova | 17:33 | |
*** mriedem_feeds is now known as mriedem | 17:34 | |
*** ganso has quit IRC | 17:34 | |
*** jawad_axd has quit IRC | 17:34 | |
mriedem | dansmith: it's because of your sin | 17:34 |
* dansmith goes to repent | 17:34 | |
efried | dansmith: I'm +A on mriedem's CD doc patch, but the predecessor (link shuffle) needs to be sent as well | 17:35 |
efried | https://review.opendev.org/#/c/692569/1 | 17:36 |
dansmith | efried: oh sorry I missed that earlier I guess | 17:36 |
dansmith | I dun getified it | 17:36 |
*** ganso has joined #openstack-nova | 17:38 | |
mriedem | dansmith: i looked at a dstat from one of the 'timed out waiting for cell' failed jobs and the only thing around that time is a spike in paging and networking which doesn't tell me much | 17:40 |
dansmith | mriedem: yeah I looked pretty in depth at one dstat from a cell fail job | 17:40 |
dansmith | and it looked overly normal to me | 17:40 |
dansmith | mysql wasn't spiking that I could tell, it was the big memory user, but only like 400MB at the time | 17:40 |
dansmith | IO wasn't crazy, etc | 17:41 |
mriedem | yeah nothing obvious like the 0 i/ops in the other thing earlier today | 17:41 |
dansmith | I didn't catch the zero iops thing | 17:42 |
mriedem | and there is no rabbit involved b/c we're going straight to the db yeah? | 17:42 |
dansmith | you were assuming zero iops meant bad? | 17:42 |
mriedem | the 0 iops thing was related to another bug in the ML where we get long timeouts | 17:42 |
mriedem | which triggered the resize long_rpc_timeout discussion | 17:42 |
mriedem | and a very obvious message in the mysql error log | 17:43 |
dansmith | you mean zero iops for a long period of time or something I assume/ | 17:43 |
mriedem | 3 minutes yeah | 17:43 |
mriedem | while placement was processing a POST /allocations request | 17:43 |
dansmith | okay, yeah, one or two samples with zero iops wouldn't be alarming to me, but minutes of it, for sure | 17:43 |
*** READ10 has quit IRC | 17:44 | |
*** eharney has quit IRC | 17:44 | |
mriedem | and https://bugs.launchpad.net/nova/+bug/1825584 shouldn't be related to the scheduler timing out waiting for a response from cell1 b/c there is no uwsgi involved there | 17:46 |
openstack | Launchpad bug 1825584 in OpenStack Compute (nova) "eventlet monkey-patching breaks AMQP heartbeat on uWSGI" [Undecided,Confirmed] | 17:46 |
mriedem | just scheduler running with eventlet | 17:46 |
*** henriqueof has joined #openstack-nova | 17:53 | |
*** kaisers_away is now known as kaisers_ | 17:55 | |
*** kaisers_ is now known as kaisers_away | 17:55 | |
*** ralonsoh has quit IRC | 18:06 | |
*** tbachman has quit IRC | 18:12 | |
*** ociuhandu has joined #openstack-nova | 18:20 | |
*** jaosorior has joined #openstack-nova | 18:24 | |
*** ociuhandu has quit IRC | 18:28 | |
openstackgerrit | Merged openstack/nova stable/rocky: libvirt: Ignore volume exceptions during post_live_migration https://review.opendev.org/691283 | 18:30 |
openstackgerrit | Matt Riedemann proposed openstack/nova stable/queens: libvirt: Ignore volume exceptions during post_live_migration https://review.opendev.org/691284 | 18:36 |
*** kaisers_away is now known as kaisers_ | 18:38 | |
*** kaisers_ is now known as kaisers_away | 18:38 | |
*** kaisers_away is now known as kaisers_ | 18:40 | |
*** kaisers_ is now known as kaisers_away | 18:40 | |
mriedem | interesting that we don't have a task_state change while confirming a resized server | 18:48 |
mriedem | to prevent other actions from happening while still confirming and cleaning up the source host | 18:48 |
mriedem | but we do have task_states.RESIZE_REVERTING for reverting a resize | 18:57 |
*** kaisers_away is now known as kaisers_ | 18:58 | |
*** kaisers_ is now known as kaisers_away | 18:58 | |
*** luksky has joined #openstack-nova | 19:10 | |
*** jmlowe has quit IRC | 19:12 | |
*** maciejjozefczyk has joined #openstack-nova | 19:14 | |
*** jaosorior has quit IRC | 19:25 | |
*** sapd1 has quit IRC | 19:27 | |
*** sapd1 has joined #openstack-nova | 19:28 | |
*** maciejjozefczyk has quit IRC | 19:32 | |
*** eharney has joined #openstack-nova | 19:35 | |
openstackgerrit | Merged openstack/nova master: doc: link to nova code review guide from dev policies https://review.opendev.org/692569 | 19:36 |
*** markvoelker has quit IRC | 19:36 | |
openstackgerrit | Merged openstack/nova master: Document CD mentality policy for nova contributors https://review.opendev.org/692572 | 19:36 |
*** jmlowe has joined #openstack-nova | 19:39 | |
*** dklyle has quit IRC | 19:44 | |
mriedem | dansmith: MessagingTimeouts in nova-next in the latest failure in https://review.opendev.org/#/c/691390/ | 19:46 |
mriedem | looks like just a slow node? | 19:47 |
mriedem | dstat shows ram usage around 6.5gb for most of the run | 19:47 |
mriedem | well, at least around the time it was timing out | 19:47 |
mriedem | load average really spikes around the time of the timeouts | 19:48 |
*** dklyle has joined #openstack-nova | 19:51 | |
*** markvoelker has joined #openstack-nova | 19:53 | |
*** jamesdenton has quit IRC | 19:57 | |
*** kaisers_away is now known as kaisers_ | 19:58 | |
*** kaisers_ is now known as kaisers_away | 19:58 | |
*** kaisers_away is now known as kaisers_ | 19:59 | |
*** kaisers_ is now known as kaisers_away | 19:59 | |
sean-k-mooney | mriedem: you ruled out memory right | 19:59 |
sean-k-mooney | i allocated a little too many hugpage when playing with dpdk today and when i almost exausted memory in the vm i was getting big cpu spikes | 20:00 |
*** dklyle has quit IRC | 20:00 | |
*** dklyle has joined #openstack-nova | 20:00 | |
sean-k-mooney | like all core 100% for 3 secons then normal then spike then normal | 20:00 |
mriedem | peakmemtracker shows mysql as the top memory user but it doesn't look like it's running out | 20:08 |
mriedem | should be 8gb of ram on the node | 20:08 |
sean-k-mooney | ya you are proably fine | 20:08 |
sean-k-mooney | i had no swap and allcoated 4G of hugepages instead of 2 by mistake so i was running out | 20:09 |
*** JamesBenson has quit IRC | 20:15 | |
*** JamesBenson has joined #openstack-nova | 20:15 | |
*** kaisers_away is now known as kaisers_ | 20:18 | |
*** kaisers_ is now known as kaisers_away | 20:18 | |
efried | jroll: You available to brainstorm some of this vTPM stuff? | 20:21 |
jroll | efried: maybe? | 20:21 |
jroll | I'm not really an expert on the TPM bits, but I have a few minutes to chat, yes | 20:22 |
efried | jroll: I'm just trying to understand how the nova flow needs to work. | 20:22 |
jroll | sure, I can try to help. | 20:22 |
efried | If each boot needs to be able to specify an independent value, that's a bigger change than other things. | 20:23 |
efried | Basically what I'm trying to figure out is where the value of `secret` comes from in | 20:24 |
efried | <tpm model='tpm-tis'> | 20:24 |
efried | <backend type='emulator' version='2.0'> | 20:24 |
efried | <encryption secret='6dd3e4a5-1d76-44ce-961f-f119f5aad935'/> | 20:24 |
efried | </backend> | 20:24 |
efried | </tpm> | 20:24 |
jroll | right | 20:24 |
jroll | so take the fact that's it's a tpm out of your head | 20:24 |
efried | hah, that's easy enough | 20:24 |
efried | I can't even spell tpm yet. | 20:24 |
jroll | basically, there's a file for the instance on the hypervisor's disk, which needs to be encrypted | 20:24 |
jroll | to encrypt a file, you need a secret | 20:25 |
efried | to your understanding, is the value of `secret` the actual secret, or is it an identifier for a secret that needs to be looked up? | 20:25 |
jroll | the 6dd3e4a5-1d76-44ce-961f-f119f5aad935 in your paste is a reference to a secret object stored in libvirt (i.e. virsh secret-set 6dd3e4a5-1d76-44ce-961f-f119f5aad935 myawesomesecret) | 20:25 |
efried | "a reference" <= ✔ | 20:25 |
jroll | see https://libvirt.org/formatsecret.html#vTPMUsageType | 20:26 |
efried | "stored in libvirt" -- from the host or from within the VM | 20:26 |
efried | ...okay... | 20:26 |
jroll | from the host, yeah | 20:26 |
jroll | the VM knows nothing about this file | 20:26 |
efried | oh, look, a vTPM usage man page, that's convenient. | 20:26 |
jroll | :D | 20:26 |
jroll | I found it from a libvirt page you linked in that spec :P | 20:26 |
efried | butbutbut... | 20:26 |
efried | I thought the whole point of this was that someone who walks away with the disk can't get at the TPM | 20:27 |
jroll | correct, because the file representing the TPM is encrypted | 20:27 |
efried | with a key that's available on the host tho? | 20:27 |
jroll | I *assume*, but have not verified, that this secret is stored in memory, not on disk | 20:27 |
jroll | which is why I mentioned we may want to call virSecretSetValue on every instance start() | 20:28 |
jroll | as if I'm correct about it being in-memory, a host reboot or instance migration would mean we need to set that secret again | 20:28 |
efried | I thought I read somewhere that it's in memory | 20:28 |
efried | https://github.com/qemu/qemu/blob/master/docs/specs/tpm.txt | 20:28 |
efried | L14 | 20:29 |
efried | and 31 | 20:29 |
jroll | The TIS interface makes a memory mapped IO region in the area 0xfed40000 - | 20:29 |
jroll | 0xfed44fff available to the guest operating system. | 20:29 |
jroll | I read that as it maps the decrypted version of the file on disk into memory, but I'm not good with terms at this level | 20:30 |
*** ociuhandu has joined #openstack-nova | 20:30 | |
jroll | I thought I read somewhere that swtpm keeps the state on disk, but I don't see it now | 20:31 |
efried | "...reboot or ... migration ... set that secret again" | 20:31 |
jroll | so if swtpm keeps it in memory, and encrypted, that's even better | 20:31 |
efried | If the host has possession of the master secret, everything sounds easy. But it also sounds like it doesn't satisfy your use case, which is "must not be compromised if someone walks off with the disk". | 20:32 |
jroll | if the host keeps the master secret in memory only, that means it is not compromised if someone walks off with the disk | 20:32 |
efried | okay, how does the master secret get into memory? | 20:32 |
jroll | via virSecretSetValue | 20:33 |
jroll | per https://libvirt.org/formatsecret.html#vTPMUsageType | 20:33 |
efried | ...with a value or file that comes from where? | 20:33 |
jroll | that's the TODO in this spec that I don't have an answer to | 20:33 |
jroll | :) | 20:33 |
efried | Because I don't expect a flow like | 20:34 |
efried | - Host boots | 20:34 |
efried | - Operator intervention is required to type in a password or load up a file | 20:34 |
efried | - now your VMs can start | 20:34 |
efried | to be acceptable in general | 20:34 |
jroll | I agree | 20:34 |
jroll | so we'd have to have something in the instance database record, which is either a secret, or a reference to a secret | 20:35 |
jroll | ideally the secret would be in barbican or equivalent | 20:35 |
jroll | and that secret could be fetched by nova-compute as needed | 20:35 |
*** ociuhandu has quit IRC | 20:36 | |
efried | and you're allowed to fetch it by virtue of... being an authed user? | 20:36 |
jroll | I would assume by being the nova user | 20:37 |
jroll | or having access to the rabbit bus, if we had nova-compute ask nova-conductor for it | 20:37 |
efried | <red flag> upcall </red flag> | 20:38 |
jroll | heh | 20:38 |
efried | I was looking over some stuff about trusted certs for image signatures | 20:38 |
jroll | do computes not ever ask the conductor for info? what about all the database reads? | 20:38 |
efried | https://docs.openstack.org/nova/latest/user/certificate-validation.html | 20:39 |
*** ociuhandu has joined #openstack-nova | 20:39 | |
efried | jroll: at least in theory, traffic only ever flows one way, from the conductor down to the compute | 20:39 |
efried | There may be one upcall left over, I can't remember, but we've been working hard to purge them and not allow any new ones. | 20:39 |
jroll | erm, how does a compute read from the database? | 20:40 |
efried | I think there are two databases | 20:40 |
jroll | it isn't an RPC upcall, I get that, but it's still a call to the conductor, just buried in o.vo | 20:40 |
efried | I'm not real well versed on this stuff I'm afraid | 20:40 |
jroll | ok | 20:41 |
umbSublime | I'm not sure that would apply, but say the secret were stored in barbican couldn't the compute host query barbican by means of the service_token ? | 20:41 |
jroll | just sayin': https://github.com/openstack/nova/blob/master/nova/cmd/compute.py#L53 | 20:41 |
jroll | umbSublime: probably, yeah | 20:41 |
* jroll reads this cert validation thing | 20:41 | |
efried | jroll: I don't know how relevant the certificate validation thing is; I was just looking at how it tries to bootstrap secrets. | 20:42 |
*** ociuhandu has quit IRC | 20:43 | |
efried | unfortunately it seems to have added new CLI/API fields -- that would be the "specify a secret to nova boot" route. I would hate to have to do that because it blows up the feature scope significantly. | 20:43 |
jroll | right | 20:43 |
efried | without that, we like add a new extra spec and parlay it into xml and we're done. | 20:44 |
efried | with it, we've got microversions and client/CLI enhancements and on and on. | 20:44 |
jroll | it looks similar to how TPM secret storage could work, in that nova-compute has to fetch a secret from barbican to boot the instance | 20:44 |
jroll | in the image validation case, the user provides the secret reference, because the user signs the image | 20:44 |
jroll | in the vTPM case, only nova needs to know the secret, not the user, so it can handle everything internally | 20:45 |
efried | But how does nova know the secret? | 20:45 |
efried | In a way that doesn't allow you to get at it if you steal the disk? | 20:45 |
jroll | nova could generate the secret for a TPM at instance create time, put it in barbican, and then pull it later as needed (migration, etc) | 20:45 |
*** kaisers_away is now known as kaisers_ | 20:46 | |
*** kaisers_ is now known as kaisers_away | 20:46 | |
*** slaweq has joined #openstack-nova | 20:46 | |
efried | Okay. I guess I should go figure out how barbican works. Hopefully VMG's homegrown key manager equivalent responds the same way as barbican/vault does? | 20:46 |
jroll | close enough | 20:47 |
efried | because penick said something about using that in place of barbican | 20:47 |
jroll | if we do this all in castellan, we can make a castellan backend for our thing | 20:47 |
efried | ...yeah | 20:47 |
jroll | seems like the right thing to do anyway | 20:47 |
*** ociuhandu has joined #openstack-nova | 20:47 | |
efried | right, iiuc nova requires whatever key manager to be brokered/brokerable through castellan. | 20:47 |
jroll | mhm | 20:48 |
*** kaisers_away is now known as kaisers_ | 20:49 | |
*** kaisers_ is now known as kaisers_away | 20:49 | |
efried | okay, so | 20:50 |
efried | - nova creates a new secret in $backend via castellan | 20:50 |
efried | - nova uses secret to virSecretSetValue to get $secret_uuid | 20:50 |
efried | - nova boots instance with <encryption secret='$secret_uuid'> | 20:50 |
efried | Now does the tpm show up to the VM user as virginal? | 20:50 |
jroll | efried: anything else I can help with? | 20:50 |
jroll | I assume just like a new TPM | 20:50 |
jroll | or hope? | 20:50 |
efried | And the VM user bootstraps the tpm with his own secret, which only he knows | 20:50 |
jroll | that is a question for a qemu or swtpm person, I guess | 20:50 |
jroll | right | 20:50 |
*** igordc has quit IRC | 20:50 | |
efried | so nova ("the nova user") has to be the root of trust | 20:51 |
efried | in this scenario | 20:51 |
*** kaisers_away is now known as kaisers_ | 20:51 | |
*** kaisers_ is now known as kaisers_away | 20:51 | |
*** abaindur has joined #openstack-nova | 20:51 | |
jroll | I guess by some definition, yes | 20:52 |
jroll | it always comes down to the "hardware" supplier :) | 20:52 |
efried | because at any time "the nova user" could go grab that secret, decrypt the vtpm, and get at the VM user's secret | 20:52 |
*** kaisers_away is now known as kaisers_ | 20:52 | |
*** abaindur has quit IRC | 20:52 | |
*** kaisers_ is now known as kaisers_away | 20:52 | |
jroll | right | 20:52 |
efried | okay, so "walk away with disk" is a requirement, but "compromised root" is not. | 20:52 |
jroll | assuming the nova user has the encrypted vtpm state | 20:52 |
efried | yeah, ability to read memory or whatever. | 20:52 |
*** abaindur has joined #openstack-nova | 20:53 | |
jroll | yeah, I think that's something we just have to live with, in an emulated world | 20:53 |
jroll | if you've root on the hypervisor you can just go ahead and read the VM's memory anyway, right | 20:53 |
efried | I guess? | 20:53 |
efried | Okay, thanks for the talk jroll. If our assumptions are correct, I think I can work with this. | 20:54 |
*** abaindur has quit IRC | 20:54 | |
jroll | efried: you're welcome, I hope they are correct :) | 20:54 |
efried | penick said he was going to get "security people" to vet things, but it would be a while before they were available. | 20:54 |
*** abaindur has joined #openstack-nova | 20:54 | |
jroll | ¯\_(ツ)_/¯ | 20:54 |
efried | I'll at least write up spec content under these assumptions, so that they can have something to point at and say it's wrong, rather than just a void. | 20:55 |
jroll | I forgot about this thing for the last couple months, until today | 20:55 |
openstackgerrit | Merged openstack/nova master: Switch to devstack-plugin-ceph-tempest-py3 for ceph https://review.opendev.org/691765 | 20:56 |
*** ociuhandu has quit IRC | 20:57 | |
*** ociuhandu has joined #openstack-nova | 21:01 | |
*** slaweq has quit IRC | 21:05 | |
*** ociuhandu has quit IRC | 21:06 | |
*** dklyle has quit IRC | 21:08 | |
*** dklyle has joined #openstack-nova | 21:14 | |
*** spatel has quit IRC | 21:21 | |
*** dklyle has quit IRC | 21:22 | |
*** dklyle has joined #openstack-nova | 21:23 | |
*** luksky has quit IRC | 21:25 | |
*** JamesBen_ has joined #openstack-nova | 21:26 | |
*** JamesBen_ has quit IRC | 21:27 | |
*** JamesBenson has quit IRC | 21:29 | |
openstackgerrit | Merged openstack/nova master: Avoid error 500 on shelve task_state race https://review.opendev.org/692206 | 21:34 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Convert legacy nova-live-migration and nova-multinode-grenade to py3 https://review.opendev.org/692374 | 21:38 |
*** dklyle has quit IRC | 21:38 | |
*** mriedem has quit IRC | 21:39 | |
*** dklyle has joined #openstack-nova | 21:44 | |
*** dklyle has quit IRC | 21:51 | |
*** dklyle has joined #openstack-nova | 21:52 | |
*** dklyle has quit IRC | 21:53 | |
*** dklyle has joined #openstack-nova | 21:53 | |
*** ociuhandu has joined #openstack-nova | 21:53 | |
*** ociuhandu has quit IRC | 22:04 | |
*** ociuhandu has joined #openstack-nova | 22:11 | |
*** ociuhandu has quit IRC | 22:19 | |
*** dklyle has quit IRC | 22:22 | |
*** jamesdenton has joined #openstack-nova | 22:35 | |
*** dklyle has joined #openstack-nova | 22:37 | |
*** gyee has quit IRC | 22:39 | |
*** pcaruana has quit IRC | 22:44 | |
openstackgerrit | Merged openstack/nova stable/train: Add regression test for bug 1824435 https://review.opendev.org/691402 | 22:50 |
openstack | bug 1824435 in OpenStack Compute (nova) train "fill_virtual_interface_list migration fails on second attempt" [Medium,In progress] https://launchpad.net/bugs/1824435 - Assigned to melanie witt (melwitt) | 22:50 |
openstackgerrit | Merged openstack/nova stable/queens: libvirt: Ignore volume exceptions during post_live_migration https://review.opendev.org/691284 | 22:50 |
*** macz has quit IRC | 23:02 | |
*** ociuhandu has joined #openstack-nova | 23:11 | |
*** dannins has quit IRC | 23:12 | |
*** ociuhandu has quit IRC | 23:16 | |
*** mdbooth has quit IRC | 23:22 | |
*** mdbooth has joined #openstack-nova | 23:23 | |
*** ociuhandu has joined #openstack-nova | 23:38 | |
*** ociuhandu has quit IRC | 23:46 | |
*** ociuhandu has joined #openstack-nova | 23:47 | |
openstackgerrit | Artom Lifshitz proposed openstack/nova stable/train: Avoid error 500 on shelve task_state race https://review.opendev.org/692628 | 23:50 |
*** ociuhandu has quit IRC | 23:54 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!