*** ociuhandu has joined #openstack-nova | 00:10 | |
*** factor has joined #openstack-nova | 00:12 | |
*** ociuhandu has quit IRC | 00:15 | |
*** ganso has quit IRC | 00:23 | |
*** avolkov has quit IRC | 00:29 | |
*** gbarros has joined #openstack-nova | 00:30 | |
*** sapd1_x has quit IRC | 00:35 | |
*** macz has quit IRC | 00:39 | |
*** gbarros has quit IRC | 00:40 | |
*** gyee has quit IRC | 00:42 | |
*** gbarros has joined #openstack-nova | 00:46 | |
*** markvoelker has joined #openstack-nova | 00:46 | |
*** markvoelker has quit IRC | 00:48 | |
*** markvoelker has joined #openstack-nova | 00:49 | |
*** spatel has joined #openstack-nova | 00:49 | |
*** spatel has quit IRC | 00:53 | |
*** gbarros has quit IRC | 00:57 | |
*** markvoelker has quit IRC | 00:59 | |
*** markvoelker has joined #openstack-nova | 00:59 | |
*** markvoelker has quit IRC | 01:04 | |
*** nicolasbock has quit IRC | 01:04 | |
*** nicolasbock has joined #openstack-nova | 01:04 | |
*** markvoelker has joined #openstack-nova | 01:07 | |
*** markvoelker has quit IRC | 01:17 | |
*** markvoelker has joined #openstack-nova | 01:18 | |
*** markvoelker has quit IRC | 01:23 | |
*** mtanino has joined #openstack-nova | 01:25 | |
*** hongbin has joined #openstack-nova | 01:35 | |
*** awalende has joined #openstack-nova | 01:46 | |
*** markvoelker has joined #openstack-nova | 01:48 | |
*** awalende has quit IRC | 01:50 | |
openstackgerrit | Merged openstack/python-novaclient master: Microversion 2.79: Add delete_on_termination to volume-attach API https://review.opendev.org/673485 | 01:59 |
---|---|---|
*** nicolasbock has quit IRC | 02:01 | |
*** spsurya has joined #openstack-nova | 02:17 | |
*** markvoelker has quit IRC | 02:20 | |
*** markvoelker has joined #openstack-nova | 02:22 | |
*** icarusfactor has joined #openstack-nova | 02:25 | |
openstackgerrit | Akira KAMIO proposed openstack/nova master: VMware: disk_io_limits settings are not reflected when resize https://review.opendev.org/680296 | 02:27 |
*** factor has quit IRC | 02:27 | |
*** ganso has joined #openstack-nova | 02:33 | |
*** larainema has joined #openstack-nova | 02:33 | |
*** macz has joined #openstack-nova | 02:36 | |
*** macz has quit IRC | 02:41 | |
*** BjoernT has joined #openstack-nova | 03:01 | |
openstackgerrit | Artom Lifshitz proposed openstack/nova master: New objects for NUMA live migration https://review.opendev.org/634827 | 03:01 |
openstackgerrit | Artom Lifshitz proposed openstack/nova master: LM: Use Claims to update numa-related XML on the source https://review.opendev.org/635229 | 03:01 |
openstackgerrit | Artom Lifshitz proposed openstack/nova master: NUMA live migration support https://review.opendev.org/634606 | 03:01 |
openstackgerrit | Artom Lifshitz proposed openstack/nova master: Deprecate CONF.workarounds.enable_numa_live_migration https://review.opendev.org/640021 | 03:01 |
openstackgerrit | Artom Lifshitz proposed openstack/nova master: Functional tests for NUMA live migration https://review.opendev.org/672595 | 03:01 |
*** hamzy_ has quit IRC | 03:06 | |
*** BjoernT has quit IRC | 03:24 | |
*** igordc has quit IRC | 03:31 | |
*** dave-mccowan has quit IRC | 03:35 | |
*** ash2307 has joined #openstack-nova | 03:36 | |
*** Izza_ has joined #openstack-nova | 03:37 | |
Izza_ | hello...good day, i'm doing tempest testing on openstack helm environment but i encountered error "Got Server Fault" | 03:38 |
Izza_ | tempest.lib.exceptions.ServerFault: Got server fault | 03:38 |
Izza_ | Failed 1 tests - output below: | 03:39 |
Izza_ | "/usr/lib/python2.7/site-packages/tempest/test.py", line 172, in setUpClass | 03:39 |
*** Izza_ has quit IRC | 03:39 | |
*** Izza_ has joined #openstack-nova | 03:41 | |
Izza_ | Failed 1 tests - output below: | 03:41 |
Izza_ | "/usr/lib/python2.7/site-packages/tempest/test.py", line 172, in setUpClass | 03:41 |
*** Izza_ has quit IRC | 03:41 | |
*** Izza_ has joined #openstack-nova | 03:43 | |
Izza_ | hi can u pls help me on my tempest testing, scenario: tempest.api.compute.servers.test_create_server.ServersTestJSON , error: tempest.lib.exceptions.ServerFault: Got server fault | 03:44 |
Izza_ | Details: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible. | 03:44 |
*** hongbin has quit IRC | 03:46 | |
*** mkrai has joined #openstack-nova | 03:59 | |
openstackgerrit | Artom Lifshitz proposed openstack/nova master: NUMA live migration support https://review.opendev.org/634606 | 03:59 |
openstackgerrit | Artom Lifshitz proposed openstack/nova master: Deprecate CONF.workarounds.enable_numa_live_migration https://review.opendev.org/640021 | 03:59 |
openstackgerrit | Artom Lifshitz proposed openstack/nova master: Functional tests for NUMA live migration https://review.opendev.org/672595 | 03:59 |
*** udesale has joined #openstack-nova | 04:07 | |
*** etp has joined #openstack-nova | 04:18 | |
*** mtanino has quit IRC | 04:19 | |
openstackgerrit | Sundar Nadathur proposed openstack/nova master: ksa auth conf and client for Cyborg access https://review.opendev.org/631242 | 04:30 |
openstackgerrit | Sundar Nadathur proposed openstack/nova master: Add Cyborg device profile groups to request spec. https://review.opendev.org/631243 | 04:30 |
openstackgerrit | Sundar Nadathur proposed openstack/nova master: Create and bind Cyborg ARQs. https://review.opendev.org/631244 | 04:30 |
openstackgerrit | Sundar Nadathur proposed openstack/nova master: Get resolved Cyborg ARQs and add PCI BDFs to VM's domain XML. https://review.opendev.org/631245 | 04:30 |
openstackgerrit | Sundar Nadathur proposed openstack/nova master: Delete ARQs for an instance when the instance is deleted. https://review.opendev.org/673735 | 04:30 |
openstackgerrit | Sundar Nadathur proposed openstack/nova master: [WIP] add cyborg tempest job https://review.opendev.org/670999 | 04:30 |
*** ociuhandu has joined #openstack-nova | 04:30 | |
*** ociuhandu has quit IRC | 04:34 | |
*** Luzi has joined #openstack-nova | 04:36 | |
*** macz has joined #openstack-nova | 04:37 | |
openstackgerrit | Sundar Nadathur proposed openstack/nova master: Add Cyborg device profile groups to request spec. https://review.opendev.org/631243 | 04:39 |
openstackgerrit | Sundar Nadathur proposed openstack/nova master: Create and bind Cyborg ARQs. https://review.opendev.org/631244 | 04:39 |
openstackgerrit | Sundar Nadathur proposed openstack/nova master: Get resolved Cyborg ARQs and add PCI BDFs to VM's domain XML. https://review.opendev.org/631245 | 04:39 |
openstackgerrit | Sundar Nadathur proposed openstack/nova master: Delete ARQs for an instance when the instance is deleted. https://review.opendev.org/673735 | 04:39 |
openstackgerrit | Sundar Nadathur proposed openstack/nova master: [WIP] add cyborg tempest job https://review.opendev.org/670999 | 04:39 |
*** etp_ has joined #openstack-nova | 04:41 | |
*** macz has quit IRC | 04:41 | |
*** etp_ has quit IRC | 04:41 | |
*** etp_ has joined #openstack-nova | 04:42 | |
*** damien_r has joined #openstack-nova | 04:42 | |
*** etp_ has quit IRC | 04:43 | |
*** mkrai has quit IRC | 04:43 | |
*** igordc has joined #openstack-nova | 04:44 | |
*** etp_ has joined #openstack-nova | 04:44 | |
*** etp has quit IRC | 04:46 | |
*** etp has joined #openstack-nova | 04:46 | |
*** damien_r has quit IRC | 04:47 | |
*** etp_ has quit IRC | 04:47 | |
*** etp_ has joined #openstack-nova | 04:47 | |
*** markvoelker has quit IRC | 04:48 | |
*** etp_ has quit IRC | 04:48 | |
*** etp_ has joined #openstack-nova | 04:48 | |
*** etp has quit IRC | 04:50 | |
*** etp_ is now known as etp | 04:50 | |
*** etp_ has joined #openstack-nova | 04:50 | |
*** etp has quit IRC | 04:51 | |
*** ricolin has joined #openstack-nova | 05:00 | |
*** ratailor has joined #openstack-nova | 05:04 | |
*** etp_ has quit IRC | 05:06 | |
*** markvoelker has joined #openstack-nova | 05:26 | |
*** markvoelker has quit IRC | 05:30 | |
*** ralonsoh has joined #openstack-nova | 05:38 | |
*** pots has quit IRC | 05:42 | |
*** etp has joined #openstack-nova | 05:47 | |
openstackgerrit | Luyao Zhong proposed openstack/nova master: Claim resources in resource tracker https://review.opendev.org/678452 | 06:13 |
openstackgerrit | Luyao Zhong proposed openstack/nova master: libvirt: Enable driver discovering PMEM namespaces https://review.opendev.org/678453 | 06:13 |
openstackgerrit | Luyao Zhong proposed openstack/nova master: libvirt: report VPMEM resources by provider tree https://review.opendev.org/678454 | 06:13 |
openstackgerrit | Luyao Zhong proposed openstack/nova master: libvirt: Support VM creation with vpmems and vpmems cleanup https://review.opendev.org/678455 | 06:13 |
openstackgerrit | Luyao Zhong proposed openstack/nova master: Parse vpmem related flavor extra spec https://review.opendev.org/678456 | 06:13 |
*** dtantsur|afk is now known as dtantsur | 06:15 | |
*** arshad777 has joined #openstack-nova | 06:20 | |
*** dklyle has quit IRC | 06:20 | |
*** dklyle has joined #openstack-nova | 06:21 | |
arshad777 | I have created a sync replicated volume with peer persistence enabled. Attached this volume | 06:24 |
*** sapd1_x has joined #openstack-nova | 06:25 | |
*** N3l1x has quit IRC | 06:27 | |
*** jawad_axd has joined #openstack-nova | 06:29 | |
*** slaweq has joined #openstack-nova | 06:41 | |
*** ileixe has joined #openstack-nova | 06:45 | |
*** luksky has joined #openstack-nova | 06:52 | |
*** ileixe has quit IRC | 06:52 | |
*** igordc has quit IRC | 07:00 | |
*** avolkov has joined #openstack-nova | 07:01 | |
*** tesseract has joined #openstack-nova | 07:05 | |
*** awalende has joined #openstack-nova | 07:08 | |
*** rcernin has quit IRC | 07:09 | |
*** damien_r has joined #openstack-nova | 07:10 | |
*** damien_r has quit IRC | 07:11 | |
*** damien_r has joined #openstack-nova | 07:11 | |
*** ociuhandu has joined #openstack-nova | 07:14 | |
*** maciejjozefczyk has joined #openstack-nova | 07:15 | |
openstackgerrit | Luyao Zhong proposed openstack/nova master: libvirt: Enable driver configuring PMEM namespaces https://review.opendev.org/679640 | 07:18 |
openstackgerrit | Luyao Zhong proposed openstack/nova master: Add functional tests for virtual persistent memory https://review.opendev.org/678470 | 07:18 |
*** threestrands has quit IRC | 07:20 | |
*** ociuhandu has quit IRC | 07:21 | |
*** rpittau|afk is now known as rpittau | 07:28 | |
*** cdent has joined #openstack-nova | 07:30 | |
*** tssurya has joined #openstack-nova | 07:31 | |
openstackgerrit | Takashi NATSUME proposed openstack/python-novaclient master: doc: Add support microversions for options https://review.opendev.org/681174 | 07:36 |
*** macz has joined #openstack-nova | 07:39 | |
*** macz has quit IRC | 07:43 | |
*** trident has quit IRC | 07:50 | |
*** lpetrut has joined #openstack-nova | 07:55 | |
*** jangutter has joined #openstack-nova | 07:58 | |
*** trident has joined #openstack-nova | 08:01 | |
*** priteau has joined #openstack-nova | 08:07 | |
*** sapd1_x has quit IRC | 08:18 | |
*** ociuhandu has joined #openstack-nova | 08:20 | |
bauzas | stephenfin: once you're there, see dansmith's comments | 08:21 |
bauzas | stephenfin: he wants to squash https://review.opendev.org/#/c/680983/ | 08:21 |
bauzas | stephenfin: so please try to provide the new revisions this morning | 08:21 |
*** tkajinam has quit IRC | 08:22 | |
*** panda|rover has quit IRC | 08:23 | |
*** panda has joined #openstack-nova | 08:24 | |
*** ociuhandu has quit IRC | 08:25 | |
*** ociuhandu has joined #openstack-nova | 08:26 | |
*** ociuhandu has quit IRC | 08:30 | |
*** ociuhandu has joined #openstack-nova | 08:30 | |
*** takashin has left #openstack-nova | 08:32 | |
*** derekh has joined #openstack-nova | 08:33 | |
*** ociuhandu has quit IRC | 08:40 | |
*** ociuhandu has joined #openstack-nova | 08:41 | |
*** ociuhandu has quit IRC | 08:44 | |
*** yaawang_ has joined #openstack-nova | 08:49 | |
*** yaawang has quit IRC | 08:49 | |
stephenfin | sure thing | 08:51 |
openstackgerrit | Balazs Gibizer proposed openstack/nova master: Support reverting migration / resize with bandwidth https://review.opendev.org/676140 | 09:02 |
*** jaosorior has joined #openstack-nova | 09:11 | |
openstackgerrit | weibin proposed openstack/nova master: Add support for using ceph RBD ereasure code https://review.opendev.org/681188 | 09:11 |
*** shilpasd has joined #openstack-nova | 09:17 | |
*** tetsuro has joined #openstack-nova | 09:18 | |
openstackgerrit | Balazs Gibizer proposed openstack/nova master: Func test for migrate re-schedule with bandwidth https://review.opendev.org/676972 | 09:19 |
openstackgerrit | Balazs Gibizer proposed openstack/nova master: Support migrating SRIOV port with bandwidth https://review.opendev.org/676980 | 09:21 |
openstackgerrit | Balazs Gibizer proposed openstack/nova master: Allow migrating server with port resource request https://review.opendev.org/671497 | 09:23 |
stephenfin | bauzas: trivial patch with no conflicts here https://review.opendev.org/#/c/679339/ | 09:24 |
stephenfin | (if you'd be so kind) | 09:24 |
bauzas | reminder : I'm French, I'm never kind | 09:25 |
stephenfin | bauzas: I'm going to rebase my the cpu-resources series on top of the SEV series to head off the incoming merge conflicts too | 09:25 |
stephenfin | while I rework it, that is | 09:25 |
bauzas | stephenfin: +Wd FWIW | 09:26 |
openstackgerrit | Balazs Gibizer proposed openstack/nova master: Do not query allocations twice in finish_revert_resize https://review.opendev.org/678827 | 09:26 |
bauzas | stephenfin: cool, I'll look at gibi's series then | 09:26 |
openstackgerrit | Balazs Gibizer proposed openstack/nova master: Allow resizing server with port resource request https://review.opendev.org/679019 | 09:28 |
openstackgerrit | Balazs Gibizer proposed openstack/nova master: Extract pf$N literals as constants from func test https://review.opendev.org/680991 | 09:30 |
openstackgerrit | Balazs Gibizer proposed openstack/nova master: Improve dest service level func tests https://review.opendev.org/680998 | 09:30 |
*** yaawang_ has quit IRC | 09:33 | |
*** yaawang has joined #openstack-nova | 09:33 | |
openstackgerrit | Brin Zhang proposed openstack/python-novaclient master: Microversion 2.80: Add user_id/project_id to migration-list API https://review.opendev.org/675023 | 09:34 |
*** boxiang has joined #openstack-nova | 09:35 | |
*** yaawang has quit IRC | 09:37 | |
*** yaawang has joined #openstack-nova | 09:37 | |
openstackgerrit | Brin Zhang proposed openstack/python-novaclient master: Microversion 2.80: Add user_id/project_id to migration-list API https://review.opendev.org/675023 | 09:38 |
*** rcernin has joined #openstack-nova | 09:38 | |
*** yaawang has quit IRC | 09:41 | |
*** aarents has quit IRC | 09:42 | |
*** yaawang has joined #openstack-nova | 09:43 | |
bauzas | gibi: I'm almost done with https://review.opendev.org/#/c/676140 but I'll need to do some checks later today | 09:48 |
* bauzas disappears for the gym | 09:48 | |
openstackgerrit | Brin Zhang proposed openstack/python-novaclient master: Microversion 2.80: Add user_id/project_id to migration-list API https://review.opendev.org/675023 | 09:57 |
*** dklyle has quit IRC | 10:03 | |
*** dklyle has joined #openstack-nova | 10:04 | |
*** udesale has quit IRC | 10:09 | |
aspiers | stephenfin: I'm here in case you have any last minute questions about https://review.opendev.org/#/c/644565/ | 10:09 |
*** udesale has joined #openstack-nova | 10:10 | |
*** markvoelker has joined #openstack-nova | 10:16 | |
*** tetsuro has quit IRC | 10:17 | |
*** markvoelker has quit IRC | 10:21 | |
gibi | bauzas: thanks | 10:21 |
*** macz has joined #openstack-nova | 10:25 | |
*** macz has quit IRC | 10:30 | |
openstackgerrit | Alexandra Settle proposed openstack/nova master: Fixing broken links https://review.opendev.org/681206 | 10:31 |
openstackgerrit | Alexandra Settle proposed openstack/nova master: Fixing broken links https://review.opendev.org/681206 | 10:32 |
*** jawad_axd has quit IRC | 10:40 | |
*** tbachman has quit IRC | 10:42 | |
*** markvoelker has joined #openstack-nova | 10:55 | |
aspiers | Is there a way to get a guest's libvirt XML via the nova api? I'm guessing no | 11:00 |
*** ociuhandu has joined #openstack-nova | 11:00 | |
*** markvoelker has quit IRC | 11:00 | |
aspiers | Would be nice if it was available from https://docs.openstack.org/api-ref/compute/?expanded=show-server-diagnostics-detail#show-server-diagnostics | 11:01 |
*** ociuhandu has quit IRC | 11:01 | |
aspiers | Without the XML I am struggling to see how Tempest can verify that an SEV guest was actually booted with SEV enabled | 11:01 |
*** nicolasbock has joined #openstack-nova | 11:06 | |
gmann | aspiers: same case for NFV use case testing from artom L180 -https://etherpad.openstack.org/p/qa-train-ptg | 11:07 |
aspiers | hmm | 11:07 |
aspiers | gmann: which line? | 11:07 |
*** ociuhandu has joined #openstack-nova | 11:07 | |
gmann | in QA PTG, we accepted the idea of adding that tempest plugin into QA but it was action item for artom to propose that. | 11:08 |
gmann | L180 | 11:08 |
aspiers | thanks | 11:08 |
aspiers | oh nice | 11:08 |
aspiers | yeah "white box plugin" sounds like a good concept | 11:08 |
gmann | yeah and it can be expanded with more use case which is out of scope from Tempest | 11:09 |
aspiers | right | 11:09 |
*** rouk has quit IRC | 11:09 | |
aspiers | gmann: although I wonder if adding libvirt XML to the show-server-diagnostics API call might be a simpler solution | 11:10 |
aspiers | showing the XML could be an admin-only thing | 11:11 |
gmann | but even admin-only does it expose more info than nova should ? (from security point of view.) | 11:12 |
gmann | I think it was discussed previously also but sean-k-mooney or artom might know more on that. | 11:13 |
*** udesale has quit IRC | 11:14 | |
*** larainema has quit IRC | 11:15 | |
artom | gmann, yeah, I still want to do that, I was meant to propose a spec for Train but only have a very WIP up | 11:16 |
*** dave-mccowan has joined #openstack-nova | 11:17 | |
artom | The code exists in RDO project's gerrit, and a Red Hat QE and myself wanted to clear outstanding reviews on there and merging its tests for NUMA live migration | 11:17 |
artom | But that didn't happen yet | 11:17 |
gmann | i see. | 11:17 |
sean-k-mooney | gmann: sorry i was not following chat | 11:18 |
sean-k-mooney | what were we talking about | 11:18 |
gmann | sean-k-mooney: question from aspiers on exposing the libvirt XML to the show-server-diagnostics API | 11:18 |
sean-k-mooney | aspiers: no there is not a way to get the xml form the api | 11:18 |
sean-k-mooney | and there nerver will be | 11:19 |
aspiers | that's a bold statement :) | 11:19 |
sean-k-mooney | it completely violates the cloud abstration to expose that level of detail via the api | 11:19 |
sean-k-mooney | a non admin is not even ment to know the hypervior that is in use | 11:19 |
aspiers | sean-k-mooney: as admin-only diagnostics there is no violation | 11:20 |
*** jawad_axd has joined #openstack-nova | 11:20 | |
aspiers | sean-k-mooney: we're not talking about non-admins | 11:20 |
sean-k-mooney | as admin only technically be they could jsut ssh into the host and look at the xml | 11:20 |
aspiers | sean-k-mooney: not from tempest they can't | 11:20 |
aspiers | plus that's a lot less convenient | 11:20 |
sean-k-mooney | tempest shoudl not be asserting behavior of the xml generation | 11:20 |
openstackgerrit | Brin Zhang proposed openstack/python-novaclient master: Microversion 2.80: Add user_id/project_id to migration-list API https://review.opendev.org/675023 | 11:20 |
aspiers | sean-k-mooney: please first read the use case above to understand the need ^^^ | 11:21 |
sean-k-mooney | that is what functional or white box testing is for | 11:21 |
artom | aspiers, yeah, that's quite explicitly out of scope for tempest | 11:21 |
sean-k-mooney | tempest if for blackbox testing | 11:21 |
artom | But... quite explicitly *in* scope for whitebox :) | 11:21 |
gmann | yeah | 11:21 |
artom | So now you've landed yourself on the list of people interested, and will be poked mercilessly once it's ready ;) | 11:21 |
aspiers | again, this is repeating discussion from a few minutes ago when we talked about the white box plugin | 11:21 |
aspiers | if there is a white box plugin for tempest, then that means white box testing *is* in scope for the tempest ecosystem, even if not the core | 11:22 |
sean-k-mooney | aspiers: so why cant you boot a vm and ssh into an detect that sev is configured from within the vm? | 11:22 |
sean-k-mooney | or at least available | 11:22 |
aspiers | sean-k-mooney: how would I detect that? | 11:22 |
sean-k-mooney | lscpu? | 11:22 |
sean-k-mooney | is there not an msr or cpu flag for sev | 11:22 |
aspiers | sean-k-mooney: have you tested that? | 11:22 |
sean-k-mooney | no i dont have sev hardware | 11:22 |
gmann | aspiers: within scope of QA ecosystem not Tempest ecosystem. Tempest is just a tool under QA :) | 11:23 |
artom | aspiers, yeah, we're not saying "don't do it", we're saying "don't propose patches for it to Tempest" | 11:23 |
aspiers | artom: OK, it sounded like the former before :) | 11:23 |
aspiers | gmann: the white box tempest plugin isn't in the tempest ecosystem? ;-) | 11:24 |
sean-k-mooney | aspiers: whitebox and the intel nfv test repo use tempest as a framework | 11:24 |
sean-k-mooney | to do this type of testing | 11:24 |
sean-k-mooney | but its not in tempest as its out of scope fo tempest | 11:24 |
sean-k-mooney | but a whitebox style tempest plugin for sev would be fine | 11:24 |
sean-k-mooney | or add it to whitebox | 11:24 |
artom | sean-k-mooney, unrelated, but we had some discussion around saving the new NUMA topology in https://review.opendev.org/#/c/634606/75/nova/compute/manager.py@7223 - and in func tests at least, that instance.refresh() isn't necessary (yes, the tests check the InstanceNUMATopology) | 11:25 |
gmann | aspiers: It will be QA ecosystem. it can be done via tempest plugin or separate testing framework like extreme-testing( which never got progress ). but it will be separate project under QA with separate team. | 11:25 |
artom | sean-k-mooney, I'll try to get to the office later this morning, to see what's up with my machine, would you have the bandwidth to play around with that in the meantime in your env? | 11:25 |
sean-k-mooney | artom: since this is aparently our highest priority i can make time | 11:26 |
sean-k-mooney | what exactly do you want me to test | 11:26 |
sean-k-mooney | remove the instace.refresh | 11:26 |
artom | sean-k-mooney, I did that already in the latest patchset | 11:26 |
artom | Making sure that the new instance NUMA topology is saved in the DB | 11:27 |
sean-k-mooney | and check both the db and virsh to confrim that the state is updated correctly? | 11:27 |
sean-k-mooney | ok | 11:27 |
sean-k-mooney | ya ill do that now | 11:27 |
artom | Thank you (for the ∞'s time) | 11:27 |
artom | :) | 11:27 |
aspiers | sean-k-mooney: BTW lscpu on the guest does not mention sev at all | 11:28 |
sean-k-mooney | as i said before i have exposed the servers im using for testing via port forwarding so if you continue to have issue then you can ssh into them | 11:28 |
sean-k-mooney | aspiers: is there anything in dmidecode/dmesg to indicate sev | 11:29 |
sean-k-mooney | i though the guest had to set bit 48 to 1 to enable the encryption | 11:29 |
sean-k-mooney | for pointers | 11:29 |
aspiers | sean-k-mooney: http://paste.openstack.org/show/774681/ | 11:30 |
aspiers | doesn't even look right | 11:30 |
artom | sean-k-mooney, IIRC when I tried connecting last time I couldn't - but yeah, what's the connection info again? | 11:31 |
sean-k-mooney | https://events.linuxfoundation.org/wp-content/uploads/2017/12/Extending-Secure-Encrypted-Virtualization-with-SEV-ES-Thomas-Lendacky-AMD.pdf looking at slide 12 we might be able to check it via the gurest msr | 11:32 |
sean-k-mooney | artom: i have two routter my isp one and my ubiquity one. my isp router firwall was blocking it so i truned it off and it started working | 11:32 |
aspiers | sean-k-mooney: I will ask the experts | 11:33 |
artom | sean-k-mooney, sure, but I still don't have the IP/FQDN in my bash history for some reason | 11:34 |
sean-k-mooney | artom: ya i know im looking it up in mine/my router config | 11:35 |
*** priteau has quit IRC | 11:38 | |
*** mtreinish has joined #openstack-nova | 11:39 | |
kashyap | aspiers: Randomly chiming in, but there is an MSR for SEV: https://www.kernel.org/doc/html/latest/x86/amd-memory-encryption.html | 11:41 |
kashyap | aspiers: And it's reported via `cpuid` | 11:41 |
kashyap | "Support for SME and SEV can be determined through the CPUID instruction. The CPUID function 0x8000001f reports information related to SME:" | 11:41 |
kashyap | And: | 11:41 |
kashyap | "If SEV is supported, MSR 0xc0010131 (MSR_AMD64_SEV) can be used to determine if SEV is active: [...]" | 11:42 |
aspiers | Yeah, I've already used that in the past | 11:42 |
kashyap | Ah, then disregard me. | 11:42 |
* kashyap goes back to fiddling with what he needs to fiddle with | 11:42 | |
aspiers | No, the reminder is appreciated | 11:42 |
kashyap | aspiers: Your goal is to check if the instance (in the upstream CI) booted has indeed with SEV, yeah? | 11:43 |
aspiers | right | 11:43 |
aspiers | I guess the JeOS image will need to include cpuid | 11:44 |
kashyap | Isn't the JeOS (Just Enough OS, I presume) CirrOS in this case? | 11:44 |
aspiers | It's whatever tempest is configured with | 11:44 |
kashyap | (Nod) | 11:45 |
sean-k-mooney | aspiers: has one of the upstream ci provered provided you with a lable that will run on SEV hardware | 11:47 |
sean-k-mooney | aspiers: because 90% of the upstream ci cloud are proably runing intel x86_64 | 11:47 |
sean-k-mooney | although rackspace did run power for a while | 11:47 |
aspiers | sean-k-mooney: SUSE has our own SEV boxes which can run 3rd party CI | 11:48 |
sean-k-mooney | ah so its for the suse thrid party ci not for upstream | 11:48 |
aspiers | well if upstream has SEV hardware then great, but I was not expecting that any time soon | 11:48 |
sean-k-mooney | aspiers: i dont think it currely does | 11:49 |
sean-k-mooney | or at least not in a way you can target | 11:49 |
sean-k-mooney | im sure some of the clould proably have at least a small amd eypc inventory if for nothing else but there own internal validation | 11:49 |
aspiers | yeah | 11:50 |
donnyd | sean-k-mooney: have the numa jobs been running? | 12:00 |
donnyd | I saw last night it was having some inbound ssh timeout issues. | 12:02 |
sean-k-mooney | i have not checked this morning but i think clarkb kicked off a job around 4/5 am so ill see if that passed | 12:04 |
sean-k-mooney | donnyd: so that one failed at 04:54 | 12:04 |
*** markvoelker has joined #openstack-nova | 12:04 | |
sean-k-mooney | donnyd: was that after your l3 agent restart | 12:05 |
donnyd | i didn't restart till well after you were talking about it on infra last night | 12:05 |
sean-k-mooney | donnyd: ya looking at the infra scroll back clarkb kicked that off after you did the restart | 12:06 |
openstackgerrit | Chris Dent proposed openstack/os-resource-classes master: Update api-ref link to canonical location https://review.opendev.org/681235 | 12:06 |
donnyd | oh yea i see | 12:06 |
donnyd | It really just makes no sense though | 12:07 |
donnyd | all the rest of the labels seem to work without issue | 12:07 |
donnyd | mostly without issue... not anymore than any other provider at least | 12:07 |
*** tbachman has joined #openstack-nova | 12:08 | |
openstackgerrit | Chris Dent proposed openstack/os-traits master: Update README to be a bit more clear https://review.opendev.org/681237 | 12:09 |
donnyd | So the old label we have setup for numa, does that still work? | 12:09 |
donnyd | I haven't set it back on my end | 12:10 |
openstackgerrit | Balazs Gibizer proposed openstack/nova master: Shrink the race window in confirm resize func test https://review.opendev.org/681238 | 12:12 |
*** etp has quit IRC | 12:16 | |
kashyap | aspiers: Is there a way to detect from _inside_ the guest that it has indeed booted with SEV? | 12:25 |
kashyap | aspiers: (From the host, we can use the CPUID bit or other ways) | 12:25 |
*** macz has joined #openstack-nova | 12:26 | |
kashyap | aspiers: Meanwhile, I learnt that if SEV isn't provided by the host, an SEV-enabled kernel should fail to boot. | 12:26 |
openstackgerrit | Alexandra Settle proposed openstack/nova master: Fixing broken links https://review.opendev.org/681206 | 12:27 |
openstackgerrit | Chris Dent proposed openstack/os-traits master: Update README to be a bit more clear https://review.opendev.org/681237 | 12:29 |
gibi | efried, aspiers: Is there anything I can do regarding the SEV series? | 12:30 |
stephenfin | aspiers: Random question: why can we not add the iommu attribute for all those device types? | 12:30 |
stephenfin | gibi: I'm reviewing the last patch now. Think you've already hit it though | 12:30 |
stephenfin | by which I mean the first one | 12:30 |
stephenfin | since the other two are +2 | 12:30 |
stephenfin | +W | 12:30 |
gibi | stephenfin: cool. Yeah I tried to find somebody how can look at the first | 12:30 |
*** macz has quit IRC | 12:30 | |
gibi | but then it is in good hands now | 12:31 |
kashyap | aspiers: Also, when you're about, see my response to your comment: https://review.opendev.org/#/c/348394/10 | 12:33 |
kashyap | aspiers: If you can confirm that /usr/share/qemu/ovmf-x86_64-suse-code.bin is indeed the binary built with Secure Boot, then we can fix it | 12:34 |
luyao | Hi, everyone, I have a question, when an instance is in rebuild, how many allocations will it have? Both new and old ? | 12:34 |
brinzhang | efried: Could you please review this patch https://review.opendev.org/#/c/681151/, it changes the novalient version to 15.1.0, needed by https://review.opendev.org/#/c/673725/ | 12:43 |
*** derekh has quit IRC | 12:43 | |
*** derekh has joined #openstack-nova | 12:43 | |
stephenfin | aspiers: Is that follow-up for https://review.opendev.org/#/c/666616/ around yet? | 12:44 |
gibi | stephenfin: I did not find the followup either so I guess it isn't | 12:47 |
*** tbachman has quit IRC | 12:47 | |
*** mriedem has joined #openstack-nova | 12:55 | |
*** udesale has joined #openstack-nova | 12:57 | |
openstackgerrit | Stephen Finucane proposed openstack/nova master: Apply SEV-specific guest config when SEV is required https://review.opendev.org/644565 | 12:58 |
openstackgerrit | Stephen Finucane proposed openstack/nova master: Reject live migration and suspend on SEV guests https://review.opendev.org/680158 | 12:58 |
openstackgerrit | Stephen Finucane proposed openstack/nova master: Enable booting of libvirt guests with AMD SEV memory encryption https://review.opendev.org/666616 | 12:58 |
openstackgerrit | Stephen Finucane proposed openstack/nova master: libvirt: Start reporting PCPU inventory to placement https://review.opendev.org/671793 | 12:58 |
openstackgerrit | Stephen Finucane proposed openstack/nova master: libvirt: '_get_(v|p)cpu_total' to '_get_(v|p)cpu_available' https://review.opendev.org/672693 | 12:58 |
openstackgerrit | Stephen Finucane proposed openstack/nova master: objects: Add 'InstanceNUMATopology.cpu_pinning' property https://review.opendev.org/680106 | 12:58 |
openstackgerrit | Stephen Finucane proposed openstack/nova master: Validate CPU config options against running instances https://review.opendev.org/680107 | 12:58 |
openstackgerrit | Stephen Finucane proposed openstack/nova master: trivial: Use sane indent https://review.opendev.org/680229 | 12:58 |
openstackgerrit | Stephen Finucane proposed openstack/nova master: objects: Add 'NUMACell.pcpuset' field https://review.opendev.org/680108 | 12:58 |
openstackgerrit | Stephen Finucane proposed openstack/nova master: hardware: Differentiate between shared and dedicated CPUs https://review.opendev.org/671800 | 12:58 |
openstackgerrit | Stephen Finucane proposed openstack/nova master: libvirt: Start reporting 'HW_CPU_HYPERTHREADING' trait https://review.opendev.org/675571 | 12:58 |
openstackgerrit | Stephen Finucane proposed openstack/nova master: Add support for translating CPU policy extra specs, image meta https://review.opendev.org/671801 | 12:58 |
openstackgerrit | Stephen Finucane proposed openstack/nova master: fakelibvirt: Make 'Connection.getHostname' unique https://review.opendev.org/681060 | 12:58 |
openstackgerrit | Stephen Finucane proposed openstack/nova master: libvirt: Mock 'libvirt_utils.file_open' properly https://review.opendev.org/681061 | 12:58 |
openstackgerrit | Stephen Finucane proposed openstack/nova master: Add reshaper for PCPU https://review.opendev.org/674895 | 12:58 |
bauzas | gibi: add a question for you https://review.opendev.org/#/c/676140/ | 12:58 |
bauzas | hadù | 12:58 |
gibi | bauzas: looking | 12:59 |
*** hamzy_ has joined #openstack-nova | 12:59 | |
*** nweinber has joined #openstack-nova | 12:59 | |
openstackgerrit | Stephen Finucane proposed openstack/nova master: Apply SEV-specific guest config when SEV is required https://review.opendev.org/644565 | 13:00 |
openstackgerrit | Stephen Finucane proposed openstack/nova master: Reject live migration and suspend on SEV guests https://review.opendev.org/680158 | 13:00 |
openstackgerrit | Stephen Finucane proposed openstack/nova master: Enable booting of libvirt guests with AMD SEV memory encryption https://review.opendev.org/666616 | 13:00 |
openstackgerrit | Stephen Finucane proposed openstack/nova master: libvirt: Start reporting PCPU inventory to placement https://review.opendev.org/671793 | 13:00 |
openstackgerrit | Stephen Finucane proposed openstack/nova master: libvirt: '_get_(v|p)cpu_total' to '_get_(v|p)cpu_available' https://review.opendev.org/672693 | 13:00 |
openstackgerrit | Stephen Finucane proposed openstack/nova master: objects: Add 'InstanceNUMATopology.cpu_pinning' property https://review.opendev.org/680106 | 13:00 |
openstackgerrit | Stephen Finucane proposed openstack/nova master: Validate CPU config options against running instances https://review.opendev.org/680107 | 13:00 |
openstackgerrit | Stephen Finucane proposed openstack/nova master: trivial: Use sane indent https://review.opendev.org/680229 | 13:00 |
openstackgerrit | Stephen Finucane proposed openstack/nova master: objects: Add 'NUMACell.pcpuset' field https://review.opendev.org/680108 | 13:00 |
openstackgerrit | Stephen Finucane proposed openstack/nova master: hardware: Differentiate between shared and dedicated CPUs https://review.opendev.org/671800 | 13:00 |
openstackgerrit | Stephen Finucane proposed openstack/nova master: libvirt: Start reporting 'HW_CPU_HYPERTHREADING' trait https://review.opendev.org/675571 | 13:00 |
openstackgerrit | Stephen Finucane proposed openstack/nova master: Add support for translating CPU policy extra specs, image meta https://review.opendev.org/671801 | 13:00 |
openstackgerrit | Stephen Finucane proposed openstack/nova master: fakelibvirt: Make 'Connection.getHostname' unique https://review.opendev.org/681060 | 13:00 |
openstackgerrit | Stephen Finucane proposed openstack/nova master: libvirt: Mock 'libvirt_utils.file_open' properly https://review.opendev.org/681061 | 13:00 |
openstackgerrit | Stephen Finucane proposed openstack/nova master: Add reshaper for PCPU https://review.opendev.org/674895 | 13:00 |
*** tbachman has joined #openstack-nova | 13:00 | |
*** BjoernT has joined #openstack-nova | 13:02 | |
*** macz has joined #openstack-nova | 13:03 | |
stephenfin | bauzas, alex_xu: I've merged in those follow-ups to https://review.opendev.org/#/c/671793/ and the next two patches, per dansmith's request, if you fancy re +2ing | 13:04 |
bauzas | ack | 13:05 |
gibi | bauzas: replied in https://review.opendev.org/#/c/676140 | 13:07 |
bauzas | gibi: cool thanks | 13:07 |
*** macz has quit IRC | 13:07 | |
luyao | efried: Hi efried, are you around? | 13:07 |
mriedem | gibi: do you have a bug reported for https://954d3ddb67e757934983-a9cc155153d08dd30dfffbbf1d71d234.ssl.cf5.rackcdn.com/676138/16/gate/nova-tox-functional-py36/fb5d235/testr_results.html.gz yet? | 13:08 |
gibi | mriedem: not yet. I have a patch that shrinks the window | 13:08 |
gibi | https://review.opendev.org/#/c/681238/ | 13:08 |
*** jdillaman has quit IRC | 13:09 | |
*** tbachman has quit IRC | 13:09 | |
*** nweinber_ has joined #openstack-nova | 13:09 | |
efried | luyao: o/ | 13:09 |
bauzas | gibi: mriedem: just an unrelated thought, should we somehow persist VGPU resource request to the RequestSpec requested_resources field ? | 13:10 |
luyao | efried: I would like another patch to refactor this later if necessary. Current appproach works well and it will be great for me to change after merged. https://review.opendev.org/#/c/678452/22/nova/compute/resource_tracker.py@1168 | 13:10 |
gibi | requested_resources field is intentionally not persistes in the db | 13:10 |
gibi | bauzas: ^^ | 13:10 |
bauzas | k | 13:11 |
efried | bauzas: imo VGPU should take its lead from what's been done with vpmem | 13:11 |
*** arshad777 has quit IRC | 13:11 | |
efried | move vgpu into the resources field | 13:11 |
bauzas | for the moment, we don't really do anything, we just take the allocation that was done | 13:11 |
stephenfin | dansmith: When you're around, we're going to need to discuss https://review.opendev.org/#/c/671801/40/nova/conf/scheduler.py@208 so I can grasp where you're coming from | 13:11 |
*** nweinber has quit IRC | 13:11 | |
bauzas | efried: WDYM ? sorry, don't have a lot of contect of VPMEM apart of the spec | 13:11 |
*** eharney has joined #openstack-nova | 13:12 | |
efried | luyao: Yes, I completely overlooked the fact that the report client's provider tree doesn't have the resources in it. I agree we should keep it this way for now, but I think for the refactor we may want to reduce the data structure stored in the RT to just a dict, keyed by rp_uuid, of lists of resources. | 13:12 |
*** gbarros has joined #openstack-nova | 13:12 | |
efried | bauzas: Unless you were thinking to do this for Train, let's talk about it after FF. | 13:13 |
bauzas | efried: of course not, just considering the next steps | 13:13 |
efried | stephenfin: Are you going to have time today to review the vpmem series? | 13:13 |
luyao | efried: Okey, we can discuss the details in the future. :) | 13:13 |
bauzas | since we plan to do VGPU affinity | 13:13 |
stephenfin | efried: I am, and I'm going to hold off on the follow-up requested for https://review.opendev.org/#/c/674895/ because reviews seem more important at the moment, right? | 13:14 |
efried | brinzhang: looks like you already got your novaclient release merging, yah? | 13:14 |
efried | stephenfin: totally | 13:15 |
openstackgerrit | Merged openstack/nova master: Indent fake libvirt host capabilities fixtures more nicely https://review.opendev.org/679339 | 13:16 |
gibi | mriedem: field a bug https://bugs.launchpad.net/nova/+bug/1843433 | 13:16 |
openstack | Launchpad bug 1843433 in OpenStack Compute (nova) "functional test test_migrate_server_with_qos_port fails intermittently due to race condition" [Undecided,New] | 13:16 |
bauzas | gibi: you could triage it as critical since it impacts the gate IMHO | 13:17 |
bauzas | gibi: any ideas of the failure rate? | 13:17 |
gibi | bauzas: does it block the gate? | 13:17 |
mriedem | http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22in%20test_migrate_server_with_qos_port%5C%22%20AND%20tags%3A%5C%22console%5C%22&from=7d | 13:17 |
mriedem | it's not critical | 13:17 |
mriedem | yes it hits the gate but doesn't block it | 13:17 |
bauzas | mriedem: okay, thanks, I was about to ask logstash | 13:18 |
mriedem | gibi: ack, i left a few comments in the patch to close the race | 13:18 |
gibi | mriedem: thanks. I will respin soon | 13:18 |
bauzas | gibi: what's the patch up for this ? | 13:18 |
gibi | bauzas: https://review.opendev.org/#/c/681238/1 | 13:18 |
bauzas | gibi: tbh, I'm a bit afraid it's at the top of the series | 13:19 |
bauzas | if you need to respin, I'd appreciate you move it down | 13:19 |
luyao | efried: Could you help confirm will an rebuild instance have two groups of allocations? I knew migration instance will change one allocation consumer to migration uuid. https://review.opendev.org/#/c/678452/23/nova/compute/resource_tracker.py@406 | 13:19 |
gibi | bauzas: I think it is top of master | 13:19 |
bauzas | yup | 13:19 |
gibi | bauzas: so it is not connected to the series | 13:19 |
gibi | bauzas: just by topic | 13:19 |
mriedem | right, it's a race in an already merged functional test, | 13:20 |
mriedem | so let's just fix the race and merge it | 13:20 |
bauzas | gibi: my bad, you're right | 13:20 |
gibi | mriedem: good suggestion about instance event, that will fix the race | 13:20 |
mriedem | \o/ | 13:20 |
gibi | I was affraid we need to poll the migration allocation in placement to close the race | 13:20 |
gibi | mriedem: I also fixed the bug grenade found in revert | 13:21 |
gibi | mriedem: unfortunately we still not have test run on that patch | 13:21 |
aspiers | kashyap: oh, I thought you were talking about cpuid inside the guest | 13:21 |
mriedem | gibi: yeah i saw you updated but i haven't looked at your replies or changes, | 13:22 |
aspiers | kashyap: I think there are multiple binaries with SB | 13:22 |
mriedem | it kind of sucks that we don't store the information needed for revert in the migration context | 13:22 |
gibi | mriedem: I have better plans, described in my reply | 13:22 |
*** Sundar has joined #openstack-nova | 13:22 | |
gibi | mriedem: we can let neutron remember the mapping | 13:22 |
kashyap | aspiers: Sorry, I was not clear. I don't _know_ if it is also reported in the guest `cpuid` -- maybe you can tell from your hardware? | 13:23 |
gibi | mriedem: by using multiple portbinding | 13:23 |
aspiers | kashyap: I don't have a guest with cpuid installed currently | 13:23 |
aspiers | kashyap: I think I misremembered - before I probably ran it on the host | 13:23 |
kashyap | aspiers: For SB -- IMHO, you don't _need_ all those 4M, 2M variants -- it's beyond overkill. Your life, and admins life will be far more simpler, if you used the "two pairs" approach I noted in the comment | 13:24 |
aspiers | kashyap: You're preaching to the choir. You really need to file a bug report on bugzilla.opensuse.org | 13:24 |
mriedem | gibi: ah yeah | 13:24 |
sean-k-mooney | artom: i have confirmed that the migration context correctly contains the new numa toplogy blob but it does not get saved to the db | 13:24 |
*** ratailor has quit IRC | 13:25 | |
aspiers | kashyap: or reach the right team in some other way | 13:25 |
kashyap | aspiers: Aaah, sorry; didn't realize we're on the same line, same word | 13:25 |
openstackgerrit | Adam Spiers proposed openstack/nova master: Improve SEV documentation and other minor tweaks https://review.opendev.org/681254 | 13:25 |
aspiers | efried, gibi, stephenfin: there's the follow-up ^^^ | 13:25 |
sean-k-mooney | artom: so do ing apply migration context followed by drop to delete it form the db does not work | 13:25 |
sean-k-mooney | artom: i would assume that for some reason the filed is not flaged as dirty and is not being saved | 13:26 |
gibi | aspiers: ack! | 13:26 |
*** BjoernT_ has joined #openstack-nova | 13:27 | |
Sundar | Hi sean-k-mooney, how are you doing? | 13:27 |
sean-k-mooney | Sundar: hi | 13:28 |
efried | luyao: I'm trying to make sense of those scenarios right now. | 13:28 |
Sundar | sean-k-mooney: The Cyborg patches for nova-integ have merged; https://review.opendev.org/#/q/project:openstack/cyborg+branch:master+topic:nova-integ+owner:Sundar | 13:28 |
sean-k-mooney | Sundar: ok so its just the nova code that is pending | 13:28 |
Sundar | There is one related patch for Nova notification https://review.opendev.org/674520 that is close | 13:29 |
*** BjoernT has quit IRC | 13:29 | |
aspiers | kashyap: Where did you hear that an SEV-enabled kernel would fail to boot if SEV is not provided? I have empirically proven that false by booting the same image both with and without SEV | 13:29 |
sean-k-mooney | Sundar: ok yes without that nova will timeout waiting and rollback the spawn | 13:29 |
mriedem | Sundar: no testing at all for those cyborg patches? | 13:30 |
Sundar | Yes. BTW, I updated the Nova patches too. Notification works with 674520. If you look at the seeming UT failures, they are mostly unrelated to Cyborg | 13:30 |
kashyap | aspiers: DanPB; but to be fair to him, he used a qualifier "IIUC". If you've tested it, I'll go with your data | 13:30 |
aspiers | kashyap: gotcha | 13:30 |
Sundar | mriedem: What do you mean by 'no testing'? | 13:30 |
sean-k-mooney | Sundar: have we had a end to end run of the tempest job | 13:30 |
mriedem | Sundar: there are no tests associated with that patch | 13:31 |
Sundar | The tempest CI is working with the patches, and we are hoping to merge it by this week. | 13:31 |
efried | mriedem: do we use "migration instance" for evac? | 13:31 |
mriedem | efried: no | 13:31 |
efried | so what happens to allocations? | 13:31 |
mriedem | efried: you mean migratoin-based allocations? | 13:31 |
efried | yeah | 13:31 |
mriedem | efried: they sit on the source | 13:31 |
efried | thanks | 13:31 |
mriedem | which is why we can't delete resource providers when deleting a compute service that has evacuated instance allocations against it | 13:31 |
mriedem | that whole thread i had in the ML | 13:32 |
*** Luzi has quit IRC | 13:32 | |
mriedem | efried: related to this https://review.opendev.org/#/c/678100/ | 13:32 |
sean-k-mooney | mriedem: that has been changed recently right. we delete the allcoation if they exisits if and only if you bring back up the compute agent on the failed host | 13:32 |
sean-k-mooney | so that only help if you repair whatever the issue was | 13:32 |
mriedem | sean-k-mooney: yes the source allocations are deleted if you bring up the evacuated-from compute service | 13:33 |
sean-k-mooney | if you dont you get into the stiutation in that ml thread | 13:33 |
mriedem | i don't think it's probably uncommon to have a host failure, down the compute service, evacuate it, and then try to delete the comptue service before redeploying on that host | 13:33 |
mriedem | Sundar: tempest only covers happy path testing for the most part; unit tests are good for testing error conditions and such - exceptional cases | 13:34 |
mriedem | anyway, that's up to the cyborg core team for how they want to enforce testing standards | 13:34 |
efried | mriedem: also rebuild? | 13:35 |
mriedem | efried: a rebuild isn't a migration | 13:35 |
mriedem | and you can't rebuild on a down host | 13:35 |
*** bbowen_ has joined #openstack-nova | 13:35 | |
mriedem | so the host stays the same and the flavor stays the same, but the image might change on a rebuild | 13:35 |
sean-k-mooney | Sundar: have the depencies been fixed in https://review.opendev.org/#/c/670999/ | 13:36 |
efried | mriedem: right, separate topic, I'm saying: does rebuild also end up with multiple sets of allocations for the same instance uuid? | 13:36 |
sean-k-mooney | Sundar: you rebased it but did not run the new job via "check experimental" | 13:36 |
sean-k-mooney | so we still dont have a run that worked | 13:36 |
sean-k-mooney | the previous run failded https://c3308e17743765936b80-6c7fec3fffbf24afb7394804bcdecfae.ssl.cf2.rackcdn.com/670999/7/experimental/cyborg-tempest/2fe52ec/testr_results.html.gz | 13:37 |
mriedem | efried: no | 13:37 |
sean-k-mooney | that said we knew the depencies were wrong so we expected that | 13:37 |
*** bbowen__ has joined #openstack-nova | 13:37 | |
*** bbowen has quit IRC | 13:37 | |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Fixing broken links https://review.opendev.org/681206 | 13:38 |
luyao | mriedem: what about in the process of rebuild | 13:39 |
*** bbowen_ has quit IRC | 13:39 | |
luyao | mriedem: rebuild is not done, how many allocations will an instance have? | 13:40 |
efried | we destroy the instance but keep the allocations, then respawn it with the existing allocations? | 13:40 |
efried | ...and maybe a different image? | 13:40 |
sean-k-mooney | Sundar: it looks like there are 3 patches still remaining on the cyborg side. 2 for nova integration and 1 for python3 support | 13:40 |
mriedem | luyao: i don't understand your question | 13:41 |
mriedem | efried: correct | 13:41 |
efried | mriedem: luyao and I are asking about the same thing. Thanks for the help. | 13:41 |
mriedem | rebuild is basically re-spawn in place on the same host with maybe a different image but the same ports/volumes/flavor - if you're on shared storage you keep your root disk, if not your root disk is rebuilt from the specified image | 13:42 |
Sundar | sean-k-mooney: I'll rerun with 'check experimental'. Re. dependencies for https://review.opendev.org/#/c/670999/, one has already merged, and the other is the tempest code itself, which is working with the patches and should merge soon. There should be only 1 for nova-integ (i.e. Nova notification) -- I fixed the topic now. | 13:42 |
sean-k-mooney | Sundar: wait a minute | 13:42 |
*** eharney has quit IRC | 13:42 | |
mriedem | though i might be thinking of evacuate for that root disk comment | 13:42 |
sean-k-mooney | im going to fix it to not waste gate resouces and fix the depencies | 13:42 |
mriedem | anyway, root disk doesn't matter for what you're asking | 13:43 |
mriedem | sean-k-mooney: cyborg integration isn't happening in nova in train - where are we with test runs on the numa live migration series? | 13:43 |
efried | stephenfin, dansmith: btw, vpmem now has a CI passing. I haven't opened one up yet, but it's taking half an hour, so it's doing... *something* :P | 13:43 |
luyao | mriedem: I saw an instance have new and old allocations in an functional test when it's in evacuating. | 13:43 |
sean-k-mooney | mriedem: i confrimed that it is not saving the updated numa toplogy so it looks like we need instance.refresh | 13:43 |
*** jawad_axd has quit IRC | 13:43 | |
sean-k-mooney | mriedem: the migration context has the correct numa toplogy | 13:44 |
luyao | mriedem: I thought rebuild is similar to evacuate | 13:44 |
sean-k-mooney | mriedem: so it looks liek its not marking the field as changed for some reason | 13:44 |
sean-k-mooney | mriedem: perhapes because apply migration context uses setattr | 13:44 |
mriedem | luyao: evacuate is rebuild to another host when the source host is down | 13:44 |
sean-k-mooney | mriedem: i -1'd the patch | 13:44 |
*** rcernin has quit IRC | 13:45 | |
mriedem | evacuate and rebuild use the same code flows in conductor and compute services with conditionals for any differences | 13:45 |
efried | luyao, mriedem: I find all of it so confusing that I don't even try to remember, because that just makes things worse. It goes against the very fiber of my being, but this is one where I'll ask. Every. Time. | 13:45 |
*** jawad_axd has joined #openstack-nova | 13:45 | |
mriedem | e.g. evacuate does a claim on the dest host, rebuild does not | 13:45 |
*** jawad_axd has quit IRC | 13:45 | |
mriedem | sean-k-mooney: ok i didn't realize artom removed the instance.refresh in post live migration | 13:45 |
mriedem | but haven't looked at changes since yesterday | 13:46 |
mriedem | efried: rebuild+evacuate and resize+cold-migrate are confusing in that they share code but don't do the exact same things | 13:46 |
mriedem | but are like 90% the same | 13:46 |
mriedem | efried: do we need a "what's the difference between evacuate and rebuild" thing in the contributor docs? | 13:47 |
mriedem | b/c that's probably pretty easy to write up | 13:47 |
efried | that must be why luyao was conflating rebuild+evacuate wrt allocations. That must fall in the 10% that's different. | 13:47 |
mriedem | part of it yeah | 13:48 |
efried | mriedem: I'll admit I didn't even go looking for a doc on this. | 13:48 |
mriedem | there likely isn't one | 13:48 |
mriedem | but this isn't the first time this has come up, | 13:48 |
efried | There was that one blog post | 13:48 |
efried | "one" | 13:48 |
mriedem | so instead of explaining it each time, one could just link to a doc | 13:48 |
jangutter | efried: http://www.danplanet.com/blog/2016/03/03/evacuate-in-nova-one-command-to-confuse-us-all/ this one? | 13:48 |
mriedem | not really the same | 13:48 |
efried | that looks familiar jangutter | 13:49 |
*** jawad_axd has joined #openstack-nova | 13:49 | |
mriedem | that's about nova client commands that live migrate all instances off a host | 13:49 |
openstackgerrit | Adam Spiers proposed openstack/nova master: Improve SEV documentation and other minor tweaks https://review.opendev.org/681254 | 13:49 |
luyao | efried: yes, actually I found evacuting instance has two groups of allocations, and I thought rebuild is the same. | 13:49 |
* alex_xu tried to understand luyao's question | 13:49 | |
mriedem | "After a compute host has failed, rebuild my instance from the original image in another place, keeping my name, uuid, network addresses, and any other allocated resources that I had before." | 13:49 |
efried | anyway, I envision some kind of table with all these related operations on one axis, an things like "same/different destination host", "what happens to allocations", "instance/migration UUIDs", etc on the other. | 13:50 |
mriedem | that's a much bigger doc if you're trying to lump in all move operations, | 13:51 |
openstackgerrit | sean mooney proposed openstack/nova master: [WIP] add cyborg tempest job https://review.opendev.org/670999 | 13:51 |
mriedem | because you'd have to consider resize on the same host, which is complicated in different ways | 13:51 |
efried | mriedem: is there a todo list on which that ^ could be registered so it is not forgotten, but not started until after FF? We need you for more urgent things this week. | 13:51 |
sean-k-mooney | Sundar: that ^ should test with the correct deps. im going back to testing artoms code | 13:51 |
mriedem | efried: i'm not signing up to write a doc with all of the axis of evil, | 13:52 |
mriedem | writing up something quick and easy about "what's the difference between rebuild and evacuate" is pretty easy for me to crank out | 13:52 |
efried | even so | 13:52 |
mriedem | a bug is good enough for that | 13:52 |
mriedem | docs bug, link to the irc question that started tihs | 13:52 |
efried | ack | 13:52 |
*** jawad_axd has quit IRC | 13:53 | |
sean-k-mooney | mriedem: im going to add back in the instnace.refresh locally and add some logging to see what teh numa toplogy blob look like before and after. | 13:53 |
*** med_ has joined #openstack-nova | 13:53 | |
sean-k-mooney | mriedem: artom is in dad taxi mode currently but he should be back soon ish | 13:53 |
*** nweinber__ has joined #openstack-nova | 13:53 | |
sean-k-mooney | although looking at that funciton im not clear why that helps | 13:55 |
Sundar | Thanks, sean-k-mooney | 13:55 |
*** eharney has joined #openstack-nova | 13:55 | |
*** mlavalle has joined #openstack-nova | 13:56 | |
efried | https://bugs.launchpad.net/nova/+bug/1843439 | 13:56 |
openstack | Launchpad bug 1843439 in OpenStack Compute (nova) "doc: Rebuild vs. Evacuate (and other move-ish ops)" [Low,Confirmed] - Assigned to Matt Riedemann (mriedem) | 13:56 |
*** nweinber_ has quit IRC | 13:56 | |
efried | mriedem: there's a 'doc' tag and a 'docs' tag. Which one is real? | 13:56 |
efried | (I used both) | 13:56 |
*** ociuhandu has quit IRC | 13:57 | |
*** ociuhandu has joined #openstack-nova | 13:58 | |
*** tbachman has joined #openstack-nova | 13:58 | |
bauzas | efried: IIRC, it was 'docs' 3 years ago | 13:59 |
bauzas | oops, I meant 'doc' | 14:00 |
bauzas | holy shit | 14:00 |
sean-k-mooney | its doc | 14:00 |
*** BjoernT has joined #openstack-nova | 14:00 | |
sean-k-mooney | https://wiki.openstack.org/wiki/Nova/BugTriage#Tag_Owner_List | 14:00 |
bauzas | efried: yeah, I'm correct : https://bugs.launchpad.net/nova/+bugs?field.tag=doc vs. https://bugs.launchpad.net/nova/+bugs?field.tag=docs | 14:00 |
efried | bauzas: are tags autovivified or do they need to be explicitly created somewhere? | 14:01 |
*** Izza_ has quit IRC | 14:01 | |
bauzas | efried: nope, you can add anyone | 14:01 |
efried | Like, could we go through the latter four bugs and remove 'docs' and that tag would disappear? | 14:01 |
mriedem | there is an official list | 14:01 |
bauzas | efried: I'm just updating the 4 'docs' | 14:01 |
mriedem | anyone can add any tag, | 14:01 |
sean-k-mooney | efried: if you go to the wiki i likned | 14:01 |
*** BjoernT_ has quit IRC | 14:01 | |
bauzas | s/docs/doc | 14:01 |
mriedem | but the list of official tags that auto-complete is curated | 14:01 |
sean-k-mooney | there is a link to manage the offical tags | 14:01 |
efried | cool | 14:01 |
efried | mriedem: oh, in that case both 'docs' and 'doc' must be in there | 14:02 |
sean-k-mooney | https://bugs.launchpad.net/nova/+manage-official-tags | 14:02 |
efried | because both autocompleted for me | 14:02 |
mriedem | https://bugs.launchpad.net/nova/+manage-official-tags | 14:02 |
sean-k-mooney | no idea how to use that however | 14:02 |
*** tkajinam has joined #openstack-nova | 14:02 | |
mriedem | you move things in or out of the 'official tags' column | 14:02 |
bauzas | efried: mriedem: I removed usage for 'docs' https://bugs.launchpad.net/nova/+bugs?field.tag=docs | 14:02 |
efried | okay, I'm removing 'docs' from official... | 14:03 |
bauzas | and yeah, I just removed 'docs' from the official list | 14:03 |
bauzas | heh, jinxed | 14:03 |
*** ociuhandu has quit IRC | 14:03 | |
efried | sorted | 14:03 |
openstackgerrit | Stephen Finucane proposed openstack/nova master: objects: Remove custom comparison methods https://review.opendev.org/472285 | 14:03 |
mriedem | i just added "documentation" | 14:03 |
efried | you're a bastard | 14:03 |
mriedem | and "docify" | 14:03 |
openstackgerrit | Stephen Finucane proposed openstack/nova master: objects: Remove custom comparison methods https://review.opendev.org/472285 | 14:04 |
efried | gibi: Did I see something about a grenade failure - should I wait to recheck stuff? | 14:04 |
bauzas | efried: mriedem: either way, it's been a long time since I triaged a single bug in Launchpad, floor is yours, folks | 14:04 |
gibi | efried: don't have to wait | 14:04 |
efried | thanks | 14:04 |
gibi | efried: it was a failure in an open review | 14:04 |
efried | o good :) | 14:04 |
gibi | efried: which I respinned with a fix | 14:04 |
efried | I'm going to need to bring in a ringer for this vpmem patch https://review.opendev.org/#/c/678455/ | 14:05 |
efried | I can handle the rest, but that one is beyond me. | 14:05 |
efried | stephenfin is already lined up for it. alex_xu is a co-author. | 14:05 |
efried | sean-k-mooney, kashyap, aspiers, artom: I would proxy a +2 from one of you if you're able --^ | 14:06 |
alex_xu | also the vpmem CI is pretty close https://review.opendev.org/#/c/679640/14 \o/ | 14:06 |
kashyap | efried: vPMEM? | 14:06 |
* kashyap clicks | 14:07 | |
efried | kashyap: Yeah, that patch is just "do shit efried doesn't understand to the libvirt xml" | 14:07 |
efried | so I don't think you need to understand vpmem | 14:07 |
sean-k-mooney | efried: the pmem seriese. i can take a look if kashyap cant | 14:07 |
kashyap | efried: It's been forever on my to-read list, but I was pulled into urgent downstream stuff last/this week | 14:07 |
bauzas | efried: once I'm done with gibi's series and stephenfin's cpu-resources, I could give some look at https://review.opendev.org/#/q/topic:bp/virtual-persistent-memory | 14:07 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Fixing broken links https://review.opendev.org/681206 | 14:08 |
efried | Thanks folks. bauzas kashyap sean-k-mooney in the interest of focusing resources, if it's possible to start with https://review.opendev.org/#/c/678455/ rather than going through the whole series, that's the one where we have a known reviewer gap. | 14:09 |
alex_xu | bauzas: kashyap sean-k-mooney thanks also | 14:09 |
bauzas | efried: honestly, reviewing the series needs me looking from the start, but I can surely just look at the patches without really commenting them | 14:09 |
sean-k-mooney | yes if that is the one that need to be reviewed i can start there in about 5 mins | 14:10 |
kashyap | efried: Nod; that's what I clicked on | 14:10 |
gibi | stephenfin: a question about the cpu series. The vcpu_pins_set can be and empty string and the nova will use every host cpu for the vcpus. Does this behaviour preserved for the cpu_shared_set and cpu_dedicated_set ? | 14:10 |
*** brinzhang has quit IRC | 14:11 | |
efried | gibi: iiuc no, you'll get an error if you try that. | 14:11 |
bauzas | gibi: good question, we haven't agreed on that in the spec | 14:12 |
*** brinzhang has joined #openstack-nova | 14:12 | |
gibi | efried: so I can assume that when the vcpu_pin_set is removed then there will be no way to implicitly offer every host cpu as vcpu? But I have to explicitly list them in one of the new configs | 14:13 |
stephenfin | gibi: It's preserved, yes | 14:13 |
bauzas | stephenfin: only if you don't use the new options, right? | 14:14 |
bauzas | stephenfin: because if you start messing with your options, then something will come up :) | 14:14 |
stephenfin | any(vcpu_pin_set, compute.cpu_shared_set, compute.cpu_dedicated_set) == False --> allocate all host CPUs for PCPU inventory | 14:14 |
stephenfin | bauzas: correct | 14:14 |
sean-k-mooney | whats the status on the SEV series by the way. stephen did you get a chance to look at the bottom patch? i think that was all that was left to be +w and the rest were ready to go? | 14:14 |
efried | Right, I guess we don't have to make the decision about what gibi is asking until a future release. | 14:15 |
efried | sean-k-mooney: it's gateward | 14:15 |
stephenfin | sean-k-mooney: It's gone through | 14:15 |
stephenfin | going, rather | 14:15 |
sean-k-mooney | ok so i can unload the SEV context form my brain | 14:15 |
sean-k-mooney | at least for a while thats good | 14:15 |
stephenfin | burn it. burn it all | 14:15 |
stephenfin | at least until people start using it in two years or so | 14:15 |
gibi | efried: sure, the part of what happens when we remove the vcpu_pin_set is for the future not for Train | 14:15 |
gibi | stephenfin: thanks | 14:16 |
stephenfin | gibi, efried: fwiw, when we remove 'vcpu_pin_set' I'd like to preserve the same behavior | 14:16 |
stephenfin | though sean-k-mooney disagrees | 14:16 |
stephenfin | in U or V or whatever, the above simply becomes | 14:16 |
stephenfin | any(compute.cpu_shared_set, compute.cpu_dedicated_set) == False --> allocate all host CPUs for PCPU inventory | 14:16 |
gibi | stephenfin: I have conflicting requirements internally what empty vcpu_pin_set "should" mean | 14:17 |
stephenfin | damn, sorry, VCPU | 14:17 |
stephenfin | in both cases | 14:17 |
gibi | stephenfin: yeah, VCPU. even beter | 14:17 |
efried | alex_xu: even if not yet +2able, do you see https://review.opendev.org/#/c/671800/ being generally sane, enough to unblock the bottom and start merging cpu-resources patches? | 14:17 |
stephenfin | yeah, my mistake | 14:17 |
alex_xu | efried: yes, I think it is good. the case I tested passed | 14:18 |
efried | stephenfin: any reservations about that ^ ? | 14:18 |
stephenfin | efried: Nope. I need to fix a thing with quotas and discuss whether we need the scheduler option with dansmith ([1]) but neither of those should affect anything [1] https://review.opendev.org/#/c/671801 | 14:19 |
stephenfin | *anything lower in the series | 14:19 |
stephenfin | sean-k-mooney: Want to skim through https://review.opendev.org/#/c/671793/ again now that the fixes for your issues are merged directly in? | 14:20 |
sean-k-mooney | am ill add it to the queue after the vpmem patch but sure | 14:20 |
efried | bauzas: would you please re+2 and +W the bottom cpu-resources https://review.opendev.org/#/c/671793/ once sean-k-mooney has done that ^ ? | 14:20 |
*** awalende has quit IRC | 14:24 | |
*** awalende has joined #openstack-nova | 14:25 | |
openstackgerrit | Luyao Zhong proposed openstack/nova master: Claim resources in resource tracker https://review.opendev.org/678452 | 14:25 |
openstackgerrit | Luyao Zhong proposed openstack/nova master: libvirt: Enable driver discovering PMEM namespaces https://review.opendev.org/678453 | 14:25 |
openstackgerrit | Luyao Zhong proposed openstack/nova master: libvirt: report VPMEM resources by provider tree https://review.opendev.org/678454 | 14:25 |
openstackgerrit | Luyao Zhong proposed openstack/nova master: libvirt: Support VM creation with vpmems and vpmems cleanup https://review.opendev.org/678455 | 14:25 |
openstackgerrit | Luyao Zhong proposed openstack/nova master: Parse vpmem related flavor extra spec https://review.opendev.org/678456 | 14:25 |
openstackgerrit | Luyao Zhong proposed openstack/nova master: libvirt: Enable driver configuring PMEM namespaces https://review.opendev.org/679640 | 14:25 |
*** awalende has quit IRC | 14:25 | |
openstackgerrit | Luyao Zhong proposed openstack/nova master: Add functional tests for virtual persistent memory https://review.opendev.org/678470 | 14:25 |
*** dtantsur is now known as dtantsur|afk | 14:25 | |
*** awalende has joined #openstack-nova | 14:26 | |
artom | sean-k-mooney, thanks for testing, I saw your comments while waiting at the doctor's office. I'm going to try and fix the func tests, and then get a better handle on why instance.refresh() makes it work | 14:27 |
sean-k-mooney | artom: well it does not | 14:27 |
*** brinzhang_ has joined #openstack-nova | 14:27 | |
artom | sean-k-mooney, hah, so it never worked? | 14:27 |
sean-k-mooney | at least adding it before apply does not work | 14:27 |
sean-k-mooney | im gong to try adding after drop migration context | 14:27 |
artom | sean-k-mooney, no, it was on the source, before calling driver.cleanup() | 14:27 |
sean-k-mooney | oh i thought it was in post live migrate | 14:28 |
artom | The theory was that driver.cleanup(), at least in libvirt, calls instance.save() and clobbers that the dest saved | 14:28 |
sean-k-mooney | oh ok | 14:28 |
kashyap | efried: Ah, nice - just noticed that Luyao also wrote the upstream libvirt-proper part of the 'pmem' / related NVDIMM support bits :-). (/me needs more time, it's a massive context-switch for my slow brain.) | 14:28 |
sean-k-mooney | ill look at the old code and try and figure out where to add it | 14:28 |
sean-k-mooney | if you know then i can test it | 14:28 |
* kashyap bbiab; neighbour called for help | 14:29 | |
sean-k-mooney | or feel free to updated it on the host i gave you acess too | 14:29 |
efried | kashyap: that sounds impressive | 14:29 |
sean-k-mooney | whichever works | 14:29 |
artom | mriedem, ^^ ... would we be OK using instance.refresh() without a clear handle on why it's needed? I'm assuming not | 14:29 |
*** awalende has quit IRC | 14:30 | |
*** maciejjozefczyk has quit IRC | 14:30 | |
sean-k-mooney | efried: https://review.opendev.org/#/c/678455/24/nova/virt/libvirt/config.py is the primary file you want our input on yes | 14:31 |
efried | sean-k-mooney: yeah, I can at least somewhat understand the others. | 14:32 |
sean-k-mooney | i have looked at the other and assming this file looks sane those do too | 14:32 |
*** ociuhandu has joined #openstack-nova | 14:32 | |
sean-k-mooney | im almost done but sofar nothing jumps out as me as obviously wrong | 14:32 |
*** ociuhandu has quit IRC | 14:33 | |
efried | luyao: https://review.opendev.org/#/c/678456/ quick fix here please | 14:33 |
*** ociuhandu has joined #openstack-nova | 14:33 | |
luyao | efried : ok | 14:34 |
mriedem | artom: i thought sean-k-mooney said it was needed | 14:35 |
sean-k-mooney | mriedem: i said it does not work without it. i have not tested readding instance.refresh in the correct place | 14:35 |
mriedem | artom: like i said in review, my guess is the copy of the instance on the source has a dirty migration_context field, the applies it's copy and drops it, and then the source does an instance.save and saves the dirty copy | 14:35 |
sean-k-mooney | the current code is setting the correct numa toplogy in the migration context but its not makeing it into the numa_toplogy field int eh instance_extra table | 14:36 |
openstackgerrit | Luyao Zhong proposed openstack/nova master: Parse vpmem related flavor extra spec https://review.opendev.org/678456 | 14:36 |
openstackgerrit | Luyao Zhong proposed openstack/nova master: libvirt: Enable driver configuring PMEM namespaces https://review.opendev.org/679640 | 14:36 |
openstackgerrit | Luyao Zhong proposed openstack/nova master: Add functional tests for virtual persistent memory https://review.opendev.org/678470 | 14:36 |
sean-k-mooney | luyao: efried alex_xu where are ye subtracting the alingment form the requested size | 14:36 |
sean-k-mooney | self.target_size = kwargs.get("size_kb", 0) - self.align_size | 14:37 |
artom | mriedem, it *is* needed, sean-k-mooney confirmed this morning | 14:37 |
artom | But I don't have a good understanding of why | 14:37 |
sean-k-mooney | luyao: efried alex_xu https://review.opendev.org/#/c/678455/24/nova/virt/libvirt/config.py@3180 | 14:37 |
artom | Err, brb, my client's all wonky | 14:37 |
*** ociuhandu has quit IRC | 14:37 | |
*** artom has quit IRC | 14:38 | |
*** artom has joined #openstack-nova | 14:38 | |
*** ociuhandu has joined #openstack-nova | 14:39 | |
luyao | efried, sean-k-mooney: because the label will occupy some space, and the the size must be aligned | 14:39 |
sean-k-mooney | luyao: sure but jsut substracting the alignment is not correct | 14:40 |
sean-k-mooney | it will result in a smaller allcotion then you asked for.well | 14:40 |
sean-k-mooney | there are two ways to handel it | 14:40 |
sean-k-mooney | if we know the lable size | 14:40 |
artom | mriedem, btw, I removed all the allocation-style pinning/non-NUMA stuff that we talked about last night. You and sean-k-mooney convinced me it was 1. tangential (long-standing issue not related to intance NUMA topology, fixed with a hard reboot) 2. scope creep that's too risky at this point. So I pushed a PS last night with out it | 14:40 |
artom | mriedem, so, in my mind, all that's left is the instance.refresh() thing | 14:40 |
sean-k-mooney | we can add that then round up to the next size or we had do what your doing | 14:40 |
luyao | sean-k-mooney : Yes it is, it will be smaller | 14:41 |
mriedem | artom: ok. the instance.refresh() certainly doesn't hurt so i have no issues with leaving that in | 14:41 |
dansmith | same, | 14:41 |
sean-k-mooney | luyao: i normally would have assumed we would handel this by adding the overhaed size and round up to then next aliment boundry like we do for block device but i guess we cant do that as we need to fit in the size in the placemnt allocation | 14:42 |
dansmith | I had said I wasn't totally sure why, but wasn't surprised it was needed | 14:42 |
sean-k-mooney | luyao: so you chose to subtract the alignment | 14:42 |
luyao | sean-k-mooney: the size must be aligned by alignsize | 14:42 |
mriedem | i think it would be pretty simple to debug after the fact if we have a working job - just dump the source instance.migration_context to logs, call post at dest, then refresh and dump it again to the logs | 14:42 |
sean-k-mooney | luyao: yes i know | 14:42 |
sean-k-mooney | im not arguing that point | 14:42 |
dansmith | mriedem: yeah | 14:42 |
sean-k-mooney | but that code does not actully guarentee that | 14:42 |
artom | mriedem, dansmith, ack, so first thing I'll do is put it back in, and while you guys check whether I've addressed the rest of the feedback satisfactorily, I can fix the func tests for not picking that up, and bet a better handle on why it's needed in the first place | 14:43 |
luyao | sean-k-mooney: let me check it again | 14:43 |
dansmith | artom: ack | 14:43 |
sean-k-mooney | luyao: im refering to https://review.opendev.org/#/c/678455/24/nova/virt/libvirt/config.py@3180 by the way | 14:44 |
sean-k-mooney | luyao: oh you are relying on the plamcent step size to prevent non aligned allocations | 14:45 |
luyao | sean-k-mooney: sorry, I don't understand | 14:46 |
luyao | sean-k-mooney: the target size is aligned by alignsize, this is guaruanteed by creating namepsace, then I subtract the alignment, it will be aligned by alignsize also. | 14:47 |
luyao | sean-k-mooney: what is plamcent step size | 14:48 |
sean-k-mooney | yes but this code is relying on other code makeing sure that size_kb is a multiple of the aliment size | 14:48 |
sean-k-mooney | it is nto actully checkign for that here its just assuming it | 14:48 |
sean-k-mooney | which is why this normally would not guarentee its a still alinged but in this case it will be | 14:49 |
sean-k-mooney | luyao: ignore the placment step size we decided to model it diffrently | 14:50 |
luyao | sean-k-mooney: the size is return from ndctl utility, ndctl will guarentee the size is aligned. the alignsize is also from ndctl | 14:51 |
sean-k-mooney | we had two proposals this is why it would have been better for me to review all of the series rather then jsut this patch | 14:51 |
sean-k-mooney | ok | 14:51 |
efried | luyao: FYI: step_size is the granularity with which you're allowed to make an allocation for a resource. | 14:51 |
efried | So if you have 64 units of something, and your step_size is 8, you're allowed to allocate 8, 16, 24... but not 9 or 3 or 14. | 14:51 |
sean-k-mooney | efried: ya so in the current propoal we are tracking invenoties of namespace size | 14:52 |
sean-k-mooney | so i think its 1 | 14:52 |
sean-k-mooney | but we had debated modeling it differently at one point | 14:52 |
sean-k-mooney | in its current form its not relevent. but i need to remind my self of some of teh other decisions | 14:53 |
luyao | efried,sean-k-mooney: thanks, I see. it has no step_size, it is not partable | 14:54 |
sean-k-mooney | luyao: well its set to 1 | 14:54 |
efried | Yes, I remember this now. 1 is appropriate, because they're single namespaces | 14:55 |
sean-k-mooney | so you get pmem in units of 1 namespace | 14:55 |
luyao | sean-k-mooney: yes | 14:55 |
efried | ^ | 14:55 |
efried | and the size/align shouldn't actually leak into any kind of placement, scheduling, etc. code. | 14:55 |
efried | If it's anywhere at all, it would be deep in the libvirt-isms. | 14:55 |
efried | because as far as nova+placement are concerned, a namespace is a single unit. | 14:56 |
sean-k-mooney | efried: at one point we had discussed the idea of dynamically creating the namespces but we regjected that | 14:56 |
sean-k-mooney | when we were considring it the alinment was one of the posible values for the stepsize i belive | 14:57 |
sean-k-mooney | anyway that is not relevent | 14:57 |
efried | right | 14:57 |
* efried ==> meetings... | 14:57 | |
openstackgerrit | Balazs Gibizer proposed openstack/nova master: Fix the race in confirm resize func test https://review.opendev.org/681238 | 14:57 |
gibi | mriedem: ^^ | 14:57 |
*** nweinber_ has joined #openstack-nova | 14:58 | |
efried | dansmith, stephenfin: FYI on the pmem CI, the reason it's passing on patches outside of the pmem series is because currently the job is configured to pull down the top patch (https://review.opendev.org/#/c/678470) before running. | 14:59 |
efried | I'm still digging into what the tests are actually doing. | 14:59 |
gibi | mriedem: it seems that the top of the bw series hits the race so I'm wondering what to do. Rebase the series to inclued the fix in the bottom, or wait for the grenade test result first as gate is slow | 14:59 |
dansmith | efried: are you talking about vpmem or PCPU? | 14:59 |
*** udesale has quit IRC | 15:00 | |
*** udesale has joined #openstack-nova | 15:00 | |
efried | dansmith: vpmem | 15:00 |
*** nweinber__ has quit IRC | 15:01 | |
efried | bbiab | 15:01 |
*** efried is now known as efried_afk | 15:01 | |
*** gbarros has quit IRC | 15:02 | |
mriedem | gibi: a nit inline on that race test fix | 15:03 |
mriedem | gibi: i think we'll just get the race fix merged today and then it's the same as you rebasing your series on it | 15:03 |
gibi | mriedem: OK. will put back the lower() call. | 15:04 |
gibi | mriedem: ahh good point, the gate will rebase the series for the test anyhow | 15:05 |
*** spsurya has quit IRC | 15:05 | |
*** jawad_axd has joined #openstack-nova | 15:06 | |
*** brinzhang_ has quit IRC | 15:07 | |
mriedem | gibi: left another comment, don't know if you want to fup again or not, | 15:07 |
mriedem | but since you changed all of these copy/paste blocks of confirmResize/wait for status/wait for event, those could go into a single helper method | 15:08 |
gibi | mriedem: OK I will put that in a helper. Do you suggest to only change the new confirm resize tests? | 15:09 |
mriedem | i'm not sure what you mean by 'new confirm resize tests' | 15:09 |
mriedem | like the new bw provider migration tests? | 15:10 |
gibi | mriedem: the current patch adds the instance action event waiting for every confirm resize tests in test_servers | 15:10 |
gibi | mriedem: not just to the bw related confirm resize tests | 15:10 |
mriedem | yeah i didn't expect to see all of those other changes in tests that aren't doing exotic allocatoins | 15:10 |
*** Sundar has quit IRC | 15:10 | |
mriedem | i mean, it's fine, and we could just put the helper in ServerMovingTests | 15:11 |
gibi | this race does not depend on any exotic allocation. | 15:11 |
gibi | true we only saw the issue in the bw related tests | 15:12 |
mriedem | putting it in ServerMovingTests is probably not good enough then, since the new tests don't use that | 15:12 |
mriedem | in PortResourceRequestBasedSchedulingTestBase | 15:12 |
mriedem | could go in ProviderUsageBaseTestCase though... | 15:13 |
*** tkajinam has quit IRC | 15:13 | |
mriedem | up to you - i just don't love the copy/paste so we should use a helper at some point, fup or whatever - up to you | 15:13 |
*** tssurya has quit IRC | 15:13 | |
gibi | mriedem: OK. I will respin the patch | 15:13 |
openstackgerrit | Artom Lifshitz proposed openstack/nova master: NUMA live migration support https://review.opendev.org/634606 | 15:14 |
openstackgerrit | Artom Lifshitz proposed openstack/nova master: Deprecate CONF.workarounds.enable_numa_live_migration https://review.opendev.org/640021 | 15:14 |
openstackgerrit | Artom Lifshitz proposed openstack/nova master: Functional tests for NUMA live migration https://review.opendev.org/672595 | 15:14 |
bauzas | gibi: could you please give me again the race fix change ? | 15:14 |
bauzas | gibi: can't easily find it | 15:14 |
gibi | bauzas: https://review.opendev.org/#/c/681238 | 15:15 |
bauzas | gibi: thanks | 15:15 |
bauzas | ah, that's because the topic name changed :) | 15:15 |
sean-k-mooney | alex_xu: luyao have you tested the vpmem code with instace that request hugepage or pinning but dont specify a numa toplogy | 15:16 |
artom | dansmith, mriedem, ^^ | 15:17 |
*** gbarros has joined #openstack-nova | 15:17 | |
*** shilpasd has quit IRC | 15:17 | |
bauzas | gibi: so you'll respin https://review.opendev.org/#/c/681238 ? | 15:18 |
alex_xu | sean-k-mooney: I guess luyao have test on that. But that will be a case like normal instance with numa. Since when we specify pinning or hugepage, we will create one node numa for the instance, right? | 15:18 |
* bauzas context switches to cpu-resources then | 15:18 | |
sean-k-mooney | alex_xu: we should but i need to make sure https://review.opendev.org/#/c/678455/25/nova/virt/libvirt/driver.py@4664 does nto break that | 15:19 |
*** ociuhandu has quit IRC | 15:19 | |
sean-k-mooney | e.g. and instance with hw:cpu_policy=dedicated + vpmem needs to be pinned | 15:19 |
*** ociuhandu has joined #openstack-nova | 15:19 | |
sean-k-mooney | and i need to ensure adding not need_pin wont break that | 15:19 |
alex_xu | sean-k-mooney: https://review.opendev.org/#/c/678455/25/nova/virt/libvirt/driver.py@5454, for that case, the need_pin should be True, since the instance has numa_topology | 15:20 |
sean-k-mooney | instance.numa_topology shoudl be set in the calim but i need to triple check | 15:20 |
alex_xu | yea | 15:20 |
gibi | bauzas: yeah, I will respin soon. | 15:20 |
bauzas | cool | 15:21 |
alex_xu | sean-k-mooney: actually we set instance.numa_topology in the API layer I think | 15:21 |
sean-k-mooney | alex_xu: i think the code is correct but im not sure im happy with this be in a diffreent location then the other places we create a numa toplogy implcitly | 15:21 |
alex_xu | sean-k-mooney: yea, that is agreement on the PTG, people said they want the vpmem on no numa instance, but I totally understand your point | 15:22 |
*** ociuhandu has quit IRC | 15:22 | |
alex_xu | I can't remember who is asking, but it should be redhat people | 15:23 |
*** ociuhandu has joined #openstack-nova | 15:23 | |
sean-k-mooney | well as a redhat person i did not but i think maybe dansmith prefered supporting this. i kown we said we woudl do that at the PTG however | 15:23 |
alex_xu | yea, actually I'm on your side in the beginning | 15:24 |
sean-k-mooney | alex_xu: do you have an xml i can inspect | 15:24 |
dansmith | sean-k-mooney: um, what? | 15:24 |
sean-k-mooney | e.g. an exmaple xml that was generated using this code | 15:24 |
* dansmith feels like sean-k-mooney has been blaming a lot of stuff on him lately | 15:24 | |
*** tbachman has quit IRC | 15:25 | |
alex_xu | sean-k-mooney: luyao has, but she is on the way home~ | 15:25 |
sean-k-mooney | dansmith: you prefered to add pmem suport without requireing numa in the inital version corect? | 15:25 |
sean-k-mooney | dansmith: i could be miss rememebring if so my applogies | 15:25 |
sean-k-mooney | dansmith: i just rememebr you and i were the most active redhatters in that discuession in the ptg. | 15:26 |
alex_xu | sean-k-mooney: here is one http://52.27.155.124/95/674895/35/check/pmem-tempest-plugin-filtered/af5fdef/controller/logs/screen-n-cpu.txt.gz | 15:26 |
alex_xu | sean-k-mooney: search the 'pmem', you will see the xml output by nova-compute log | 15:27 |
dansmith | sean-k-mooney: I seriously doubt I said that | 15:27 |
*** ociuhandu has quit IRC | 15:28 | |
sean-k-mooney | dansmith: in that case my applogies for invoking your name :) | 15:29 |
sean-k-mooney | alex_xu: so this is the generated domain | 15:29 |
sean-k-mooney | http://paste.openstack.org/show/774840/ | 15:29 |
openstackgerrit | Balazs Gibizer proposed openstack/nova master: Fix the race in confirm resize func test https://review.opendev.org/681238 | 15:30 |
gibi | mriedem, bauzas: ^^ | 15:30 |
sean-k-mooney | alex_xu: it is both pinning ram of the and the core of the guest to float over host numa node 0 | 15:30 |
alex_xu | sean-k-mooney: yes, but I think the is based on old patchset, since that ci job always pull a old version of luyao's patch | 15:31 |
*** ociuhandu has joined #openstack-nova | 15:31 | |
alex_xu | in new version, it shouldn't have the pinning | 15:31 |
sean-k-mooney | alex_xu: but it is not specificy a constrait on pmem device as far as i can see | 15:31 |
alex_xu | sean-k-mooney: yes, that is the output for patchset13, sorry | 15:33 |
sean-k-mooney | alex_xu: im not sure that is a good thing. but ok. do you have an updated run | 15:33 |
sean-k-mooney | ill check the ci logs i guess | 15:33 |
alex_xu | sean-k-mooney: luyao is trying to get one for you | 15:33 |
sean-k-mooney | http://52.27.155.124/93/671793/23/check/pmem-tempest-plugin-filtered/23b32fa/ shoudl have them right | 15:34 |
sean-k-mooney | that from ps 23 | 15:34 |
sean-k-mooney | i found one | 15:34 |
alex_xu | sean-k-mooney: no, it won't, the CI always try to apply the PS13, Rui is trying to remove that. That is why you can see other patch also can pass the pmem ci test | 15:35 |
*** macz has joined #openstack-nova | 15:35 | |
*** gyee has joined #openstack-nova | 15:35 | |
sean-k-mooney | in the new code will both the numatune and cputune element be removed if only a pmem device is requested in the flavor | 15:37 |
*** ociuhandu has quit IRC | 15:38 | |
kashyap | sean-k-mooney: To my eyes this whole NUMA and PMEM interaction looks sutble enough that some more functional testing is required... | 15:39 |
*** tbachman has joined #openstack-nova | 15:41 | |
sean-k-mooney | kashyap: well there are functional tests in https://review.opendev.org/#/c/678470/27 | 15:41 |
sean-k-mooney | but i have not looked at them yet | 15:41 |
kashyap | sean-k-mooney: Ah, missed it; thx | 15:42 |
sean-k-mooney | ... its locking up my firefox windwo trying to open it for some reason | 15:43 |
bauzas | gibi: https://review.opendev.org/#/c/681238/3 | 15:44 |
bauzas | gibi: can I update the commit msg ? | 15:44 |
sean-k-mooney | kashyap: the functional test will not assert anythin about the xml generation | 15:44 |
gibi | bauzas: sure go ahead | 15:44 |
*** factor has joined #openstack-nova | 15:45 | |
kashyap | sean-k-mooney: Sure, I see those are already in the main change. | 15:45 |
*** icarusfactor has quit IRC | 15:45 | |
sean-k-mooney | there are no assertion made about numa element in the main change as far as i can tell | 15:46 |
sean-k-mooney | rather numatune and cputune | 15:46 |
*** ociuhandu has joined #openstack-nova | 15:47 | |
bauzas | gibi: cool, will do and +2 | 15:48 |
gibi | bauzas: thanks a lot | 15:48 |
mriedem | gibi: nits in https://review.opendev.org/#/c/676140/19 for a follow up | 15:48 |
openstackgerrit | Sylvain Bauza proposed openstack/nova master: Fix the race in confirm resize func test https://review.opendev.org/681238 | 15:48 |
sean-k-mooney | efried_afk: alex_xu ill try to come back to the pmem stuff in an hour or so. i need to review stephens patch then sync with artom | 15:49 |
gibi | mriedem: ack. thanks | 15:49 |
sean-k-mooney | stephenfin: you wanted me to review https://review.opendev.org/#/c/671793/ specifcally right. should i also review the rest | 15:50 |
stephenfin | sean-k-mooney: would be helpful, aye | 15:50 |
alex_xu | sean-k-mooney: i got one http://paste.openstack.org/show/774842/ | 15:50 |
stephenfin | I'm marching through the pmem stuff | 15:50 |
* bauzas is done for the day, sorry stephenfin :( | 15:50 | |
bauzas | but I'll look at your series tomorrow morning | 15:51 |
sean-k-mooney | alex_xu: looking | 15:51 |
artom | sean-k-mooney, for once, I think I'm good, and do not require your excellent services :) | 15:51 |
sean-k-mooney | alex_xu: i think that that is correct. it is creatin a virtual numa toplogy but it is not tiying it to the host in any way | 15:52 |
sean-k-mooney | artom: the pmem device is also assocaitated with the virtual guest numa node 0 | 15:53 |
*** gbarros has quit IRC | 15:53 | |
sean-k-mooney | artom: are you working on fixing the persitence issue | 15:53 |
sean-k-mooney | artom: i jsut wanted to circle back and see if you had anything form me to test or if i should start looking into where to fix the issue | 15:54 |
artom | sean-k-mooney, I added instance.refresh() back it, so that's settled | 15:54 |
alex_xu | sean-k-mooney: but I think this is wrong https://review.opendev.org/#/c/678455/25/nova/virt/libvirt/driver.py@5458 | 15:54 |
sean-k-mooney | artom: ok did you push that? | 15:54 |
artom | The func tests weren't hitting it because driver.cleanup() is called conditionally, and the func test env isn't meeting those conditions | 15:54 |
artom | sean-k-mooney, I did push | 15:54 |
alex_xu | sean-k-mooney: I checked the nova show, I saw there is "hw:numa_nodes" being added, so I guess that is persistented in the db | 15:55 |
sean-k-mooney | ok then ill check it locally unless you have already | 15:55 |
artom | If I just change the code to always call it, I can reproduce | 15:55 |
sean-k-mooney | artom: that is not correct | 15:55 |
artom | And instance.refresh() does indeed fix it | 15:55 |
artom | sean-k-mooney, I know :) It was just to test | 15:55 |
mriedem | stephenfin: did you ever re-post your PCPU upgrade ML thread with [nova] tagged on it to actually get operator visibility? | 15:55 |
artom | Next step is to trigger driver.cleanup in the "real" way in func tests | 15:55 |
sean-k-mooney | we should not see hw:numa_nodes in nova show | 15:55 |
alex_xu | sean-k-mooney: yes, we shouldn't | 15:55 |
sean-k-mooney | artom: sorry that was for allex | 15:56 |
sean-k-mooney | artom: i have not looked at your change | 15:56 |
stephenfin | mriedem: No, it didn't seem necessary since we'd solved the upgrade issue in a way that didn't require anything special from the operator | 15:56 |
*** damien_r has quit IRC | 15:56 | |
stephenfin | outside of bog standard config options | 15:56 |
sean-k-mooney | alex_xu: we should not see hw:numa_nodes=1 if its not in the flavor | 15:56 |
sean-k-mooney | alex_xu: we do not see that when you get implcit numa toplogyies in other cases | 15:56 |
sean-k-mooney | alex_xu: so if you are seeing it then the code is incorrect | 15:57 |
artom | sean-k-mooney, we're good, don't worry :) | 15:57 |
mriedem | stephenfin: isn't dansmith's comment all about a nasty upgrade problem? | 15:57 |
alex_xu | sean-k-mooney: I think the problem is the patch is change instance.flavor direclty, after driver.spawn, the nova-comptue update the instance object, then persistent it into the db. | 15:57 |
mriedem | to which operators, like mnaser, might want to weigh in? | 15:57 |
sean-k-mooney | alex_xu: if we were to cold migrate the instance the behavior sould chagne if we save it to the db | 15:57 |
mnaser | hm | 15:57 |
* mnaser can read context now while munching food | 15:57 | |
dansmith | mriedem: yeah, the "plan" doesn't seem super great to me as currently laid out :/ | 15:57 |
sean-k-mooney | alex_xu: ya we shoudl not be changeing tehe flavor at all | 15:57 |
*** ociuhandu has quit IRC | 15:58 | |
stephenfin | mriedem: an intractable one though. Even if operators don't like the little dance we're doing, I fail to see how there's an alternative | 15:58 |
mriedem | mnaser: you'd need someone to tl;dr it (i would also) | 15:58 |
sean-k-mooney | alex_xu: the other case dont chagne the flaovr they just create a numa toplogy | 15:58 |
mriedem | since there are 5 conversations going on at once in here right now | 15:58 |
mnaser | sounds like an ML post? | 15:58 |
mnaser | that i can read | 15:58 |
mriedem | mnaser: there was one which no operators read :) | 15:58 |
mriedem | b/c it was'nt tagged for [nova] or [ops] | 15:58 |
sean-k-mooney | so the pmem code is tacking a shortcut by updating the flavor. on a hard reboot that instance would be pinned | 15:58 |
*** ociuhandu has joined #openstack-nova | 15:58 | |
stephenfin | sean-k-mooney: I'm looking at that at the moment. I don't like it. | 15:59 |
stephenfin | Not at all | 15:59 |
stephenfin | Assuming you're referring to https://review.opendev.org/#/c/678455/25/nova/virt/libvirt/driver.py@5458 | 15:59 |
mriedem | stephenfin: for the sake of everyone's clarity, could you post a new ML thread with the proposed upgrade path for PCPU and tag with [nova] and [ops]? | 15:59 |
sean-k-mooney | stephenfin: if we need to create a numa toplogy we should move it to where we do it for hugepages right? | 15:59 |
sean-k-mooney | stephenfin: yes | 15:59 |
sean-k-mooney | stephenfin: that is the hack that i dont like | 16:00 |
stephenfin | sean-k-mooney: exactly what I'm writing in a comment as we speak | 16:00 |
stephenfin | what is it with people trying to hack flavors :D | 16:00 |
stephenfin | mriedem: sure | 16:00 |
stephenfin | though I really don't see the point | 16:00 |
dansmith | stephenfin: "intractable" is a bit of a silly characterization :) | 16:00 |
alex_xu | sean-k-mooney: yea, agree with you | 16:00 |
stephenfin | because the only people that can solve this are in this channel/on the review already | 16:00 |
sean-k-mooney | stephenfin: the issue is that we want a numa topology in the xml. but not in the numa toplogy filter | 16:01 |
stephenfin | dansmith: Possibly :) I have been thinking about this for quite some time though and we've gone through a lot of options, so it starts looking like that to me, heh | 16:01 |
*** ociuhandu has quit IRC | 16:02 | |
*** ociuhandu has joined #openstack-nova | 16:03 | |
sean-k-mooney | alex_xu: since stephenfin is looking at it im gong to review his cpu code then ill come back to this after i test artoms code | 16:03 |
mriedem | stephenfin: saying "because the only people that can solve this are in this channel/on the review already" is not true imo - if you've got a hard upgrade thing coming for operators, you likely should get some feedback from them before pushing forward | 16:03 |
alex_xu | sean-k-mooney: thanks a lot | 16:03 |
sean-k-mooney | mriedem: the upgrade will be signifcantly harder if we also have to deal with numa in placment in the same release | 16:04 |
mriedem | sean-k-mooney: i don't know what that has to do with this at all | 16:04 |
sean-k-mooney | mriedem: if we defer pcpus in placment to U we will have to deal with both in one go | 16:04 |
mriedem | i didn't say anything about deferring | 16:04 |
mriedem | i said, does anyone outside of the 3 people reviewing this that will actually have to deal with the upgrade know what the plan is | 16:05 |
mriedem | and are they ok with it | 16:05 |
*** tbachman has quit IRC | 16:05 | |
sean-k-mooney | right but the current upgrade approch is the best we could come up with and we went to the MLs and asked if the toggel was ok | 16:05 |
mriedem | and no operators even saw that thread, | 16:05 |
mriedem | which is why i asked (again) if it could be posed with a [nova][ops] tag | 16:06 |
mriedem | to get visibility | 16:06 |
mriedem | lack of feedback from operators is not agreement | 16:06 |
*** tesseract has quit IRC | 16:06 | |
sean-k-mooney | well we did ask cern in irc | 16:06 |
*** rpittau is now known as rpittau|afk | 16:06 | |
dansmith | I think it's probably good to get feedback not just from ops,. | 16:06 |
sean-k-mooney | but i would have liked other to comment too | 16:07 |
dansmith | but from people that have to do this in the deployment tools | 16:07 |
mriedem | it would be a lot better to know before releasing train that "this sucks but it's not terrible" rather than "this is a no-go for me" | 16:07 |
dansmith | as this adds at least one more atomic reconfigure/restart of the deployment | 16:07 |
mriedem | sure, i lump mnaser into the ops and tooling (OSA) camps | 16:07 |
dansmith | yup | 16:08 |
stephenfin | I don't see what the actual issue is though | 16:08 |
sean-k-mooney | dansmith: for what its worth we talked about this internally with our tripleo folks that will be implementing and they were ok and actully prefered the seperate config flip step | 16:08 |
mriedem | sean-k-mooney: and cern (surya? belmiro?) said what? | 16:08 |
sean-k-mooney | mriedem: we ask belmiro | 16:08 |
dansmith | sean-k-mooney: preferred to what/ | 16:08 |
sean-k-mooney | and he was ok with the config | 16:08 |
stephenfin | You do your upgrade and nothing changes. At some point after the upgrade, you go tweak knobs on the compute nodes followed by a knob on the scheduler | 16:08 |
stephenfin | and you're done | 16:08 |
sean-k-mooney | ill see if i can find the irc logs | 16:08 |
dansmith | stephenfin: and restart the whole deployment atomically :) | 16:08 |
stephenfin | no, you don't need to do that | 16:09 |
dansmith | no? | 16:09 |
dansmith | you say "immediately" in your comment | 16:09 |
stephenfin | I said we'd have to do that immediately if I wasn't doing the things I was doing to prevent that | 16:09 |
sean-k-mooney | dansmith: the alternitve was to do the doble report of resouces as both vcpu and pcpu by the way. and that was not done for a spciric reason i cant rememebr | 16:10 |
*** tbachman has joined #openstack-nova | 16:10 | |
dansmith | sean-k-mooney: yeah, that's a terrible alternative, agreed :) | 16:10 |
dansmith | "Would you prefer an extra config step or a kick in the nuts?" | 16:10 |
dansmith | I could get most people to agree to the first | 16:11 |
sean-k-mooney | it did not require the config and it was self healing but ok | 16:11 |
dansmith | seems to me that with the current plan, | 16:11 |
dansmith | after they've upgraded, | 16:12 |
sean-k-mooney | i think we did not do it because of an issue with reshapes | 16:12 |
stephenfin | sean-k-mooney: Not reshapes, no | 16:12 |
stephenfin | http://lists.openstack.org/pipermail/openstack-discuss/2019-August/008501.html | 16:12 |
dansmith | they have to change all (or some fraction) of their computes to the new config to expose the new resources, | 16:12 |
dansmith | restart them, | 16:12 |
dansmith | then tweak the scheduler config to ask for the new thing, then restart those, | 16:12 |
dansmith | then fix the rest of the computes before running out of capacity, and then restart those | 16:12 |
dansmith | right? | 16:12 |
*** lpetrut has quit IRC | 16:12 | |
sean-k-mooney | dansmith more or less | 16:12 |
dansmith | that's the most graceful thing | 16:12 |
stephenfin | dansmith: exactly, yeah | 16:13 |
dansmith | which sounds like (a) hard to automate and (b) laborious | 16:13 |
dansmith | otherwise you're looking for full atomic downtime while you do all that in one go | 16:13 |
dansmith | I can't imagine OSA is going to decide what the sufficient fraction for conversion is, | 16:13 |
sean-k-mooney | well for FFU we take down the whole cloud contol plain so that not unprecidented | 16:13 |
dansmith | convert that set, reconfig/restart control services, etc | 16:14 |
stephenfin | the scheduler option exists to prevent the need for that atomic upgrade | 16:14 |
dansmith | sean-k-mooney: this is for rolling one release | 16:14 |
stephenfin | *exists solely | 16:14 |
*** bbowen__ has quit IRC | 16:14 | |
mriedem | sean-k-mooney: vexxhost doesn't need to FFU because they actually don't suck at CD | 16:14 |
dansmith | stephenfin: which likely only works for the case where the humans decide when to throw that switch | 16:14 |
sean-k-mooney | mriedem: :) yes but telco dont upgrade untill the last second. | 16:15 |
stephenfin | dansmith: Yeah, I've told mschuppert et al internally to not even try automating this in TripleO | 16:15 |
dansmith | stephenfin: exactly | 16:15 |
stephenfin | dansmith: But manual human intervention is going to be necessary anyway | 16:15 |
mriedem | oh right, openstack's only consumer, telco's | 16:15 |
dansmith | stephenfin: I don't think that's a given | 16:15 |
stephenfin | yeah, it is | 16:15 |
dansmith | alright, well, end of discussion then huh? | 16:16 |
sean-k-mooney | stephenfin: well they will be automatining it as a seperate step that you run but the full detail are tbd | 16:16 |
stephenfin | wait, I'm preparing my longer answer :) | 16:16 |
stephenfin | we can't tell if a host is intended for pinned workloads, unpinned workloads or (bad!) both | 16:16 |
dansmith | if converting from the current cpuset config to the new one is not something a computer can automate, then this is all unreasonable, IMHO | 16:17 |
stephenfin | so we can't therefore tell whether we should be mapping 'vcpu_pin_set' to '[compute] cpu_dedicated_set' or '[compute] cpu_shared_set' | 16:17 |
stephenfin | assuming 'vcpu_pin_set' is even set, which it doesn't have to be | 16:17 |
mriedem | stephenfin: couldn't we detect that if the compute was reporting a trait saying what it's configured for? | 16:17 |
dansmith | mriedem: he's saying the config doesn't currently include the intended behavior | 16:17 |
dansmith | which is fine, the operator may have to tell the tool what they're using their pinning for, | 16:18 |
stephenfin | mriedem: To paraphrase a kid with a spoon, there is no trait | 16:18 |
dansmith | but the actual conversion of the formats does not need to be hand-edited everywhere | 16:18 |
mriedem | stephenfin: i don't know what that means (the spoon kid thing) but sure there is never a trait unless we add one | 16:18 |
dansmith | matrix | 16:18 |
dansmith | can't believe you didn't get a 90s movie reference dude | 16:18 |
mriedem | my point was, if the control plane needs to know things about how the compute is configured/supported, then we use traits for that now | 16:18 |
mriedem | i'm not a matrix fanboy | 16:19 |
* dansmith glares | 16:19 | |
stephenfin | mriedem: We don't need a trait though - we have resources | 16:19 |
dansmith | right, that's not really the problem | 16:19 |
stephenfin | the problem is that we're going from a world where two different types of resource have been munged together, and we're trying to unmunge them as cleanly as possible | 16:20 |
dansmith | the controller side of this seems easy to make flexible enough to handle the rolling config of computes to me | 16:20 |
stephenfin | using service versions? | 16:21 |
dansmith | I would say that the scheduler should ask for the new format by default. If placement returns some options, then we filter and schedule to those if possible. Basically, prefer the upgraded machines | 16:21 |
dansmith | if we get back no candidates or filter them all out, | 16:21 |
sean-k-mooney | we are going form a world where teh VCPU reouse was the number of virtual cpus avialabel to it meaning the number of shared cpus | 16:21 |
dansmith | we check the service version to determine if there are old computes in the deployment. If so, we query again for the older format to see if there's any room that way | 16:21 |
artom | So... maybe we need entirely new resource names then? | 16:21 |
dansmith | potentially cache that determination for ten minutes or something, but... | 16:22 |
mriedem | artom: moving away from VCPU is a non starter to me | 16:22 |
dansmith | then you can upgrade and convert computes in one step, which is what OSA and other tools are going to want to do.. | 16:22 |
mriedem | it's baked into *everything* | 16:22 |
dansmith | they want to upgrade and fix the config as one step generally | 16:22 |
artom | mriedem, I'm not saying remove it, I'm saying leave it "legacy CPU resource thing", and come up with new resources to mean shared CPU and dedicated CPU | 16:23 |
dansmith | artom: -3 | 16:23 |
* artom shuts up in a corner | 16:23 | |
artom | Anyways, I have 0 context and func tests to fix | 16:23 |
* artom tries to stop bike-shed-astinating | 16:24 | |
* dansmith whips artom like a mule | 16:24 | |
sean-k-mooney | artom: we talked about SCPU and PCPU in the past but we said no we want to keep VCPU resouces | 16:24 |
artom | dansmith, hey man, dinner first | 16:24 |
sean-k-mooney | so way to late to go back that route | 16:24 |
stephenfin | dansmith: so tl;dr: kill the static scheduler-only config option and instead do "give me PCPU, but if you can't give me PCPU then search for VCPU" instead | 16:24 |
dansmith | stephenfin: yes, but only the last part of there are old computes around. once you do that one time and find everything is upgraded, stop even doing that check | 16:25 |
dansmith | stephenfin: remove that compat step in U, no deprecation cycle needed for that | 16:25 |
stephenfin | What do you mean by old computes? | 16:25 |
sean-k-mooney | dansmith: if we did that we would need a trait to make the scond query safe | 16:25 |
mriedem | stephenfin: older than train computes | 16:26 |
mriedem | based on the nova-compute service rpc api version | 16:26 |
dansmith | stephenfin: service version will tell you if all computes have been upgraded.. so actually, maybe just always do that in T if no candidates, because they could do the upgrade and the config tweak separately | 16:26 |
sean-k-mooney | we would need a support_PCPU capablity trait and need to add it as a forbiden trait for the second query | 16:26 |
*** bbowen__ has joined #openstack-nova | 16:26 | |
stephenfin | that won't work though | 16:26 |
dansmith | stephenfin: so yeah, what you said | 16:26 |
stephenfin | because by default we don't report PCPU on Train | 16:26 |
stephenfin | doing so would force people to set new config options as soon as they upgrade | 16:27 |
stephenfin | which we can't do | 16:27 |
sean-k-mooney | stephenfin: if you look at artom migratio ncode we ignore the "can live migrate with numa" config if everythin is upgraded | 16:27 |
dansmith | stephenfin: that's why I backed away from the version check there | 16:27 |
sean-k-mooney | that would still require everything to be restated however. | 16:28 |
stephenfin | sean-k-mooney: right, but that's because the "can live migrate with numa" check only needs to know that code is new enough. it doesn't need the operator to tweak some config first | 16:28 |
stephenfin | which this does | 16:28 |
stephenfin | dansmith: Ah, yeah, I missed the "so actually" | 16:28 |
dansmith | stephenfin: the reason I was heading in that direction, is because: | 16:29 |
sean-k-mooney | i think if we were to do the double query as i said we need the compute capableity trait to protect against lading on a new host when old hsot are available and we ask for vcpus | 16:29 |
mriedem | you'd only do that fallback for VCPU if the flavor in the request spec (in the scheduler) has resources:PCPU=x right? | 16:29 |
stephenfin | sean-k-mooney: not gonna happen - the NUMATopologyFilter or libvirt driver protect us | 16:30 |
dansmith | stephenfin: I was going to suggest we also always expose the inventory, even if we have to synthesize it from the older config, but I'm guessing you're going to say we'd potentially expose as shared or dedicated and be wrong for the intention of the operator right? | 16:30 |
sean-k-mooney | mriedem: no we are translating hw:cpu_policy=dedicated into PCPU requests | 16:30 |
dansmith | stephenfin: require them to convert their configs before U, but expose the new inventory right away | 16:30 |
mriedem | ok but my point is, just make sure we're not unconditionally doing that fallback re-query | 16:30 |
stephenfin | sean-k-mooney: because if you land on a Train compute node, that'll have explicitly set 'NUMATopology.pcpuset' to None | 16:30 |
dansmith | mriedem: yes, only do the fallback query for PCPU things | 16:31 |
sean-k-mooney | ok we could have issue with the limit paramater on placement | 16:31 |
stephenfin | and if that's None, we've got nothing to pin to | 16:31 |
stephenfin | so placement will pass but the filter will fail | 16:31 |
sean-k-mooney | but if we do the second query with out a limit then ya the numa toplogy filter would prevent that | 16:31 |
sean-k-mooney | so we dont need the trait | 16:31 |
stephenfin | dansmith: Yeah, exactly | 16:32 |
dansmith | I don't think the limit is a problem | 16:32 |
*** N3l1x has joined #openstack-nova | 16:32 | |
dansmith | stephenfin: so they express their intent right now how? | 16:32 |
sean-k-mooney | dansmith: doesnt cern set it to like 15 | 16:32 |
stephenfin | dansmith: My previous solution had been to expose CPU inventory on hosts without the new configuration as both PCPU and VCPU | 16:32 |
dansmith | sean-k-mooney: we're asking placement for a query that will return things with PCPU resources.. if placement returns nothing then there's nothing that will fit | 16:33 |
sean-k-mooney | the default limit is 1000 so that should not be a proablem but if its really low then i expect it could be | 16:33 |
dansmith | sean-k-mooney: their problem is different | 16:33 |
sean-k-mooney | dansmith: im not talking baout the first query for pcpus | 16:33 |
dansmith | stephenfin: yeah, I don't think that's a good plan | 16:33 |
sean-k-mooney | dansmith: i ment the second query for VCPUS | 16:33 |
*** gbarros has joined #openstack-nova | 16:33 | |
stephenfin | Yeah, it's really not. I detailed why in that ML post | 16:33 |
dansmith | sean-k-mooney: I don't see what the problem is | 16:34 |
sean-k-mooney | the new host can have invetories of both | 16:34 |
sean-k-mooney | we could get 15 new hosts and no old hosts | 16:34 |
dansmith | stephenfin: I'm asking about today.. they use this ambiguous config thing.. how do they control which type land where? | 16:34 |
sean-k-mooney | then the numa toplogy filter would elinate all hosts | 16:34 |
stephenfin | the scheduler option and the NUMATopologyFilter | 16:35 |
*** mdbooth has joined #openstack-nova | 16:35 | |
dansmith | stephenfin: meaning static config to determine how to treat pinned cpu requests? | 16:35 |
stephenfin | If the scheduler option is set to False (so it's not translating 'hw:cpu_policy=dedicated' to 'resources:PCPU') then we'll keep requesting VCPU from placement | 16:36 |
dansmith | no, | 16:36 |
dansmith | I'm asking about TODAY | 16:36 |
stephenfin | oh, today | 16:36 |
dansmith | not the mythical future where your patches are landed | 16:36 |
stephenfin | gotcha | 16:36 |
stephenfin | host aggregates | 16:36 |
dansmith | right, okay | 16:36 |
stephenfin | and metadata | 16:36 |
dansmith | that's what I thought | 16:36 |
stephenfin | that's what you're _supposed_ to use, of course. We never enforced it | 16:36 |
dansmith | so the problem is that computes literally don't have access to the information they need to know what kind of thing to expose, because they have config, but no access to the aggregate info | 16:37 |
sean-k-mooney | and windrieve developed a host agent to allow you to mix | 16:37 |
sean-k-mooney | so they exploited that we did nto enforece it | 16:37 |
stephenfin | correct | 16:37 |
dansmith | stephenfin: so, if we defaulted to the looser interpretation of the inventory, | 16:37 |
stephenfin | the aggregate metadata isn't standardized | 16:37 |
stephenfin | and it doesn't even need to be set | 16:37 |
dansmith | then the aggregate configs would still be in place and we could fall back appropriately maybe? | 16:37 |
efried_afk | stephenfin: Not having looked thoroughly, if I addressed your -1s on the pmem series, a) would I still be able to +2 in your opinion; b) would it do any good over waiting for luyao to hit it overnight, in terms of you being able to get back to it today? | 16:37 |
dansmith | anyway, we're getting off on a tangent a bit I think, so let me summarize: | 16:38 |
dansmith | I think we can do away with the scheduler knob by doing the query-nay-requery approach to prioritize upgraded and converted computes | 16:39 |
*** igordc has joined #openstack-nova | 16:39 | |
dansmith | it would be nice if we could make the compute side smarter and/or default to a closer runtime scenario, | 16:39 |
stephenfin | efried_afk: So if the comments were addressed could you +2 in my absence? I guess, but I'm happy to review again tomorrow too | 16:39 |
dansmith | but I'm less concerned about that if they can be reconfigured and restarted in isolation to be picked up by the scheduler by default | 16:39 |
*** efried_afk is now known as efried | 16:39 | |
efried | stephenfin: I meant me +2ing on myself, not assuming your +2. | 16:39 |
stephenfin | ohh | 16:40 |
stephenfin | Yeah, sure. There's no major rework necessary in anything I've reviewed so far | 16:40 |
stephenfin | this will need a good bit of modification though, I suspect https://review.opendev.org/#/c/678455/25/nova/virt/libvirt/driver.py | 16:41 |
*** jawad_axd has quit IRC | 16:41 | |
stephenfin | dansmith: Yup, that all makes sense to me | 16:41 |
efried | I ain't touching that patch | 16:43 |
stephenfin | good call :) | 16:43 |
gibi | mriedem: replied in https://review.opendev.org/#/c/676140 with some questions regarding the private helper you suggested | 16:43 |
*** igordc has quit IRC | 16:44 | |
*** derekh has quit IRC | 16:44 | |
* gibi needs to drop for today | 16:45 | |
*** maciejjozefczyk has joined #openstack-nova | 16:47 | |
mriedem | gibi: replied | 16:48 |
*** awalende has joined #openstack-nova | 16:55 | |
*** maciejjozefczyk has quit IRC | 16:56 | |
*** awalende has quit IRC | 17:00 | |
*** ociuhandu_ has joined #openstack-nova | 17:00 | |
*** ociuhandu has quit IRC | 17:03 | |
*** brault has joined #openstack-nova | 17:03 | |
*** ociuhandu_ has quit IRC | 17:05 | |
*** brault has quit IRC | 17:08 | |
sean-k-mooney | i need to take a break for a bit to clear my head so im going to have something to eat. | 17:11 |
sean-k-mooney | i just kicked of stacking with the latest version of artoms code | 17:11 |
sean-k-mooney | ill start testing it when i get back | 17:11 |
*** udesale has quit IRC | 17:12 | |
*** tbachman has quit IRC | 17:14 | |
openstackgerrit | Merged openstack/os-resource-classes master: Update api-ref link to canonical location https://review.opendev.org/681235 | 17:21 |
*** brault has joined #openstack-nova | 17:22 | |
efried | dansmith: bottom two vpmems on your radar for today? https://review.opendev.org/#/c/678447/ | 17:23 |
efried | hopefully easy, minor updates per your prior comments | 17:24 |
dansmith | efried: no, I gotta give a talk in a bit and I spent too much time this morning on reviews already | 17:24 |
dansmith | if comments were addressed it's probably easy for other people to confirm, | 17:24 |
efried | if you're okay with that, then sure. | 17:25 |
dansmith | but for the record, I'm scared of both pcpu and vpmems at this point | 17:25 |
efried | imo the changes are simple and in line with what you requested | 17:25 |
efried | I don't blame you. But hey, it's just FF. We have *weeks* to fix bugs :P | 17:25 |
* efried feeds face | 17:26 | |
*** brault has quit IRC | 17:27 | |
*** ralonsoh has quit IRC | 17:32 | |
*** cdent has quit IRC | 17:33 | |
*** nicolasbock has quit IRC | 17:35 | |
*** tbachman has joined #openstack-nova | 17:37 | |
*** ralonsoh has joined #openstack-nova | 17:39 | |
*** nicolasbock has joined #openstack-nova | 17:40 | |
mriedem | i'll call it, the first major bugs from pmem and pcpu regressions will be after vexxhost upgrades to train, and then in about 18-24 months when other deployments start upgrading to train :) | 17:43 |
mriedem | maybe cern in a year | 17:43 |
dansmith | heh, yeah, we won't find any of the bugs in the FF->release window | 17:43 |
*** igordc has joined #openstack-nova | 17:44 | |
mriedem | is there any way to test pcpu in a gate job if we ran tempest smoke tests in serial or something? like wouldn't that just be a matter of creating a flavor with PCPU, single node devstack? | 17:45 |
mriedem | and configuring n-cpu for the dedicated CPUs on the host? | 17:45 |
mriedem | basically all of them | 17:45 |
*** trident has quit IRC | 17:46 | |
dansmith | I'm less concerned about if it actually works in a contrived scenario, and more about an existing deployment trying to get through the upgrade and/or being able to use it without regressions or other issues | 17:46 |
mriedem | sure, and functional tests are good, they just aren't a replacement for the real thing | 17:47 |
dansmith | presumably vpmem testing is pretty much impossive | 17:47 |
dansmith | *impossible | 17:47 |
mriedem | i assume so, hence the 3rd party ci | 17:47 |
*** igordc has quit IRC | 17:50 | |
*** gbarros has quit IRC | 17:51 | |
artom | Which, btw, came up super fast (though I dunno if it had been in the works for a long time before) | 17:57 |
*** igordc has joined #openstack-nova | 17:57 | |
artom | RH can learn a thing or 2 | 17:57 |
*** gbarros has joined #openstack-nova | 17:58 | |
*** trident has joined #openstack-nova | 17:59 | |
dansmith | mriedem: so numa LM should be ready for you, if I understand the current state right? | 18:02 |
dansmith | I hit the object patch again a bit ago | 18:02 |
dansmith | I think the third patch should be good too, but you wanted some verification from the NUMA boyz which sent it off into that cpu pinning tangent, so not sure if you're still waiting for something there | 18:03 |
*** igordc has quit IRC | 18:06 | |
*** igordc has joined #openstack-nova | 18:06 | |
mriedem | artom: came up super fast? you mean the 3rd party CI did? | 18:10 |
artom | mriedem, yeah | 18:10 |
artom | Intel_Zuul appeared a few days ago | 18:10 |
mriedem | dansmith: yeah it's sitting in a tab, was going through gibi's series until i hit a stopping point, which i just did | 18:10 |
*** jdillaman has joined #openstack-nova | 18:10 | |
mriedem | gibi: https://review.opendev.org/#/c/676972/ re-introduces the race you just fixed, so i'll rebase the series and fix that and +W once your other change is merged | 18:10 |
artom | I talked with efried a few days ago, mention the apparent disconnect between the "burden of proof" on my for NUMA LM, vs the VPMEM stuff that was apparently ready to go in without any public demonstrations of it working | 18:11 |
mriedem | and "Intel NFV CI" comments on everything immediately just to say it was skipped, which is annoying | 18:11 |
mriedem | efried: any way you can get "Intel NFV CI" to shut up if it's not going to do anything? | 18:11 |
mriedem | it hasn't done anything for years | 18:11 |
artom | Dunno if Intel_Zuul popping up was related, but the coincidence was interesting :) | 18:11 |
mriedem | artom: fwiw i'm not ready for vpmem to go in | 18:12 |
efried | mriedem, artom: The Intel_Zuul is actually running. It's only running pmem, three tests. Unfortunately at the moment it's hardcoded to pull down an old version of the pmem series. | 18:12 |
dansmith | artom: well, one difference is that yours has the potential to break existing functionality, whereas vpmem hopefully won't break anything existing, only people that try to use it | 18:12 |
efried | right ^ | 18:12 |
artom | dansmith, yeah, efried made the same argument - and it's true | 18:12 |
dansmith | artom: not that it's no risk at all, but it is a teensy bit different | 18:12 |
artom | dansmith, yep, I get you | 18:13 |
mriedem | except all of the resource tracker weird side bugs refactoring that code will probably introduce on everyone | 18:13 |
dansmith | well, that's true | 18:13 |
efried | mriedem: retrieving the allocations earlier than we were before. That's the only difference. | 18:13 |
efried | All of it runs under COMPUTE_RESOURCE_SEMAPHORE | 18:13 |
artom | Anyways, my point was: Intel spun up a CI for their RFE in a matter of days (apparently). RH sucks if we can't do the same | 18:13 |
efried | did before, does now. | 18:13 |
mriedem | artom: the guy was working on that since denver i think | 18:14 |
mriedem | efried: btw i was harassing you about "Intel NFV CI" not Intel_Zuul | 18:14 |
artom | mriedem, on the CI? | 18:14 |
efried | yeah, I can go ask, but it wasn't like days | 18:14 |
mriedem | "Intel NFV CI" is an old thing that no longer does anything except comment that it's not doing anything | 18:14 |
mriedem | artom: yeah | 18:14 |
mriedem | we were talking about 3rd party CI in denver | 18:14 |
mriedem | for vpmem | 18:14 |
artom | mriedem, OK, I'll eat my words then | 18:14 |
mriedem | you and the other red hat bros might have been hungover still from the bar the night before :P | 18:15 |
efried | mriedem: fwiw Intel NFV CI they're trying to resurrect to do the thing it was originally intended for. | 18:15 |
*** igordc has quit IRC | 18:15 | |
mriedem | efried: i've been hearing that for months | 18:15 |
efried | first step was turning it back on to be a no-op | 18:15 |
mriedem | i'd like it to shut up until it actually delivers | 18:15 |
efried | yeah I know. | 18:15 |
artom | mriedem, I have Russian roots, calling me an alcoholic is a noop ;) | 18:15 |
sean-k-mooney | mriedem: we shoudl be able to test the pcpu stuff in the gate yes | 18:16 |
mriedem | https://www.youtube.com/watch?v=soNcOfRvOtg | 18:16 |
dansmith | <3 | 18:17 |
dansmith | I love me some early 'priest | 18:17 |
sean-k-mooney | i was talking to a few people on the infra channel and i think we migth be able to replace teh intel nfv ci with first party ci. | 18:17 |
*** eharney has quit IRC | 18:17 | |
sean-k-mooney | we might be abel to get multi numa nested virt lables form limestone and vexhost in the future | 18:18 |
sean-k-mooney | we can do non numa testing in the gate already as we have 3 providers with nested virt capablity | 18:18 |
sean-k-mooney | or rather singel numa | 18:19 |
sean-k-mooney | the testing i set up for the numa live migration was/is testing with cpu pinning, multiple numa nodes and hugepages | 18:21 |
mriedem | hmm, it seems weird to me that the intel pmem job is running and passing on patches that don't have anything to do with pmem when pmem isn't merged | 18:23 |
mriedem | like, how does that even work? | 18:23 |
sean-k-mooney | mriedem: they have hardocde a specifc version of the patch to be merged in | 18:24 |
sean-k-mooney | they were tring to un do that earlier | 18:25 |
mriedem | 2019-09-10 03:27:01.729430 | TASK [upgrade-libvirt-qemu : apply vpmem patch] | 18:25 |
mriedem | aha | 18:25 |
efried | yeah | 18:25 |
mriedem | refs/changes/70/678470/13 -> FETCH_HEAD | 18:26 |
efried | unfortunately it's applying PS13, not the latest | 18:26 |
mriedem | that patch is up to PS27 now | 18:26 |
sean-k-mooney | yep | 18:26 |
efried | known, Rui is supposed to unwind that asap. | 18:26 |
mriedem | i'm not sure why you wouldn't just only run it on patches in that series and skip for anything else | 18:26 |
efried | that would have been another way to do it. | 18:26 |
efried | though it wouldn't have made sense to do it for patches at the bottom either | 18:27 |
mriedem | artom: sean-k-mooney: so this is the numa lm patch/job we care about right? https://review.opendev.org/#/c/680739/ | 18:27 |
sean-k-mooney | yes more or less | 18:27 |
mriedem | which hasn't run on the latest series of patches | 18:27 |
sean-k-mooney | yes so if you recheck it | 18:28 |
sean-k-mooney | it will just run the singel job we want | 18:28 |
sean-k-mooney | shall i do it | 18:29 |
mriedem | ye shall | 18:29 |
artom | mriedem, it hasn't - what's FN's status, are we back up? | 18:29 |
artom | donnyd ^^? | 18:29 |
donnyd | yea | 18:29 |
donnyd | it was back up yesterday | 18:29 |
artom | 👍 | 18:29 |
sean-k-mooney | ya there is ocationally a network issue | 18:29 |
sean-k-mooney | becaue we curently cant fail over to another cloud it more obvios with this job | 18:30 |
donnyd | im not sure why this particular job keeps doing that too... it seems to be failure more often than not | 18:31 |
donnyd | every now and again i get something that pukes... but by and large it works | 18:32 |
sean-k-mooney | well this job cant retry on another provider and im not sure if zuul will retry on the same one | 18:32 |
sean-k-mooney | so i think its more a case of it works our your out of luck | 18:33 |
openstackgerrit | melanie witt proposed openstack/nova master: WIP Include error 'details' in dynamic vendordata log https://review.opendev.org/681329 | 18:34 |
donnyd | Well if FN was having this kind of failure rate we are seeing with this job... I am pretty sure the other projects would have already kilt me | 18:35 |
sean-k-mooney | well i admit it does look like there is something else going on but im at a loss as to what | 18:36 |
donnyd | I don't think we were having this much of an issue when it was in a seperate pool on the other label | 18:36 |
sean-k-mooney | well we have hit a few different issues. 1 was quotas when we change pool | 18:37 |
sean-k-mooney | then the other issue is the ssh connection | 18:37 |
sean-k-mooney | i think the quota issue has gone away sicne you are nolonger managin that on your end and contoling it via nodepool | 18:38 |
donnyd | well there is no more quota (within reason) max-servers is set to 70 and quota for instances set to 100 | 18:38 |
donnyd | everything else on quota is -1 | 18:38 |
sean-k-mooney | ya | 18:38 |
sean-k-mooney | the pools were both the same project on the same cloud right | 18:39 |
*** ociuhandu has joined #openstack-nova | 18:39 | |
sean-k-mooney | so it should not affect this behavior | 18:39 |
donnyd | correct | 18:39 |
donnyd | should is the opportune word | 18:39 |
sean-k-mooney | yes this also should work :) | 18:39 |
donnyd | LOL sean-k-mooney | 18:39 |
mriedem | https://review.opendev.org/#/c/634606/ went from PS75 to PS83 in a hurry | 18:39 |
donnyd | So the ssh thing I have some theories on and a fix in flight | 18:40 |
donnyd | my edge router could be a little (lot) better | 18:40 |
donnyd | so its possible that its struggling with all of the connections... or i have a knob i need to turn | 18:40 |
donnyd | load is not high according to cpu / mem / network... but that doesn't mean something else isn't borked... so I am digging | 18:41 |
sean-k-mooney | ya perhaps you said your using bgp to advertiese the block right but i assume you dont need to adverteise each /64 you are delegating to the vms and have a /48 or something from your isp | 18:42 |
donnyd | each vm is on the same /64 that is advertised (i know... not cloudy) at the edge | 18:43 |
*** ociuhandu has quit IRC | 18:44 | |
donnyd | the zuul tenant has one /64 and its already advertised.. so if we had a routing issue it would be pretty big | 18:44 |
sean-k-mooney | donnyd: oh you have a /64 for the sight and the vms are getting a /128? form that subnet | 18:44 |
donnyd | its also possible my state table is too small | 18:44 |
donnyd | I have tweaked some knobs to start | 18:44 |
donnyd | so hopefully the issue goes away | 18:44 |
sean-k-mooney | i know that alot of hardware router assume that endpoing are /64 so i know use /128 or more stict routes can cause issue somethimes | 18:45 |
donnyd | yea, i tried to get the functionality to work pretty much like v4 so people don't get too confused | 18:45 |
donnyd | The way its setup is exactly like if your isp provided you a /64 | 18:46 |
sean-k-mooney | right | 18:46 |
sean-k-mooney | ok makes sense | 18:46 |
donnyd | all the systems on your network could get addresses off that /64 and the isp routes it | 18:46 |
*** ralonsoh has quit IRC | 18:46 | |
donnyd | that is exactly how I have FN setup | 18:46 |
donnyd | not great for billing (don't have billing at FN), but great for actual networky things | 18:47 |
donnyd | so if two instances needed to talk in the same tenant... they could just start talking.. | 18:47 |
sean-k-mooney | im currently trying to get ipv6 working via a HE.net ipv4->ipv6 tunnel but i have only got as far as my router has ipv6 | 18:47 |
sean-k-mooney | when i had my client get ipv6 i had mtu issue | 18:47 |
sean-k-mooney | so your doing better then me at getting ipv6 to work | 18:48 |
donnyd | i have my /48 routed to me via HE and then I am able to pass out /64's to tenants | 18:48 |
sean-k-mooney | ya that is what i setup on my router too but couldnt get the mtu clamping to work | 18:49 |
*** gyee has quit IRC | 18:49 | |
sean-k-mooney | so i could get to ipv6 sight but packtes over aboutu 1350-1400 bytes were droped | 18:49 |
donnyd | that is strange | 18:50 |
donnyd | what are you using at your edge? | 18:50 |
*** gyee has joined #openstack-nova | 18:50 | |
donnyd | could be ISP not doing you any favors | 18:50 |
donnyd | i have a business connection, so they pretty much leave me alone | 18:50 |
sean-k-mooney | a ubiquite edgerouter-x well actully that is not my edge router | 18:50 |
sean-k-mooney | my isp router is infront of it | 18:50 |
sean-k-mooney | so you my isp router could be messing it up | 18:51 |
* dansmith is thankful to have native ipv6 these days | 18:51 | |
dansmith | also a business connection here and they'll give me multiple /64s for my subnets | 18:51 |
sean-k-mooney | i pay extra for a static ip but they wont let me pay for a buisness connection and that is the only way to get native ipv6 | 18:52 |
dansmith | our residential connections here all have native ipv6, | 18:52 |
dansmith | but not sure those people can get multiple /64s like I can | 18:52 |
sean-k-mooney | most are on cable broadband here but not vsdl | 18:52 |
donnyd | dansmith: verizon fios business doesn't even have v6 to offer | 18:53 |
dansmith | donnyd: sucks | 18:53 |
sean-k-mooney | all the fiber to home stuff is ipv6 enabled | 18:53 |
dansmith | mine is cable | 18:53 |
donnyd | yea.. its pretty frustrating | 18:53 |
dansmith | I don't like to say nice things about comcast, but they do have the ipv6 stuff pretty well sorted at this point (and have for a couple years) | 18:53 |
donnyd | the network is pretty quick... so the HE overlay doesn't really seem to be hurting for performance | 18:53 |
dansmith | yeah, I loved my fios business for speed when I had it | 18:54 |
donnyd | yea I had them when i was in CoSprings and their business class was pretty good | 18:54 |
dansmith | but then moved outside their area and they mostly stopped expanding it | 18:54 |
*** brinzhang has quit IRC | 18:54 | |
*** brinzhang has joined #openstack-nova | 18:54 | |
*** panda is now known as panda|rover|off | 18:54 | |
sean-k-mooney | artom: sorry this took so long but the latest version is updating the numa toplogy blob in the db correctly | 18:55 |
sean-k-mooney | so puting back in the instance.refresh() or whatever you did fixed that issue | 18:56 |
sean-k-mooney | im going to try testing a bunch of different cases but are there any in partcalar people want me to test | 18:58 |
*** ricolin has quit IRC | 18:58 | |
mriedem | artom: so don't respin now since we need to get a result from that ci job, but queue this locally https://review.opendev.org/#/c/634606/83 | 19:05 |
artom | mriedem, ack | 19:06 |
artom | sean-k-mooney, cool, thank you :) | 19:06 |
artom | I managed to hit it func tests, as I said. But... that was by cheating and setting do_cleanup = True in the code itself, to trigger the driver.cleanup() call | 19:07 |
mriedem | smells like a networking orgy in here | 19:08 |
artom | So I'm trying to do it correctly by forcing making is_shared_instance_path false | 19:08 |
artom | Which leads to a whole other rabbit hole... | 19:08 |
dansmith | mriedem: efried: not sure if you saw this, but alex_xu confirmed that we just stop caring about quotas on pcpu instances (IIUC) per my question about how it works: https://review.opendev.org/#/c/674895/32/nova/virt/libvirt/driver.py | 19:08 |
mriedem | stephenfin said something about dealing with quota needs to happen yet | 19:09 |
dansmith | mriedem: efried: I can see the solution being a new quota class (ick) or lumping them together (which may be confusing) but just leaving them ignored doesn't seem like a good plan to me, especially since it differs for the user based on whether or not the deployment is configured for placement | 19:09 |
sean-k-mooney | dansmith: for what its worth i point out the PCPU quota issue in the unified limts spec | 19:11 |
*** eharney has joined #openstack-nova | 19:11 | |
sean-k-mooney | if we implement it next cycle we either will have 2 quotas or have to use both vcpu and pcpu count when looking at the cpu quota | 19:12 |
dansmith | we have to do something other than just pretend they're not there regardless of when we land it | 19:13 |
sean-k-mooney | well we did not intend to pretend they are not there | 19:13 |
dansmith | since it's all about consuming whole cpus, I'm pretty sure that operators won't be okay with just allowing anyone to boot enough pcpu guests to exhaust the whole deployment | 19:13 |
sean-k-mooney | but yes | 19:13 |
dansmith | I know, but... | 19:13 |
sean-k-mooney | personally i prefer having two spereate quotas one per RC but i can see why some wont want to distiguiush | 19:14 |
dansmith | yeah, I mean, | 19:15 |
dansmith | you'd think that the operators would want to quota those separately | 19:15 |
sean-k-mooney | so they can bill for the seperatly | 19:15 |
dansmith | no, | 19:15 |
sean-k-mooney | because one cost them a lot more | 19:15 |
dansmith | they will bill separately, they just need to make sure one tenant doesn't eat them all up | 19:15 |
sean-k-mooney | well ya that too | 19:16 |
sean-k-mooney | if we treat them sepatly we dont need to special case for them | 19:16 |
sean-k-mooney | its just 1 limit per resource class | 19:16 |
sean-k-mooney | so its simpler to reason about | 19:16 |
sean-k-mooney | the down side is we need to teach people that there would now be two limits on cpus | 19:17 |
dansmith | well, only if they configure flavors with those things | 19:18 |
dansmith | I can't imagine anyone that enables this functionality will be fine with treating them as the same.. one costing like 100x the other :) | 19:18 |
sean-k-mooney | by those things you mean cpu pinning | 19:18 |
dansmith | no, I mean allowing people to ask for dedicated cpus | 19:19 |
dansmith | isn't that the point of this? | 19:19 |
sean-k-mooney | of pcpu in placmenet or unified limits | 19:19 |
*** pcaruana has quit IRC | 19:20 | |
sean-k-mooney | my understanding is we are sitll going to request dedicated cpus the same way we always didn with hw:cpu_policy=dedicated | 19:20 |
dansmith | I think I've lost grasp of this conversation | 19:21 |
sean-k-mooney | me too | 19:21 |
dansmith | heh | 19:21 |
sean-k-mooney | i was going to say i dont think the way we request things is going to change | 19:21 |
mriedem | but but but i could override my flavor vcpu with resources:VCPU=0 and define resources:PCPU=1 | 19:21 |
sean-k-mooney | i am 99% sure stephen add a check in his code to block that | 19:22 |
sean-k-mooney | we definetly discussed addint one | 19:22 |
sean-k-mooney | but yes today with out his patches you could | 19:22 |
mriedem | it would be confusing anyway since flavor.vcpus would be...what? | 19:22 |
sean-k-mooney | flavor.vcpus would be 1 | 19:22 |
mriedem | even though you're not getting vcpu, you're getting pcpu | 19:23 |
sean-k-mooney | yes | 19:23 |
dansmith | mriedem: right, I'm hoping we go in the direction of a resource override in the flavor and not the hw:$foo stuff | 19:23 |
mriedem | yeah you can't even create a flavor with vcpu=0 | 19:23 |
sean-k-mooney | the placement resouce class shouldn nver have been vcpu. flavor.vcpu means virtual cpu count exposed to the guest | 19:24 |
sean-k-mooney | dansmith: we are currently not going in that direction | 19:24 |
sean-k-mooney | dansmith: we are currently explcitly planning ot block resouce class overrides | 19:24 |
dansmith | mriedem: yeah, but the "rewriting" patch will cause you to get an allocation with vcpu=0, which is weirdish | 19:24 |
dansmith | sean-k-mooney: huh? | 19:24 |
dansmith | I dunno who the "we" is in that scenario | 19:25 |
mriedem | we = stephen and the people approving the changes | 19:25 |
dansmith | do you mean specifically for PCPU things? | 19:25 |
dansmith | mriedem: heh, yeah | 19:25 |
sean-k-mooney | well yes to both | 19:25 |
dansmith | I definitely don't agree with blocking resource based overrides in the future :) | 19:25 |
sean-k-mooney | erric stephen alex gibi and i were talking about this about a mothn ago | 19:26 |
sean-k-mooney | dansmith: the reason is if we dont you need to modfiy your flavor if the toplogy of resouce changes in placmenet | 19:26 |
sean-k-mooney | e.g. you flavor would break if we move cpus under numa nodes or cache nodes | 19:26 |
mriedem | i get the reason in the short term | 19:26 |
dansmith | sean-k-mooney: I don't see what that has to do with it at all | 19:27 |
sean-k-mooney | you woudl have to chagne teh resouce: syntax to the numberd group form | 19:27 |
efried | I'm not fully swapped into this conversation, but last time I looked you get an error if flavor.vcpu != PCPU | 19:27 |
efried | (when PCPU is specified) | 19:27 |
sean-k-mooney | am that is not always correct | 19:28 |
dansmith | which is also really weird and confusing | 19:28 |
efried | except for something something hyperthread | 19:28 |
efried | yes, it's confusing, but it's the compromise that was agreed on in the spec. | 19:28 |
sean-k-mooney | except if you have hw:emulator_threads_policy=isolate | 19:28 |
efried | then you add 1 | 19:28 |
efried | right? | 19:28 |
sean-k-mooney | in that case we allocate 1 addtional pcpu for the emulater thread | 19:29 |
sean-k-mooney | yes | 19:29 |
efried | that was in there too. | 19:29 |
dansmith | well, I don't see how we can be landing any of this without the quota bit in place at the very least | 19:29 |
mriedem | which is because of this right? https://github.com/openstack/nova/blob/stable/stein/nova/virt/libvirt/driver.py#L852 | 19:29 |
mriedem | is hw:emulator_threads_policy and that +1 thing for pcpu restricted to the libvirt driver or does that logic creep into the api? | 19:30 |
*** nweinber_ has quit IRC | 19:30 | |
sean-k-mooney | am well that is hard to answer | 19:30 |
mriedem | GREAT | 19:31 |
sean-k-mooney | only the libvirt driver supprots pinning | 19:31 |
sean-k-mooney | and this only works with pinning | 19:31 |
sean-k-mooney | so this only works with the libvirt dirver | 19:31 |
dansmith | that is not the right answer | 19:31 |
dansmith | letting stuff like that creep into the API because only one driver supports it is how we have a ton of xen-specific warts on the api | 19:31 |
sean-k-mooney | is the right answer we hope to remove that in U | 19:31 |
sean-k-mooney | because we hope to remvoe it in U | 19:32 |
mriedem | delicious agent builds | 19:32 |
sean-k-mooney | now that we have support for share | 19:32 |
sean-k-mooney | wehich mapps the emulator threads to the same cpu pool as the floating vms we dont think we need isolate anymore | 19:33 |
mriedem | sean-k-mooney: so by remove "that" you mean hw:emulator_threads_policy ? | 19:33 |
sean-k-mooney | yes | 19:33 |
sean-k-mooney | what stephen and i woudl like to propose in U is this. if you have cpu_shared_set defiend the emultor thread run there. if not they run on the same cores as your pinned vm | 19:34 |
sean-k-mooney | that is what you get if you do hw:emulator_threads_policy=share today | 19:34 |
sean-k-mooney | and if we only support 1 value for the option we can remove it entirly and just do that by default | 19:35 |
mriedem | efried: so you might want to not drop the -2 on https://review.opendev.org/#/c/671793/23 until the quota issue is sorted out | 19:36 |
mriedem | this quota issue https://review.opendev.org/#/c/674895/32/nova/virt/libvirt/driver.py@7457 | 19:36 |
efried | done | 19:36 |
sean-k-mooney | do we have quota issue wiht this? | 19:36 |
sean-k-mooney | i though quota was being counted on flavor.vcpu | 19:36 |
dansmith | sean-k-mooney: did you read the comments? | 19:36 |
sean-k-mooney | not on the resouce class in train | 19:36 |
sean-k-mooney | dansmith: no not yet | 19:37 |
*** panda|rover|off has quit IRC | 19:37 | |
dansmith | sean-k-mooney: maybe do that :) | 19:37 |
efried | good, now everyone can focus on vpmem | 19:37 |
sean-k-mooney | ok but since we only support vms with all pinned or all shared cpus with this seriese it does not change things as far as i was aware in train | 19:37 |
*** panda has joined #openstack-nova | 19:38 | |
dansmith | before this, we'd limit you to at least your vcpu quopta | 19:38 |
dansmith | after this, no limits AFAICT | 19:38 |
dansmith | honestly, I'm not sure how you're going to fix it so it works the same with placement quotas and nova quotas | 19:39 |
sean-k-mooney | we should still be limiting on flavor.vcpu | 19:39 |
dansmith | not with placement | 19:39 |
dansmith | did you read the comments? | 19:39 |
sean-k-mooney | im trying to find which patch you put it on | 19:39 |
mriedem | (2:36:21 PM) mriedem: this quota issue https://review.opendev.org/#/c/674895/32/nova/virt/libvirt/driver.py@7457 | 19:39 |
sean-k-mooney | oh but i think i know what your saying | 19:40 |
sean-k-mooney | we have the option to track cpu quota in placmenet already | 19:40 |
mriedem | by default we don't use placement for quota usage counting | 19:40 |
sean-k-mooney | we enabled it by default stien right | 19:40 |
mriedem | f naw | 19:40 |
sean-k-mooney | oh we dont | 19:40 |
mriedem | where is melwitt | 19:40 |
dansmith | we should be in train I think right? | 19:40 |
dansmith | meaning, | 19:40 |
dansmith | we should be turning that on but have't | 19:41 |
melwitt | mriedem: hiding | 19:41 |
sean-k-mooney | dansmith: so if we turn that on then yes we have a proablem | 19:41 |
sean-k-mooney | if we dont we should not | 19:41 |
dansmith | no | 19:41 |
dansmith | we have a problem because people could have that on | 19:41 |
sean-k-mooney | ture | 19:41 |
dansmith | you know config knobs can be set to not the default right? :) | 19:41 |
mriedem | cern uses it | 19:41 |
sean-k-mooney | ok i see the problem. i was aware that was a thing but ya i had not consiered the impact to this | 19:43 |
mriedem | quota is generally the last thing anyone thinks about | 19:43 |
dansmith | if CPUs weren't the most expensive and constrained resource in the cloud, then maybe less of an issue, but.... :D | 19:44 |
mriedem | [quota]/injected_file_content_bytes is a major concern of mine | 19:45 |
sean-k-mooney | im not sure if sarcasim or if i shoudl feel realy bad for you | 19:45 |
mriedem | heh, sarcasm | 19:45 |
sean-k-mooney | dansmith: i always ran out of ram first but ya. i think the only thing we coudl do woudl be to could both inventories. at least short term | 19:46 |
sean-k-mooney | wehere does that code live | 19:47 |
sean-k-mooney | e.g. where we check the quota | 19:47 |
melwitt | enforcing quota on other resource classes (DISK_GB and more) is part of the proposal in the unified limits spec, as I think sean-k-mooney mentioned earlier | 19:47 |
dansmith | melwitt: right, but this set is taking dedicated cpus out of the equation for the cpus quota, | 19:48 |
dansmith | effectively making them unconstrained (and unconstrainable) | 19:48 |
melwitt | sean-k-mooney: nova/quota.py is the main file | 19:48 |
dansmith | but they're the most expensive things | 19:48 |
mriedem | sean-k-mooney: start here https://github.com/openstack/nova/blob/f4ca3e70852c0a7ed7904a9f2d7177c9118d3d1c/nova/quota.py#L1281 | 19:48 |
sean-k-mooney | melwitt: thanks | 19:48 |
dansmith | and combining them together as a hack is pretty dumb, because the whole point of this effort is to make dedicated cpus be first class citizens instead of a hack | 19:49 |
dansmith | the other problem is, | 19:49 |
dansmith | we wouldn't want the quota behavior to be different depending on whether or not you're using placement for quota | 19:50 |
melwitt | dansmith: yeah, sorry, was trying to say that if enforcing quota for PCPU resource class is what's needed then that's a big amount of work | 19:50 |
dansmith | somewhat unrelated, | 19:50 |
dansmith | melwitt: were't we going to enable quota in placement by default soon? | 19:50 |
sean-k-mooney | dansmith: well if combined them together ti woudl be maintained previous behavior | 19:50 |
dansmith | melwitt: yep, and I'm saying that's the right solution | 19:50 |
dansmith | sean-k-mooney: the point of this work is to make pcpus not work like vcpus right? :) | 19:50 |
melwitt | dansmith: I don't remember a specific timeline, just that we wanted to let it bake with cern for awhile before making it default | 19:51 |
mriedem | idk that counting PCPU for cores quota from placement is that hard, you'd just sum VCPU and PCPU here wouldn't you? https://github.com/openstack/nova/blob/f4ca3e70852c0a7ed7904a9f2d7177c9118d3d1c/nova/scheduler/client/report.py#L2344 | 19:51 |
sean-k-mooney | and the two main goals of this work was co existing of pinned and floating vms on the same host and removeing the race on claiming pinned cpus | 19:51 |
mriedem | and here https://github.com/openstack/nova/blob/f4ca3e70852c0a7ed7904a9f2d7177c9118d3d1c/nova/scheduler/client/report.py#L2356 | 19:51 |
dansmith | melwitt: ack | 19:51 |
sean-k-mooney | dansmith: not direclty that is a sideffect | 19:51 |
dansmith | mriedem: if you combine them yes | 19:51 |
dansmith | mriedem: that's a hack to make it work, not the right solution | 19:52 |
mriedem | i know we dont want to combine them externally, but the non-placement counting method is going to be based on instances.vcpus which comes from the embedded instance.flavor.vcpus | 19:52 |
sean-k-mooney | dansmith: the follow up spec that wants to have a vm with some pinned cores and some floating cores need them to be different thigns | 19:52 |
dansmith | mriedem: right but we should be moving to placement quotas anyway | 19:52 |
dansmith | sean-k-mooney: AFAIK, this work is aimed at making pcpus not just a per-host special case of pretending that they're like vcpus | 19:53 |
dansmith | sean-k-mooney: so, yeah, obviously separate quotas is not the goal of this, | 19:53 |
dansmith | but all of this is important to get right | 19:53 |
sean-k-mooney | yes | 19:53 |
sean-k-mooney | im not disputing that at all | 19:53 |
dansmith | otherwise people really need to segregate these hosts still, which means the rest of it doesn't get us anything | 19:53 |
sean-k-mooney | im not sure i agree on the last point but im not going to rathole on it either | 19:54 |
sean-k-mooney | i think this will change in U again with unified limits | 19:54 |
sean-k-mooney | so i would prefer to groub them in Train to keep the pretrain behavior | 19:54 |
sean-k-mooney | and then only change the behavior once | 19:54 |
sean-k-mooney | in U when everything goes to unifed limits | 19:55 |
dansmith | lol | 19:55 |
dansmith | riiiight | 19:55 |
mriedem | i want to say making count_usage_from_placement=True by default was dependent on consumer types to smooth out some of the inconsistencies with the legacy counting method today in edge cases | 19:55 |
sean-k-mooney | the lol being ever getting to unifed limits | 19:55 |
mriedem | like during a resize | 19:55 |
melwitt | unified limits is going to go through the same bake time that counting usage from placement is. it will begin defaulted to False | 19:56 |
dansmith | mriedem: I didn't think that's why we didn't make it default, because we'd still do the other thing for the non-placement-able resources | 19:56 |
dansmith | sean-k-mooney: I'm saying don't count your eggs before they hatch | 19:56 |
sean-k-mooney | ok | 19:56 |
dansmith | unified limits has been a long time coming, so.. | 19:56 |
*** gbarros has quit IRC | 19:57 | |
sean-k-mooney | ya i know. so are you opposed to internally in nova combining them for Train | 19:57 |
mriedem | dansmith: without digging up old review comments and ML thread conversations, i want to say my recollection was (1) counting usage from placement has at least 3 differences in behavior from legacy counting - which are documenting in the config option help text and (2) the main benefit is for multi-cell deployments, of which there are few, | 19:57 |
mriedem | and (3) it landed late in stein, | 19:57 |
sean-k-mooney | when counting using placment | 19:57 |
mriedem | so to reduce the risk on non-multi-cell deployments with the behavior change, default to legacy counting | 19:57 |
mriedem | melwitt: ^ is that what you remember? | 19:58 |
melwitt | it landed early in train, the data migration landed late in stein | 19:58 |
dansmith | oh, I thought it was stein | 19:58 |
mriedem | i thought it was stein too... | 19:58 |
dansmith | in that case, can't default it in train | 19:58 |
* dansmith has to go to a thing | 19:58 | |
mriedem | https://review.opendev.org/#/c/638073/ it was train | 19:59 |
mriedem | time flies | 19:59 |
sean-k-mooney | well if we cant default it in train then we still need to supprot it in train with pcpu in plamcent right | 19:59 |
melwitt | mriedem: but yeah I think the combo of all those reasons were why we default to false | 19:59 |
melwitt | big delta from legacy counting and only multi-cell ppl likely to "need" it (for the down cell resilience). so we chose not to impose the big delta in behavior on the majority who don't need down cell resilience | 20:00 |
sean-k-mooney | so do we just want to add another line here https://github.com/openstack/nova/blob/f4ca3e70852c0a7ed7904a9f2d7177c9118d3d1c/nova/scheduler/client/report.py#L2356 to make it work by combining them? | 20:01 |
sean-k-mooney | *them being PCPUs | 20:01 |
melwitt | mriedem pointed that out earlier | 20:01 |
mriedem | to be fair though, some of those differences in behavior in counting also changed since i think ocata or pike | 20:01 |
mriedem | when we moved to counting in general and dropped reservations | 20:01 |
mriedem | and no one apparently noticed until we noticed in stein :) | 20:01 |
sean-k-mooney | yes im wondering if we should comment tha ton stephens patch as the way forward so he can do it tomorrow | 20:01 |
melwitt | mriedem: I think there was only one change in pike, doesn't leave room for a revert resize, IIRC (I documented all of it on the review comment) | 20:02 |
mriedem | sean-k-mooney: i think one could say, "one way forward is ...." | 20:02 |
melwitt | the new stuff, there's way more deltas than that, a big laundry list | 20:02 |
mriedem | what is a laundry list anyway? 1. put stuff in washer and wash, 2. put stuff in dryer, 3. fold. wouldn't a more accurate list be a grocery list? | 20:03 |
sean-k-mooney | mriedem: ok but stephen isnt here and i wanted to provide a summary to him but ill just tell him to read scollback in the morning | 20:03 |
melwitt | mriedem: hah | 20:03 |
melwitt | I dunno where that saying comes from actually | 20:03 |
mriedem | sean-k-mooney: whatever, that's better than nothign | 20:03 |
mriedem | https://english.stackexchange.com/questions/437507/what-is-the-origin-of-the-phrase-laundry-list | 20:04 |
melwitt | I still haven't wrapped my head around what "combining" the resource classes means and whether/how it's different from legacy counting | 20:04 |
mriedem | it wouldn't be different from legacy counting | 20:04 |
melwitt | I mean, I know it means VCPU + PCPU | 20:04 |
melwitt | ok | 20:04 |
mriedem | so i think the summary is, | 20:04 |
mriedem | 1. with legacy counting, pcpu is counted the same since we use instance.vcpus which comes from the flavor | 20:05 |
mriedem | 2. with placement counting, we'd have to combine VCPU and PCPU usage to match ^ | 20:05 |
dansmith | 3. neither are actually right | 20:05 |
mriedem | 3. long-term combining them goes against the goal of separate them as countable trackable resources | 20:05 |
mriedem | right | 20:05 |
mriedem | so the immediate question is, is #2 good enough for train with punting sorting out #3 to the future | 20:06 |
sean-k-mooney | so this is a bit of nova i dont really fully understand. melwitt is there anythin we can do before unified limts to supprot not combining them if that is prefrable | 20:07 |
sean-k-mooney | or do we just wati for unifed limits to do #3 | 20:07 |
melwitt | I guess if it's not worse than what's possible with legacy counting (inability to separate) then the hack doesn't sound like a huge issue to me. it would obviously need a TODO on it to get rid of it when unified limit support exists and is enabled | 20:07 |
melwitt | sean-k-mooney: nothing reasonable, really. you'd have to add a new quota config and resource for "pcpu_whatever" and use that to let people set the limit separately. because until unified limits, there would be no way for nova to consume the unified limit for PCPU that a person sets in keystone | 20:09 |
sean-k-mooney | right and we dont really want to do that because techdebth | 20:10 |
efried | From where I sit (outside the swirling maelstrom of actual understanding) it sounds like we've had a pretty big imbalance "forever" where we've been counting dedicated and virtual cpus "the same" from a quota perspective. If so, we're not making it worse by continuing to do that, just now we're doing it with a separate resource class for the former. | 20:10 |
melwitt | yeah, that would be a big tech debt | 20:10 |
melwitt | whereas the hack of combining VCPU and PCPU doesn't sound like big debt to me, unless I've missed something more complex about it | 20:11 |
melwitt | "small debt" | 20:11 |
sean-k-mooney | it should be small to do i think. i assume its just the line mriedem linked but this code is new to me | 20:13 |
mriedem | plus functional testing | 20:13 |
sean-k-mooney | we need to haneld this branch too https://github.com/openstack/nova/blob/f4ca3e70852c0a7ed7904a9f2d7177c9118d3d1c/nova/scheduler/client/report.py#L2344 | 20:13 |
mriedem | i'd want to see a functional test that creates a server that allocates PCPU inventory and then assert that cores usage for that tenant is incremented | 20:13 |
* melwitt nods | 20:14 | |
sean-k-mooney | i think stephen has a function test for the first half so the quota check shoudl not be too hard to add | 20:14 |
sean-k-mooney | ok so he has functional test in the reshap patch at a minium | 20:17 |
sean-k-mooney | ok yes he add more here https://review.opendev.org/#/c/671801/43/nova/tests/functional/libvirt/test_numa_servers.py | 20:18 |
sean-k-mooney | so that proably where it makes sense to ad the test | 20:18 |
*** nweinber_ has joined #openstack-nova | 20:28 | |
*** spatel has joined #openstack-nova | 20:31 | |
*** spatel has quit IRC | 20:35 | |
*** ociuhandu has joined #openstack-nova | 20:37 | |
mriedem | artom: https://review.opendev.org/#/c/640021/48 | 20:41 |
mriedem | i probably won't get to the functional test patch tonight | 20:45 |
sean-k-mooney | mriedem: for what its worth i did test teh [upgrade_levels]/compute=stein case prviously | 20:46 |
sean-k-mooney | i can test it again tomorrow im more or less done for the day | 20:47 |
sean-k-mooney | btu if either node has [upgrade_levels]/compute=stein we endup with the stien behavior | 20:47 |
sean-k-mooney | the migration successes as long as the cores used in the souce host exist on the dest | 20:47 |
sean-k-mooney | but no xml is updated | 20:47 |
mriedem | so it goes back to the bug behavior | 20:47 |
mriedem | right? | 20:47 |
sean-k-mooney | yep | 20:48 |
sean-k-mooney | it goes back to the current master/stien behavior | 20:48 |
mriedem | yeah i don't know that we really need to bend over backward to try and detect that from conductor | 20:48 |
mriedem | if you're computes are fully upgraded and restarted and reporting as train service versions you should have also removed any manual pins | 20:49 |
sean-k-mooney | if you dont have can_live_migrate_numa set we will block it anyway | 20:49 |
mriedem | *your | 20:49 |
mriedem | not with the new checks | 20:49 |
mriedem | we would totally ignore can_live_migrate_numa | 20:49 |
mriedem | as long as all computes are reporting train | 20:49 |
sean-k-mooney | i thik we have a min compute servcie check too before we ignore it | 20:50 |
sean-k-mooney | oh thats the compute service | 20:50 |
sean-k-mooney | not the RPC | 20:50 |
mriedem | if we pass that check we ignore the config | 20:50 |
mriedem | right | 20:50 |
sean-k-mooney | well ya i agree we likely dont need to bend over backwards to check this in the conductor | 20:51 |
sean-k-mooney | anyway my concentration is totally gone so ill call it a day. | 20:52 |
*** ociuhandu has quit IRC | 20:52 | |
sean-k-mooney | artom: ill test v 49 tommorow when you have addressed mriedem comments | 20:52 |
*** ociuhandu has joined #openstack-nova | 20:53 | |
*** ociuhandu has quit IRC | 20:57 | |
*** luksky has quit IRC | 21:00 | |
mriedem | melwitt: brinzhang has been asking me to review this api change https://review.opendev.org/#/c/674243/ but i haven't had time with the bw provider migrate and numa lm stuff - maybe you can peruse it? | 21:10 |
artom | mriedem, ack, thanks! | 21:11 |
artom | (The func test can wait until after FF if that's what it comes to, right?) | 21:11 |
artom | I mean, it's not plan A, but... | 21:11 |
mriedem | my priority is sean's manual testing and the gate job | 21:13 |
mriedem | brinzhang: a follow up is not the correct answer to all issues https://review.opendev.org/#/c/679413/4 | 21:13 |
melwitt | mriedem: will do | 21:19 |
*** BjoernT has quit IRC | 21:20 | |
openstackgerrit | Dustin Cowles proposed openstack/nova master: Use SDK for add/remove instance info from node https://review.opendev.org/659691 | 21:25 |
openstackgerrit | Dustin Cowles proposed openstack/nova master: Use SDK for getting network metadata from node https://review.opendev.org/670213 | 21:25 |
*** hemna has joined #openstack-nova | 21:29 | |
*** henriqueof1 has joined #openstack-nova | 21:30 | |
*** henriqueof has quit IRC | 21:31 | |
*** hemna has quit IRC | 21:34 | |
*** mriedem has quit IRC | 21:37 | |
*** nweinber_ has quit IRC | 21:43 | |
*** aloga has quit IRC | 21:45 | |
openstackgerrit | Dustin Cowles proposed openstack/nova master: Use SDK for add/remove instance info from node https://review.opendev.org/659691 | 21:50 |
openstackgerrit | Dustin Cowles proposed openstack/nova master: Use SDK for getting network metadata from node https://review.opendev.org/670213 | 21:50 |
*** N3l1x has quit IRC | 21:50 | |
*** hemna has joined #openstack-nova | 21:57 | |
*** adriant has quit IRC | 21:59 | |
*** mriedem has joined #openstack-nova | 22:01 | |
*** slaweq has quit IRC | 22:21 | |
*** panda has quit IRC | 22:26 | |
*** panda has joined #openstack-nova | 22:28 | |
*** aloga has joined #openstack-nova | 22:29 | |
*** ociuhandu has joined #openstack-nova | 22:30 | |
*** ociuhandu has quit IRC | 22:35 | |
*** mriedem has quit IRC | 22:41 | |
*** brault has joined #openstack-nova | 22:43 | |
*** avolkov has quit IRC | 22:47 | |
*** hemna has quit IRC | 22:50 | |
*** TxGirlGeek has joined #openstack-nova | 22:57 | |
*** brault has quit IRC | 22:59 | |
*** rcernin has joined #openstack-nova | 22:59 | |
*** henriqueof1 has quit IRC | 23:02 | |
*** tkajinam has joined #openstack-nova | 23:03 | |
*** macz has quit IRC | 23:05 | |
*** slaweq has joined #openstack-nova | 23:11 | |
*** slaweq has quit IRC | 23:15 | |
*** spatel has joined #openstack-nova | 23:29 | |
*** TxGirlGeek has quit IRC | 23:30 | |
*** spatel has quit IRC | 23:34 | |
openstackgerrit | Merged openstack/nova master: api-ref: fix server topology "host_numa_node" field param name https://review.opendev.org/680775 | 23:34 |
*** mlavalle has quit IRC | 23:37 | |
*** adriant has joined #openstack-nova | 23:39 | |
*** mtreinish has quit IRC | 23:43 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!