*** slaweq has joined #openstack-nova | 00:11 | |
*** mvkr has joined #openstack-nova | 00:12 | |
*** slaweq has quit IRC | 00:16 | |
*** eandersson has joined #openstack-nova | 00:16 | |
*** gyee has quit IRC | 00:16 | |
*** hshiina has joined #openstack-nova | 00:23 | |
*** pcaruana has quit IRC | 00:25 | |
*** hshiina has quit IRC | 00:27 | |
*** hshiina has joined #openstack-nova | 00:28 | |
*** brinzhang has joined #openstack-nova | 00:37 | |
*** tetsuro has joined #openstack-nova | 00:40 | |
*** k_mouza has quit IRC | 00:46 | |
*** Dinesh_Bhor has joined #openstack-nova | 01:03 | |
*** slaweq has joined #openstack-nova | 01:11 | |
*** erlon_ has quit IRC | 01:15 | |
*** slaweq has quit IRC | 01:16 | |
*** Dinesh_Bhor has quit IRC | 01:25 | |
*** Dinesh_Bhor has joined #openstack-nova | 01:31 | |
openstackgerrit | Takashi NATSUME proposed openstack/nova master: Add description of custom resource classes https://review.openstack.org/616721 | 01:32 |
---|---|---|
openstackgerrit | Takashi NATSUME proposed openstack/nova master: Fix a help string in nova-manage https://review.openstack.org/616723 | 01:50 |
openstackgerrit | Merged openstack/nova master: Harden placement init under wsgi https://review.openstack.org/610034 | 01:53 |
*** hamzy has joined #openstack-nova | 02:15 | |
*** hongbin has joined #openstack-nova | 02:42 | |
*** mrsoul has quit IRC | 02:46 | |
*** eharney has quit IRC | 02:54 | |
openstackgerrit | Merged openstack/nova master: Use SleepFixture instead of mocking _ThreadingEvent.wait https://review.openstack.org/615724 | 03:09 |
*** slaweq has joined #openstack-nova | 03:11 | |
*** slaweq has quit IRC | 03:15 | |
*** Dinesh_Bhor has quit IRC | 03:17 | |
openstackgerrit | 98k proposed openstack/os-traits master: Add python 3.6 unit test job https://review.openstack.org/616749 | 03:18 |
*** Dinesh_Bhor has joined #openstack-nova | 03:20 | |
*** tbachman has quit IRC | 03:27 | |
*** Dinesh_Bhor has quit IRC | 04:08 | |
*** janki has joined #openstack-nova | 04:36 | |
*** Dinesh_Bhor has joined #openstack-nova | 04:40 | |
*** openstackstatus has quit IRC | 04:59 | |
*** hongbin has quit IRC | 05:07 | |
*** slaweq has joined #openstack-nova | 05:11 | |
*** moshele has joined #openstack-nova | 05:12 | |
*** slaweq has quit IRC | 05:16 | |
*** moshele has quit IRC | 05:18 | |
*** openstack has joined #openstack-nova | 07:09 | |
*** ChanServ sets mode: +o openstack | 07:09 | |
*** dpawlik has joined #openstack-nova | 07:12 | |
openstackgerrit | Takashi NATSUME proposed openstack/nova master: Fix server query examples https://review.openstack.org/616834 | 07:13 |
*** sahid has joined #openstack-nova | 07:18 | |
openstackgerrit | Takashi NATSUME proposed openstack/nova master: Remove mox in unit/network/test_neutronv2.py (6) https://review.openstack.org/574113 | 07:20 |
openstackgerrit | Takashi NATSUME proposed openstack/nova master: Remove mox in unit/network/test_neutronv2.py (7) https://review.openstack.org/574974 | 07:20 |
openstackgerrit | Takashi NATSUME proposed openstack/nova master: Remove mox in unit/network/test_neutronv2.py (8) https://review.openstack.org/575311 | 07:20 |
*** pcaruana has joined #openstack-nova | 07:21 | |
*** dpawlik has quit IRC | 07:28 | |
*** dpawlik has joined #openstack-nova | 07:29 | |
*** dpawlik has quit IRC | 07:29 | |
*** dpawlik has joined #openstack-nova | 07:30 | |
*** slaweq has joined #openstack-nova | 07:45 | |
*** jangutter has quit IRC | 07:50 | |
*** jangutter has joined #openstack-nova | 07:50 | |
openstackgerrit | Brin Zhang proposed openstack/nova-specs master: Support admin to specify project to create snapshot https://review.openstack.org/616843 | 07:52 |
gibi | melwitt: I think there is only one patch left from use-nested-allocation-candidates that handles allocations that only use nested RPs and not the root RP https://review.openstack.org/#/c/608298/ | 07:57 |
gibi | melwitt: that happen to be lower prio to me as that does not needed for GPU and Bandwidth work. It would be needed for the NUMA work though | 07:58 |
gibi | melwitt: so I think we can pull use-nested-allocation-candidates from runway | 08:01 |
*** takashin has left #openstack-nova | 08:03 | |
gibi | melwitt: I did pull that from the runway now on the etherpad | 08:03 |
melwitt | gibi: got it, thank you. great work on all of that btw. it is exciting | 08:05 |
openstackgerrit | Jeffrey Zhang proposed openstack/nova master: Add feature to flatten the volume from glance image snapshort https://review.openstack.org/616461 | 08:06 |
*** tssurya has joined #openstack-nova | 08:07 | |
gibi | melwitt: thanks for looking at the bandwidth patches. It was motivating for me to see them moving forward. | 08:07 |
gibi | melwitt: this week I finally resumed to work on that series | 08:08 |
melwitt | np, it is really cool to see it all coming together now | 08:08 |
*** trident has quit IRC | 08:12 | |
*** trident has joined #openstack-nova | 08:14 | |
*** ralonsoh has joined #openstack-nova | 08:15 | |
*** helenaAM has joined #openstack-nova | 08:31 | |
*** sridharg has joined #openstack-nova | 08:54 | |
*** sridharg has quit IRC | 08:54 | |
*** sridharg has joined #openstack-nova | 08:55 | |
*** hshiina has quit IRC | 09:09 | |
*** tetsuro has quit IRC | 09:20 | |
*** panda|rover|off is now known as panda|rover | 09:24 | |
*** Dinesh_Bhor has quit IRC | 09:34 | |
*** derekh has joined #openstack-nova | 09:42 | |
*** k_mouza has joined #openstack-nova | 09:52 | |
*** k_mouza has quit IRC | 09:53 | |
*** k_mouza has joined #openstack-nova | 09:54 | |
*** Dinesh_Bhor has joined #openstack-nova | 09:57 | |
*** Dinesh_Bhor has quit IRC | 10:01 | |
*** ttsiouts has joined #openstack-nova | 10:05 | |
*** ttsiouts has quit IRC | 10:10 | |
*** ttsiouts has joined #openstack-nova | 10:11 | |
*** ttsiouts has quit IRC | 10:15 | |
*** ttsiouts has joined #openstack-nova | 10:20 | |
*** ttsiouts has quit IRC | 10:21 | |
*** ttsiouts has joined #openstack-nova | 10:22 | |
*** ttsiouts has quit IRC | 10:26 | |
*** ttsiouts has joined #openstack-nova | 10:30 | |
openstackgerrit | zhouxinyong proposed openstack/nova master: delete unavailable links https://review.openstack.org/616870 | 10:30 |
openstackgerrit | Surya Seetharaman proposed openstack/nova master: Make _instances_cores_ram_count() be smart about cells https://review.openstack.org/569055 | 10:31 |
openstackgerrit | Surya Seetharaman proposed openstack/nova master: Make _instances_cores_ram_count() be smart about cells https://review.openstack.org/569055 | 10:32 |
*** sambetts_ is now known as sambetts|afk | 10:33 | |
*** davidsha has joined #openstack-nova | 10:36 | |
openstackgerrit | zhouxinyong proposed openstack/nova master: delete unavailable links https://review.openstack.org/616870 | 10:52 |
*** maciejjozefczyk has quit IRC | 10:59 | |
*** ttsiouts has quit IRC | 10:59 | |
*** ttsiouts has joined #openstack-nova | 11:00 | |
*** maciejjozefczyk has joined #openstack-nova | 11:01 | |
*** maciejjozefczyk has joined #openstack-nova | 11:01 | |
*** tssurya has quit IRC | 11:04 | |
*** dpawlik has quit IRC | 11:08 | |
*** rodolof has joined #openstack-nova | 11:12 | |
*** k_mouza has quit IRC | 11:31 | |
*** k_mouza has joined #openstack-nova | 11:33 | |
*** k_mouza has quit IRC | 11:52 | |
*** dtantsur is now known as dtantsur|brb | 11:59 | |
*** k_mouza has joined #openstack-nova | 12:05 | |
*** brinzhang has quit IRC | 12:10 | |
*** janki has quit IRC | 12:11 | |
*** ondrejme has joined #openstack-nova | 12:13 | |
*** maciejjozefczyk has quit IRC | 12:15 | |
*** maciejjozefczyk has joined #openstack-nova | 12:16 | |
*** erlon has joined #openstack-nova | 12:29 | |
*** panda|rover is now known as panda|rover|lch | 12:31 | |
*** alexchadin has joined #openstack-nova | 12:39 | |
*** rodolof has quit IRC | 12:55 | |
openstackgerrit | zhouxinyong proposed openstack/nova master: modify the avaliable link https://review.openstack.org/616905 | 13:08 |
*** dtantsur|brb is now known as dtantsur | 13:09 | |
*** k_mouza has quit IRC | 13:11 | |
*** Dinesh_Bhor has joined #openstack-nova | 13:18 | |
*** k_mouza has joined #openstack-nova | 13:23 | |
openstackgerrit | Lee Yarwood proposed openstack/nova master: DNM WIP zuul: Add a lioadm based multiattach job https://review.openstack.org/616916 | 13:26 |
*** Dinesh_Bhor has quit IRC | 13:34 | |
*** sahid has quit IRC | 13:42 | |
*** sahid has joined #openstack-nova | 13:47 | |
*** mriedem has joined #openstack-nova | 14:08 | |
*** tbachman has joined #openstack-nova | 14:11 | |
openstackgerrit | Balazs Gibizer proposed openstack/nova master: Calculate port_id rp_uuid mapping for binding https://review.openstack.org/616239 | 14:18 |
openstackgerrit | Balazs Gibizer proposed openstack/nova master: Pass allocations and traits to neturonv2 api https://review.openstack.org/616240 | 14:18 |
openstackgerrit | Balazs Gibizer proposed openstack/nova master: Send RP uuid in the port binding https://review.openstack.org/569459 | 14:18 |
openstackgerrit | Balazs Gibizer proposed openstack/nova master: Test boot with more ports with bandwidth request https://review.openstack.org/573317 | 14:18 |
openstackgerrit | Ivaylo Mitev proposed openstack/nova master: VMware: Attach volumes using adapter type from instance https://review.openstack.org/616599 | 14:18 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: libvirt: change "Ignoring supplied device name" warning to info https://review.openstack.org/616952 | 14:19 |
*** davidsha has quit IRC | 14:32 | |
*** eharney has joined #openstack-nova | 14:34 | |
*** liuyulong has joined #openstack-nova | 14:35 | |
*** eharney has quit IRC | 14:54 | |
*** tssurya has joined #openstack-nova | 14:57 | |
*** maciejjozefczyk has quit IRC | 15:00 | |
*** lbragstad has quit IRC | 15:02 | |
*** lbragstad has joined #openstack-nova | 15:03 | |
*** maciejjozefczyk has joined #openstack-nova | 15:07 | |
*** maciejjozefczyk has quit IRC | 15:07 | |
*** maciejjozefczyk has joined #openstack-nova | 15:09 | |
*** alexchadin has quit IRC | 15:10 | |
*** k_mouza has quit IRC | 15:16 | |
*** hongbin has joined #openstack-nova | 15:18 | |
*** artom has quit IRC | 15:21 | |
*** artom has joined #openstack-nova | 15:21 | |
*** k_mouza has joined #openstack-nova | 15:24 | |
*** maciejjozefczyk has quit IRC | 15:33 | |
*** maciejjozefczyk has joined #openstack-nova | 15:34 | |
*** maciejjozefczyk has quit IRC | 15:34 | |
*** maciejjozefczyk has joined #openstack-nova | 15:34 | |
*** maciejjozefczyk has quit IRC | 15:35 | |
*** maciejjozefczyk has joined #openstack-nova | 15:35 | |
*** rnoriega has quit IRC | 15:36 | |
*** maciejjozefczyk has quit IRC | 15:36 | |
*** rnoriega has joined #openstack-nova | 15:36 | |
mriedem | aspiers: done https://review.openstack.org/#/c/609779/ | 15:43 |
mriedem | i'm really not thrilled at the amount of technical debt we'll be taking on if we add this | 15:43 |
mriedem | but such is life i suppose | 15:43 |
mriedem | by technical debt i mean "oh i can create and destroy a sev-enabled vm, but that's all i can do with it" | 15:43 |
openstackgerrit | Jack Ding proposed openstack/nova master: Add cache=none option for qemu-img convert https://review.openstack.org/616692 | 15:45 |
*** spatel has joined #openstack-nova | 15:48 | |
spatel | sean-k-mooney: Howdy!!! | 15:50 |
spatel | Morning | 15:50 |
spatel | https://bugs.launchpad.net/nova/+bug/1792763 | 15:50 |
openstack | Launchpad bug 1792763 in OpenStack Compute (nova) "tap TX packet drops during high cpu load " [Undecided,Invalid] | 15:50 |
spatel | Yes this could be resolved or close.. because its design question ( i won't say its BUG ) | 15:51 |
*** bnemec is now known as beekneemech | 15:53 | |
sean-k-mooney | ya the drops you were seeing were a limitation of linux bridge as you said so this is not something nova can fix | 15:53 |
sean-k-mooney | spatel: the kernel can only handel about 1.4mpps on a 3.4GHz cpu and generally its less then that | 15:54 |
sean-k-mooney | once you exceed that level you get drops. | 15:54 |
spatel | In my test after 50kpps i was seeing TX drop on tap interface | 15:55 |
*** eharney has joined #openstack-nova | 15:56 | |
spatel | I think this is issue of tap interface design, it run on kernel space which is overhead on kernel | 15:56 |
spatel | virtio i meant | 15:56 |
sean-k-mooney | to get to 1.4 you need kernel ovs with kernel vhost module to acclerate it | 15:56 |
sean-k-mooney | spatel: yes so its not something openstack can remedy | 15:57 |
spatel | ++ | 15:57 |
spatel | In my case i am using linuxbridge (not ovs) | 15:57 |
spatel | do you think OVS is better in performance compare to linuxbridge ? | 15:58 |
sean-k-mooney | yes it is | 15:58 |
sean-k-mooney | bar multicast tunnelling | 15:58 |
sean-k-mooney | if you have a multicast hevey workload use linux bridge as ovs falls back to unicast | 15:58 |
sean-k-mooney | but in gereral ovs out performes linux bridge in vm based workloads | 15:59 |
spatel | when you say multicast what is the relation here? | 15:59 |
*** jistr is now known as jistr|call | 16:00 | |
sean-k-mooney | linux bridge support using multicast endpoint for teant networks meaning it can more efficetly handel tenatn traffic with a high proportion of broadcase or multicast traffic | 16:00 |
sean-k-mooney | ovs does not and has to fall back to a unicast mesh toplogy for vxlan | 16:01 |
sean-k-mooney | but for typical workloads ovs will out perfrom linux bridge | 16:01 |
dansmith | mriedem: I threw a comment in there about using sysmeta to let virt drivers declare some ops as invalid for an instance. is there some reason that's not reasonable? | 16:04 |
*** burt has joined #openstack-nova | 16:04 | |
dansmith | presumably 403 is allowed for pretty much any operation on any microversion, so I would think it'd be not a huge deal, and immediately applies to existing operations in certain situations | 16:04 |
*** jangutter has quit IRC | 16:07 | |
*** Luzi has quit IRC | 16:07 | |
*** janki has joined #openstack-nova | 16:09 | |
mriedem | dansmith: yeah, replied | 16:09 |
mriedem | it really goes back to the capabilities thing we've discussed several times before | 16:09 |
mriedem | i'm mostly concerned about snapshot, because if you can't move the instance, users are at least going to want to be able to snapshot it i'd think before it has to be destroyed and recreated elsewhere because the compute it's on is going away | 16:11 |
mriedem | of course this is where someone says, "just attach a data volume and rewrite the application to use that" | 16:12 |
spatel | sean-k-mooney: thanks for clear that point.. :) i have all unicast workload | 16:12 |
*** lbragstad has quit IRC | 16:14 | |
dansmith | mriedem: did he say snapshot wasn't supported? I would think it would be | 16:15 |
*** lbragstad has joined #openstack-nova | 16:15 | |
dansmith | the airplane wifi is sucking too hard for me to even open it again | 16:16 |
mriedem | it wasn't mentioned | 16:16 |
mriedem | that's why i asked, because it sure seems like a lot can't be supported | 16:16 |
*** jistr|call is now known as jistr | 16:17 | |
*** imacdonn has quit IRC | 16:18 | |
mriedem | we got a bug b/c of the limit of tenant ids for the aggregate multitenancy isolation filter, that's resolved with the placement request filter, but doesn't look like the docs for the placement filter mention you can namespace the metadata so you can add as many tenants as you want | 16:18 |
dansmith | mriedem: the suspend/resume and live migration are about in-memory state, which is why they're hard to support I think | 16:18 |
dansmith | snapshot, reboot, cold migrate should all be fine I would think | 16:18 |
dansmith | based on my reading and assumptions about how this works | 16:18 |
*** cfriesen has joined #openstack-nova | 16:18 | |
dansmith | mriedem: hmm, I was sure I put that in there | 16:19 |
mriedem | don't see it, i can push up something for that | 16:19 |
dansmith | okay | 16:20 |
mriedem | and i'll probably update the docs for the old filter to mention the limitation (and link to the bug) and say the placement one is a better replacement | 16:20 |
dansmith | ack | 16:21 |
dansmith | did I have it in the commit message or something? | 16:21 |
dansmith | I was sure I wrote words about this | 16:21 |
*** etp has quit IRC | 16:21 | |
*** gyee has joined #openstack-nova | 16:22 | |
*** tssurya has quit IRC | 16:22 | |
mriedem | https://review.openstack.org/#/c/545002/27 "This also allows making this filter advisory but not required, and supports multiple tenants per aggregate, unlike the original filter." | 16:22 |
mriedem | maybe that | 16:22 |
dansmith | nova with the ``filter_tenant_id`` key (optionally suffixed with any string for | 16:24 |
dansmith | multiple tenants, | 16:24 |
dansmith | https://review.openstack.org/#/c/557490/8/releasenotes/notes/tenant_aggregate_placement_filter-c2fed8889f43b6e3.yaml | 16:24 |
dansmith | in the reno not hte docs | 16:24 |
dansmith | tha's mah bad | 16:25 |
mriedem | k, i'll copy that | 16:25 |
*** etp has joined #openstack-nova | 16:27 | |
openstackgerrit | Balazs Gibizer proposed openstack/nova master: Calculate port_id rp_uuid mapping for binding https://review.openstack.org/616239 | 16:28 |
openstackgerrit | Balazs Gibizer proposed openstack/nova master: Pass allocations and traits to neturonv2 api https://review.openstack.org/616240 | 16:28 |
openstackgerrit | Balazs Gibizer proposed openstack/nova master: Send RP uuid in the port binding https://review.openstack.org/569459 | 16:28 |
openstackgerrit | Balazs Gibizer proposed openstack/nova master: Test boot with more ports with bandwidth request https://review.openstack.org/573317 | 16:29 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Mention meta key suffix in tenant isolation with placement docs https://review.openstack.org/616991 | 16:33 |
mriedem | sean-k-mooney: is it just me or is NeutronLinuxBridgeInterfaceDriver completely replaced with os-vif now? | 16:38 |
sean-k-mooney | ill need to check but probably | 16:41 |
mriedem | i think it would only be used via the linuxnet_interface_driver config option, but i don't see anything with neutron in nova using the code path that hits that option | 16:42 |
mriedem | only the nova-network l3 stuff | 16:42 |
sean-k-mooney | we can likely kill it when we kill nova networks | 16:42 |
mriedem | sure but this is a neutron-specific driver | 16:42 |
sean-k-mooney | mriedem: ill look into it next week while and see what its actully used for but your right | 16:43 |
sean-k-mooney | * while ye are at the summit | 16:44 |
*** spatel has quit IRC | 16:44 | |
cfriesen | stephenfin: question about your commit https://review.openstack.org/#/c/526329 One of our guys says he ran into a scenario in pike where "image_chunks" itself was None due to things like firewall breakage or server-side problems. Does that get handled properly currently? | 16:45 |
openstackgerrit | Merged openstack/nova stable/pike: Fix the request context in ServiceFixture https://review.openstack.org/599839 | 16:46 |
openstackgerrit | Merged openstack/nova stable/pike: Add functional test for affinity with multiple cells https://review.openstack.org/599840 | 16:46 |
*** k_mouza has quit IRC | 16:46 | |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Delete NeutronLinuxBridgeInterfaceDriver https://review.openstack.org/616995 | 16:54 |
*** spatel has joined #openstack-nova | 16:54 | |
*** helenaAM has quit IRC | 16:59 | |
*** dtantsur is now known as dtantsur|afk | 17:00 | |
*** hongbin has quit IRC | 17:02 | |
cfriesen | mriedem: regarding the disk sector size issue...are you aware of any 8K sector disks or did you suggest it for future expansion? | 17:03 |
stephenfin | cfriesen: Based on that as-is, it would not | 17:03 |
stephenfin | cfriesen: Though I'd have expected to see an exception raised by the client, more so than anything else | 17:04 |
mriedem | cfriesen: was just suggesting based on what was noted in the bug report | 17:06 |
sean-k-mooney | im not aware of any 8k sector discs but i belive we can also diskcover the sector size by querying the disk via sysfs so we proably dont need to hardcode it | 17:07 |
sean-k-mooney | that said 4k and 512 are the most common | 17:07 |
cfriesen | I thought that 4K was still the highest supported physical sector size (since you'd want to be able to read a whole disk sector into a memory page) | 17:07 |
dansmith | not everyone uses 4k pages :) | 17:08 |
sean-k-mooney | power pc i think is 16k | 17:08 |
dansmith | I thought there were some SAN types that used larger sector sizes just because of the network optimization, | 17:08 |
dansmith | even if not backed by actual 8k | 17:08 |
cfriesen | as far as I know the block size can be different from the sector size | 17:09 |
dansmith | also, netapp I think uses some super odd sizes, even to the point of having weirdly low-level-formatted drives for them | 17:09 |
sean-k-mooney | i know some raid controls can be configured to exposed larger sector sizes but i dont know how common that is anymore | 17:09 |
dansmith | yep | 17:09 |
sean-k-mooney | mriedem: i think one of the things you suggested was just making a config option correct | 17:10 |
sean-k-mooney | something like directio_sector_sizes=512,4096 | 17:11 |
dansmith | ah, but it's hidden to the LUN: https://kb.netapp.com/app/answers/answer_view/a_id/1001353/~/how-can-the-bytes%2Fsector-be-changed-in-a-luns-geometry%3F- | 17:13 |
cfriesen | the goal here is to figure out if the filesystem supports O_DIRECT. according to the man page, this should be set to the logical block size of the underlying storage, which can be determined using the ioctl() BLKSSZGET operation or by calling "blockdev --getss" | 17:14 |
sean-k-mooney | cfriesen: i think part of the issue is that on older kernel < 2.4 O_DIRECT required alinged access | 17:16 |
sean-k-mooney | but on bsd and newer linux kernel O_DIRECT did not reuqire alinged acess | 17:16 |
sean-k-mooney | cfriesen: im not actully sure of we need to do the alignment check we are doing anymore | 17:17 |
openstackgerrit | Merged openstack/nova stable/queens: fixtures: Track volume attachments within CinderFixtureNewAttachFlow https://review.openstack.org/612494 | 17:17 |
cfriesen | the man page says: Under Linux 2.4, transfer sizes, and the alignment of the user buffer | 17:17 |
cfriesen | and the file offset must all be multiples of the logical block size | 17:17 |
cfriesen | of the filesystem. Since Linux 2.6.0, alignment to the logical block | 17:17 |
cfriesen | size of the underlying storage (typically 512 bytes) suffices. | 17:17 |
cfriesen | oh, ick. sorry | 17:17 |
openstackgerrit | Merged openstack/nova stable/queens: Add regression test for bug#1784353 https://review.openstack.org/612495 | 17:17 |
sean-k-mooney | cfriesen: ok so we still need aligned acess but the function is ment to determin if directio is possibel what its currently doing is determing if direct 512B aligned acess is possible with is a different thing | 17:19 |
cfriesen | agreed. I think switching to 4K would cover 95% of the cases. | 17:19 |
cfriesen | doing it totally correctly woudl require querying the block size from the OS for that specific device | 17:19 |
mriedem | sean-k-mooney: i suggested a config option as an option because this sounds very hit or miss | 17:20 |
sean-k-mooney | well the backing store for libvirt instance is only going to be on one mountpoint | 17:21 |
mriedem | this is definitely not something i've got a lot of experience in though | 17:21 |
sean-k-mooney | presumable it will all have the same alignment/sector size so we could jsut have a single valus and defualt it to 512 and they could set it to 4k or 8k if they have something else | 17:22 |
sean-k-mooney | the other option is just super over align to like a 64K bondary | 17:23 |
sean-k-mooney | that said im sure someone will have a 128K lun now that i have said that | 17:23 |
cfriesen | so guaranteed setting it to 4K will work for both 4K and 512b disks, so I think 4K should be the default | 17:24 |
sean-k-mooney | cfriesen: its also the most common sector size on most new disks so ya that should work | 17:24 |
dansmith | for years now | 17:25 |
cfriesen | I'd be okay with a config option if someone has weird hardware | 17:25 |
sean-k-mooney | cfriesen: so your going to submit a patch :)_ | 17:26 |
cfriesen | there's already a patch in progress | 17:26 |
cfriesen | by someone else | 17:26 |
mriedem | wee https://bugs.launchpad.net/nova/+bug/1798688 | 17:27 |
openstack | Launchpad bug 1798688 in OpenStack Compute (nova) "AllocationUpdateFailed_Remote: Failed to update allocations for consumer. Error: another process changed the consumer after the report client read the consumer state during the claim" [Undecided,Triaged] | 17:27 |
mriedem | looks like our scheduler allocation claim races have shot up since nov 4 | 17:27 |
cfriesen | mriedem: did you want to get the person to update the 4K patch to add a config option? or change the hardcoded number to 4K (which is still an improvement) and add the config option later if someone complains? | 17:28 |
dansmith | mriedem: what does that mean exactly? just the scheduler having to retry the allocation part? | 17:29 |
mriedem | dansmith: we already retry on PUT allocations in the scheduler | 17:29 |
mriedem | maybe the error message changed and we're not retrying properly now? | 17:29 |
mriedem | i haven't dug in yet | 17:29 |
mriedem | cfriesen: the patch already just changes 512 to 4k right? | 17:29 |
mriedem | cfriesen: i believe i just said we might want a 'fixes' release note for it | 17:30 |
mriedem | as a heads up | 17:30 |
melwitt | cfriesen: this is the method we have to checking for directio, if that's the same thing you mentioned earlier https://github.com/openstack/nova/blob/master/nova/privsep/utils.py#L34 | 17:30 |
dansmith | mriedem: I know we do, I'm wondering if you mean it's just having to retry more lately or if it's failing | 17:30 |
mriedem | haven't dug into the logs yet | 17:30 |
mriedem | hopefully we log if we are retrying | 17:30 |
sean-k-mooney | mriedem: ya i was under teh impression we had planned at least to retry in this case which is why we have the generation on the resouce providers in the first place | 17:30 |
*** markvoelker has quit IRC | 17:30 | |
dansmith | sean-k-mooney: we do retry | 17:31 |
mriedem | the scheduler doesn't do anything with generations for this as far as i know | 17:31 |
mriedem | just duplicated another bug in triage to this if someone is looking for work https://bugs.launchpad.net/nova/+bug/1783338 | 17:31 |
openstack | Launchpad bug 1783338 in OpenStack Compute (nova) "Unexpected exception in API method: ValueError: year is out of range" [Medium,Confirmed] - Assigned to Ghanshyam Mann (ghanshyammann) | 17:31 |
mriedem | something in the simple tenant usage code | 17:31 |
cfriesen | mriedem: in irc you were talking about a config option I thought. but yeah, I'd be cool with just a release note for now. | 17:32 |
*** imacdonn has joined #openstack-nova | 17:32 | |
mriedem | gmann: looks like https://bugs.launchpad.net/nova/+bug/1783338 was due to bad data in the db? you're probably traveling, but if you don't plan on handling this we should unassign you https://bugs.launchpad.net/nova/+bug/1783338 | 17:32 |
openstack | Launchpad bug 1783338 in OpenStack Compute (nova) "Unexpected exception in API method: ValueError: year is out of range" [Medium,Confirmed] - Assigned to Ghanshyam Mann (ghanshyammann) | 17:32 |
cfriesen | melwitt: yes, that's the one. there's a patch in review to change the 512 to 4096 in there. which is good, but maybe not sufficient for exotic hardware | 17:32 |
mriedem | i'm mriedem | 17:33 |
melwitt | I said something to him earlier | 17:33 |
melwitt | cfriesen: ok, cool. *looks for the patch* | 17:34 |
mriedem | oh missed that | 17:34 |
cfriesen | melwitt: https://review.openstack.org/#/c/616580 | 17:34 |
melwitt | probably because our nicks blend together. maybe I need to be jgwentworth all the time | 17:34 |
mriedem | dansmith: looks like, from the logs, that we're not retrying | 17:35 |
dansmith | mriedem: maybe something changed recently then? | 17:36 |
*** derekh has quit IRC | 17:37 | |
melwitt | cfriesen: ok, so looks like trying to decide which value to use for the check | 17:39 |
*** sahid has quit IRC | 17:40 | |
mriedem | dansmith: my guess would be https://review.openstack.org/#/c/583667/ | 17:40 |
mriedem | because the scheduler logs are saying we're doing a double up allocation | 17:40 |
mriedem | Nov 06 19:48:36.969356 ubuntu-xenial-inap-mtl01-0000379614 nova-scheduler[12154]: DEBUG nova.scheduler.client.report [None req-f266a0ff-2840-413d-9877-4500e61512f5 tempest-ServersNegativeTestJSON-477704048 tempest-ServersNegativeTestJSON-477704048] Doubling-up allocation_request for move operation. {{(pid=13677) _move_operation_alloc_request /opt/stack/nova/nova/scheduler/client/report.py:203}} | 17:40 |
mriedem | but in this test, we're just unshelving a shelved offloaded server | 17:41 |
mriedem | so that shouldn't really double up any allocatoins | 17:41 |
cfriesen | melwitt: basically, yes. 4096 would work for the vast majority of systems | 17:42 |
mriedem | Nov 06 19:48:36.969659 ubuntu-xenial-inap-mtl01-0000379614 nova-scheduler[12154]: DEBUG nova.scheduler.client.report [None req-f266a0ff-2840-413d-9877-4500e61512f5 tempest-ServersNegativeTestJSON-477704048 tempest-ServersNegativeTestJSON-477704048] New allocation_request containing both source and destination hosts in move operation: {'allocations': {u'3ceb7eab-549c-40ba-a70c-320822c310ab': {'resources': {u'VCPU': 2, u'MEMORY | 17:42 |
mriedem | : 128}}}} {{(pid=13677) _move_operation_alloc_request /opt/stack/nova/nova/scheduler/client/report.py:234}} | 17:42 |
dansmith | mriedem: it also touches the code near where we raise retry... | 17:42 |
mriedem | ^ is definitely wrong | 17:42 |
mriedem | there is only one provider in that log | 17:43 |
dansmith | mriedem: it seems to specifically exclude the consumer generation conflict from the case where we retry | 17:43 |
dansmith | mriedem: do you see "another process changed the consumer" in the log? | 17:46 |
mriedem | yes | 17:46 |
dansmith | then it's hitting that consumer case and not retrying | 17:46 |
mriedem | that's why we don't retry | 17:46 |
dansmith | yeah | 17:46 |
mriedem | i also don't know why it thinks we're starting with existing allocations for a shelved offloaded server | 17:47 |
dansmith | and that's causing it to try to double? | 17:47 |
mriedem | well, it goes into _move_operation_alloc_request but doesn't actually double anything | 17:48 |
mriedem | Nov 06 19:48:36.969659 ubuntu-xenial-inap-mtl01-0000379614 nova-scheduler[12154]: DEBUG nova.scheduler.client.report [None req-f266a0ff-2840-413d-9877-4500e61512f5 tempest-ServersNegativeTestJSON-477704048 tempest-ServersNegativeTestJSON-477704048] New allocation_request containing both source and destination hosts in move operation: {'allocations': {u'3ceb7eab-549c-40ba-a70c-320822c310ab': {'resources': {u'VCPU': 2, u'MEMORY | 17:48 |
mriedem | : 128}}}} {{(pid=13677) _move_operation_alloc_request /opt/stack/nova/nova/scheduler/client/report.py:234}} | 17:48 |
mriedem | there is only one provider in that body | 17:48 |
mriedem | this is the error from placement | 17:51 |
mriedem | Nov 06 19:48:37.013780 ubuntu-xenial-inap-mtl01-0000379614 nova-scheduler[12154]: WARNING nova.scheduler.client.report [None req-f266a0ff-2840-413d-9877-4500e61512f5 tempest-ServersNegativeTestJSON-477704048 tempest-ServersNegativeTestJSON-477704048] Failed to save allocation for 6665f00a-dcf1-4286-b075-d7dcd7c37487. Got HTTP 409: {"errors": [{"status": 409, "request_id": "req-c9ba6cbd-3b6e-4e5d-b550-9588be8a49d2", "code": "p | 17:51 |
mriedem | ment.concurrent_update", "detail": "There was a conflict when trying to complete your request.\n\n consumer generation conflict - expected null but got 1 ", "title": "Conflict"}]} | 17:51 |
*** ralonsoh has quit IRC | 17:51 | |
*** ttsiouts has quit IRC | 17:51 | |
mriedem | idk wtf is going on, but i see 3 different PUT allocations in the placement logs for that consumer | 17:58 |
mriedem | first is probably for the initial scheduler, and then we offload and remove allocations | 17:58 |
mriedem | 2nd is for the unshelve | 17:59 |
*** ivve has joined #openstack-nova | 18:00 | |
mriedem | aha | 18:07 |
mriedem | the allocatoin delete on unshelve changed with this patch https://review.openstack.org/#/c/591597/ | 18:07 |
mriedem | so we no longer actually delete allocations, we PUT {} | 18:07 |
sean-k-mooney | quick question. in what cases does nova update the network info cache? | 18:07 |
*** Swami has joined #openstack-nova | 18:07 | |
sean-k-mooney | i know it does it in respconce to neutron notification. there is also a periodic heal task right | 18:08 |
sean-k-mooney | is that it? | 18:08 |
mriedem | in all cases | 18:08 |
mriedem | attach vifs | 18:08 |
mriedem | etc | 18:08 |
mriedem | dansmith: yeah so there are 3 PUTs for allocations, 1st for initial schedule, 2nd for shelve offload (PUT /allocations with {}) and then the 3rd is scheduling during unshelve | 18:08 |
mriedem | since the allocations aren't deleted on shelve offload, the consumer must persist in placement with it's existing consumer generation | 18:09 |
mriedem | which the sheduler in the 3rd PUT doesn't account for | 18:09 |
dansmith | okay | 18:09 |
dansmith | the scheduler assumes that the consumer is gone? | 18:09 |
dansmith | I'm surprised it would care, | 18:09 |
dansmith | because it doesn't know if we're doing an initial boot or a move right? | 18:10 |
mriedem | it would know if we're doing a move if the consumer already has allocations elsewhere | 18:10 |
dansmith | right, but otherwise it doesn't, | 18:10 |
dansmith | and in this case there are no remaining allocations right? | 18:11 |
mriedem | well, | 18:11 |
dansmith | or are you saying it assumes that if you have no allocations the consumer can't exist? | 18:11 |
*** sridharg has quit IRC | 18:11 | |
mriedem | that's not what the scheduler thinks, because it goes down that _move_operation_alloc_request path | 18:11 |
mriedem | maybe the tempest test isn't really waiting for the instance be fully shelved offloaded before it unshelves | 18:11 |
mriedem | ah indeed, | 18:11 |
mriedem | we set the instance vm_state to SHELVED_OFFLOADED *before* we remove allocations | 18:12 |
dansmith | I guess I'm still surprised it cares | 18:12 |
mriedem | which is definitely a race | 18:12 |
*** zul has quit IRC | 18:12 | |
dansmith | why? | 18:16 |
dansmith | just because it signals to tempest that it's done? | 18:16 |
dansmith | if we didn't, | 18:17 |
dansmith | and we crashed right between deleting the allocations and makring it as offloaded we'd have lost some information I would think | 18:17 |
mriedem | well, this also doesn't seem right | 18:17 |
mriedem | https://review.openstack.org/#/c/591597/8/nova/scheduler/client/report.py@2091 | 18:17 |
mriedem | "# removing all resources from the allocation will auto delete the | 18:17 |
mriedem | # consumer in placement" | 18:17 |
dansmith | I guess maybe your point is that it's a race between starting the unshelve and there still being allocations... | 18:18 |
mriedem | maybe that is happening, i'm not sure | 18:18 |
mriedem | correct | 18:18 |
mriedem | b/c i'm seeing this being true during unshelve https://github.com/openstack/nova/blob/e27905f482ba26d2bbf3ae5d948dee37523042d5/nova/scheduler/client/report.py#L1824 | 18:18 |
dansmith | not sure that's better either way | 18:18 |
mriedem | which shouldn't be the case | 18:18 |
*** k_mouza has joined #openstack-nova | 18:19 | |
dansmith | it's certainly possible that the move to PUT{} from DELETE didn't bring over some "and delete the consumer" part | 18:19 |
dansmith | but again, I'm not sure why it should matter to the scheduler that it exists | 18:20 |
dansmith | although... | 18:20 |
dansmith | we have no api for looking at the consumer to get the generation if it already exists, IIRC | 18:20 |
dansmith | so maybe without seeing an allocation, and not being able to see the consumer directly, we have no alternative? | 18:20 |
mriedem | right, it's supposed to come back on the GET /allocations/{consumer} call | 18:20 |
dansmith | I expect jaypipes to pop in here any second and say "ah hah!" | 18:20 |
mriedem | jaypipes is busy chefing it up | 18:21 |
dansmith | his fave | 18:21 |
mriedem | and getting ready to sleep for a week while the rest of us are in berlin | 18:21 |
dansmith | lucky bastard | 18:21 |
dansmith | I'm going to be landing pretty soon, FYI | 18:21 |
mriedem | i'm overdue for lunch as well | 18:22 |
mriedem | so, can't claim_resources in the scheduler still just retry if it hits that consumer generatoin conflict? | 18:22 |
dansmith | well, | 18:22 |
dansmith | not if it doesn't know what the generation is | 18:23 |
dansmith | that's what I was saying.. it might not be able to find out what it is, | 18:23 |
dansmith | with no consumer api and no existing allocation to look at | 18:23 |
mriedem | i would expect GET /allocations/{consumer_uuid} to return the consumer generation | 18:23 |
mriedem | even if allocations are {} | 18:23 |
dansmith | with no alloc records? | 18:24 |
dansmith | I dunno | 18:24 |
mriedem | but i guess i'd have to dig into the placement code | 18:24 |
dansmith | I would expect that code returns 404 if none come back, | 18:24 |
*** k_mouza has quit IRC | 18:24 | |
mriedem | should probably also log in placement when the consumer is deleted b/c allocations went to 0 | 18:24 |
dansmith | because it would only get the consume through the join, or afterwards I would expect | 18:24 |
mriedem | no it doesn't 404, you get {"allocations": {}} | 18:24 |
mriedem | if there are no allocations for the consumer | 18:24 |
dansmith | oh? | 18:24 |
mriedem | yeah it's confusing | 18:24 |
dansmith | that seems supremely weird to me, but okay | 18:25 |
mriedem | what does taylor think about all this? | 18:25 |
dansmith | she's busy with her own work | 18:26 |
jaypipes | mriedem: fuck chef. fuck ansible. fuck docker. it's all a bunch of complete assbaggery. | 18:26 |
dansmith | my little mobile wifi router lets us share the same crappy airline wifi, so after that came online, I might as well not be sitting next to her | 18:26 |
mriedem | jaypipes: but salt?! | 18:26 |
* jaypipes reads back to see something about consumers. | 18:26 | |
dansmith | jaypipes: you're gonna love it | 18:27 |
sean-k-mooney | mriedem: i think jaypipes has enough salt in his life right now | 18:27 |
mriedem | jaypipes: notes are in https://bugs.launchpad.net/nova/+bug/1798688 | 18:27 |
openstack | Launchpad bug 1798688 in OpenStack Compute (nova) "AllocationUpdateFailed_Remote: Failed to update allocations for consumer. Error: another process changed the consumer after the report client read the consumer state during the claim" [Undecided,Triaged] | 18:27 |
dansmith | jaypipes: question.. should placement have a consumers endpoint? | 18:27 |
mriedem | looks like we're racing between shelve offload (allocation removal) and unshelve (put new allocations) and hitting a consumer generation conflict | 18:27 |
sean-k-mooney | jaypipes: at least you dont have to use tripplo where we use yaml to drive heat to drive puppet to drive ansible to deploy docker containers ... | 18:28 |
sean-k-mooney | or maybe the ansible drives puppet its hard to keep track of | 18:28 |
dansmith | sean-k-mooney: it drives ... me insane | 18:29 |
mriedem | so, i see in placement handler code where it handles the "allocations are being removed on PUT" case, and it ensures a consumer exists, but then i don't see where that consumer is deleted | 18:30 |
mriedem | like the note in the compute code | 18:30 |
mriedem | https://github.com/openstack/nova/blob/e27905f482ba26d2bbf3ae5d948dee37523042d5/nova/api/openstack/placement/handlers/allocation.py#L404 | 18:32 |
mriedem | oh i guess it should happen here https://github.com/openstack/nova/blob/e27905f482ba26d2bbf3ae5d948dee37523042d5/nova/api/openstack/placement/objects/resource_provider.py#L2099 | 18:34 |
mriedem | https://github.com/openstack/nova/blob/e27905f482ba26d2bbf3ae5d948dee37523042d5/nova/api/openstack/placement/objects/consumer.py#L70 might be broken | 18:36 |
mriedem | in the same way that ensure was broken https://github.com/openstack/nova/commit/730936e535e67127c76d4f27649a16d8cf05efc9#diff-fcca11e34c1b5fce52a4ddbc418aa2d5 | 18:36 |
openstackgerrit | Merged openstack/nova stable/queens: conductor: Recreate volume attachments during a reschedule https://review.openstack.org/612496 | 18:37 |
openstackgerrit | Merged openstack/nova master: Update the description to make it more accuracy https://review.openstack.org/615362 | 18:37 |
mriedem | i can't really tell where delete_consumers_if_no_allocations is tested though... | 18:39 |
mriedem | some gabbit i'm sure | 18:40 |
dansmith | seems easily unit testable, | 18:41 |
dansmith | and I definitely can't look at that and tell that it works | 18:41 |
dansmith | since it's joining on consume id and asserting that it's none in one case | 18:41 |
mriedem | yeah i can't do anything with the sql w/o testing it | 18:42 |
dansmith | time to pack up, back later | 18:42 |
mriedem | oh i guess DeleteConsumerIfNoAllocsTestCase | 18:42 |
mriedem | i'll tweak that after lunch | 18:43 |
*** bigdogstl has joined #openstack-nova | 18:50 | |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Add debug logs when doubling-up allocations during scheduling https://review.openstack.org/617016 | 18:53 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Log consumers_to_check when calling delete_consumers_if_no_allocations https://review.openstack.org/617017 | 18:53 |
mriedem | jaypipes: debug logging needed for this gate bug ^ | 18:53 |
jaypipes | still reading back, sorry | 18:54 |
*** mriedem is now known as mriedem_hangry | 18:55 | |
jaypipes | dansmith, mriedem_hangry: there is a check at the end of the server-side of PUT /allocations that will auto-delete the consumer record if there are no allocations still referring to it. | 19:04 |
jaypipes | dansmith: and yes, I've said for a long time that we should have a GET /consumers endpoint. There are placement devs that vehemently disagreed with that. | 19:07 |
*** bigdogstl has quit IRC | 19:08 | |
*** bigdogstl has joined #openstack-nova | 19:12 | |
*** ivve has quit IRC | 19:12 | |
*** janki has quit IRC | 19:22 | |
*** zigo has quit IRC | 19:25 | |
*** bigdogstl has quit IRC | 19:26 | |
sean-k-mooney | mriedem_hangry: would you have any objection to backporting https://review.openstack.org/#/c/591607/9 to newton? im pretty sure we have a customer that is hitting this as they reported instance restarting after a host reboot missing interfaces that show up in nova interface-list | 19:27 |
*** bigdogstl has joined #openstack-nova | 19:30 | |
sean-k-mooney | mriedem_hangry: actullly i just realised newton is way older then i remembered and is eol | 19:34 |
*** mriedem_hangry is now known as mriedem | 19:34 | |
mriedem | jaypipes: yup found that, and the related functional test | 19:34 |
mriedem | sean-k-mooney: not to mention that isn't even approved on master | 19:35 |
*** bigdogstl has quit IRC | 19:35 | |
sean-k-mooney | mriedem: yes :) im aware. i was more asking do you think this is something that can be backported in general upstream | 19:35 |
sean-k-mooney | once it lands in master | 19:36 |
mriedem | idk | 19:36 |
mriedem | it seems to be pretty controversial | 19:37 |
mriedem | like my david bowie costume on halloween | 19:37 |
sean-k-mooney | im waiting on more logs for the downstream bug to confirm this is actully the issue | 19:38 |
sean-k-mooney | mriedem: ill try to review and digest these chagnes more on monday | 19:40 |
sean-k-mooney | mriedem: are you flying out to berlin today/tomorow? | 19:41 |
mriedem | the problem the huawei ops team ran into was policy changed on the neutron side which started returning an empty list of ports, which was then saved into the info cache in the nova db, | 19:41 |
mriedem | and the heal periodic relies on the info cache rather than the source of truth to fix the cache | 19:41 |
mriedem | tonight | 19:41 |
mriedem | there are other ways to simply rebuild the cache if that's what is needed, e.g. https://docs.openstack.org/python-novaclient/latest/cli/nova.html#nova-refresh-network | 19:41 |
sean-k-mooney | right. i think we discussed this in the past at some point too it feels familar but i have not reviewd this before | 19:42 |
mriedem | it came up at the ptg i think | 19:42 |
mriedem | and the public cloud sig has brought it up before (OVH obviously) | 19:42 |
sean-k-mooney | mriedem: perhaps. oh so we can force a rebuild via the nova client today? | 19:42 |
* sean-k-mooney clicks | 19:42 | |
mriedem | it doesn't rebuild from neutron | 19:43 |
mriedem | i don't think anyway | 19:43 |
mriedem | it just sends a network-changed event to the compute | 19:43 |
sean-k-mooney | oh it rebuilds form the vif table in the nova db? | 19:43 |
mriedem | yes | 19:43 |
mriedem | well, | 19:43 |
mriedem | from the info cache | 19:43 |
mriedem | iow, | 19:44 |
sean-k-mooney | ok but if the info cache got currpted when the host rebooted your still stuck | 19:44 |
mriedem | the network-changed event and _heal_instance_info_cache periodic do the same thing today | 19:44 |
mriedem | correct | 19:44 |
mriedem | hence the reason for making the periodic actually "heal" from the source of truth, which is neutron | 19:44 |
mriedem | and not our potentially corrupted cache | 19:44 |
sean-k-mooney | so i suggested they detach the missing interfaces and reattach them as a workaround for now to try and force nova and neutron to resync | 19:45 |
sean-k-mooney | i think that would work in this case but its not ideal | 19:45 |
sean-k-mooney | mriedem: anyway thanks. i was sraching my head for the last day or so trying to parse what was going on from incomplete logs but im 98% sure this is it. | 19:47 |
sean-k-mooney | mriedem: have a safe trip. | 19:47 |
*** bigdogstl has joined #openstack-nova | 19:53 | |
*** bigdogstl has quit IRC | 19:57 | |
*** sambetts|afk has quit IRC | 20:00 | |
*** sambetts_ has joined #openstack-nova | 20:02 | |
*** efried is now known as fried_rice | 20:09 | |
*** munimeha1 has joined #openstack-nova | 20:19 | |
*** erlon has quit IRC | 20:47 | |
mriedem | thanks | 20:48 |
*** eharney has quit IRC | 20:53 | |
dansmith | mriedem: so you found a test that confirmed the behavior of that thing? | 20:58 |
dansmith | mriedem: that deletes the consumer? | 20:58 |
*** bigdogstl has joined #openstack-nova | 20:59 | |
mriedem | DeleteConsumerIfNoAllocsTestCase is the functional test that covers that case, | 21:02 |
mriedem | and it looks like a correct test to me, | 21:02 |
mriedem | creates 2 consumers each with 2 allocations on different resource classes, | 21:03 |
mriedem | clears the allocations for one of them and asserts the consumer is gone | 21:03 |
mriedem | i think we're just hitting a race with the shelve offloaded status change before we cleanup the allocations | 21:03 |
mriedem | but i've posted a couple of patches to add debug logs to help determine if that's the case | 21:03 |
mriedem | https://review.openstack.org/617016 | 21:03 |
dansmith | okay I'm not sure how we could race and see no allocations but a consumer and get that generation conflict | 21:04 |
dansmith | it'd be one thing if we thought the consumer was there and then disappeared out from under us | 21:05 |
*** lbragstad has quit IRC | 21:08 | |
*** bigdogstl has quit IRC | 21:09 | |
*** bigdogstl has joined #openstack-nova | 21:13 | |
mriedem | during unshelve the scheduler does see allocations | 21:17 |
mriedem | and it thinks we're doing a move | 21:17 |
dansmith | okay I thought you pasted a line showing that there was only one allocation going back to placement | 21:17 |
mriedem | there are 3 PUTs for allocations | 21:18 |
mriedem | 1. create the server - initial | 21:18 |
mriedem | 2. shelve offload - wipe the allocations to {} - which should delete the consumer | 21:18 |
mriedem | 3. unshelve - scheduler claims resources with the wrong consumer generation | 21:18 |
mriedem | and when 3 happens, the scheduler gets allocations for hte consumer and they are there, | 21:18 |
dansmith | ...right | 21:18 |
*** bigdogstl has quit IRC | 21:18 | |
mriedem | so it uses the consumer generation (1) from those allocations | 21:18 |
mriedem | then i think what happens is, | 21:19 |
dansmith | oh, so it passes generation=1 instead of generation=0, meaning new consumer? | 21:19 |
mriedem | placement recreates the consumer which will have generation null | 21:19 |
mriedem | yes | 21:19 |
dansmith | okay I see | 21:19 |
dansmith | I thought you were seeing consumer generation was null or zero or whatever in the third put, but still getting a conflict | 21:19 |
dansmith | but that makes sense now | 21:19 |
mriedem | Nov 06 19:48:37.013780 ubuntu-xenial-inap-mtl01-0000379614 nova-scheduler[12154]: WARNING nova.scheduler.client.report [None req-f266a0ff-2840-413d-9877-4500e61512f5 tempest-ServersNegativeTestJSON-477704048 tempest-ServersNegativeTestJSON-477704048] Failed to save allocation for 6665f00a-dcf1-4286-b075-d7dcd7c37487. Got HTTP 409: {"errors": [{"status": 409, "request_id": "req-c9ba6cbd-3b6e-4e5d-b550-9588be8a49d2", "code": "p | 21:20 |
mriedem | ment.concurrent_update", "detail": "There was a conflict when trying to complete your request.\n\n consumer generation conflict - expected null but got 1 ", "title": "Conflict"}]} | 21:20 |
mriedem | consumer generation conflict - expected null but got 1 | 21:20 |
mriedem | yup - so new consumer but we're passing a generation of 1 | 21:20 |
mriedem | from the old, now deleted consumer | 21:20 |
dansmith | cool | 21:21 |
mriedem | so, | 21:21 |
dansmith | I wish there was something better to communicate that, but any time we get "expected null" in that case, we should be able to re-try but as a non-move sort of thing | 21:21 |
mriedem | we can paper over this by deleting the allocations before marking the instance as shelved offloaded, but that's whack-a-moley | 21:21 |
dansmith | yeah | 21:21 |
mriedem | right we need to retry from claim_resources but i'm not sure what's the best way to do that | 21:22 |
dansmith | and like I said, I think it's not really any better, it just changes the problem | 21:22 |
dansmith | yeah | 21:22 |
mriedem | if we do retry that method, the next get for allocations will see there are none and we should be good | 21:22 |
dansmith | right | 21:22 |
mriedem | b/c we'll pass consumer_generation=None | 21:22 |
mriedem | i think i know what we can do | 21:22 |
mriedem | if we hit | 21:23 |
mriedem | if 'consumer generation conflict' in err['detail']: | 21:23 |
mriedem | we get the allocs again, and if empty, | 21:23 |
mriedem | we retry | 21:23 |
mriedem | easy peasy | 21:23 |
mriedem | it's a double get but meh? | 21:23 |
dansmith | yeah, it's just that it takes us an extra op, | 21:23 |
dansmith | when "expected null" should be enough | 21:23 |
dansmith | yeah | 21:23 |
mriedem | i can parse that out of the message if we want.. | 21:23 |
dansmith | I know we can, it's just icky and unfortunate | 21:23 |
mriedem | with a TODO for more granular error codes later | 21:23 |
dansmith | like all the other cases in there | 21:24 |
mriedem | right | 21:24 |
mriedem | i haven't even started packing yet | 21:24 |
mriedem | laura is starting to check in on me every 30 minutes | 21:24 |
mriedem | "this is what i'm wearing all week! god!" | 21:24 |
dansmith | hah | 21:24 |
mriedem | plus my mother in law is here, | 21:25 |
mriedem | so lots of teenage angst memories coming back right now | 21:25 |
mriedem | the coffee and metallica doesn't hel[p | 21:25 |
dansmith | isn't that a good reason to pack and get out? | 21:25 |
mriedem | i've just holed up in my office | 21:26 |
mriedem | i'll crank out a patch for this bug and be off | 21:26 |
*** eharney has joined #openstack-nova | 21:43 | |
openstackgerrit | Jack Ding proposed openstack/nova master: Use virt.images.convert_image for qemu-img convert https://review.openstack.org/616692 | 21:45 |
fried_rice | mriedem: Sanity check, please. The compute manager has a report client via the scheduler client, that's *not* the same as the report client the resource tracker has. | 21:52 |
fried_rice | which means my current SIGHUP doesn't do shit to the RT's cache | 21:53 |
fried_rice | I need to make the report client a singleton. | 21:53 |
mriedem | correct | 21:53 |
mriedem | we have report clients all over the place | 21:53 |
mriedem | api, conductor, scheduler | 21:53 |
mriedem | etc | 21:53 |
fried_rice | that's a scroo | 21:54 |
fried_rice | mriedem: So - make the report client a singleton (per process), or just diddle the compute manager's reset to hit the rt's reportclient instead. | 21:54 |
fried_rice | f, without knowing what the various ones in the compute manager are used for, is it really safe to make them a singleton? | 21:55 |
mriedem | the latter would be a smaller blast area | 21:56 |
fried_rice | I think I may have actually done this to myself, by removing that LazyLoader | 21:57 |
fried_rice | I suspect that guy was incidentally singleton-ing. | 21:57 |
* fried_rice looks... | 21:57 | |
fried_rice | nah, that should still have been creating separate instances per scheduler client. | 21:58 |
dansmith | fried_rice: yeah I thought that was making it a singleton a month ago when I was looking at it | 22:03 |
dansmith | a lot of stuff in nova used to be lazy loaded because.. um, terrible reasons | 22:04 |
dansmith | lazy loaded or pluggable | 22:04 |
fried_rice | dansmith: In this case it was supposedly because of a circular import. Whether that was ever really an issue, it isn't now, so I ripped it out. But having just looked, I still don't think it was making the report client a singleton. Care to confirm? | 22:04 |
dansmith | I think I confirmed that a month ago when I was looking into a seemingly recent memory leak | 22:05 |
dansmith | so I think it's fine that it's gone | 22:05 |
dansmith | I agree that randomly making it a singleton now should be done with care | 22:06 |
dansmith | but I don't really know how to convince myself that it's okay once its done, tbh | 22:06 |
openstackgerrit | Jack Ding proposed openstack/nova master: Use virt.images.convert_image for qemu-img convert https://review.openstack.org/616692 | 22:06 |
fried_rice | well, using the RT's report client fixed the problem I was having. So maybe I pretend singleton was never suggested. | 22:07 |
fried_rice | oh, f, this is gonna break all over the place. I can't see a reason why the compute manager would possibly want or need to use separate report clients. I'd really like to put 'em together. If not making it a singleton, at least using only one of them from the compute manager. | 22:10 |
* dansmith ->plane | 22:10 | |
fried_rice | o/ | 22:10 |
*** eharney has quit IRC | 22:15 | |
mriedem | fried_rice: the compute manager / RT using the same report client is probably fine, | 22:15 |
mriedem | a lot of that compute manager / RT code was cleaned up way back in ocata i think when jaypipes made the RT a singleton that tracked multiple compute nodes, | 22:16 |
fried_rice | mriedem: Ima put up an independent patch for that | 22:16 |
mriedem | whereas before it was 1 RT per compute node | 22:16 |
fried_rice | ah | 22:16 |
mriedem | they are very tightly coupled, like how the compute manager passes the virt driver into the RT | 22:16 |
openstackgerrit | Jack Ding proposed openstack/nova master: Use virt.images.convert_image for qemu-img convert https://review.openstack.org/616692 | 22:16 |
openstackgerrit | Eric Fried proposed openstack/nova master: SIGHUP n-cpu to refresh provider tree cache https://review.openstack.org/615646 | 22:18 |
openstackgerrit | Eric Fried proposed openstack/nova master: Reduce calls to placement from _ensure https://review.openstack.org/615677 | 22:18 |
openstackgerrit | Eric Fried proposed openstack/nova master: Consolidate inventory refresh https://review.openstack.org/615695 | 22:18 |
openstackgerrit | Eric Fried proposed openstack/nova master: Commonize _update code path https://review.openstack.org/615705 | 22:18 |
openstackgerrit | Eric Fried proposed openstack/nova master: Turn off rp association refresh in nova-next https://review.openstack.org/616033 | 22:18 |
fried_rice | let's see how that goes | 22:18 |
fried_rice | mriedem: Oh, my removal of lazyload probably reinstated "lockutils spam" mentioned in nova/compute/api.py@257 | 22:21 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Retry on consumer delete race in claim_resources https://review.openstack.org/617040 | 22:25 |
mriedem | dansmith: jaypipes: fried_rice: ^ bingo bango | 22:25 |
mriedem | gibi: you too ^ | 22:25 |
mriedem | the commit message is longer than the code | 22:25 |
mriedem | and with that i'm off | 22:29 |
*** mriedem has quit IRC | 22:29 | |
*** bigdogstl has joined #openstack-nova | 22:51 | |
openstackgerrit | Eric Fried proposed openstack/nova master: Rip the report client out of SchedulerClient https://review.openstack.org/617042 | 22:52 |
openstackgerrit | Eric Fried proposed openstack/nova master: Rip the report client out of SchedulerClient https://review.openstack.org/617042 | 22:54 |
*** spatel has quit IRC | 22:56 | |
*** betherly has joined #openstack-nova | 23:02 | |
*** bigdogstl has quit IRC | 23:03 | |
*** bigdogstl has joined #openstack-nova | 23:05 | |
*** betherly has quit IRC | 23:06 | |
*** bigdogstl has quit IRC | 23:10 | |
*** bigdogstl has joined #openstack-nova | 23:11 | |
*** munimeha1 has quit IRC | 23:12 | |
openstackgerrit | Vlad Gusev proposed openstack/nova stable/pike: libvirt: Reduce calls to qemu-img during update_available_resource https://review.openstack.org/604039 | 23:17 |
*** elod has quit IRC | 23:17 | |
openstackgerrit | Merged openstack/nova stable/pike: Make scheduler.utils.setup_instance_group query all cells https://review.openstack.org/599841 | 23:18 |
openstackgerrit | Vlad Gusev proposed openstack/nova stable/pike: libvirt: Reduce calls to qemu-img during update_available_resource https://review.openstack.org/604039 | 23:18 |
*** elod has joined #openstack-nova | 23:18 | |
openstackgerrit | Merged openstack/nova master: Add recreate test for bug 1799892 https://review.openstack.org/613304 | 23:20 |
openstack | bug 1799892 in OpenStack Compute (nova) rocky "Placement API crashes with 500s in Rocky upgrade with downed compute nodes" [Medium,Triaged] https://launchpad.net/bugs/1799892 | 23:20 |
*** bigdogstl has quit IRC | 23:24 | |
aspiers | mriedem: thanks for the review! Regarding technical debt, my understanding is that the intention is very much for SUSE/AMD to carry on working to flesh out the functionality after implementation of the MVP described in the initial spec, rather than just to dump some half-baked implementation upstream and then vanish ;-) This would include adding support for attestation, migration etc. | 23:25 |
aspiers | ah, he's gone | 23:26 |
*** bigdogstl has joined #openstack-nova | 23:27 | |
openstackgerrit | Vlad Gusev proposed openstack/nova stable/pike: libvirt: Use os.stat and os.path.getsize for RAW disk inspection https://review.openstack.org/607544 | 23:27 |
*** s10 has joined #openstack-nova | 23:28 | |
*** bigdogstl has quit IRC | 23:32 | |
*** bigdogstl has joined #openstack-nova | 23:43 | |
*** bigdogstl has quit IRC | 23:54 | |
openstackgerrit | Merged openstack/nova master: Mention meta key suffix in tenant isolation with placement docs https://review.openstack.org/616991 | 23:56 |
openstackgerrit | Ken'ichi Ohmichi proposed openstack/nova master: api-ref: Add a description about sort order https://review.openstack.org/616773 | 23:57 |
*** slaweq has quit IRC | 23:57 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!