*** hongbin has quit IRC | 00:01 | |
*** BjoernT has quit IRC | 00:02 | |
*** ttsiouts has joined #openstack-nova | 00:08 | |
*** artom has joined #openstack-nova | 00:15 | |
*** gyee has quit IRC | 00:36 | |
*** xek has quit IRC | 00:37 | |
*** dpawlik has joined #openstack-nova | 00:38 | |
*** ttsiouts has quit IRC | 00:41 | |
openstackgerrit | Brin Zhang proposed openstack/nova master: Cap sphinx for py2 to match global requirements https://review.opendev.org/659201 | 00:42 |
---|---|---|
*** dpawlik has quit IRC | 00:42 | |
*** ricolin has joined #openstack-nova | 00:50 | |
openstackgerrit | Brin Zhang proposed openstack/nova-specs master: Cap sphinx for py2 to match global requirements https://review.opendev.org/659205 | 00:53 |
*** dpawlik has joined #openstack-nova | 00:54 | |
*** awalende has joined #openstack-nova | 00:57 | |
*** dpawlik has quit IRC | 00:59 | |
openstackgerrit | Tetsuro Nakamura proposed openstack/nova stable/stein: Skip _exclude_nested_providers() if not nested https://review.opendev.org/659206 | 00:59 |
*** d34dh0r53 has joined #openstack-nova | 01:01 | |
*** awalende has quit IRC | 01:02 | |
*** bbowen has quit IRC | 01:27 | |
*** nicolasbock has quit IRC | 01:36 | |
*** lbragstad has quit IRC | 01:42 | |
*** guozijn has joined #openstack-nova | 01:48 | |
openstackgerrit | Tetsuro Nakamura proposed openstack/nova stable/rocky: Skip _exclude_nested_providers() if not nested https://review.opendev.org/659207 | 01:51 |
*** lbragstad has joined #openstack-nova | 01:51 | |
*** BjoernT has joined #openstack-nova | 02:00 | |
*** whoami-rajat has joined #openstack-nova | 02:06 | |
*** cfriesen has quit IRC | 02:10 | |
*** guozijn_ has joined #openstack-nova | 02:11 | |
*** guozijn has quit IRC | 02:13 | |
*** guozijn_ is now known as guozijn | 02:13 | |
*** dklyle has joined #openstack-nova | 02:14 | |
*** BjoernT has quit IRC | 02:17 | |
*** BjoernT has joined #openstack-nova | 02:20 | |
openstackgerrit | Boxiang Zhu proposed openstack/nova master: Fix failure to boot instances with qcow2 format images https://review.opendev.org/640271 | 02:22 |
*** JamesBenson has joined #openstack-nova | 02:26 | |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Update usage in RT.drop_move_claim during confirm resize https://review.opendev.org/641806 | 02:26 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Robustify attachment tracking in CinderFixtureNewAttachFlow https://review.opendev.org/658904 | 02:26 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Add Migration.cross_cell_move and get_by_uuid https://review.opendev.org/614012 | 02:26 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Add InstanceAction/Event create() method https://review.opendev.org/614036 | 02:26 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Add Instance.hidden field https://review.opendev.org/631123 | 02:26 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Add TargetDBSetupTask https://review.opendev.org/627892 | 02:26 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Add CrossCellMigrationTask https://review.opendev.org/631581 | 02:26 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Execute TargetDBSetupTask https://review.opendev.org/633853 | 02:26 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Add prep_snapshot_based_resize_at_dest compute method https://review.opendev.org/633293 | 02:26 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Add PrepResizeAtDestTask https://review.opendev.org/627890 | 02:26 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Add prep_snapshot_based_resize_at_source compute method https://review.opendev.org/634832 | 02:26 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Add nova.compute.utils.delete_image https://review.opendev.org/637605 | 02:26 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Add PrepResizeAtSourceTask https://review.opendev.org/627891 | 02:26 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Refactor ComputeManager.remove_volume_connection https://review.opendev.org/642183 | 02:26 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Add power_on kwarg to ComputeDriver.spawn() method https://review.opendev.org/642590 | 02:26 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Add finish_snapshot_based_resize_at_dest compute method https://review.opendev.org/635080 | 02:26 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Add FinishResizeAtDestTask https://review.opendev.org/635646 | 02:26 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Add Destination.allow_cross_cell_move field https://review.opendev.org/614035 | 02:26 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Support cross-cell moves in external_instance_event https://review.opendev.org/658478 | 02:26 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Add cross-cell resize policy rule and enable in API https://review.opendev.org/638269 | 02:26 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: WIP: Enable cross-cell resize in the nova-multi-cell job https://review.opendev.org/656656 | 02:27 |
openstackgerrit | Merged openstack/nova master: Fix retry of instance_update_and_get_original https://review.opendev.org/658845 | 02:32 |
guilhermesp | hello! I'm currently running an upgrade N -> O. There is something weird going on when I run nova-manage cell_v2 map_instances <id>. It seems that the nova_api.instance_mappings keeps stuck for a while, incrementing slowly one by one, then after a huge amount of time it starts to re-map the instance in faster way | 02:33 |
guilhermesp | the deployment has 160k rows in nova.instances | 02:33 |
guilhermesp | we currently re-mapped 80K instnaces, but sometimes the command finishes but there are still rows to remap | 02:34 |
guilhermesp | the active instnaces of the deployment are currently given 404, which mean they doesn't reside on the newly created cell during the deployment | 02:34 |
*** BjoernT has quit IRC | 02:35 | |
guilhermesp | we are sometimes adding to the command --max-counts like 50k, 100k and trying to track the remapping watching the rows in nova_api.instance_mappings | 02:35 |
guilhermesp | does anyone knows what could be the trick in that scenario | 02:35 |
openstackgerrit | Merged openstack/nova master: Cap sphinx for py2 to match global requirements https://review.opendev.org/659201 | 02:35 |
guilhermesp | we have a staging env that this didn't happen | 02:36 |
guilhermesp | but we used to have around 1.8k rows in nova.instances | 02:36 |
guilhermesp | maybe the amount of records in the other env could be a possible reason? | 02:37 |
*** ttsiouts has joined #openstack-nova | 02:38 | |
*** lbragstad has quit IRC | 02:40 | |
*** lbragstad has joined #openstack-nova | 02:43 | |
*** dpawlik has joined #openstack-nova | 02:55 | |
*** dpawlik has quit IRC | 03:00 | |
*** dpawlik has joined #openstack-nova | 03:11 | |
*** ttsiouts has quit IRC | 03:11 | |
*** hongbin has joined #openstack-nova | 03:14 | |
*** dpawlik has quit IRC | 03:15 | |
*** licanwei has joined #openstack-nova | 03:26 | |
*** udesale has joined #openstack-nova | 03:53 | |
*** hongbin has quit IRC | 03:55 | |
*** JamesBenson has quit IRC | 04:01 | |
*** janki has joined #openstack-nova | 04:43 | |
*** bhagyashris has joined #openstack-nova | 04:46 | |
*** guozijn has quit IRC | 04:47 | |
bhagyashris | sean-k-mooney: Hi, | 04:47 |
*** _d34dh0r53_ has joined #openstack-nova | 04:55 | |
*** guozijn has joined #openstack-nova | 04:56 | |
*** lbragstad has quit IRC | 04:59 | |
*** _d34dh0r53_ has quit IRC | 05:08 | |
*** ttsiouts has joined #openstack-nova | 05:09 | |
*** _d34dh0r53_ has joined #openstack-nova | 05:11 | |
*** dpawlik has joined #openstack-nova | 05:12 | |
*** dpawlik has quit IRC | 05:17 | |
*** ratailor has joined #openstack-nova | 05:21 | |
*** dpawlik has joined #openstack-nova | 05:28 | |
*** Luzi has joined #openstack-nova | 05:40 | |
*** ttsiouts has quit IRC | 05:41 | |
*** ivve has quit IRC | 05:41 | |
*** hamzy has quit IRC | 05:44 | |
*** udesale has quit IRC | 06:05 | |
*** udesale has joined #openstack-nova | 06:06 | |
*** hamzy has joined #openstack-nova | 06:21 | |
*** ccamacho has joined #openstack-nova | 06:31 | |
*** trondham has joined #openstack-nova | 06:33 | |
*** pmannidi has joined #openstack-nova | 06:34 | |
*** udesale has quit IRC | 06:51 | |
*** udesale has joined #openstack-nova | 06:52 | |
*** xek has joined #openstack-nova | 06:54 | |
*** ivve has joined #openstack-nova | 06:58 | |
*** rcernin has quit IRC | 07:01 | |
*** udesale has quit IRC | 07:02 | |
*** openstackgerrit has quit IRC | 07:03 | |
*** udesale has joined #openstack-nova | 07:03 | |
*** tesseract has joined #openstack-nova | 07:05 | |
*** jaosorior has quit IRC | 07:06 | |
*** pcaruana has joined #openstack-nova | 07:13 | |
*** guozijn_ has joined #openstack-nova | 07:15 | |
*** rpittau|afk is now known as rpittau | 07:17 | |
*** guozijn has quit IRC | 07:18 | |
*** guozijn_ is now known as guozijn | 07:18 | |
*** openstackgerrit has joined #openstack-nova | 07:20 | |
openstackgerrit | Takashi NATSUME proposed openstack/nova stable/stein: Fix assert methods in unit tests https://review.opendev.org/657708 | 07:20 |
*** xek has quit IRC | 07:21 | |
*** helenafm has joined #openstack-nova | 07:21 | |
openstackgerrit | Takashi NATSUME proposed openstack/nova stable/rocky: Fix assert methods in unit tests https://review.opendev.org/657709 | 07:21 |
openstackgerrit | Takashi NATSUME proposed openstack/nova stable/rocky: Fix assert methods in unit tests https://review.opendev.org/657709 | 07:21 |
*** ralonsoh has joined #openstack-nova | 07:23 | |
openstackgerrit | zhangboye proposed openstack/nova-specs master: Cap sphinx for py2 to match global requirements https://review.opendev.org/659470 | 07:24 |
kashyap | weshay: Was asleep when you pinged. | 07:26 |
* kashyap reads the scroll | 07:28 | |
*** tssurya has joined #openstack-nova | 07:31 | |
*** udesale has quit IRC | 07:36 | |
kashyap | weshay: I see that melwitt (thanks) has already investigated the issue and posted a patch to skip logging the trace: https://review.opendev.org/#/c/659374/ | 07:37 |
*** udesale has joined #openstack-nova | 07:37 | |
*** ttsiouts has joined #openstack-nova | 07:39 | |
*** slaweq has joined #openstack-nova | 07:40 | |
*** awalende has joined #openstack-nova | 07:49 | |
*** tesseract has quit IRC | 07:52 | |
*** tesseract has joined #openstack-nova | 07:52 | |
*** ttsiouts has quit IRC | 07:57 | |
openstackgerrit | Gorka Eguileor proposed openstack/nova master: Fix doc requirements https://review.opendev.org/659484 | 08:05 |
openstackgerrit | Gorka Eguileor proposed openstack/nova master: Fix doc requirements https://review.opendev.org/659484 | 08:05 |
*** maciejjozefczyk has joined #openstack-nova | 08:07 | |
openstackgerrit | Gorka Eguileor proposed openstack/nova master: Use os-brick locking for volume attach and detach https://review.opendev.org/614190 | 08:07 |
*** awalende_ has joined #openstack-nova | 08:15 | |
*** ttsiouts has joined #openstack-nova | 08:16 | |
*** awalend__ has joined #openstack-nova | 08:18 | |
*** awalende has quit IRC | 08:19 | |
*** awalende_ has quit IRC | 08:20 | |
*** awalende has joined #openstack-nova | 08:21 | |
*** awalend__ has quit IRC | 08:24 | |
*** cdent has joined #openstack-nova | 08:25 | |
*** brinzh has joined #openstack-nova | 08:26 | |
*** brinzhang has quit IRC | 08:29 | |
*** tkajinam has quit IRC | 08:32 | |
*** derekh has joined #openstack-nova | 08:38 | |
*** dtantsur|afk is now known as dtantsur | 08:50 | |
*** sridharg has joined #openstack-nova | 08:53 | |
*** rabel has quit IRC | 09:03 | |
openstackgerrit | Arnaud Morin proposed openstack/nova master: Always Set dhcp_server in network_info https://review.opendev.org/658362 | 09:10 |
*** panda is now known as panda|rover | 09:17 | |
*** brinzh has quit IRC | 09:19 | |
*** ttsiouts has quit IRC | 09:20 | |
*** brinzh has joined #openstack-nova | 09:20 | |
bhagyashris | stephenfin: Hi, Just want to ask regarding the reserved_host_cpus as we are setting the reserved value of cpus as 0 for both the PCPU and VCPU then how the reserved value will be maintain I mean there should some cpu reserved for host process | 09:28 |
sean-k-mooney | bhagyashris: that is achive by not listing them in the cpu_shared_set or the cpu_dedicated_set | 09:32 |
sean-k-mooney | so on an 8 core system you can do cpu_shared_set=2,3 cpu_dedicated_set=4-7 and that leave cores 0 and 1 for the host | 09:33 |
sean-k-mooney | this is how you reserve cores for the host using vcpu_pin_set today | 09:34 |
bhagyashris | sean-k-mooney: that means operator should keep in mind that some cpus should left for host process while setting the cpu_dedicated_set and cpu_shared_set | 09:34 |
sean-k-mooney | yes | 09:34 |
bhagyashris | okay | 09:34 |
bhagyashris | sean-k-mooney: thank you for clearing my doubt. | 09:34 |
sean-k-mooney | althogh today reserved_host_cpus defaults to 0 and by defalt the vcpu_pin_set if unset enable all host cpus | 09:35 |
sean-k-mooney | so if they dont do that it wont be a change in behavior | 09:35 |
bhagyashris | sean-k-mooney: okay, but suppose if operator assign all the cpus to the dedicated set then what will happen in that case, host process won't have any cpu to process the host process. | 09:38 |
sean-k-mooney | the hosts will steal cpu time form the pinned instances | 09:38 |
sean-k-mooney | but again that is what happens today if you dont ommit cores for the host to use form the vcpu pin set | 09:39 |
*** takashin has left #openstack-nova | 09:40 | |
bhagyashris | okay, So that case should be handle by code or by operator then | 09:41 |
sean-k-mooney | by the operator | 09:41 |
bhagyashris | okay | 09:41 |
bhagyashris | Thank you :) | 09:42 |
*** bhagyashris has quit IRC | 09:43 | |
*** pmannidi has quit IRC | 10:05 | |
openstackgerrit | Arnaud Morin proposed openstack/nova master: Always Set dhcp_server in network_info https://review.opendev.org/658362 | 10:08 |
*** awalende_ has joined #openstack-nova | 10:11 | |
*** jaosorior has joined #openstack-nova | 10:12 | |
*** awalende has quit IRC | 10:12 | |
*** mdbooth_ is now known as mdbooth | 10:24 | |
mdbooth | Any reason https://review.opendev.org/658845 shouldn't be backported to stable? | 10:25 |
* mdbooth has posted the backports to Queens and they're all clean | 10:25 | |
*** pcaruana has quit IRC | 10:27 | |
*** xek has joined #openstack-nova | 10:27 | |
*** mmethot has quit IRC | 10:29 | |
*** guozijn has quit IRC | 10:37 | |
*** ricolin has quit IRC | 10:39 | |
*** davidsha has joined #openstack-nova | 10:40 | |
*** tbachman has quit IRC | 10:50 | |
*** luksky has joined #openstack-nova | 10:52 | |
*** nicolasbock has joined #openstack-nova | 10:57 | |
*** panda|rover is now known as panda|rover|eat | 11:04 | |
*** bbowen has joined #openstack-nova | 11:11 | |
*** udesale has quit IRC | 11:15 | |
*** ttsiouts has joined #openstack-nova | 11:17 | |
*** mmethot has joined #openstack-nova | 11:19 | |
*** guozijn has joined #openstack-nova | 11:24 | |
*** yan0s has joined #openstack-nova | 11:25 | |
*** guozijn has quit IRC | 11:33 | |
*** pcaruana has joined #openstack-nova | 11:37 | |
*** dtantsur is now known as dtantsur|brb | 11:46 | |
*** ttsiouts has quit IRC | 11:49 | |
*** ttsiouts has joined #openstack-nova | 11:49 | |
*** florius has joined #openstack-nova | 11:51 | |
*** rabel has joined #openstack-nova | 11:58 | |
rabel | hi. there seems to be a problem with api microversions. i have an openstack rocky (deployed with packstack), but my openstackclient (newest version, directly from git) tells me that the compute api version is 2.1 | 11:59 |
*** florius has left #openstack-nova | 12:01 | |
artom | rabel, it's an known osc behaviour/bug. While novaclient autonegotiates to the greatest common supported microversion, osc defaults to 2.1, and you have to manually specify which microversion you want | 12:02 |
*** tbachman has joined #openstack-nova | 12:06 | |
*** ttsiouts has quit IRC | 12:06 | |
*** panda|rover|eat is now known as panda | 12:06 | |
*** tbachman has quit IRC | 12:10 | |
*** panda is now known as panda|rover | 12:11 | |
rabel | artom thanks | 12:13 |
*** tbachman has joined #openstack-nova | 12:13 | |
*** awalende has joined #openstack-nova | 12:27 | |
*** janki has quit IRC | 12:28 | |
*** awalende_ has quit IRC | 12:29 | |
*** ttsiouts has joined #openstack-nova | 12:38 | |
*** tetsuro has joined #openstack-nova | 12:41 | |
*** tetsuro has quit IRC | 12:41 | |
*** ttsiouts has quit IRC | 12:43 | |
alex_xu | efried: I think the in_tree take care of all the cases for https://review.opendev.org/#/c/645520/ also | 12:56 |
*** ccamacho has quit IRC | 12:56 | |
*** ccamacho has joined #openstack-nova | 12:57 | |
*** ratailor has quit IRC | 12:58 | |
*** mchlumsky has joined #openstack-nova | 12:59 | |
*** udesale has joined #openstack-nova | 13:02 | |
*** mchlumsky has quit IRC | 13:06 | |
*** mchlumsky has joined #openstack-nova | 13:08 | |
efried | alex_xu: Okay. I'll take a closer look at that patch soon. | 13:08 |
alex_xu | efried: thx | 13:08 |
*** lbragstad has joined #openstack-nova | 13:09 | |
*** itlinux has quit IRC | 13:10 | |
*** ttsiouts has joined #openstack-nova | 13:12 | |
*** mriedem has joined #openstack-nova | 13:15 | |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Stop logging traceback when skipping quiesce https://review.opendev.org/659374 | 13:16 |
*** dtantsur|brb is now known as dtantsur | 13:18 | |
*** artom has quit IRC | 13:18 | |
*** ttsiouts has quit IRC | 13:19 | |
*** ttsiouts has joined #openstack-nova | 13:20 | |
*** tbachman has quit IRC | 13:20 | |
*** wwriverrat has quit IRC | 13:22 | |
mriedem | dansmith: random idea, but with something like https://bugs.launchpad.net/nova/+bug/1829349 where the RT is constantly failing b/c of conflicted pinned CPUs on the same host, we could leverage the build failure stats thing and if the RT fails after x hits or whatever we try to take that host out of scheduling | 13:23 |
openstack | Launchpad bug 1829349 in OpenStack Compute (nova) "Resource Tracker failed to update usage in case numa topology conflict happen" [Undecided,New] | 13:23 |
mriedem | they did the naughty and enabled live migration for numa instances, 2 landed on the same host with conflicting pinned CPUs, and how the RT constantly fails but the scheduler continues to pick the host b/c it looks free since it's usage isn't getting reported | 13:24 |
mriedem | s/how/now/ | 13:24 |
dansmith | oye | 13:25 |
dansmith | isn't the build failure count going up in that case anyway? | 13:25 |
mriedem | maybe, i can ask in the bug report | 13:25 |
sean-k-mooney | mriedem: did we fix the typo so that that config option works yet | 13:26 |
* sean-k-mooney goes to check | 13:26 | |
mriedem | the qemu vs QEMU thing? | 13:27 |
mriedem | it wouldn't matter, | 13:28 |
*** xek_ has joined #openstack-nova | 13:28 | |
mriedem | because in this case they enabled the workaround to allow live migrating instances with numa | 13:28 |
sean-k-mooney | ya we were checking for hypervior = kvm but its alway qemu | 13:28 |
sean-k-mooney | yes we fixed it https://review.opendev.org/#/c/635350/] | 13:29 |
sean-k-mooney | we never actully released with it broken so we shoudl be good | 13:29 |
mriedem | well, this person is on rocky | 13:29 |
mriedem | but yes https://review.opendev.org/#/q/I5127227a1e3d76dd413a820b048547ba578aff6b | 13:29 |
*** xek has quit IRC | 13:30 | |
*** tbachman has joined #openstack-nova | 13:31 | |
sean-k-mooney | mriedem: i would have hoped the numa toplogy filter would have prevented new vms from being scudled by the way if there was not space | 13:31 |
sean-k-mooney | the one that was mighrtated would not be fixed but ya | 13:31 |
aarents | Hi, mdbooth I updated https://review.opendev.org/#/c/659054/ if you want to have a look, maybe also mriedem ? | 13:32 |
sean-k-mooney | which reminding me i have a meeting then i need to updated docs for the sriov migration | 13:32 |
mdbooth | aarents: Thanks, still +1. | 13:34 |
adrianc | sean-k-mooney: Hi, id like to see if we can converge on the FUP for SR-IOV LM, would be great if you can take a quick look :) | 13:34 |
*** brinzh has quit IRC | 13:41 | |
sean-k-mooney | adrianc: sure i think this was more or less fine so ill take alook at the latest version | 13:41 |
adrianc | sean-k-mooney, i did not upload a new PS as i wanted to reach an agreement on the propery thing, anyway... you will see | 13:44 |
*** Luzi has quit IRC | 13:45 | |
*** takashin has joined #openstack-nova | 13:49 | |
*** mgariepy has quit IRC | 13:55 | |
efried | Nova meeting in 4 minutes in #openstack-meeting | 13:56 |
*** pcaruana has quit IRC | 13:56 | |
*** artom has joined #openstack-nova | 13:57 | |
mriedem | tssurya: i'm going to remove the locked reason stuff from the runways etherpad since the nova and novaclient stuff is done https://etherpad.openstack.org/p/nova-runways-train | 14:02 |
*** JamesBenson has joined #openstack-nova | 14:02 | |
*** mgariepy has joined #openstack-nova | 14:03 | |
*** JamesBenson has quit IRC | 14:07 | |
*** UdayTKumar has joined #openstack-nova | 14:07 | |
efried | sean-k-mooney: Last week's nova meeting we got a request to review | 14:09 |
efried | https://review.opendev.org/#/c/643023/ | 14:09 |
efried | https://review.opendev.org/#/c/643024/ | 14:09 |
efried | which it looked like you were actively looking into. Would you be able to follow up please? | 14:09 |
sean-k-mooney | yes | 14:10 |
tssurya | mriedem: ack | 14:10 |
*** alex_xu has quit IRC | 14:10 | |
efried | thanks sean-k-mooney | 14:10 |
*** JamesBenson has joined #openstack-nova | 14:14 | |
*** JamesBenson has quit IRC | 14:18 | |
sean-k-mooney | adrianc: just replied on https://review.opendev.org/#/c/659101/2/nova/objects/instance_pci_requests.py@47 | 14:18 |
adrianc | thx sean-k-mooney, ill take a look shortly | 14:19 |
sean-k-mooney | im going to go review https://review.opendev.org/#/q/topic:bug/1809095+(status:open+OR+status:merged) before my next meeting | 14:20 |
sean-k-mooney | which is your fix for bug #1809095 | 14:21 |
openstack | bug 1809095 in OpenStack Compute (nova) "Wrong representor port was unplugged from OVS during cold migration" [Medium,In progress] https://launchpad.net/bugs/1809095 - Assigned to Adrian Chiris (adrian.chiris) | 14:21 |
*** mlavalle has joined #openstack-nova | 14:22 | |
*** itlinux has joined #openstack-nova | 14:27 | |
*** itlinux has quit IRC | 14:31 | |
mriedem | btw, i remember mtreinish trying to get glance-api moved to wsgi in pike and it got hung up on threading stuff with eventlet, hence https://review.opendev.org/#/c/549743/ | 14:32 |
dansmith | yup | 14:32 |
mriedem | oh and https://review.opendev.org/#/c/531498/ | 14:34 |
mriedem | ^ taskflow evenlet async stuff + wsgi + glance-api | 14:35 |
*** d34dh0r53 has quit IRC | 14:36 | |
mtreinish | yep, but I couldn't get anyone to ever really review that or explain how to use the async glance APIs. And there is no dsvm coverage for any of it | 14:36 |
*** d34dh0r53 has joined #openstack-nova | 14:37 | |
*** pcaruana has joined #openstack-nova | 14:37 | |
mriedem | mtreinish: for context we're going over similar in nova-api https://review.opendev.org/#/c/650172/ | 14:37 |
cdent | I reckon the root of the problems in both glance and nova is that we are trying to async workings in an environment (an HTTP API) where it doesn't make sense. The traditional python solution to this problem is to dump into an offboard async task queue (celery and the like). The scatter/gather thing is not quite in the same boat as the glance stuff (which is more amenable to the celery thing perhaps) | 14:41 |
*** d34dh0r53 is now known as Guest28904 | 14:41 | |
mtreinish | cdent: yeah, that was the best way to fix it, to just do the async bits in another service | 14:46 |
mtreinish | but I assumed just launching another process or thread to handle it should work as a quick workaround | 14:46 |
cdent | it ought to | 14:46 |
cdent | but I would guess that zombie blocked processes would present the same concern as blocked threads | 14:47 |
cdent | which might suggest: that's the way unix goes, deal... | 14:47 |
*** JamesBenson has joined #openstack-nova | 14:51 | |
sean-k-mooney | i mean we could just not set a limit on the worker trhead in the tread pool or make it a config option | 14:53 |
sean-k-mooney | python will grow and srink the thread pool automatically if you dont specify the number of workers | 14:53 |
dansmith | that's the problem, these are a threadpool, which means they're re-used and capped at some max number, so enough api requests come in during a cell down event, and you immediately exhaust the pool of workers until things time out | 14:53 |
dansmith | sean-k-mooney: only up to N*CPUs I think I read | 14:54 |
dansmith | where N is like 3 or something | 14:54 |
sean-k-mooney | dansmith: maybe not sure | 14:54 |
sean-k-mooney | but we could leave it as a config option and defult to -1 mean let python figure it out | 14:55 |
sean-k-mooney | or set it to X if you care and have mesured for your system | 14:55 |
mriedem | https://docs.python.org/3.6/library/concurrent.futures.html#concurrent.futures.ThreadPoolExecutor | 14:55 |
mriedem | "If max_workers is None or not given, it will default to the number of processors on the machine, multiplied by 5, assuming that ThreadPoolExecutor is often used to overlap I/O instead of CPU work and the number of workers should be higher than the number of workers for ProcessPoolExecutor." | 14:55 |
dansmith | right, so with 5 cells, that becomes $CPUs | 14:56 |
dansmith | and everyone has at least two | 14:56 |
cdent | when calculating the number of threads that will be around, another thing to keep in mind is mod_wsgi will often run with multiple process and multiple threads in those processes. what is the pool associated with? | 14:57 |
cdent | every request or thread or process? | 14:57 |
* cdent has lost track of the actual code, reviews | 14:58 | |
dansmith | idk | 14:58 |
cdent | process | 14:58 |
sean-k-mooney | so we could revert to createing the thread pool per call and set it to one thread per cell | 14:59 |
sean-k-mooney | but the issue with that is a its expensive to spawn treads and b we nolonger have a limit on them | 15:00 |
*** ratailor has joined #openstack-nova | 15:00 | |
sean-k-mooney | but we will not have the issue of a potention exausting of the thread pool | 15:00 |
sean-k-mooney | cdent: the pool i a module global in the current patch so it would associated with the process | 15:01 |
*** takashin has left #openstack-nova | 15:02 | |
cdent | sean-k-mooney: [t 6MVa] | 15:03 |
purplerbot | <cdent> process [2019-05-16 14:58:44.001687] [n 6MVa] | 15:03 |
*** artom has quit IRC | 15:03 | |
sean-k-mooney | ? | 15:03 |
cdent | [l 6MVa] | 15:04 |
purplerbot | http://p.anticdent.org/6MVa | 15:04 |
cdent | in other words, I figured it out by looking at the code a few minutes ago | 15:04 |
sean-k-mooney | oh right :) | 15:04 |
mriedem | efried: on that ironic job failure, it looks like that job has probably been broken since stein when we started reporting capability traits for nodes http://logs.openstack.org/32/634832/29/check/ironic-tempest-ipa-wholedisk-bios-agent_ipmitool-tinyipa/fba9197/controller/logs/devstacklog.txt.gz#_2019-05-16_03_28_21_590 | 15:06 |
jangutter | cdent: I really thought you were mistyping 6 million volt-ampere. | 15:06 |
mriedem | if i'm reading that correctly it's expecting only the CUSTOM_GOLD trait | 15:06 |
sean-k-mooney | dansmith: what was the behavior that motivated the scater gather code in the first place to you rememebr? was it to deal with down cells or was it a performce imporvoment? | 15:06 |
dansmith | sean-k-mooney: it's to avoid a sequential walk of all the cell databases, serializing their latencies | 15:07 |
sean-k-mooney | ah ok so it was to addres perfromace ok | 15:07 |
cdent | Are there numbers on those latencies? | 15:07 |
dansmith | sean-k-mooney: cern has 70 cells, some of which are remote | 15:07 |
cdent | ah. the remote part | 15:08 |
dansmith | as in, in another country | 15:08 |
efried | mriedem: looks like the fix is just to take out the square brackets, nah? | 15:08 |
sean-k-mooney | ya even if they wer all in the same data center it could be a proble as the number of cell grow | 15:08 |
dansmith | sean-k-mooney: especially if one in the middle is timing out, and we wait for that to complete before moving to the next one and get its response | 15:09 |
dansmith | and everyone has two of these at minimum so moving to sequential means 2N latency, which is twice as bad, regardless of what N is | 15:09 |
sean-k-mooney | i was just wondering if the initall approch to solving the problem was patinting us in to a corner that could be avoid by revisiting the originial usecase cool | 15:10 |
sean-k-mooney | effectivly waht we want to do is queue up a bunch of asyc request and wait for them to complete | 15:11 |
dansmith | ..a thing for which eventlet is ideal | 15:11 |
sean-k-mooney | eventlet and thread pools both allwo that with different tradeoffs | 15:11 |
*** awalende has quit IRC | 15:11 | |
sean-k-mooney | the issue isnt the use of eventlets its self | 15:11 |
*** awalende has joined #openstack-nova | 15:12 | |
sean-k-mooney | its that mod_wsgi and uwsgi both have exposed diffreent bugs caused by using eventlets with them | 15:12 |
*** cfriesen has joined #openstack-nova | 15:12 | |
*** ivve has quit IRC | 15:13 | |
sean-k-mooney | does wsgi provide a way to run request like this in the background that we could leverag and fall back to eventlest when not runnign under wsgi? | 15:15 |
*** artom has joined #openstack-nova | 15:15 | |
*** awalende_ has joined #openstack-nova | 15:15 | |
sean-k-mooney | google found me https://pypi.org/project/uwsgi-tasks/ but no idea if that would help | 15:16 |
*** awalende has quit IRC | 15:16 | |
cdent | that could work for the uwsgi situation, but not mod_wsgi, presumably | 15:18 |
*** luksky has quit IRC | 15:19 | |
sean-k-mooney | i dont know i personally would prefer to try melwitt's approch first and maybe ask cern to try it and get feed back | 15:19 |
*** JamesBenson has quit IRC | 15:19 | |
*** awalende_ has quit IRC | 15:20 | |
*** JamesBenson has joined #openstack-nova | 15:20 | |
cdent | sean-k-mooney: you mean the current code under review? that seems reasonable. I guess the issue is that we don't know if the problem is a problem until it gets hit... | 15:20 |
sean-k-mooney | cdent: right | 15:20 |
cdent | mriedem seemed to think it could be caused with some effort in devstack | 15:21 |
sean-k-mooney | and yes the current code under review | 15:21 |
efried | mriedem: I'm going to propose that fix unless you've arleady got it in the pipe | 15:21 |
mriedem | efried: i don't, i'm trying to recreate with bash locally | 15:23 |
efried | ack | 15:23 |
mriedem | cdent: i know how to manually test multiple cells in a single node devstack and make one or more down, but that doesn't mean i know how to recreate the issue | 15:24 |
mriedem | we haven't hit this problem in the gate as far as i know with nova-api under wsgi + 2 cells | 15:24 |
mriedem | we also now have the nova-multi-cell job which is effectively 3 cells | 15:24 |
mriedem | cell0/1/2 | 15:25 |
melwitt | imacdonn: do you happen to know why/how the uwsgi/mod_wsgi problem doesn't seem to show itself in the gate? | 15:26 |
cdent | as I remmeber (which may be wrong) the problem with eventlet needs some work, a long pause, some attempted work to show up. And the expected problem with threads requires connections to databases (or other I/O) that blocks weirdly and doesn't time out in natural ways | 15:26 |
mriedem | efried: removing the brackets gives me this: | 15:26 |
mriedem | jq: error (at <stdin>:1): array (["COMPUTE_N...) and string ("CUSTOM_GOLD") cannot have their containment checked | 15:26 |
cdent | melwitt: I reckon it the lack of pause | 15:26 |
*** ccamacho has quit IRC | 15:26 | |
melwitt | aye | 15:26 |
efried | mriedem: ugh, I guess it would help me to know wtf jq is. | 15:27 |
mriedem | efried: json parser | 15:27 |
mriedem | efried: i considered just re-writing this with osc-placement and grep | 15:27 |
efried | seems like that would be better. | 15:27 |
mriedem | openstack --os-placement-api-version 1.6 resource provider trait list $provider_id | grep $trait | 15:27 |
mriedem | it would definitely be easier to f'ing read | 15:27 |
efried | agreed | 15:27 |
mriedem | i'll push that up | 15:28 |
efried | ight | 15:28 |
efried | mriedem: I made a story | 15:28 |
mriedem | this would also suggest all ironic jobs would be busted since stein and if that were the case i'd think we would have heard about it by now... | 15:28 |
efried | mriedem: https://storyboard.openstack.org/#!/story/2005725 | 15:28 |
efried | you would think, yah | 15:28 |
mriedem | cdent: melwitt: dansmith: you don't think it's a simple as adding an https://eventlet.net/doc/modules/greenthread.html#eventlet.greenthread.sleep between calls in the scatter gather do you? | 15:29 |
imacdonn | melwitt: I don't know enough about what happens in the gate ... but the problem manifests when nova-api does something that requires an API call, then goes quiet for more than a minute or so, then gets another request that requires RPC | 15:30 |
dansmith | the problem is that we're not monkeypatched properly such that background threads aren't being run and thus the heartbeat for the rabbit connection isn't working | 15:30 |
dansmith | which seems like it should manifest in the gate easily if it's fundamental to how we're doing service setup, and which also means other things shouldn't really be working | 15:31 |
dansmith | so I'm kinda suspicious of the whole thing, and don't really want to "solve" it by changing our whole threading model for one service | 15:31 |
dansmith | but I don't really have time to dig deep, as I said | 15:31 |
cdent | dansmith: I thought the "more than a minute or so" was a critical factor as it is resulting in mod_wsgi doing $something to the python process that breaks the monkey patching | 15:32 |
cdent | but I have to admit that finding the sensible thread through all this is hard so I'm not sure if I'm remembering the right stuff | 15:32 |
mriedem | going quite for a minute or so is something that wouldn't happen in the gate yeah | 15:33 |
imacdonn | cdent: rabbitmq, in a typical configuration, will drop a connection if it has not seen heartbeats for over a minute | 15:33 |
mriedem | tempest is all hammer until it's done and then we teardown | 15:33 |
imacdonn | cdent: that's what leads to a "connection reset by peer" when nova-api wakes up again and tries to use the connection that it thinks is still healthy | 15:33 |
dansmith | so you think we're just not running hearbeat background tasks if no api requests are coming in? | 15:33 |
cdent | it does sound a bit that way | 15:33 |
imacdonn | I have observed that with debug logging | 15:34 |
dansmith | seems like that should be easy to reproduce in a devstack no? | 15:34 |
mriedem | could we maybe tickle this in a post test script after tempest is done, sleep a minute or something, and then try to create or do something to a server, like stop it? | 15:34 |
*** helenafm has quit IRC | 15:35 | |
kashyap | efried: Thanks for representing on #openstack-meeting (still haven't caught up). Stuck in consecutive calls, afraid. | 15:35 |
dansmith | mriedem: or just a devstack | 15:35 |
mriedem | yeah..i will leave devstack running for many more minutes before coming back to it to do something | 15:35 |
*** ttsiouts has quit IRC | 15:35 | |
dansmith | changing the whole threading model for this seems like a big hammer | 15:35 |
mriedem | also note that vexxhost is running stein and haven't reported this being a problem... | 15:36 |
cdent | mriedem: is your devstack using mod_wsgi, by default it won't be | 15:36 |
mriedem | cdent: oh no i use the default (uwsgi) | 15:36 |
cdent | and it's not clear the problem is the same uwsgi as mod_wsgi | 15:36 |
cdent | it _might_ be | 15:36 |
dansmith | I thought imacdonn said that it repros the same in uwsgi? | 15:36 |
cdent | but since the only people we've heard of this problem from are tripleo, it's not clear | 15:36 |
melwitt | I think imacdonn said it behaves the same uwsgi and mod_wsgi, IIRC | 15:37 |
imacdonn | I have easily reproduced the exact symptoms (including absense of heartbeats with debug logging) with both mod_wsgi and uWSGI | 15:37 |
cdent | ah, okay | 15:37 |
cdent | cool | 15:37 |
imacdonn | and I'm not tripleo, FWIW | 15:38 |
dansmith | ...said with beaming pride :) | 15:38 |
*** _d34dh0r53_ is now known as d34dh0r53 | 15:38 | |
imacdonn | heh | 15:38 |
cdent | so is the other way around the problem to catch the reset by peer and reopen? because expecting an idle web service to make rpc heartbeats is ... odd | 15:38 |
cdent | imacdonn: sorry, like I said, it's been hard to keep track of the details on this one | 15:38 |
dansmith | yeah | 15:38 |
dansmith | so that's the other thing, | 15:38 |
*** ricolin has joined #openstack-nova | 15:38 | |
dansmith | why is the connection not just immediately re-youknow-connected? | 15:38 |
dansmith | because that would be a lot more sane | 15:39 |
* cdent nods | 15:39 | |
*** macza has joined #openstack-nova | 15:39 | |
imacdonn | the whole point of heartbeats is to keep connections alive when there's no traffic | 15:39 |
imacdonn | for example, a firewall idle timeout could bite you otherwise | 15:39 |
cdent | sure, but a web server process may not even exist in some deployment scenarios when there are no requests | 15:39 |
cdent | who is going to send the heartbeat? | 15:40 |
dansmith | imacdonn: but only really for things where you might be expecting to hear something from the other side | 15:40 |
dansmith | imacdonn: the api server receives no messages other than replies, | 15:40 |
imacdonn | the wsgi app container, I guess? and if that doesn't exist, there's no problem ... a new one would start a new connection | 15:40 |
dansmith | so there's really no reason for a connection except when it's processing a reuest | 15:40 |
imacdonn | I guess I might have assumed that setting up a new connection incurs some expense ... so it's better to keep it open once you have it | 15:41 |
dansmith | open connections that aren't used are pretty expensive in rabbit terms ;) | 15:42 |
imacdonn | kindof like the REST concept | 15:42 |
imacdonn | re. "things where you might be expecting to hear something from the other side" - I wonder if there are cases where that could break - maybe an async RPC call that takes a long time to complete ? | 15:43 |
dansmith | imacdonn: in that case you're processing a request still | 15:44 |
dansmith | what I don't understand is, if the wsgi container has spawned a process to do a thing, and has not killed it off during times of inactivity, what is it doing that prevents the normal threading background stuff from running like any other process? | 15:44 |
dansmith | unless we've tickled something in the python threading library before patching or whatever, but if that's the problem that's the kind of thing we need to know | 15:45 |
dansmith | I'd be interested in what the differences are between you and everyone else.. same exact (unmodified) nova, oslo, et al code? same version of deps, etc? | 15:46 |
dansmith | these are the things worth chasing, imho | 15:46 |
imacdonn | do we know of anyone running stein nova-api on wsgi successfully ? | 15:46 |
imacdonn | cos I only know of the other guys who are seeing the same issue as me | 15:47 |
cdent | it's always running on wsgi, whether you're using nova-api script or using mod_wsgi or uwsgi. the distinguishing factor is standalone eventlet server versus everything else | 15:47 |
*** pcaruana has quit IRC | 15:47 | |
imacdonn | dansmith: see also https://bugs.launchpad.net/nova/+bug/1825584/comments/7 - I haven't really tried to digest that yet | 15:48 |
openstack | Launchpad bug 1825584 in OpenStack Compute (nova) "eventlet monkey-patching breaks AMQP heartbeat on uWSGI" [Undecided,New] | 15:48 |
sean-k-mooney | cdent: is it running on wsgi when using the console script | 15:48 |
sean-k-mooney | i did not think that was the case | 15:48 |
cdent | yes, wsgi is the protocol by which a web request is translated into an eviron dict that is passed to a __callable__ handler with a start_response function | 15:49 |
cdent | it's the reason why it is possible to run what amounts to the same code with the eventlet server based console script and "everything else" | 15:50 |
dansmith | imacdonn: okay if it's stopping the interpreter, why does using another real thread change it? does it just pause the main thread's interpreter and release the GIL?\ | 15:50 |
sean-k-mooney | cdent: i did not know https://github.com/openstack/nova/blob/master/nova/cmd/api.py was implement the wsgi progocaol but it look like it is | 15:50 |
dansmith | if the wsgi server is pausing the interpreter, I take that to mean it expects the thing to stop doing stuff, and thus, shouldn't really be doing heartbeats? | 15:50 |
cdent | imacdonn: have you tried melwitt's patch? does it "fix it". Because if not, maybe it doesn't... | 15:51 |
*** ttsiouts has joined #openstack-nova | 15:51 | |
cdent | and what we said above about reopening the connection is the real fix | 15:51 |
imacdonn | cdent: depends which patch, I guess .. there was a set of 3 related ones, and the last one removed the monkey patching ... I did some basic testing of that, and it seemed to work OK | 15:52 |
imacdonn | by "that", I mean "all three" | 15:52 |
cdent | imacdonn: this one: https://review.opendev.org/#/c/650172/ | 15:52 |
imacdonn | I think I did (most of) these: https://review.opendev.org/#/q/topic:cell-scatter-gather-futurist+(status:open+OR+status:merged) | 15:53 |
*** wwriverrat has joined #openstack-nova | 15:54 | |
dansmith | so if you disable monkeypatching but still use eventlet, I think everything just becomes synchronous, right? | 15:54 |
dansmith | in that case, the "it works" is probably just because you're doing everything serially, so unless you apply the other patches to make it actually use the native threading, it's not a relevant test, IMHO | 15:55 |
sean-k-mooney | dansmith: if you dont call eventlet directly i think so | 15:55 |
*** ttsiouts has quit IRC | 15:55 | |
sean-k-mooney | im not sure what happens if you do call eventlet spawn et al | 15:55 |
dansmith | again, I think we need to repro this in a clean devstack and/or the gate before we get too much further.. or figure out why we can't. | 15:56 |
sean-k-mooney | you mean the rabbitmq issues | 15:57 |
sean-k-mooney | is the theroy that they will start happening if the api is not actively used after a few minutes | 15:58 |
*** rpittau is now known as rpittau|afk | 15:58 | |
sean-k-mooney | e.g. it shoudl break if we stack go have coffee and come back? | 15:58 |
*** imacdonn has quit IRC | 15:58 | |
bnemec | There's some analysis of what was happening in http://lists.openstack.org/pipermail/openstack-discuss/2019-April/005310.html | 16:01 |
*** imacdonn has joined #openstack-nova | 16:02 | |
imacdonn | I missed the last ~5 mins ... *^&&*$% stupid VPN | 16:02 |
*** sridharg has quit IRC | 16:02 | |
imacdonn | for repro, I think the best way is to enable debug logging for oslo_messaging and observe heartbeats (or lack thereof) | 16:03 |
cdent | dansmith, mriedem I put a comment with links to this discussion and some other summaries on the ThreadPoolExecutor patch | 16:03 |
imacdonn | (after making an API call that requires RPC - like getting the VNC console URL for an instance) | 16:03 |
dansmith | imacdonn: is it specific to vnc console by chance? because many api calls make rpc calls.. pretty much anything other than get/list instances | 16:04 |
fungi | mriedem et al: i found what was wrong with http://status.openstack.org/reviews/ and got content updating again, though i'm at a loss for explanation as to how it broke (one of the dependencies we pip install for reviewday, cheetah, was somehow not installed on the server and all reasonable explanations have been ruled out) | 16:04 |
fungi | anyway, let us know if you're relying on it and see it go stale again somehow | 16:04 |
fungi | maybe i'll be able to track it down that way | 16:05 |
fungi | should update roughly every 30 minutes | 16:05 |
imacdonn | dansmith: I don't believe so .. that's just an easy way to make an RPC call happen | 16:05 |
*** dims has quit IRC | 16:05 | |
*** tesseract has quit IRC | 16:05 | |
*** yan0s has quit IRC | 16:07 | |
dansmith | confirming that would be good, because I think we have mucked with console stuff recently | 16:07 |
dansmith | and would also help narrow down the differences | 16:07 |
*** ratailor has quit IRC | 16:07 | |
imacdonn | I have observed other things failing, but didn't make note of what specifically .. is there a particular call that you would suggest? | 16:09 |
*** awalende has joined #openstack-nova | 16:09 | |
dansmith | I suggest reproducing this with a devstack | 16:09 |
imacdonn | working on that (devstack) now | 16:09 |
mriedem | fungi: thanks | 16:11 |
fungi | any time! | 16:11 |
*** awalende has quit IRC | 16:13 | |
*** yonglihe has quit IRC | 16:14 | |
*** cdent has quit IRC | 16:17 | |
*** dtantsur is now known as dtantsur|afk | 16:19 | |
*** luksky has joined #openstack-nova | 16:21 | |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Support cross-cell moves in external_instance_event https://review.opendev.org/658478 | 16:40 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Add cross-cell resize policy rule and enable in API https://review.opendev.org/638269 | 16:40 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: WIP: Enable cross-cell resize in the nova-multi-cell job https://review.opendev.org/656656 | 16:40 |
aspiers | kashyap, efried: are you good with https://review.opendev.org/#/c/655268/ now? | 16:40 |
*** jmlowe has quit IRC | 16:41 | |
mriedem | efried: i never +2ed this https://review.opendev.org/#/c/659374/ | 16:41 |
mriedem | well, | 16:41 |
mriedem | i did before -1ing it | 16:41 |
*** ricolin has quit IRC | 16:44 | |
mriedem | but now i am :) | 16:44 |
*** ttsiouts has joined #openstack-nova | 16:50 | |
imacdonn | dansmith: I've reproduced the lack of heartbeats on devstack (and confirmed that the heartbeats happen if monkey patching is removed) ... but somehow (so-far) devstack is able to recover from the unexpected loss of amqp connection, where my RDO environment is (sometimes) not | 16:51 |
dansmith | okay so devstack is behaving like we would expect, assuming the pause-the-interpreter wsgi behavior you describe | 16:51 |
imacdonn | so-far, I suppose so | 16:52 |
imacdonn | if the pause is considered "expected" | 16:52 |
*** mrch_ has quit IRC | 16:53 | |
*** bauzas has quit IRC | 16:54 | |
dansmith | if the pause isn't, then the problem is in the configuration of the wsgi container or something else | 16:56 |
*** bauzas has joined #openstack-nova | 16:57 | |
imacdonn | but it only happens when python has been monkeyed-with | 16:57 |
dansmith | mm, I think you're misunderstanding | 16:58 |
dansmith | I think that wsgi is pausing regardless, | 16:58 |
*** davidsha has quit IRC | 16:58 | |
dansmith | and in the eventlet case, there is only one thing to pause, so it pauses | 16:58 |
dansmith | and without eventlet, we spawn a real thread that the wsgi container does not pause (only the main thread) | 16:58 |
dansmith | which is why I say that if the design of the wsgi container is to pause the whole thing between requests, then us subverting that with a real thread seems wrong | 16:59 |
imacdonn | I definitely don't understand the mechanisms well enough ... so my observation is fairly high-level ... in one case, the heartbeats happen, so it seems awake .. in the other case, it goes completely quiet | 16:59 |
imacdonn | so if it's wrong, the whole oslo_messaging model seems inappropriate ..... ??? | 17:00 |
*** derekh has quit IRC | 17:00 | |
imacdonn | cos it does heartbeats quite intentionally, AFAIK ... | 17:00 |
dansmith | well, o.msg _was_ developed with eventlet servers in mind | 17:00 |
*** jmlowe has joined #openstack-nova | 17:01 | |
*** dims has joined #openstack-nova | 17:03 | |
*** lbragstad has quit IRC | 17:05 | |
dansmith | so here's a random person asking a similar question: https://github.com/benoitc/gunicorn/issues/1924 | 17:06 |
dansmith | not much response there, except that it basically says "don't spawn worker threads from your wsgi app" | 17:06 |
dansmith | check this out: https://docs.newrelic.com/docs/agents/python-agent/web-frameworks-servers/python-agent-uwsgi-web-server | 17:08 |
dansmith | "By default uWSGI does not enable threading support within the Python interpreter core. This means it is not possible to create background threads from Python code. " | 17:09 |
dansmith | ...which might mean that disabling eventlet and letting threading be unpatched just gets you synchronous behavior | 17:09 |
*** tssurya has quit IRC | 17:10 | |
dansmith | I wonder if using --enable-threads but leaving the monkeypatching in place would allow the eventlet background threads to run? | 17:11 |
imacdonn | "By default the Python plugin does not initialize the GIL. This means your app-generated threads will not run. If you need threads, remember to enable them with enable-threads. Running uWSGI in multithreading mode (with the threads options) will automatically enable threading support. This “strange” default behaviour is for performance reasons, no shame in that." | 17:12 |
imacdonn | ^ from https://uwsgi-docs.readthedocs.io/en/latest/ThingsToKnow.html | 17:12 |
dansmith | yeah, just saw that.. so... do you have threads enabled? | 17:12 |
*** dims has quit IRC | 17:12 | |
imacdonn | checking ... probably not | 17:12 |
*** lbragstad has joined #openstack-nova | 17:13 | |
*** panda|rover is now known as panda|rover|off | 17:13 | |
*** dims has joined #openstack-nova | 17:13 | |
dansmith | if I were you, I'd really want to find out why devstack recovers properly and yours does not though | 17:14 |
imacdonn | or... | 17:14 |
imacdonn | # grep threads /etc/nova/nova-api-uwsgi.ini | 17:14 |
imacdonn | enable-threads = true | 17:14 |
imacdonn | # | 17:14 |
imacdonn | (this is on devstack) | 17:14 |
imacdonn | thinking about that some more ... I think if enable-threads was not on, the heartbeat thread would not work in the non-patched case (?) | 17:15 |
imacdonn | guess I could attempt to test that | 17:15 |
*** ivve has joined #openstack-nova | 17:16 | |
imacdonn | confirmed that I do have enable-threads in my "real" (non-devstack) config too | 17:16 |
openstackgerrit | Stephen Finucane proposed openstack/nova master: Fix some issues with the newton release notes https://review.opendev.org/659623 | 17:18 |
dansmith | so, | 17:18 |
dansmith | thinking about this some more | 17:19 |
openstackgerrit | Stephen Finucane proposed openstack/nova master: Use new hotness for upper-constraints https://review.opendev.org/659624 | 17:19 |
dansmith | I'm not sure how eventlet could really ever work in the wsgi mode, because background tasks really require the main thread to block on the event loop | 17:19 |
dansmith | I'm struggling to find where the entry point is for nova-api in wsgi mode... | 17:19 |
imacdonn | that seems to be the consensus ... eventlet and wsgi containers seem fundamentally incompatible | 17:21 |
dansmith | well, | 17:21 |
dansmith | not really, | 17:21 |
dansmith | just wsgi+eventlet with a background thread | 17:21 |
imacdonn | ok | 17:21 |
imacdonn | FWIW, with enable-threads turned off, and monkey patching still removed, it looks the same - no heartbeats after the API request that opens an RPC connection | 17:23 |
dansmith | yeah, not running the event loop means we'll just never do that thing | 17:23 |
*** ttsiouts has quit IRC | 17:24 | |
dansmith | tbh, it kinda seems like what we should be doing is splitting nova-api into two pieces in a wsgi world | 17:25 |
dansmith | which is, of course, not how it was designed | 17:25 |
imacdonn | someone else suggested that | 17:25 |
dansmith | mnaser: you around? | 17:27 |
dansmith | imacdonn: how instantaneous is the reconnect to rabbit on devstack? | 17:28 |
imacdonn | no perceptible delay | 17:28 |
mnaser | dansmith: hiya | 17:29 |
dansmith | mnaser: you are on stein and running nova-api in wsgi something right? | 17:29 |
*** _alastor_ has quit IRC | 17:29 | |
mnaser | Correct. Uwsgi | 17:29 |
dansmith | mnaser: and you don't see this rabbit heartbeat timeout freezeup thing correct? | 17:30 |
*** xek__ has joined #openstack-nova | 17:30 | |
mnaser | I have not. I thought the ML post thread concluded it was mod_wsgi related ? | 17:30 |
dansmith | mnaser: nope, both apparently | 17:30 |
*** _alastor_ has joined #openstack-nova | 17:31 | |
dansmith | mnaser: imacdonn says it's reproducible in devstack too, but it just reconnects instantaneously and thus isn't a problem (which is what ideal behavior should be) | 17:31 |
mnaser | Hmm, I mean, maybe it could be? I’ve not seen any issues but maybe our APIs get hit often enough? | 17:31 |
dansmith | i.e. not requiring the wsgi app to run a daemon thread to maintain a connection | 17:31 |
mnaser | Let me grep some logs | 17:32 |
dansmith | mnaser: possible I guess. is there any way you could take one worker out of LB rotation and see if it hangs up after a couple minutes if you hit it directly? | 17:32 |
mnaser | What string should I look for? | 17:32 |
dansmith | imacdonn: ^ | 17:32 |
imacdonn | QRX ... reproducing again | 17:32 |
imacdonn | erm, I mean .. standby :) | 17:32 |
dansmith | imacdonn: ah, so you are the ham I keep finding in my googling :D | 17:32 |
*** xek_ has quit IRC | 17:32 | |
imacdonn | even an occasional chirp user ;) | 17:33 |
dansmith | hah nice | 17:33 |
mnaser | gosh the latency is awful from here | 17:34 |
mnaser | oh actually | 17:34 |
mnaser | I have a quiet stein cloud here | 17:34 |
*** dklyle has quit IRC | 17:34 | |
dansmith | imacdonn: is it possible that mod_wsgi is the reason you don't get a reconnect and that uwsgi, despite pausing, makes that just work? | 17:35 |
imacdonn | mnaser: one think you could look for is, in your rabbitmq server log: | 17:35 |
imacdonn | =ERROR REPORT==== 16-May-2019::16:45:39 === | 17:35 |
imacdonn | closing AMQP connection <0.4566.0> (x.x.x.x:34178 -> x.x.x.x:5672 - uwsgi:47892:8bc2c402-d1d4-4762-9b83-29a7b7f419fd): | 17:35 |
imacdonn | missed heartbeats from client, timeout: 60s | 17:35 |
imacdonn | dansmith: no, because I'm using uWSGI "for real" too | 17:35 |
dansmith | okay | 17:35 |
-openstackstatus- NOTICE: Gerrit is being restarted to add gitweb links back to Gerrit. Sorry for the noise. | 17:37 | |
mnaser | so is it only something logged inside RabbitMQ? | 17:37 |
*** sean-k-mooney has quit IRC | 17:39 | |
*** pcaruana has joined #openstack-nova | 17:39 | |
imacdonn | trying to get the nova-api side logs .. I'll have to clean and pastebin them | 17:39 |
*** sean-k-mooney has joined #openstack-nova | 17:41 | |
mnaser | im seeing this on a lightly loaded stein cluster 'missed heartbeats from client, timeout: 60s' | 17:42 |
mnaser | gonna see what nova logs say | 17:42 |
imacdonn | is there a handy place I can upload 10k lines of log to ? | 17:43 |
mnaser | 2019-05-16 17:38:53.643 3897 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 104] Connection reset by peer | 17:44 |
dansmith | mnaser: but sounds like it's recovering when a new call comes in such that it's not noticeable? | 17:45 |
imacdonn | mnaser: yes! that's one of symptoms | 17:45 |
dansmith | bnemec: do you know if there's a way to turn off the heartbeat thread for rabbit? | 17:45 |
mnaser | but is there a problem however? | 17:45 |
dansmith | bnemec: for o.msg's rabbit driver I mean | 17:45 |
mnaser | like n-api seems to continue to respond fine afterwards | 17:45 |
mnaser | 2019-05-16 17:38:53.644 3897 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 104] Connection reset by peer | 17:46 |
bnemec | There used to be. It was considered experimental for a couple of cycles. Let me look. | 17:46 |
dansmith | mnaser: yeah, so that's the behavior we would expect | 17:46 |
mnaser | but at a 17:38:54 a GET /v2.1/servers/uuid worked fine | 17:46 |
dansmith | dunno why imacdonn isn't getting that proper reconnect behavior | 17:46 |
dansmith | turning off the heartbeat thread for the wsgi workers would eliminate the log noise I think | 17:46 |
*** udesale has quit IRC | 17:46 | |
mnaser | rabbitmq heartbeats are super unreliable in my experience | 17:47 |
bnemec | dansmith: https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L145 | 17:47 |
mnaser | there used to be a nice article about configuring timeouts | 17:47 |
efried | aspiers: not sure I ever looked at that. | 17:47 |
imacdonn | yeah, perhaps I need to look closer at that ... but still, I think that heartbeats are there for a reason, and letting connections rot and "fail by design" is kinda icky | 17:47 |
mnaser | https://medium.com/@george.shuklin/rabbit-heartbeat-timeouts-in-openstack-fa5875e0309a | 17:47 |
mnaser | dunno how relevant it is but | 17:48 |
mnaser | a good read none the less | 17:48 |
dansmith | imacdonn: yeah, I just don't think I agree :) | 17:48 |
imacdonn | if nova-api isn't going to use persistent AMQP connections, it should gracefully close the connection before it goes to sleep, IMO | 17:48 |
mnaser | my question is: is there a problem? are your requests forever 500-ing after? | 17:49 |
dansmith | nova-api doesn't go to sleep | 17:49 |
melwitt | what's the behavior when it doesn't properly reconnect? nova-api responds with a failure? | 17:49 |
mnaser | ^^ | 17:49 |
imacdonn | yeah, I get some sort of 50x error | 17:49 |
dansmith | presumably because something is getting stuck and not letting the connections close off, which is why I want to know why imacdonn is different from the rest of the world | 17:50 |
mnaser | if you get a 50x that means you have aa traceback, right? | 17:50 |
imacdonn | dansmith: hey now .. there was at least one other person experiencing the failures | 17:51 |
mnaser | should you have a traceback if you're getting a 500? | 17:51 |
dansmith | imacdonn: I meant your deployment, not you as a person :) | 17:51 |
imacdonn | well, yeah, but the other person was using a different deployment, and mod_wsgi | 17:51 |
* mnaser still asks if there is a traceback | 17:52 | |
dansmith | imacdonn: ^ | 17:52 |
efried | mriedem: Were you going to put up a null nova change depending on https://review.opendev.org/659611 to make sure the ironic-tempest-ipa-wholedisk-bios-agent_ipmitool-tinyipa job passes? | 17:53 |
melwitt | I had thought someone else faced this behavior too, but looking at the bug report, they say it reconnects https://bugs.launchpad.net/tripleo/+bug/1827744 | 17:53 |
openstack | Launchpad bug 1827744 in tripleo "Running eventlet.monkey_patch in nova_api breaks the AMQP heartbeat thread" [High,In progress] - Assigned to Damien Ciabrini (dciabrin) | 17:53 |
efried | or does ironic run that job too? | 17:53 |
imacdonn | I'll have to rebuild my "real" environment with uWSGI and reproduce again, if we need more context around the failure to reconnect ... I'd already started down the path of reverting to running nova-api standalone, since there didn't seem to be a path forward | 17:53 |
imacdonn | the "other person" is here: https://bugs.launchpad.net/nova/+bug/1825584/comments/7 | 17:54 |
openstack | Launchpad bug 1825584 in OpenStack Compute (nova) "eventlet monkey-patching breaks AMQP heartbeat on uWSGI" [Undecided,New] | 17:54 |
dansmith | imacdonn: can you also try disabling rabbit heartbeats on a wsgi api node? just to see if that eliminates the log noise | 17:54 |
dansmith | imacdonn: even if on the devstack | 17:54 |
melwitt | imacdonn: that's the bug I linked, I think, and the proposed patch is the stop monkey patching nova-api | 17:54 |
imacdonn | what do you mean by "log noise"? or are you just asking me to test if it's possible to disable heartbeats at all ? | 17:54 |
melwitt | but in that bug report, he says that rabbit reconnects | 17:55 |
imacdonn | that's my bug report, but he commented on it | 17:55 |
melwitt | no, the bug I linked you | 17:55 |
dansmith | imacdonn: I mean the log noise where nova-api complains that the connection was reset before reconnecting (in mnaser's case) | 17:55 |
melwitt | imacdonn: https://bugs.launchpad.net/tripleo/+bug/1827744 | 17:55 |
openstack | Launchpad bug 1827744 in tripleo "Running eventlet.monkey_patch in nova_api breaks the AMQP heartbeat thread" [High,In progress] - Assigned to Damien Ciabrini (dciabrin) | 17:55 |
dansmith | i.e. this: [10:46:02] <mnaser>2019-05-16 17:38:53.644 3897 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 104] Connection reset by peer | 17:55 |
sean-k-mooney | actully in the api we dont really care about hearbeat to rabbitmq right | 17:56 |
dansmith | sean-k-mooney: I've been shouting that for hours, yes:) | 17:57 |
dansmith | there's really no reason to maintain a persistent connection to rabbit from an idle api worker | 17:57 |
sean-k-mooney | actullly i was going to say we never actully listen for anyting from the api but i guess we do if we make a blocking call | 17:57 |
imacdonn | melwitt: ah, I see .. OK ... yeah, maybe it was non-fatal in his case | 17:57 |
mnaser | I think the connection reset by peer is the real issue, I think the warning is log noise | 17:57 |
dansmith | sean-k-mooney: only in the reply case, which isn't a problem here | 17:57 |
sean-k-mooney | ya | 17:57 |
dansmith | mnaser: the connection reset by peer is because we're not running the heartbeat thread, and thus is expected | 17:58 |
sean-k-mooney | so even if the connect drops due to in activity in the normal case we just fire and forget | 17:58 |
dansmith | mnaser: it's the "this is dead let me reconnect immediately" signal, which seems to be working fine for you | 17:58 |
dansmith | sean-k-mooney: right | 17:58 |
mnaser | I'm in china and its 2am but for thee last time I'll say it | 17:58 |
mnaser | if you're getting 500s, it probably threw a traceback, and that'll probably contain something useful | 17:59 |
mnaser | (which will be the real issue) | 17:59 |
dansmith | mnaser: imacdonn said he needs to rebuild his env to get that traceback | 17:59 |
imacdonn | ^ that | 17:59 |
mnaser | aaaha | 17:59 |
mnaser | okay, well in that case, good luck | 17:59 |
mnaser | fwiw let me share the OSA config | 17:59 |
efried | mriedem: I guess it does run it. It failed. And looking closer, the 'contains' match was working properly. The problem is further up: the inventory is never getting set. | 17:59 |
dansmith | the other thing I'm interested in is, if you just don't patch, and everything seems to work, is that just because eventlet is being synchronous, or because eventlet primitives are spawning real threads? | 18:00 |
mnaser | https://opendev.org/openstack/openstack-ansible-os_nova/src/branch/master/templates/nova-uwsgi.ini.j2 | 18:01 |
mnaser | this is our uwsgi config | 18:01 |
* mnaser & | 18:01 | |
melwitt | I'm also wondering about that | 18:01 |
dansmith | I think that if disabling the heartbeat thread removes the log noise and we reconnect as we would expect for all cases except the unexplained ones, that'd eb my preference as a path forward | 18:02 |
*** nicolasbock has quit IRC | 18:02 | |
dansmith | because changing our threading model for some of nova, and maintaining a thread off each wsgi worker just to hold a useless connection to rabbit does not seem like an improvement to me | 18:02 |
mriedem | efried: ack ok just getting back, haven't looked | 18:03 |
efried | mriedem: Left comment on https://review.opendev.org/#/c/659611/ | 18:04 |
imacdonn | dansmith: not sure I like the idea of turning off heartbeat .. that could have side-effects ... like leaving stale connections on the rabbitmq-server side for ever (??) | 18:04 |
dansmith | imacdonn: show us that it causes negative effects | 18:04 |
imacdonn | not sure that I can come up with every possible scenario off the top of my head | 18:05 |
dansmith | the other things we could do are: | 18:05 |
dansmith | 1. establish and drop connections before/after request | 18:06 |
dansmith | 2. leave heartbeats in place, as they are today so that rabbit will close the stale connections, which seems to be working for most people, | 18:06 |
efried | mriedem: n-cpu thinks the inventory is being set: http://logs.openstack.org/11/659611/1/check/ironic-tempest-ipa-wholedisk-bios-agent_ipmitool-tinyipa/9c644a2/controller/logs/screen-n-cpu.txt.gz#_May_16_17_17_55_788562 | 18:06 |
sean-k-mooney | if i do | 18:06 |
sean-k-mooney | >>> def func(): | 18:06 |
dansmith | 3. change our threading model for part of nova to run a heartbeat thread | 18:06 |
sean-k-mooney | ... print("test") | 18:06 |
sean-k-mooney | ... | 18:06 |
sean-k-mooney | >>> eventlet.greenthread.spawn(func) | 18:06 |
sean-k-mooney | it never prints anything | 18:06 |
dansmith | 4. split nova-api into the wsgi request processing side and the persistent bit | 18:06 |
sean-k-mooney | so if you dont monkeypatch it looks liek spawn does nothing but queue the fucntion to run | 18:07 |
sean-k-mooney | and never run it | 18:07 |
dansmith | sean-k-mooney: that makes me wonder how disabling monkeypatching does anything | 18:07 |
sean-k-mooney | dansmith: as long a s you dont call the scater gater code it should apper to work | 18:07 |
dansmith | sean-k-mooney: right, but we do that for instance list, and they say that disabling monkeypatching just makes everything work normally, right? | 18:08 |
sean-k-mooney | that is the only part of the api that actully uses eventlet | 18:08 |
efried | mriedem: oh, reserved is 1 and never gets set to 0 | 18:08 |
mriedem | efried: hmm, i wonder if it things the node isn't available | 18:08 |
mriedem | *thinks | 18:08 |
sean-k-mooney | dansmith: maybe its different on python 2 | 18:08 |
mriedem | b/c i think that's when the driver sets reserved=1 | 18:08 |
sean-k-mooney | i jsut checked with py36 | 18:09 |
efried | yup | 18:09 |
mriedem | Node 7c5e9503-1e20-42bf-9ae7-d14cef0f38a9 is not ready for a deployment, reporting resources as reserved for it. Node's provision state is cleaning, power state is power off and maintenance is False. | 18:09 |
efried | just so | 18:09 |
mriedem | yeah so the node is dead | 18:10 |
mriedem | http://logs.openstack.org/11/659611/1/check/ironic-tempest-ipa-wholedisk-bios-agent_ipmitool-tinyipa/9c644a2/controller/logs/screen-ir-cond.txt.gz#_May_16_17_28_09_311564 | 18:10 |
imacdonn | sean-k-mooney dansmith: side-note ... my "real" deployment is py2, devstack is py3 ... may be a factor in the failure to reconnect | 18:10 |
* efried edits story text | 18:13 | |
mriedem | yeah i just abandoned the change | 18:13 |
sean-k-mooney | dansmith: if you save teh greanthread object and then call wait() it runs all the queued greenthreads | 18:15 |
dansmith | sean-k-mooney: right that's the main loop part we will never do in wsgi mode | 18:16 |
sean-k-mooney | so if i do gt = eventlet.greenthread.spawn(func) | 18:16 |
sean-k-mooney | it prints after i call gt.wait() | 18:16 |
sean-k-mooney | dont we wait in the scater gather however | 18:16 |
dansmith | oh, sorry your point is maybe that's why it's working if they disable? | 18:17 |
dansmith | yeah, so we do that in sequence, which means it becomes serialized | 18:17 |
dansmith | so they just hit cell0, then cell1, etc | 18:17 |
dansmith | that's the behavior I had been assuming for non-patched, | 18:18 |
dansmith | I just wasn't sure if in reality the lower layers would actually use threading.Thread() unpatched for the work | 18:19 |
sean-k-mooney | well i think when we call wait it is entering a coperative green thread mode so we still get concurance | 18:19 |
dansmith | which hmm, I guess could still be | 18:19 |
sean-k-mooney | until that greenthread finishes at least | 18:19 |
sean-k-mooney | and sice we are doing it in a for it will complete the scater gather as we will wait on all the greadthread we queued | 18:19 |
sean-k-mooney | so we should not need to monkey patch | 18:20 |
*** ralonsoh has quit IRC | 18:20 | |
dansmith | well, it depends on whether it's spawning a real thread and we're just waiting on it's future, or if we pass through and just exec the thread function during the wait | 18:20 |
sean-k-mooney | let me try a slightly less dumb test function that does a green trhead sleep and increment a counter and see if its serialsed or random | 18:21 |
dansmith | sean-k-mooney: >>> def func2(): | 18:21 |
dansmith | ... print(len(threading.enumerate())) | 18:21 |
dansmith | ... | 18:21 |
dansmith | >>> t = eventlet.greenthread.spawn(func2) | 18:21 |
dansmith | >>> t.wait() | 18:21 |
dansmith | test | 18:21 |
dansmith | 1 | 18:21 |
dansmith | sean-k-mooney: only one thread running from inside the thread function | 18:21 |
dansmith | so it's sequential | 18:21 |
dansmith | the test came from another thread I spawned first (like your example) which hadn't been waited on | 18:22 |
dansmith | normally enumerate() returns one for the main thread all the time, | 18:22 |
dansmith | so 1 means it didn't spawn something else for func2() | 18:23 |
*** ccamacho has joined #openstack-nova | 18:23 | |
dansmith | so one thing we could do in the interim is, | 18:24 |
dansmith | provide a knob to let them avoid the monkeypatch without having to actually patch the code, if they prefer synchronous behavior | 18:24 |
dansmith | that isn't too terrible for people with just the default two cells, and the rest of the people who don't suffer from this can continue to do what we do today | 18:25 |
melwitt | I think that's already available in the form of an env variable. which is what this tripleo patch is proposing to turn off monkey patching https://review.opendev.org/#/c/657168/1/deployment/nova/nova-api-container-puppet.yaml | 18:26 |
dansmith | melwitt: ah even better :) | 18:26 |
sean-k-mooney | dansmith: it is concurrent however | 18:27 |
sean-k-mooney | if i do | 18:27 |
sean-k-mooney | >>> def func(): | 18:27 |
sean-k-mooney | ... eventlet.greenthread.sleep(10) | 18:27 |
sean-k-mooney | ... global i | 18:27 |
sean-k-mooney | ... i+=1 | 18:27 |
sean-k-mooney | ... print(i) | 18:27 |
sean-k-mooney | ... | 18:27 |
sean-k-mooney | >>> i = 0 | 18:27 |
sean-k-mooney | >>> gt = eventlet.greenthread.spawn(func) | 18:27 |
sean-k-mooney | >>> gt = eventlet.greenthread.spawn(func) | 18:27 |
sean-k-mooney | >>> gt = eventlet.greenthread.spawn(func) | 18:27 |
sean-k-mooney | >>> gt = eventlet.greenthread.spawn(func) | 18:27 |
sean-k-mooney | >>> gt.wait() | 18:27 |
sean-k-mooney | it does not take 40 secods to print | 18:27 |
dansmith | sean-k-mooney: because you're using eventlet's sleep right? | 18:27 |
sean-k-mooney | yes | 18:27 |
imacdonn | mnaser (if still awake?), dansmith, sean-k-mooney: log from the failure-to-reconnect case: http://paste.openstack.org/show/751486/ | 18:28 |
dansmith | that does't mean it's concurrent, that means it's warping time for you :) | 18:28 |
mnaser | Looks like conductor yoking our | 18:31 |
mnaser | Timing out | 18:32 |
mnaser | I’ve seen really weird behaviour thy usually gets resolved by deleting every single queue in the vhost | 18:32 |
sean-k-mooney | dansmith: well in anycase i think we now now why its working if you dont monkeypatch | 18:33 |
dansmith | yup, as expected | 18:33 |
sean-k-mooney | there it concurant of fully synconous i dont know | 18:33 |
dansmith | sean-k-mooney: run this and tell me what you think :) https://pastebin.com/XHV9t9cM | 18:35 |
imacdonn | mnaser: this happens when the monkey patching is used in combination with a wsgi container ... and doesn't happen otherwise | 18:36 |
dansmith | sean-k-mooney: one single wait, but it clearly runs them in order, and blocking calls block everything | 18:36 |
melwitt | imacdonn: not sure what else to try other than a python2 devstack test maybe, to try and pinpoint what is the difference in your env vs the others. mnaser, are you running stein under py3 or py2? | 18:36 |
imacdonn | melwitt: yeah, thought about py2 devstack too | 18:37 |
sean-k-mooney | yes it took 53 seconds | 18:40 |
sean-k-mooney | but https://pastebin.com/QN5L7yDY took 5 | 18:41 |
sean-k-mooney | time.sleep will suppend the thread | 18:41 |
sean-k-mooney | eventlet.greenthread.sleep shoudl yeild to the next coroutine | 18:41 |
dansmith | sean-k-mooney: ah I see what you mean.. we will still get *interleaving* but not actual concurrency | 18:41 |
dansmith | sean-k-mooney: meaning we're not running multiple concurrent threads, but we will interleave their execution when they do stuff that calls into eventlet | 18:42 |
sean-k-mooney | we will get concurrency but not parrallisium | 18:42 |
sean-k-mooney | yes | 18:42 |
dansmith | sean-k-mooney: the thing I think you're missing is that if we haven't monkeypatched, then all our IO is blocking and so the first thing will be like time.sleep() andnot like eventlet.sleep() | 18:42 |
dansmith | first thing meaning first bit of IO | 18:43 |
sean-k-mooney | ya i was onding if we could import eventlet.green or somting | 18:44 |
sean-k-mooney | https://eventlet.net/doc/patching.html#import-green | 18:44 |
sean-k-mooney | our use a contex manger to patch or jsut the socket lib | 18:44 |
dansmith | I think that gets us further into the mixing paradigms state and confusion abounds, IMHO | 18:45 |
sean-k-mooney | ya perhaps | 18:45 |
sean-k-mooney | so we orginally merged the new scater gather stuff that used eventlests without monkey patching | 18:45 |
sean-k-mooney | then we noticed we were not monkeypatch and mdbooth fixed that | 18:46 |
dansmith | *for api running under wsgi | 18:46 |
sean-k-mooney | and we strated to see these uwsgi errors | 18:46 |
sean-k-mooney | ya | 18:46 |
sean-k-mooney | so setting the env var would allow people to go back to the un monkeypatched versions under wsgi with no code change but its nolonger paralle | 18:48 |
dansmith | exactly | 18:48 |
sean-k-mooney | although in that case uwsgi/apache would still provde its own concurancy | 18:48 |
dansmith | sean-k-mooney: between requests, but not between cells | 18:49 |
dansmith | cells would still be sequential | 18:49 |
sean-k-mooney | yes | 18:49 |
sean-k-mooney | im just thinking about what people can do as a workaound in the short term | 18:49 |
sean-k-mooney | e.g. what can people do that hit this in production | 18:50 |
dansmith | people being the 1.9 people we know that are having this problem right? :) | 18:50 |
*** trident has quit IRC | 18:51 | |
sean-k-mooney | well i think the only reason that we are not seeing this in osp is because hte kolla container use the console-script in the container instead of running under wsgi | 18:51 |
sean-k-mooney | so i think that is why we are not hitting it downstream | 18:51 |
sean-k-mooney | but i dont know what other deployment tools default too | 18:51 |
dansmith | well, we got the vexxhost data that they're reconnecting properly | 18:52 |
sean-k-mooney | in osa | 18:52 |
sean-k-mooney | *with | 18:52 |
dansmith | so they're clearly seeing the same threading behavior, but not suffering | 18:52 |
dansmith | i.e. they're getting the best of both worlds | 18:52 |
*** trident has joined #openstack-nova | 18:52 | |
sean-k-mooney | perhaps its because they are installing the nova api under uwsgi and not mod_wsgi | 18:59 |
sean-k-mooney | imacdonn: what are you running the nova-api with? | 18:59 |
imacdonn | sean-k-mooney: uWSGI ... I've also reproduced the heartbeats not happening with mod_wsgi, but I can't say if I've seen the failure to "establish a new connection and move on" in that case | 19:00 |
sean-k-mooney | we know that uwsgi and mod_wsgi handel the lifetime of there python interperters diffrently so model level variable are reinitalised in uwsgi but are not in mod_wsgi becuase it reuses the interpreter when the app is reloaded an uwisg does not | 19:01 |
sean-k-mooney | *module level | 19:01 |
sean-k-mooney | imacdonn: if the wsgi app dies and is restarted by uwsgi it will return to a clean state | 19:02 |
sean-k-mooney | in the mod_wsgi case it will reuse the interperter and maybe that impact the reconnection | 19:03 |
sean-k-mooney | that is a strech however | 19:03 |
sean-k-mooney | we are seeing it not reconnect on mod_wsgi correct? | 19:03 |
imacdonn | I don't see any evidence of the app dying and being restarted | 19:04 |
melwitt | sean-k-mooney: we're not sure what the common denominator is for not reconnecting and 500ing, I think | 19:04 |
melwitt | I have a python2 devstack, going to check whether it reconnects | 19:05 |
sean-k-mooney | if it was you would see nova print it config again and all the other start up stuff so if you are not seeing that its proably not happening then | 19:05 |
sean-k-mooney | ok that is testing with uwsgi then | 19:05 |
imacdonn | yeah .. the 504 error seems to be caused by an RPC timeout following the reconnect ... don't yet know why I'm seeing that where others are not | 19:05 |
sean-k-mooney | do we have any periodic task that could be keeping it alive? | 19:06 |
imacdonn | I can't say I've reproduced the RPC timeout with mod_wsgi - was focused on the absence of heartbeats before | 19:06 |
dansmith | sean-k-mooney: if you read the backscroll, it has been reprod with both | 19:07 |
imacdonn | I'm not able to reproduce the RPC timeout with py2 devstack either | 19:07 |
melwitt | how do you get the connection reset by peer message to show up in nova-api log? just waiting doesn't seem to do it | 19:08 |
dansmith | request after a delay I think | 19:08 |
sean-k-mooney | i tought it was 2 different errors one on each but ok in that case ill stop looking to see what the differe is between how the both run | 19:08 |
imacdonn | I have an instance already running.. I make a call to get the VNC console URL, wait about 5 mins, then repeat the call to get the console URL | 19:08 |
sean-k-mooney | that shoudl not need to hit rabbit | 19:09 |
sean-k-mooney | that should just hit the db directly no? | 19:09 |
dansmith | sean-k-mooney: eh | 19:09 |
sean-k-mooney | to get the url | 19:10 |
dansmith | no, it calls to the driver | 19:10 |
sean-k-mooney | to get teh console it woudl have to do an rpc | 19:10 |
sean-k-mooney | oh ok i assuem we would have sored the console url in the db | 19:10 |
*** imacdonn has quit IRC | 19:12 | |
melwitt | it is stored there afterward (base url and token hash, etc) but the driver hands out the connection details initially | 19:13 |
*** imacdonn has joined #openstack-nova | 19:14 | |
melwitt | ok yeah I see now the "Unexpected error during heartbeart thread processing, retrying...: error: [Errno 104] Connection reset by peer" happens on the second request after waiting some minutes. but the request works fine | 19:14 |
melwitt | (with py2 and uwsgi) | 19:15 |
sean-k-mooney | ok but this is what get console is actully doing right https://github.com/openstack/nova/blob/master/nova/console/api.py#L27-L29 | 19:15 |
sean-k-mooney | or rather openstack console url show | 19:17 |
melwitt | I think that's probably the xen-specific service nova-console stephenfin is looking to deprecate | 19:17 |
melwitt | this is get_vnc_console https://github.com/openstack/nova/blob/master/nova/compute/api.py#L3876 | 19:17 |
sean-k-mooney | that is what called form here https://github.com/openstack/nova/blob/master/nova/api/openstack/compute/consoles.py#L48 | 19:17 |
sean-k-mooney | oh ok so that one does a downcall to the compute cool | 19:18 |
melwitt | I think that's the xen thing that no one except old rax cluster uses | 19:18 |
dansmith | this is one reason I suggested using a different call | 19:19 |
dansmith | because console has all manner of stuff wrapped up in it | 19:19 |
dansmith | I think pause/unpause is a call right? use that | 19:19 |
dansmith | or just boot/delete | 19:19 |
imacdonn | looking at my py2 devstack, it seems the connection reset is observed at a difference place | 19:19 |
sean-k-mooney | ya or the server diagnostics | 19:19 |
dansmith | well, boot isn't a call from the apiI guess, so pause | 19:19 |
dansmith | sean-k-mooney: or that | 19:19 |
sean-k-mooney | ... of cose they are only in the nova client... | 19:20 |
melwitt | imacdonn: in my py2 devstack, I see the connection reset in nova-api log (for get_vnc_console) | 19:20 |
sean-k-mooney | nova diagnostics <server> is definetly a down call to the compute but nova get-vnc-console should be too | 19:22 |
*** ttsiouts has joined #openstack-nova | 19:22 | |
*** nicolasbock has joined #openstack-nova | 19:22 | |
*** imacdonn has quit IRC | 19:25 | |
*** jmlowe has quit IRC | 19:29 | |
*** xek__ has quit IRC | 19:32 | |
*** xek__ has joined #openstack-nova | 19:32 | |
*** imacdonn has joined #openstack-nova | 19:42 | |
sean-k-mooney | efried: just finsihed reviewing https://review.opendev.org/#/q/topic:bug/1809095+(status:open+OR+status:merged) | 19:43 |
sean-k-mooney | efried: im provisionly +1 on both i have one question on the second patch but its and edgecase | 19:43 |
*** jmlowe has joined #openstack-nova | 19:44 | |
*** imacdonn_ has joined #openstack-nova | 19:46 | |
*** maciejjozefczyk has quit IRC | 19:49 | |
*** imacdonn has quit IRC | 19:49 | |
*** pcaruana has quit IRC | 19:52 | |
*** ttsiouts has quit IRC | 19:54 | |
imacdonn_ | sean-k-mooney: vnc console does rpc .. you can see it in the trace at http://paste.openstack.org/show/751486/ | 19:55 |
imacdonn_ | (the trace for the RPC timeout) | 19:55 |
sean-k-mooney | those call to the compute node? ya | 19:56 |
imacdonn_ | yes | 19:56 |
sean-k-mooney | but it was not clear what api enpoint osc was calling for openstack console url show | 19:56 |
sean-k-mooney | if you are using the nova clintes get-vnc-console that was defietly downcalling | 19:57 |
sean-k-mooney | osc is doing some magic to figure out what console type the vm is uesing | 19:57 |
sean-k-mooney | as it has a singel command that works for all console types | 19:57 |
sean-k-mooney | so i was not sure if the osc command would actully hit the db or the compute node | 19:58 |
imacdonn_ | yeah, def does ... that stack trace resulted from 'openstack console url show ...' | 19:59 |
sean-k-mooney | its kind of strange | 20:00 |
sean-k-mooney | we can see the hearbeat_tick printed | 20:00 |
imacdonn_ | yeah, but only once | 20:01 |
imacdonn_ | it should be period from that point on | 20:01 |
imacdonn_ | periodic | 20:01 |
sean-k-mooney | no its there several times | 20:01 |
sean-k-mooney | look at line 140 to 153 | 20:01 |
imacdonn_ | for the same thread? | 20:01 |
imacdonn_ | in any case, in the non-patched case, it's continuous .. in the patched case, it goes quiet | 20:01 |
*** ccamacho has quit IRC | 20:02 | |
imacdonn_ | those are all within the same second | 20:03 |
imacdonn_ | I *may* have just found what's different about my env vs. others and devstack | 20:04 |
*** whoami-rajat has quit IRC | 20:05 | |
*** imacdonn_ is now known as imacdonn | 20:07 | |
sean-k-mooney | have you adjusted any of the messaging timeouts? | 20:07 |
imacdonn | no, but in my uwsgi config, had "threads = 8" ... I've removed that now, and it seems that the reconnect is working | 20:09 |
sean-k-mooney | this is the osa one https://opendev.org/openstack/openstack-ansible-os_nova/src/branch/master/templates/nova-uwsgi.ini.j2 | 20:10 |
imacdonn | I'm thinking (loosely) that maybe in the threads=X case, the thread making the RPC call has fired before the thread handling heartbeats has noticed the connection failure and reestablished | 20:10 |
sean-k-mooney | which defaults to 1 | 20:10 |
imacdonn | yeah. devstack doesn't set it at all | 20:11 |
imacdonn | I was thinking to figure out what uwsgi does by default .. guessing 1 | 20:11 |
imacdonn | I was *trying.... | 20:11 |
sean-k-mooney | im lookig fot atht too now :) | 20:11 |
efried | sean-k-mooney: thanks for reviewing. | 20:20 |
sean-k-mooney | imacdonn: finally https://github.com/unbit/uwsgi/blob/3149df02ed443131c54ea6afb29fcbb0ed4d1139/core/init.c#L96 it default to 1 | 20:21 |
sean-k-mooney | imacdonn: the default is not in the docs | 20:22 |
imacdonn | hah, thanks | 20:23 |
sean-k-mooney | for c code uwsgi is actully not to hard to read but finding stuff is a pain | 20:23 |
sean-k-mooney | imacdonn: so i think we know why vexhost are fine | 20:24 |
sean-k-mooney | osa expecitly sets it to 1 | 20:24 |
imacdonn | yeah, I think so ... the reconnect seems to be consistently working since I've changed that | 20:25 |
imacdonn | so now I wonder if I'm giving anything up by doing that, in terms of performance .... haven't really found any good guidance on processes/threads sizing for these things | 20:26 |
sean-k-mooney | well osa is running with 64 processes and 1 thread per process i think | 20:26 |
sean-k-mooney | sorry a max of 16 processes | 20:27 |
sean-k-mooney | each with 1 thread | 20:27 |
dansmith | is this threads per worker process? | 20:29 |
dansmith | if so, it could be that multiple threads per worker try to share a single connection pool, | 20:29 |
sean-k-mooney | yes i belive so | 20:29 |
dansmith | but maybe one of them gets responsibility for the eventlet heartbeat thread,m | 20:29 |
dansmith | so if that thread isn't the first scheduled after a period of inactivity, then heartbeat isn't run, we try to use the connection and are stuck? | 20:30 |
dansmith | something like that? | 20:30 |
imacdonn | yes, "something like that" is what it feels like | 20:30 |
sean-k-mooney | well im not sure if we have process=2 thread=2 if that wil result in 2 or 4 python interperters | 20:30 |
sean-k-mooney | e.g. are teh treads os threads or python threads | 20:31 |
melwitt | so heartbeat threads is the one that reconnects things if detected to not be open | 20:31 |
dansmith | I think that would mean two interpreters, two threads each | 20:31 |
melwitt | *thread | 20:31 |
dansmith | melwitt: no, | 20:31 |
sean-k-mooney | right so the gil would mean that only one of the two thread woudl actully be running in each python interperter | 20:31 |
dansmith | I think any of them would do it, but running or not running the heartbeat but definitely running the other threads is where I can see it getting wonky | 20:32 |
*** tbachman has quit IRC | 20:32 | |
sean-k-mooney | so maybe it is staving the heart beat thread as you suggest | 20:32 |
melwitt | I guess I'm confused. so far, I thought the heart beat thread doesn't affect anything other than to log a message saying "connection reset" if it's been idle > 60s. how does its starvation prevent reconnection? | 20:34 |
dansmith | no, I think the heartbeat thread is just the first thin scheduled after a period of inactivity, so it's the first thing that hits the dead connection | 20:36 |
imacdonn | I don't think reconnection is being prevented .. you can see it happen in the log at http://paste.openstack.org/show/751486/ ... but I tihnk it might matter *when* it happens | 20:36 |
imacdonn | (line 227 in the log shows reconnected) | 20:37 |
melwitt | oh yeah, I see. so why MessagingTimeout? sorry, I just don't get this | 20:38 |
imacdonn | maybe the RPC request got sent to a dead connection, somehow?? | 20:39 |
*** dave-mccowan has joined #openstack-nova | 20:39 | |
sean-k-mooney | the "best" respocne in https://www.reddit.com/r/Python/comments/4s40ge/understanding_uwsgi_threads_processes_and_gil/ seam to gel with how i tought this would be working | 20:39 |
sean-k-mooney | if we are initalising the connection pool as a module vaiable we are sharing it between the threads and all the therads are being run sequtually under the gil | 20:40 |
sean-k-mooney | so with threads=8 you had 8 instance of the api all sharing the same connection pool | 20:40 |
sean-k-mooney | there is a follow futher down where there is a code wakthorugh which explaing the init flow | 20:43 |
*** mchlumsky has quit IRC | 20:47 | |
artom | efried, you added me to that review but I think you'll get through them before I get a chance to even start | 20:50 |
artom | We're all putting out CI fires with our upcoming release | 20:51 |
artom | (Maybe I shouldn't say that in public?) | 20:51 |
efried | artom: no worries, I noticed you were on the one above it, which is the important one, but that one's getting a -1 from me in a few minutes here, so you got time :) | 20:51 |
artom | https://photos.app.goo.gl/fRQoJpE5WnqYeP6E8 was my desk not long ago. Yes, I needed both those drinks to get through the bugs | 20:52 |
artom | ... and I need to go pick kids up | 20:53 |
artom | o/ | 20:53 |
sean-k-mooney | dansmith: imacdonn so i think this is the issue https://github.com/openstack/nova/blob/760fc2de32057817bafc0a1512a735dab1c7c68a/nova/rpc.py#L69 | 20:54 |
sean-k-mooney | we intial the transport as a module level vaiable so that is shared across all wsgi ap instace in the same process | 20:55 |
sean-k-mooney | that end up calling https://github.com/openstack/oslo.messaging/blob/40c25c2bde6d2f5a756e7169060b7ce389caf174/oslo_messaging/transport.py#L192 | 20:55 |
dansmith | sean-k-mooney: well, that's the global thing that does the connectioning to rabbit, yeah | 20:55 |
sean-k-mooney | which calls the diver manger to construt RabitmqDriver instance which has a connection pool internally | 20:56 |
dansmith | yup | 20:56 |
sean-k-mooney | so the connection pool become global too | 20:56 |
dansmith | right | 20:56 |
*** tbachman has joined #openstack-nova | 20:57 | |
dansmith | sean-k-mooney: this is only part of the equation by the way | 20:57 |
sean-k-mooney | and sicne we have muleple python threads and the gil limits there exectuion we could starge the heartbeat | 20:57 |
dansmith | sean-k-mooney: because I think that TRANSPORT is what we use for initializing a connection which we would use if we were going to start a server and listen | 20:57 |
*** artom has quit IRC | 20:57 | |
dansmith | sean-k-mooney: but connections to the *cell* rpc brokers happens in nova.context, because we connect per cell once we read that out | 20:57 |
dansmith | sean-k-mooney: it's effectively the same situaton though, I imagine they both suffer from the same behavior | 20:58 |
sean-k-mooney | riping out those gloabls would proably be a pain | 20:59 |
*** trident has quit IRC | 20:59 | |
dansmith | sean-k-mooney: https://github.com/openstack/nova/blob/760fc2de32057817bafc0a1512a735dab1c7c68a/nova/context.py#L376 | 20:59 |
dansmith | sean-k-mooney: having implemented multi-cell tests, I can assure you it is | 20:59 |
*** cmart has joined #openstack-nova | 21:00 | |
*** trident has joined #openstack-nova | 21:00 | |
dansmith | but also, it's the thing that avoids us having to build that stuff up a hundred times, even for a single request where we make a couple rpc calls | 21:00 |
sean-k-mooney | and yep we are storing the connection in a module level CELL_CACHE | 21:00 |
dansmith | right | 21:00 |
sean-k-mooney | so yep ist the same issue | 21:00 |
melwitt | so if the heartbeat thread gets starved, then connections get closed down, and then if reconnect in another thread races with an API request coming in and the API request hits before the reconnect, it will get the MessagingTimeout? I'm looking through the oslo.messaging code and from what I can tell, the reconnect is done on-the-fly when a message is attempted to be sent, so I don't understand how a different thread is doing the | 21:01 |
melwitt | reconnect | 21:01 |
*** ttsiouts has joined #openstack-nova | 21:01 | |
cmart | Hello -- can someone help me understand the process by which ephemeral disks get a filesystem label? It's usually `ephemeral0` but I have an instance created today which has `ephemeral0_160_40` | 21:02 |
melwitt | but if that were true that oslo.messaging automatically tries to reconnect, then the API request shouldn't fail, it would try to reconnect. so, nevermind, I don't know what I'm talking about | 21:02 |
sean-k-mooney | melwitt: well thats praobly the proble it like is not auto recoonectin on making a request | 21:02 |
sean-k-mooney | we poroubl have preiodic task or something that is doing th reconnection | 21:03 |
sean-k-mooney | that is jsut a guess | 21:03 |
sean-k-mooney | i havent look at this that closely in the past | 21:03 |
melwitt | yeah. I just can't tell what triggers a reconnect. in https://github.com/openstack/oslo.messaging/blob/40c25c2bde6d2f5a756e7169060b7ce389caf174/oslo_messaging/_drivers/amqpdriver.py I see a bunch of stuff where "recoverable error" will try to reconnect | 21:03 |
sean-k-mooney | are we using the amqp driver or the rabbitmq one | 21:04 |
melwitt | sorry, wrong link https://github.com/openstack/oslo.messaging/blob/9.5.0/oslo_messaging/_drivers/impl_rabbit.py | 21:04 |
melwitt | that other link was me trying to find where/how it raises MessagingTimeout | 21:04 |
*** tbachman has quit IRC | 21:05 | |
sean-k-mooney | hum i wonder if setting https://github.com/openstack/oslo.messaging/blob/9.5.0/oslo_messaging/_drivers/impl_rabbit.py#L145-L149 to 0 would help? | 21:05 |
dansmith | sean-k-mooney: that was mentioned earlier | 21:06 |
sean-k-mooney | oh ok i tought you were asking if there was a way to disable it eariler | 21:06 |
melwitt | that is, you think it might allow threads=8 config to reconnect? | 21:06 |
dansmith | I said it would be good to test with it, *but* I think the downside of it is that the rabbit server side won't terminate the connection because of lack of heartbeats which means more likelihood of just having stale connections that have been forgotten by the conntracks in between | 21:06 |
sean-k-mooney | i think it might not close the connect | 21:06 |
sean-k-mooney | *connection | 21:07 |
melwitt | oh ok | 21:07 |
sean-k-mooney | e.g. it would only close if the tcp soecket closed | 21:07 |
melwitt | yeah, I see | 21:07 |
imacdonn | pfft, it's not like rabbitmq-server is ever going to run out of connection............... | 21:07 |
dansmith | problem is, stateful firewalls in between client and server may do that without you noticing which will require some delay before you realize the socket is dead | 21:07 |
sean-k-mooney | right which is proably why the heartbeat is there in the first place | 21:08 |
melwitt | yeah | 21:08 |
imacdonn | back in the icehouse days, I had lots of problems trying to do AMQP through a firewall, due to lack of heartbeat support | 21:08 |
dansmith | right and why I think it's fine to let it be there, and if we wake up within a minute, we're good and if not, then we've got an immediate RST waiting for us so we know to reconnect | 21:08 |
openstackgerrit | Matt Riedemann proposed openstack/nova master: Restore connection_info after live migration rollback https://review.opendev.org/551349 | 21:09 |
sean-k-mooney | melwitt: this is where the reconnect happens i think | 21:12 |
sean-k-mooney | https://github.com/openstack/oslo.messaging/blob/9.5.0/oslo_messaging/_drivers/impl_rabbit.py#L897-L928 | 21:12 |
sean-k-mooney | we call self.ensure_connection() if there was an exception wehn we calle self,_heartbeat_check() | 21:12 |
sean-k-mooney | which is a thread we inital in the dirver wihc is a global so there is again only 1 per process | 21:13 |
sean-k-mooney | so with threads=8 there is a 1 in 9 chacne that it will get scheduled to run | 21:14 |
melwitt | so since it does the reconnecting, if it doesn't run first, other thread will hit a closed connection and get MessagingTimeout | 21:17 |
*** slaweq has quit IRC | 21:18 | |
sean-k-mooney | well the message timeout comes form the fact that we did a request and then stop listening because the connection closed | 21:18 |
sean-k-mooney | and we did not reconnect before teh timeout expired | 21:18 |
sean-k-mooney | so the set succfully | 21:18 |
sean-k-mooney | then it closed and they never go tthe resopnece | 21:18 |
melwitt | ok | 21:18 |
sean-k-mooney | which likely means when we recoonet it will be sitting on the queue | 21:19 |
sean-k-mooney | things like the inststance list would likely be fine in that case | 21:21 |
sean-k-mooney | since the api is hitting all the cell dbs but not doing an rpc right | 21:21 |
sean-k-mooney | but things like get_vnc_console woudl fail | 21:21 |
efried | cmart: what does your [DEFAULT]virt_mkfs option look like? | 21:21 |
sean-k-mooney | thing like boot would likely be fine too since thats fire and forget | 21:23 |
imacdonn | so now I'm wondering if the connection reestablishment is working "by accident" with threads=1, or if I can depend on it........ | 21:23 |
cmart | efried I haven't defined it (on Queens, Ubuntu 16.04) so probably the default | 21:23 |
sean-k-mooney | well with threads=1 thre is a 50% change the connetion tread will run | 21:24 |
sean-k-mooney | sicne there is the main thread + the heatbeat thread | 21:24 |
melwitt | :\ | 21:26 |
sean-k-mooney | i did not say it woudl definetly work :P | 21:27 |
dansmith | sean-k-mooney: there's only one thread, | 21:27 |
sean-k-mooney | but its better then 1 in 9 | 21:27 |
imacdonn | so if I set threads=0, will it be 100% ? :) | 21:27 |
melwitt | lol | 21:27 |
sean-k-mooney | dansmith: the rabbitmq direver spawn a thread to run the heatbeat check | 21:28 |
dansmith | and I think that if we're monkeypatched, any io we do will always guarantee that we run the waiting greenthreads, so we'd consume the dead connection with the heartbeat right? | 21:28 |
sean-k-mooney | https://github.com/openstack/oslo.messaging/blob/9.5.0/oslo_messaging/_drivers/impl_rabbit.py#L884 | 21:28 |
dansmith | sean-k-mooney: green thread | 21:28 |
sean-k-mooney | ah your right if its monkey patched then its a greenthread | 21:28 |
dansmith | right, | 21:29 |
dansmith | so we'll always run that stuff on first io | 21:29 |
sean-k-mooney | i guess it depends on how many io operation were pending or had yeilded | 21:29 |
dansmith | I think if it was 50% we'd have heard about it | 21:29 |
dansmith | howso? | 21:29 |
efried | cmart: It looks like that name is being assigned in this neighborhood: https://opendev.org/openstack/nova/src/branch/stable/queens/nova/virt/libvirt/driver.py#L3510 | 21:30 |
sean-k-mooney | well in the scater gater case at least we are querying a bunch of gread threads to execute then waiting | 21:30 |
dansmith | the heartbeat gthread is always schedulable work on any io switch if we've not run it in t/2 or whatever it's period is | 21:30 |
dansmith | sean-k-mooney: right but on the first io that we do, to even start that, we'll run all the greenthreads pending, which is at least the heartbeat gthread, before we do much of anything else | 21:31 |
efried | cmart: that line or L3533 | 21:31 |
sean-k-mooney | ya we should | 21:32 |
*** hongbin has joined #openstack-nova | 21:32 | |
cmart | efried thank you for the orientation | 21:33 |
*** xek has joined #openstack-nova | 21:33 | |
efried | cmart: Good luck :) | 21:33 |
*** ttsiouts has quit IRC | 21:33 | |
*** xek__ has quit IRC | 21:35 | |
openstackgerrit | Merged openstack/nova master: Remove fake_libvirt_utils from connection tests. https://review.opendev.org/642557 | 21:38 |
imacdonn | I wonder if part of the issue is that the AMQP message is not being resent in the case where the connection reset is detected during publish, vs. during heartbeat | 21:39 |
melwitt | I wonder that too | 21:40 |
*** JamesBenson has quit IRC | 21:40 | |
dansmith | imacdonn: meaning if we hit it on publish, we reconnect, but didn't actually send, think we did, and wait for it until timeout? | 21:40 |
dansmith | if so that would be a giant o.msg bug | 21:40 |
imacdonn | yeah, smth like that | 21:41 |
melwitt | but I think we know it won't reconnect on publish right | 21:41 |
melwitt | otherwise the 500 wouldn't happen | 21:41 |
dansmith | no, | 21:41 |
dansmith | the 500 is from timeout right? | 21:41 |
dansmith | which means it's waiting for a reply | 21:42 |
melwitt | yeah | 21:42 |
imacdonn | 504 from timeout, yeah | 21:42 |
imacdonn | look at http://paste.openstack.org/show/751486/ - the reset follows the publish failure, not a heartbeat failure | 21:42 |
dansmith | which can happen if we effectively connect to rabbit, subscribe to a reply queue, but never sent the thing and thus time out waiting for a reply to nothing | 21:42 |
imacdonn | the reconnect, I mean | 21:42 |
sean-k-mooney | this is ment to handel the resent i think https://github.com/openstack/oslo.messaging/blob/40c25c2bde6d2f5a756e7169060b7ce389caf174/oslo_messaging/_drivers/impl_rabbit.py#L1088-L1101 | 21:43 |
*** JamesBenson has joined #openstack-nova | 21:43 | |
sean-k-mooney | for a direct send we do not pass retry | 21:43 |
imacdonn | I don't see any "Failed to publish message to topic" logged | 21:44 |
melwitt | yeah that's what confuses me, if publish does a reconnect, why are we left hanging for a timeout? unless we never send a thing like you both mentioned | 21:44 |
sean-k-mooney | https://github.com/openstack/oslo.messaging/blob/40c25c2bde6d2f5a756e7169060b7ce389caf174/oslo_messaging/_drivers/impl_rabbit.py#L1204-L1213 | 21:44 |
*** JamesBen_ has joined #openstack-nova | 21:44 | |
dansmith | melwitt: it would be if the act of sending triggered the "huh, this is dead, lemme reconnect" but was not followed by a "okay, now that I have reconnected, let me send that thing I had" | 21:45 |
melwitt | aye | 21:45 |
dansmith | because socket.write() is where you find out it's dead | 21:45 |
dansmith | which is how heartbeat is finding out | 21:45 |
dansmith | it gets scheduled, and does a "shit I'm late.. socket.write(heartbeat mofo!)" | 21:45 |
melwitt | right | 21:46 |
*** JamesBenson has quit IRC | 21:47 | |
*** xek_ has joined #openstack-nova | 21:47 | |
*** xek has quit IRC | 21:47 | |
*** JamesBen_ has quit IRC | 21:49 | |
melwitt | sean-k-mooney: yeah, I don't know the difference between direct_send and topic_send. though topic_send sounds more like what's probably being used | 21:51 |
melwitt | (based on the fact that our message queues all have topics, I would think that means those are topic_send, but I don't actually know) | 21:52 |
*** mriedem has quit IRC | 21:53 | |
sean-k-mooney | so we are invoking the get_vnc_console as a "call" | 21:53 |
sean-k-mooney | https://github.com/openstack/oslo.messaging/blob/40c25c2bde6d2f5a756e7169060b7ce389caf174/oslo_messaging/rpc/client.py#L157-L182 | 21:54 |
sean-k-mooney | which im still tracing to see if it sa topic send or a direct send | 21:54 |
sean-k-mooney | that is where we do the actul call in the nova side https://github.com/openstack/nova/blob/eae1f2257a9ee6e851182bf949568b1cfe2af763/nova/compute/rpcapi.py#L635-L640 | 21:55 |
*** xek has joined #openstack-nova | 21:56 | |
sean-k-mooney | im getting flashbacks to the openstack bootstrap podcast looking at this code | 21:56 |
*** xek_ has quit IRC | 21:56 | |
*** xek has quit IRC | 21:58 | |
*** xek has joined #openstack-nova | 21:59 | |
sean-k-mooney | ok instesting so | 22:01 |
sean-k-mooney | _send_reply uses a direct send | 22:01 |
sean-k-mooney | and that is used for this heartbeat https://github.com/openstack/oslo.messaging/blob/40c25c2bde6d2f5a756e7169060b7ce389caf174/oslo_messaging/_drivers/amqpdriver.py#L174 | 22:02 |
sean-k-mooney | but this is the send function that matters https://github.com/openstack/oslo.messaging/blob/40c25c2bde6d2f5a756e7169060b7ce389caf174/oslo_messaging/_drivers/amqpdriver.py#L581 | 22:03 |
sean-k-mooney | ok so ya its a topic_send https://github.com/openstack/oslo.messaging/blob/40c25c2bde6d2f5a756e7169060b7ce389caf174/oslo_messaging/_drivers/amqpdriver.py#L628 | 22:07 |
sean-k-mooney | which means the retry behavior is contoled by the context object | 22:08 |
*** mrch_ has joined #openstack-nova | 22:25 | |
openstackgerrit | Eric Fried proposed openstack/nova master: DNM: Revert "Use external placement in functional tests" https://review.opendev.org/659680 | 22:26 |
sean-k-mooney | ok dumb question but where does the context objec we use to sent the rpc call come form | 22:29 |
sean-k-mooney | context = req.environ['nova.context'] | 22:31 |
sean-k-mooney | so what sets that... | 22:31 |
sean-k-mooney | this https://github.com/openstack/nova/blob/c6218428e9b29a2c52808ec7d27b4b21aadc0299/nova/api/auth.py#L95-L105 ? | 22:32 |
sean-k-mooney | no that has to be the wrong context that the keystone atuth context | 22:36 |
*** hongbin has quit IRC | 22:39 | |
sean-k-mooney | ok i give up i cant tell if we actully set retry to ture | 22:39 |
imacdonn | I suppose I could add debug logging to find out | 22:40 |
sean-k-mooney | what i can tell is we dont set it expreclity wehn we call prepare in which case retry is check in the context | 22:40 |
melwitt | I think the keystone middleware does that https://docs.openstack.org/keystonemiddleware/latest/middlewarearchitecture.html | 22:41 |
sean-k-mooney | i can see the keysone context being construted here | 22:42 |
sean-k-mooney | https://github.com/openstack/nova/blob/c6218428e9b29a2c52808ec7d27b4b21aadc0299/nova/api/auth.py#L95-L106 | 22:42 |
sean-k-mooney | and i think that is the context that is passed here as ctxt https://github.com/openstack/nova/blob/c6218428e9b29a2c52808ec7d27b4b21aadc0299/nova/compute/rpcapi.py#L593-L598 | 22:42 |
openstackgerrit | Eric Fried proposed openstack/nova stable/stein: DNM: Revert "Use external placement in functional tests" https://review.opendev.org/659682 | 22:42 |
melwitt | yeah, and it's the keystone middleware that sets the headers that are used to construct the context you show | 22:43 |
sean-k-mooney | but when we constuct the rpc client we dont set retry too true as far as i can tell | 22:44 |
*** mlavalle has quit IRC | 22:44 | |
openstackgerrit | Eric Fried proposed openstack/nova stable/stein: Skip _exclude_nested_providers() if not nested https://review.opendev.org/659206 | 22:45 |
*** UdayTKumar has quit IRC | 22:48 | |
melwitt | sean-k-mooney: looks like it comes from here https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/rpc/client.py#L342-L343 | 22:49 |
melwitt | (maybe, I didn't trace it) | 22:49 |
sean-k-mooney | ok i found that like 30 mins ago but i mis interpreted the default of None as dont retry | 22:50 |
melwitt | so transport._send() will be done with forever retry | 22:50 |
sean-k-mooney | ok so it shoudl retry for ever | 22:50 |
sean-k-mooney | ya | 22:50 |
sean-k-mooney | i shoudl have read the commet | 22:51 |
sean-k-mooney | ok so this https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L666 is where the retry is implemented | 22:53 |
*** tkajinam has joined #openstack-nova | 22:55 | |
*** macza has quit IRC | 22:57 | |
sean-k-mooney | ok do i have traced that fully through nova into oslo.messaging and unless there is a bug in the auto retry suppupot in the kombu lib which i doubt the issue is not tha twe think we sent the message on publish | 23:00 |
imacdonn | seems that that retries ensuring that a connection exists, but doesn't retry sending a message ? | 23:00 |
sean-k-mooney | it does it delegate it to kombu | 23:00 |
melwitt | yeah, I saw that too, just passes everything (including the message) to kombu's autoretry method | 23:01 |
sean-k-mooney | https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L745-L752 | 23:01 |
melwitt | so you would think after reconnecting, it would send the message | 23:01 |
melwitt | I guess I'm wondering now if what's in these callbacks are what controls how it behaves, whether it would send the message after reconnecting? | 23:03 |
*** xek has quit IRC | 23:03 | |
sean-k-mooney | i assume that is what on_revive=on_reconnection is doing? | 23:04 |
imacdonn | I'm missing something .. where is the message passed to the autoretry method ? | 23:04 |
*** luksky has quit IRC | 23:04 | |
*** rcernin has joined #openstack-nova | 23:04 | |
sean-k-mooney | exec methond is a functoll.partial thing | 23:05 |
*** _erlon_ has quit IRC | 23:05 | |
*** slaweq has joined #openstack-nova | 23:05 | |
melwitt | imacdonn: here https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L1098-L1101 | 23:05 |
sean-k-mooney | https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L1098 | 23:05 |
sean-k-mooney | so the execute_method has the mesg and all the args curried in the partial callable | 23:06 |
*** ttsiouts has joined #openstack-nova | 23:06 | |
melwitt | yeah, so the on_reconnection method just reconfigures some things | 23:07 |
sean-k-mooney | yes but that is called when the conenct is revived but the auto retry i think would retry? | 23:07 |
melwitt | unfortunately it looks like we're going to have to look at what autoretry source does | 23:07 |
openstackgerrit | Merged openstack/nova master: Test macvtap port with resource request https://review.opendev.org/641309 | 23:08 |
sean-k-mooney | unless you think we are ending up here https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L760-L771 | 23:08 |
melwitt | you would think so but symptom suggests it's not | 23:08 |
sean-k-mooney | well we set the retry limit to float("inf") | 23:08 |
sean-k-mooney | so it could take a while to hit that :) | 23:09 |
melwitt | no, I don't think we're ending up there (we haven't seen that log) | 23:09 |
*** slaweq has quit IRC | 23:10 | |
melwitt | we see a reconnect message, and then we see a timeout fire some milliseconds later http://paste.openstack.org/show/751486/ | 23:10 |
*** ttsiouts has quit IRC | 23:11 | |
imacdonn | the timeout is one minute later | 23:11 |
*** slaweq has joined #openstack-nova | 23:11 | |
melwitt | oh guh, sorry | 23:11 |
imacdonn | ;) | 23:11 |
melwitt | I see now | 23:11 |
*** cmart has quit IRC | 23:15 | |
*** slaweq has quit IRC | 23:16 | |
sean-k-mooney | thats being raised form https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L1125 form inside the partial functionin the auto retry thing | 23:17 |
sean-k-mooney | actully its from the publish line but im looking at master on github | 23:18 |
*** JamesBenson has joined #openstack-nova | 23:18 | |
sean-k-mooney | so ya after the first trace which is the publish failing we see the reconnect | 23:19 |
sean-k-mooney | and then i think the timeout for sending that message has expired after we reconnect | 23:20 |
sean-k-mooney | we started to send it at 2019-05-16 18:24:09.089 | 23:20 |
*** dklyle has joined #openstack-nova | 23:21 | |
sean-k-mooney | but that doesnt make sense | 23:22 |
sean-k-mooney | the conenct is reestablished at 2019-05-16 18:24:10.109 | 23:22 |
sean-k-mooney | and then there are no log messages untill the time out 1 minut later | 23:22 |
melwitt | right | 23:22 |
sean-k-mooney | but the rpc timeout should be longer then 60 seconds right? | 23:23 |
melwitt | I think it probably defaults to 60s | 23:23 |
*** JamesBenson has quit IRC | 23:23 | |
melwitt | I can't remember where it is, lemme see | 23:23 |
imacdonn | 2019-05-16 18:16:05.973 23144 DEBUG nova.api.openstack.wsgi_app [-] long_rpc_timeout = 1800 log_opt_values /usr/lib/python2.7/site-packages/oslo_config/cfg.py:2571 | 23:23 |
imacdonn | wait that's not it | 23:24 |
melwitt | default to 60 https://github.com/openstack/oslo.messaging/blob/7102eddb1fd799d949ff5dca8aa2637794bb8a43/oslo_messaging/rpc/client.py#L41 | 23:24 |
sean-k-mooney | the long_rpc_timeout is not used for every rpc | 23:24 |
melwitt | rpc_response_timeout is the one | 23:24 |
sean-k-mooney | so this is what kombu is doing https://github.com/celery/kombu/blob/master/kombu/connection.py#L500-L547 | 23:26 |
openstackgerrit | Merged openstack/nova master: Fix some issues with the newton release notes https://review.opendev.org/659623 | 23:33 |
imacdonn | attempting repro with kombu debug logging | 23:34 |
*** dklyle has quit IRC | 23:35 | |
melwitt | good call, because looking at the code it looks like it should be doing the right thing. no clue what's happening | 23:38 |
imacdonn | weird .. not getting the connection resets now | 23:42 |
sean-k-mooney | does it not take time for it to happen | 23:42 |
sean-k-mooney | e.g a few minutes | 23:42 |
melwitt | did you remember to set threads=8? | 23:42 |
imacdonn | yeah, I waited 5 mins the first try, then a little over 3 mins second try | 23:43 |
imacdonn | yes, threads=8, but I should see the connection reset either way | 23:43 |
melwitt | true | 23:43 |
sean-k-mooney | ok its technically friday here so im going to have pizza and then sleep. if ye find anything let me know but for no i guess we shoudl update the bug and tell people to run with thread=1 | 23:45 |
sean-k-mooney | our maybe with debug loging... | 23:45 |
*** artom has joined #openstack-nova | 23:45 | |
melwitt | pizza, sounds like a solid plan | 23:45 |
imacdonn | Thanks, sean-k-mooney - pizza/sleep well! | 23:46 |
sean-k-mooney | i actully had a c++ bug where debug builds did not have the bug because it was an unitialised variable and in debug mode the compiler was "kind" enouch to initalise it to 0 | 23:46 |
imacdonn | ahh, got the reset now ... paste in a few mins | 23:48 |
imacdonn | http://paste.openstack.org/show/AZQET5IOgntK785RHtxh/ | 23:51 |
melwitt | weird, it says basic_publish_confirm but then one minute later timeout happens | 23:57 |
melwitt | so confused. | 23:58 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!