Thursday, 2019-05-16

*** hongbin has quit IRC		00:01
*** BjoernT has quit IRC		00:02
*** ttsiouts has joined #openstack-nova		00:08
*** artom has joined #openstack-nova		00:15
*** gyee has quit IRC		00:36
*** xek has quit IRC		00:37
*** dpawlik has joined #openstack-nova		00:38
*** ttsiouts has quit IRC		00:41
openstackgerrit	Brin Zhang proposed openstack/nova master: Cap sphinx for py2 to match global requirements https://review.opendev.org/659201	00:42
*** dpawlik has quit IRC		00:42
*** ricolin has joined #openstack-nova		00:50
openstackgerrit	Brin Zhang proposed openstack/nova-specs master: Cap sphinx for py2 to match global requirements https://review.opendev.org/659205	00:53
*** dpawlik has joined #openstack-nova		00:54
*** awalende has joined #openstack-nova		00:57
*** dpawlik has quit IRC		00:59
openstackgerrit	Tetsuro Nakamura proposed openstack/nova stable/stein: Skip _exclude_nested_providers() if not nested https://review.opendev.org/659206	00:59
*** d34dh0r53 has joined #openstack-nova		01:01
*** awalende has quit IRC		01:02
*** bbowen has quit IRC		01:27
*** nicolasbock has quit IRC		01:36
*** lbragstad has quit IRC		01:42
*** guozijn has joined #openstack-nova		01:48
openstackgerrit	Tetsuro Nakamura proposed openstack/nova stable/rocky: Skip _exclude_nested_providers() if not nested https://review.opendev.org/659207	01:51
*** lbragstad has joined #openstack-nova		01:51
*** BjoernT has joined #openstack-nova		02:00
*** whoami-rajat has joined #openstack-nova		02:06
*** cfriesen has quit IRC		02:10
*** guozijn_ has joined #openstack-nova		02:11
*** guozijn has quit IRC		02:13
*** guozijn_ is now known as guozijn		02:13
*** dklyle has joined #openstack-nova		02:14
*** BjoernT has quit IRC		02:17
*** BjoernT has joined #openstack-nova		02:20
openstackgerrit	Boxiang Zhu proposed openstack/nova master: Fix failure to boot instances with qcow2 format images https://review.opendev.org/640271	02:22
*** JamesBenson has joined #openstack-nova		02:26
openstackgerrit	Matt Riedemann proposed openstack/nova master: Update usage in RT.drop_move_claim during confirm resize https://review.opendev.org/641806	02:26
openstackgerrit	Matt Riedemann proposed openstack/nova master: Robustify attachment tracking in CinderFixtureNewAttachFlow https://review.opendev.org/658904	02:26
openstackgerrit	Matt Riedemann proposed openstack/nova master: Add Migration.cross_cell_move and get_by_uuid https://review.opendev.org/614012	02:26
openstackgerrit	Matt Riedemann proposed openstack/nova master: Add InstanceAction/Event create() method https://review.opendev.org/614036	02:26
openstackgerrit	Matt Riedemann proposed openstack/nova master: Add Instance.hidden field https://review.opendev.org/631123	02:26
openstackgerrit	Matt Riedemann proposed openstack/nova master: Add TargetDBSetupTask https://review.opendev.org/627892	02:26
openstackgerrit	Matt Riedemann proposed openstack/nova master: Add CrossCellMigrationTask https://review.opendev.org/631581	02:26
openstackgerrit	Matt Riedemann proposed openstack/nova master: Execute TargetDBSetupTask https://review.opendev.org/633853	02:26
openstackgerrit	Matt Riedemann proposed openstack/nova master: Add prep_snapshot_based_resize_at_dest compute method https://review.opendev.org/633293	02:26
openstackgerrit	Matt Riedemann proposed openstack/nova master: Add PrepResizeAtDestTask https://review.opendev.org/627890	02:26
openstackgerrit	Matt Riedemann proposed openstack/nova master: Add prep_snapshot_based_resize_at_source compute method https://review.opendev.org/634832	02:26
openstackgerrit	Matt Riedemann proposed openstack/nova master: Add nova.compute.utils.delete_image https://review.opendev.org/637605	02:26
openstackgerrit	Matt Riedemann proposed openstack/nova master: Add PrepResizeAtSourceTask https://review.opendev.org/627891	02:26
openstackgerrit	Matt Riedemann proposed openstack/nova master: Refactor ComputeManager.remove_volume_connection https://review.opendev.org/642183	02:26
openstackgerrit	Matt Riedemann proposed openstack/nova master: Add power_on kwarg to ComputeDriver.spawn() method https://review.opendev.org/642590	02:26
openstackgerrit	Matt Riedemann proposed openstack/nova master: Add finish_snapshot_based_resize_at_dest compute method https://review.opendev.org/635080	02:26
openstackgerrit	Matt Riedemann proposed openstack/nova master: Add FinishResizeAtDestTask https://review.opendev.org/635646	02:26
openstackgerrit	Matt Riedemann proposed openstack/nova master: Add Destination.allow_cross_cell_move field https://review.opendev.org/614035	02:26
openstackgerrit	Matt Riedemann proposed openstack/nova master: Support cross-cell moves in external_instance_event https://review.opendev.org/658478	02:26
openstackgerrit	Matt Riedemann proposed openstack/nova master: Add cross-cell resize policy rule and enable in API https://review.opendev.org/638269	02:26
openstackgerrit	Matt Riedemann proposed openstack/nova master: WIP: Enable cross-cell resize in the nova-multi-cell job https://review.opendev.org/656656	02:27
openstackgerrit	Merged openstack/nova master: Fix retry of instance_update_and_get_original https://review.opendev.org/658845	02:32
guilhermesp	hello! I'm currently running an upgrade N -> O. There is something weird going on when I run nova-manage cell_v2 map_instances <id>. It seems that the nova_api.instance_mappings keeps stuck for a while, incrementing slowly one by one, then after a huge amount of time it starts to re-map the instance in faster way	02:33
guilhermesp	the deployment has 160k rows in nova.instances	02:33
guilhermesp	we currently re-mapped 80K instnaces, but sometimes the command finishes but there are still rows to remap	02:34
guilhermesp	the active instnaces of the deployment are currently given 404, which mean they doesn't reside on the newly created cell during the deployment	02:34
*** BjoernT has quit IRC		02:35
guilhermesp	we are sometimes adding to the command --max-counts like 50k, 100k and trying to track the remapping watching the rows in nova_api.instance_mappings	02:35
guilhermesp	does anyone knows what could be the trick in that scenario	02:35
openstackgerrit	Merged openstack/nova master: Cap sphinx for py2 to match global requirements https://review.opendev.org/659201	02:35
guilhermesp	we have a staging env that this didn't happen	02:36
guilhermesp	but we used to have around 1.8k rows in nova.instances	02:36
guilhermesp	maybe the amount of records in the other env could be a possible reason?	02:37
*** ttsiouts has joined #openstack-nova		02:38
*** lbragstad has quit IRC		02:40
*** lbragstad has joined #openstack-nova		02:43
*** dpawlik has joined #openstack-nova		02:55
*** dpawlik has quit IRC		03:00
*** dpawlik has joined #openstack-nova		03:11
*** ttsiouts has quit IRC		03:11
*** hongbin has joined #openstack-nova		03:14
*** dpawlik has quit IRC		03:15
*** licanwei has joined #openstack-nova		03:26
*** udesale has joined #openstack-nova		03:53
*** hongbin has quit IRC		03:55
*** JamesBenson has quit IRC		04:01
*** janki has joined #openstack-nova		04:43
*** bhagyashris has joined #openstack-nova		04:46
*** guozijn has quit IRC		04:47
bhagyashris	sean-k-mooney: Hi,	04:47
*** _d34dh0r53_ has joined #openstack-nova		04:55
*** guozijn has joined #openstack-nova		04:56
*** lbragstad has quit IRC		04:59
*** _d34dh0r53_ has quit IRC		05:08
*** ttsiouts has joined #openstack-nova		05:09
*** _d34dh0r53_ has joined #openstack-nova		05:11
*** dpawlik has joined #openstack-nova		05:12
*** dpawlik has quit IRC		05:17
*** ratailor has joined #openstack-nova		05:21
*** dpawlik has joined #openstack-nova		05:28
*** Luzi has joined #openstack-nova		05:40
*** ttsiouts has quit IRC		05:41
*** ivve has quit IRC		05:41
*** hamzy has quit IRC		05:44
*** udesale has quit IRC		06:05
*** udesale has joined #openstack-nova		06:06
*** hamzy has joined #openstack-nova		06:21
*** ccamacho has joined #openstack-nova		06:31
*** trondham has joined #openstack-nova		06:33
*** pmannidi has joined #openstack-nova		06:34
*** udesale has quit IRC		06:51
*** udesale has joined #openstack-nova		06:52
*** xek has joined #openstack-nova		06:54
*** ivve has joined #openstack-nova		06:58
*** rcernin has quit IRC		07:01
*** udesale has quit IRC		07:02
*** openstackgerrit has quit IRC		07:03
*** udesale has joined #openstack-nova		07:03
*** tesseract has joined #openstack-nova		07:05
*** jaosorior has quit IRC		07:06
*** pcaruana has joined #openstack-nova		07:13
*** guozijn_ has joined #openstack-nova		07:15
*** rpittau\|afk is now known as rpittau		07:17
*** guozijn has quit IRC		07:18
*** guozijn_ is now known as guozijn		07:18
*** openstackgerrit has joined #openstack-nova		07:20
openstackgerrit	Takashi NATSUME proposed openstack/nova stable/stein: Fix assert methods in unit tests https://review.opendev.org/657708	07:20
*** xek has quit IRC		07:21
*** helenafm has joined #openstack-nova		07:21
openstackgerrit	Takashi NATSUME proposed openstack/nova stable/rocky: Fix assert methods in unit tests https://review.opendev.org/657709	07:21
openstackgerrit	Takashi NATSUME proposed openstack/nova stable/rocky: Fix assert methods in unit tests https://review.opendev.org/657709	07:21
*** ralonsoh has joined #openstack-nova		07:23
openstackgerrit	zhangboye proposed openstack/nova-specs master: Cap sphinx for py2 to match global requirements https://review.opendev.org/659470	07:24
kashyap	weshay: Was asleep when you pinged.	07:26
* kashyap reads the scroll		07:28
*** tssurya has joined #openstack-nova		07:31
*** udesale has quit IRC		07:36
kashyap	weshay: I see that melwitt (thanks) has already investigated the issue and posted a patch to skip logging the trace: https://review.opendev.org/#/c/659374/	07:37
*** udesale has joined #openstack-nova		07:37
*** ttsiouts has joined #openstack-nova		07:39
*** slaweq has joined #openstack-nova		07:40
*** awalende has joined #openstack-nova		07:49
*** tesseract has quit IRC		07:52
*** tesseract has joined #openstack-nova		07:52
*** ttsiouts has quit IRC		07:57
openstackgerrit	Gorka Eguileor proposed openstack/nova master: Fix doc requirements https://review.opendev.org/659484	08:05
openstackgerrit	Gorka Eguileor proposed openstack/nova master: Fix doc requirements https://review.opendev.org/659484	08:05
*** maciejjozefczyk has joined #openstack-nova		08:07
openstackgerrit	Gorka Eguileor proposed openstack/nova master: Use os-brick locking for volume attach and detach https://review.opendev.org/614190	08:07
*** awalende_ has joined #openstack-nova		08:15
*** ttsiouts has joined #openstack-nova		08:16
*** awalend__ has joined #openstack-nova		08:18
*** awalende has quit IRC		08:19
*** awalende_ has quit IRC		08:20
*** awalende has joined #openstack-nova		08:21
*** awalend__ has quit IRC		08:24
*** cdent has joined #openstack-nova		08:25
*** brinzh has joined #openstack-nova		08:26
*** brinzhang has quit IRC		08:29
*** tkajinam has quit IRC		08:32
*** derekh has joined #openstack-nova		08:38
*** dtantsur\|afk is now known as dtantsur		08:50
*** sridharg has joined #openstack-nova		08:53
*** rabel has quit IRC		09:03
openstackgerrit	Arnaud Morin proposed openstack/nova master: Always Set dhcp_server in network_info https://review.opendev.org/658362	09:10
*** panda is now known as panda\|rover		09:17
*** brinzh has quit IRC		09:19
*** ttsiouts has quit IRC		09:20
*** brinzh has joined #openstack-nova		09:20
bhagyashris	stephenfin: Hi, Just want to ask regarding the reserved_host_cpus as we are setting the reserved value of cpus as 0 for both the PCPU and VCPU then how the reserved value will be maintain I mean there should some cpu reserved for host process	09:28
sean-k-mooney	bhagyashris: that is achive by not listing them in the cpu_shared_set or the cpu_dedicated_set	09:32
sean-k-mooney	so on an 8 core system you can do cpu_shared_set=2,3 cpu_dedicated_set=4-7 and that leave cores 0 and 1 for the host	09:33
sean-k-mooney	this is how you reserve cores for the host using vcpu_pin_set today	09:34
bhagyashris	sean-k-mooney: that means operator should keep in mind that some cpus should left for host process while setting the cpu_dedicated_set and cpu_shared_set	09:34
sean-k-mooney	yes	09:34
bhagyashris	okay	09:34
bhagyashris	sean-k-mooney: thank you for clearing my doubt.	09:34
sean-k-mooney	althogh today reserved_host_cpus defaults to 0 and by defalt the vcpu_pin_set if unset enable all host cpus	09:35
sean-k-mooney	so if they dont do that it wont be a change in behavior	09:35
bhagyashris	sean-k-mooney: okay, but suppose if operator assign all the cpus to the dedicated set then what will happen in that case, host process won't have any cpu to process the host process.	09:38
sean-k-mooney	the hosts will steal cpu time form the pinned instances	09:38
sean-k-mooney	but again that is what happens today if you dont ommit cores for the host to use form the vcpu pin set	09:39
*** takashin has left #openstack-nova		09:40
bhagyashris	okay, So that case should be handle by code or by operator then	09:41
sean-k-mooney	by the operator	09:41
bhagyashris	okay	09:41
bhagyashris	Thank you :)	09:42
*** bhagyashris has quit IRC		09:43
*** pmannidi has quit IRC		10:05
openstackgerrit	Arnaud Morin proposed openstack/nova master: Always Set dhcp_server in network_info https://review.opendev.org/658362	10:08
*** awalende_ has joined #openstack-nova		10:11
*** jaosorior has joined #openstack-nova		10:12
*** awalende has quit IRC		10:12
*** mdbooth_ is now known as mdbooth		10:24
mdbooth	Any reason https://review.opendev.org/658845 shouldn't be backported to stable?	10:25
* mdbooth has posted the backports to Queens and they're all clean		10:25
*** pcaruana has quit IRC		10:27
*** xek has joined #openstack-nova		10:27
*** mmethot has quit IRC		10:29
*** guozijn has quit IRC		10:37
*** ricolin has quit IRC		10:39
*** davidsha has joined #openstack-nova		10:40
*** tbachman has quit IRC		10:50
*** luksky has joined #openstack-nova		10:52
*** nicolasbock has joined #openstack-nova		10:57
*** panda\|rover is now known as panda\|rover\|eat		11:04
*** bbowen has joined #openstack-nova		11:11
*** udesale has quit IRC		11:15
*** ttsiouts has joined #openstack-nova		11:17
*** mmethot has joined #openstack-nova		11:19
*** guozijn has joined #openstack-nova		11:24
*** yan0s has joined #openstack-nova		11:25
*** guozijn has quit IRC		11:33
*** pcaruana has joined #openstack-nova		11:37
*** dtantsur is now known as dtantsur\|brb		11:46
*** ttsiouts has quit IRC		11:49
*** ttsiouts has joined #openstack-nova		11:49
*** florius has joined #openstack-nova		11:51
*** rabel has joined #openstack-nova		11:58
rabel	hi. there seems to be a problem with api microversions. i have an openstack rocky (deployed with packstack), but my openstackclient (newest version, directly from git) tells me that the compute api version is 2.1	11:59
*** florius has left #openstack-nova		12:01
artom	rabel, it's an known osc behaviour/bug. While novaclient autonegotiates to the greatest common supported microversion, osc defaults to 2.1, and you have to manually specify which microversion you want	12:02
*** tbachman has joined #openstack-nova		12:06
*** ttsiouts has quit IRC		12:06
*** panda\|rover\|eat is now known as panda		12:06
*** tbachman has quit IRC		12:10
*** panda is now known as panda\|rover		12:11
rabel	artom thanks	12:13
*** tbachman has joined #openstack-nova		12:13
*** awalende has joined #openstack-nova		12:27
*** janki has quit IRC		12:28
*** awalende_ has quit IRC		12:29
*** ttsiouts has joined #openstack-nova		12:38
*** tetsuro has joined #openstack-nova		12:41
*** tetsuro has quit IRC		12:41
*** ttsiouts has quit IRC		12:43
alex_xu	efried: I think the in_tree take care of all the cases for https://review.opendev.org/#/c/645520/ also	12:56
*** ccamacho has quit IRC		12:56
*** ccamacho has joined #openstack-nova		12:57
*** ratailor has quit IRC		12:58
*** mchlumsky has joined #openstack-nova		12:59
*** udesale has joined #openstack-nova		13:02
*** mchlumsky has quit IRC		13:06
*** mchlumsky has joined #openstack-nova		13:08
efried	alex_xu: Okay. I'll take a closer look at that patch soon.	13:08
alex_xu	efried: thx	13:08
*** lbragstad has joined #openstack-nova		13:09
*** itlinux has quit IRC		13:10
*** ttsiouts has joined #openstack-nova		13:12
*** mriedem has joined #openstack-nova		13:15
openstackgerrit	Matt Riedemann proposed openstack/nova master: Stop logging traceback when skipping quiesce https://review.opendev.org/659374	13:16
*** dtantsur\|brb is now known as dtantsur		13:18
*** artom has quit IRC		13:18
*** ttsiouts has quit IRC		13:19
*** ttsiouts has joined #openstack-nova		13:20
*** tbachman has quit IRC		13:20
*** wwriverrat has quit IRC		13:22
mriedem	dansmith: random idea, but with something like https://bugs.launchpad.net/nova/+bug/1829349 where the RT is constantly failing b/c of conflicted pinned CPUs on the same host, we could leverage the build failure stats thing and if the RT fails after x hits or whatever we try to take that host out of scheduling	13:23
openstack	Launchpad bug 1829349 in OpenStack Compute (nova) "Resource Tracker failed to update usage in case numa topology conflict happen" [Undecided,New]	13:23
mriedem	they did the naughty and enabled live migration for numa instances, 2 landed on the same host with conflicting pinned CPUs, and how the RT constantly fails but the scheduler continues to pick the host b/c it looks free since it's usage isn't getting reported	13:24
mriedem	s/how/now/	13:24
dansmith	oye	13:25
dansmith	isn't the build failure count going up in that case anyway?	13:25
mriedem	maybe, i can ask in the bug report	13:25
sean-k-mooney	mriedem: did we fix the typo so that that config option works yet	13:26
* sean-k-mooney goes to check		13:26
mriedem	the qemu vs QEMU thing?	13:27
mriedem	it wouldn't matter,	13:28
*** xek_ has joined #openstack-nova		13:28
mriedem	because in this case they enabled the workaround to allow live migrating instances with numa	13:28
sean-k-mooney	ya we were checking for hypervior = kvm but its alway qemu	13:28
sean-k-mooney	yes we fixed it https://review.opendev.org/#/c/635350/]	13:29
sean-k-mooney	we never actully released with it broken so we shoudl be good	13:29
mriedem	well, this person is on rocky	13:29
mriedem	but yes https://review.opendev.org/#/q/I5127227a1e3d76dd413a820b048547ba578aff6b	13:29
*** xek has quit IRC		13:30
*** tbachman has joined #openstack-nova		13:31
sean-k-mooney	mriedem: i would have hoped the numa toplogy filter would have prevented new vms from being scudled by the way if there was not space	13:31
sean-k-mooney	the one that was mighrtated would not be fixed but ya	13:31
aarents	Hi, mdbooth I updated https://review.opendev.org/#/c/659054/ if you want to have a look, maybe also mriedem ?	13:32
sean-k-mooney	which reminding me i have a meeting then i need to updated docs for the sriov migration	13:32
mdbooth	aarents: Thanks, still +1.	13:34
adrianc	sean-k-mooney: Hi, id like to see if we can converge on the FUP for SR-IOV LM, would be great if you can take a quick look :)	13:34
*** brinzh has quit IRC		13:41
sean-k-mooney	adrianc: sure i think this was more or less fine so ill take alook at the latest version	13:41
adrianc	sean-k-mooney, i did not upload a new PS as i wanted to reach an agreement on the propery thing, anyway... you will see	13:44
*** Luzi has quit IRC		13:45
*** takashin has joined #openstack-nova		13:49
*** mgariepy has quit IRC		13:55
efried	Nova meeting in 4 minutes in #openstack-meeting	13:56
*** pcaruana has quit IRC		13:56
*** artom has joined #openstack-nova		13:57
mriedem	tssurya: i'm going to remove the locked reason stuff from the runways etherpad since the nova and novaclient stuff is done https://etherpad.openstack.org/p/nova-runways-train	14:02
*** JamesBenson has joined #openstack-nova		14:02
*** mgariepy has joined #openstack-nova		14:03
*** JamesBenson has quit IRC		14:07
*** UdayTKumar has joined #openstack-nova		14:07
efried	sean-k-mooney: Last week's nova meeting we got a request to review	14:09
efried	https://review.opendev.org/#/c/643023/	14:09
efried	https://review.opendev.org/#/c/643024/	14:09
efried	which it looked like you were actively looking into. Would you be able to follow up please?	14:09
sean-k-mooney	yes	14:10
tssurya	mriedem: ack	14:10
*** alex_xu has quit IRC		14:10
efried	thanks sean-k-mooney	14:10
*** JamesBenson has joined #openstack-nova		14:14
*** JamesBenson has quit IRC		14:18
sean-k-mooney	adrianc: just replied on https://review.opendev.org/#/c/659101/2/nova/objects/instance_pci_requests.py@47	14:18
adrianc	thx sean-k-mooney, ill take a look shortly	14:19
sean-k-mooney	im going to go review https://review.opendev.org/#/q/topic:bug/1809095+(status:open+OR+status:merged) before my next meeting	14:20
sean-k-mooney	which is your fix for bug #1809095	14:21
openstack	bug 1809095 in OpenStack Compute (nova) "Wrong representor port was unplugged from OVS during cold migration" [Medium,In progress] https://launchpad.net/bugs/1809095 - Assigned to Adrian Chiris (adrian.chiris)	14:21
*** mlavalle has joined #openstack-nova		14:22
*** itlinux has joined #openstack-nova		14:27
*** itlinux has quit IRC		14:31
mriedem	btw, i remember mtreinish trying to get glance-api moved to wsgi in pike and it got hung up on threading stuff with eventlet, hence https://review.opendev.org/#/c/549743/	14:32
dansmith	yup	14:32
mriedem	oh and https://review.opendev.org/#/c/531498/	14:34
mriedem	^ taskflow evenlet async stuff + wsgi + glance-api	14:35
*** d34dh0r53 has quit IRC		14:36
mtreinish	yep, but I couldn't get anyone to ever really review that or explain how to use the async glance APIs. And there is no dsvm coverage for any of it	14:36
*** d34dh0r53 has joined #openstack-nova		14:37
*** pcaruana has joined #openstack-nova		14:37
mriedem	mtreinish: for context we're going over similar in nova-api https://review.opendev.org/#/c/650172/	14:37
cdent	I reckon the root of the problems in both glance and nova is that we are trying to async workings in an environment (an HTTP API) where it doesn't make sense. The traditional python solution to this problem is to dump into an offboard async task queue (celery and the like). The scatter/gather thing is not quite in the same boat as the glance stuff (which is more amenable to the celery thing perhaps)	14:41
*** d34dh0r53 is now known as Guest28904		14:41
mtreinish	cdent: yeah, that was the best way to fix it, to just do the async bits in another service	14:46
mtreinish	but I assumed just launching another process or thread to handle it should work as a quick workaround	14:46
cdent	it ought to	14:46
cdent	but I would guess that zombie blocked processes would present the same concern as blocked threads	14:47
cdent	which might suggest: that's the way unix goes, deal...	14:47
*** JamesBenson has joined #openstack-nova		14:51
sean-k-mooney	i mean we could just not set a limit on the worker trhead in the tread pool or make it a config option	14:53
sean-k-mooney	python will grow and srink the thread pool automatically if you dont specify the number of workers	14:53
dansmith	that's the problem, these are a threadpool, which means they're re-used and capped at some max number, so enough api requests come in during a cell down event, and you immediately exhaust the pool of workers until things time out	14:53
dansmith	sean-k-mooney: only up to N*CPUs I think I read	14:54
dansmith	where N is like 3 or something	14:54
sean-k-mooney	dansmith: maybe not sure	14:54
sean-k-mooney	but we could leave it as a config option and defult to -1 mean let python figure it out	14:55
sean-k-mooney	or set it to X if you care and have mesured for your system	14:55
mriedem	https://docs.python.org/3.6/library/concurrent.futures.html#concurrent.futures.ThreadPoolExecutor	14:55
mriedem	"If max_workers is None or not given, it will default to the number of processors on the machine, multiplied by 5, assuming that ThreadPoolExecutor is often used to overlap I/O instead of CPU work and the number of workers should be higher than the number of workers for ProcessPoolExecutor."	14:55
dansmith	right, so with 5 cells, that becomes $CPUs	14:56
dansmith	and everyone has at least two	14:56
cdent	when calculating the number of threads that will be around, another thing to keep in mind is mod_wsgi will often run with multiple process and multiple threads in those processes. what is the pool associated with?	14:57
cdent	every request or thread or process?	14:57
* cdent has lost track of the actual code, reviews		14:58
dansmith	idk	14:58
cdent	process	14:58
sean-k-mooney	so we could revert to createing the thread pool per call and set it to one thread per cell	14:59
sean-k-mooney	but the issue with that is a its expensive to spawn treads and b we nolonger have a limit on them	15:00
*** ratailor has joined #openstack-nova		15:00
sean-k-mooney	but we will not have the issue of a potention exausting of the thread pool	15:00
sean-k-mooney	cdent: the pool i a module global in the current patch so it would associated with the process	15:01
*** takashin has left #openstack-nova		15:02
cdent	sean-k-mooney: [t 6MVa]	15:03
purplerbot	<cdent> process [2019-05-16 14:58:44.001687] [n 6MVa]	15:03
*** artom has quit IRC		15:03
sean-k-mooney	?	15:03
cdent	[l 6MVa]	15:04
purplerbot	http://p.anticdent.org/6MVa	15:04
cdent	in other words, I figured it out by looking at the code a few minutes ago	15:04
sean-k-mooney	oh right :)	15:04
mriedem	efried: on that ironic job failure, it looks like that job has probably been broken since stein when we started reporting capability traits for nodes http://logs.openstack.org/32/634832/29/check/ironic-tempest-ipa-wholedisk-bios-agent_ipmitool-tinyipa/fba9197/controller/logs/devstacklog.txt.gz#_2019-05-16_03_28_21_590	15:06
jangutter	cdent: I really thought you were mistyping 6 million volt-ampere.	15:06
mriedem	if i'm reading that correctly it's expecting only the CUSTOM_GOLD trait	15:06
sean-k-mooney	dansmith: what was the behavior that motivated the scater gather code in the first place to you rememebr? was it to deal with down cells or was it a performce imporvoment?	15:06
dansmith	sean-k-mooney: it's to avoid a sequential walk of all the cell databases, serializing their latencies	15:07
sean-k-mooney	ah ok so it was to addres perfromace ok	15:07
cdent	Are there numbers on those latencies?	15:07
dansmith	sean-k-mooney: cern has 70 cells, some of which are remote	15:07
cdent	ah. the remote part	15:08
dansmith	as in, in another country	15:08
efried	mriedem: looks like the fix is just to take out the square brackets, nah?	15:08
sean-k-mooney	ya even if they wer all in the same data center it could be a proble as the number of cell grow	15:08
dansmith	sean-k-mooney: especially if one in the middle is timing out, and we wait for that to complete before moving to the next one and get its response	15:09
dansmith	and everyone has two of these at minimum so moving to sequential means 2N latency, which is twice as bad, regardless of what N is	15:09
sean-k-mooney	i was just wondering if the initall approch to solving the problem was patinting us in to a corner that could be avoid by revisiting the originial usecase cool	15:10
sean-k-mooney	effectivly waht we want to do is queue up a bunch of asyc request and wait for them to complete	15:11
dansmith	..a thing for which eventlet is ideal	15:11
sean-k-mooney	eventlet and thread pools both allwo that with different tradeoffs	15:11
*** awalende has quit IRC		15:11
sean-k-mooney	the issue isnt the use of eventlets its self	15:11
*** awalende has joined #openstack-nova		15:12
sean-k-mooney	its that mod_wsgi and uwsgi both have exposed diffreent bugs caused by using eventlets with them	15:12
*** cfriesen has joined #openstack-nova		15:12
*** ivve has quit IRC		15:13
sean-k-mooney	does wsgi provide a way to run request like this in the background that we could leverag and fall back to eventlest when not runnign under wsgi?	15:15
*** artom has joined #openstack-nova		15:15
*** awalende_ has joined #openstack-nova		15:15
sean-k-mooney	google found me https://pypi.org/project/uwsgi-tasks/ but no idea if that would help	15:16
*** awalende has quit IRC		15:16
cdent	that could work for the uwsgi situation, but not mod_wsgi, presumably	15:18
*** luksky has quit IRC		15:19
sean-k-mooney	i dont know i personally would prefer to try melwitt's approch first and maybe ask cern to try it and get feed back	15:19
*** JamesBenson has quit IRC		15:19
*** awalende_ has quit IRC		15:20
*** JamesBenson has joined #openstack-nova		15:20
cdent	sean-k-mooney: you mean the current code under review? that seems reasonable. I guess the issue is that we don't know if the problem is a problem until it gets hit...	15:20
sean-k-mooney	cdent: right	15:20
cdent	mriedem seemed to think it could be caused with some effort in devstack	15:21
sean-k-mooney	and yes the current code under review	15:21
efried	mriedem: I'm going to propose that fix unless you've arleady got it in the pipe	15:21
mriedem	efried: i don't, i'm trying to recreate with bash locally	15:23
efried	ack	15:23
mriedem	cdent: i know how to manually test multiple cells in a single node devstack and make one or more down, but that doesn't mean i know how to recreate the issue	15:24
mriedem	we haven't hit this problem in the gate as far as i know with nova-api under wsgi + 2 cells	15:24
mriedem	we also now have the nova-multi-cell job which is effectively 3 cells	15:24
mriedem	cell0/1/2	15:25
melwitt	imacdonn: do you happen to know why/how the uwsgi/mod_wsgi problem doesn't seem to show itself in the gate?	15:26
cdent	as I remmeber (which may be wrong) the problem with eventlet needs some work, a long pause, some attempted work to show up. And the expected problem with threads requires connections to databases (or other I/O) that blocks weirdly and doesn't time out in natural ways	15:26
mriedem	efried: removing the brackets gives me this:	15:26
mriedem	jq: error (at <stdin>:1): array (["COMPUTE_N...) and string ("CUSTOM_GOLD") cannot have their containment checked	15:26
cdent	melwitt: I reckon it the lack of pause	15:26
*** ccamacho has quit IRC		15:26
melwitt	aye	15:26
efried	mriedem: ugh, I guess it would help me to know wtf jq is.	15:27
mriedem	efried: json parser	15:27
mriedem	efried: i considered just re-writing this with osc-placement and grep	15:27
efried	seems like that would be better.	15:27
mriedem	openstack --os-placement-api-version 1.6 resource provider trait list $provider_id \| grep $trait	15:27
mriedem	it would definitely be easier to f'ing read	15:27
efried	agreed	15:27
mriedem	i'll push that up	15:28
efried	ight	15:28
efried	mriedem: I made a story	15:28
mriedem	this would also suggest all ironic jobs would be busted since stein and if that were the case i'd think we would have heard about it by now...	15:28
efried	mriedem: https://storyboard.openstack.org/#!/story/2005725	15:28
efried	you would think, yah	15:28
mriedem	cdent: melwitt: dansmith: you don't think it's a simple as adding an https://eventlet.net/doc/modules/greenthread.html#eventlet.greenthread.sleep between calls in the scatter gather do you?	15:29
imacdonn	melwitt: I don't know enough about what happens in the gate ... but the problem manifests when nova-api does something that requires an API call, then goes quiet for more than a minute or so, then gets another request that requires RPC	15:30
dansmith	the problem is that we're not monkeypatched properly such that background threads aren't being run and thus the heartbeat for the rabbit connection isn't working	15:30
dansmith	which seems like it should manifest in the gate easily if it's fundamental to how we're doing service setup, and which also means other things shouldn't really be working	15:31
dansmith	so I'm kinda suspicious of the whole thing, and don't really want to "solve" it by changing our whole threading model for one service	15:31
dansmith	but I don't really have time to dig deep, as I said	15:31
cdent	dansmith: I thought the "more than a minute or so" was a critical factor as it is resulting in mod_wsgi doing $something to the python process that breaks the monkey patching	15:32
cdent	but I have to admit that finding the sensible thread through all this is hard so I'm not sure if I'm remembering the right stuff	15:32
mriedem	going quite for a minute or so is something that wouldn't happen in the gate yeah	15:33
imacdonn	cdent: rabbitmq, in a typical configuration, will drop a connection if it has not seen heartbeats for over a minute	15:33
mriedem	tempest is all hammer until it's done and then we teardown	15:33
imacdonn	cdent: that's what leads to a "connection reset by peer" when nova-api wakes up again and tries to use the connection that it thinks is still healthy	15:33
dansmith	so you think we're just not running hearbeat background tasks if no api requests are coming in?	15:33
cdent	it does sound a bit that way	15:33
imacdonn	I have observed that with debug logging	15:34
dansmith	seems like that should be easy to reproduce in a devstack no?	15:34
mriedem	could we maybe tickle this in a post test script after tempest is done, sleep a minute or something, and then try to create or do something to a server, like stop it?	15:34
*** helenafm has quit IRC		15:35
kashyap	efried: Thanks for representing on #openstack-meeting (still haven't caught up). Stuck in consecutive calls, afraid.	15:35
dansmith	mriedem: or just a devstack	15:35
mriedem	yeah..i will leave devstack running for many more minutes before coming back to it to do something	15:35
*** ttsiouts has quit IRC		15:35
dansmith	changing the whole threading model for this seems like a big hammer	15:35
mriedem	also note that vexxhost is running stein and haven't reported this being a problem...	15:36
cdent	mriedem: is your devstack using mod_wsgi, by default it won't be	15:36
mriedem	cdent: oh no i use the default (uwsgi)	15:36
cdent	and it's not clear the problem is the same uwsgi as mod_wsgi	15:36
cdent	it _might_ be	15:36
dansmith	I thought imacdonn said that it repros the same in uwsgi?	15:36
cdent	but since the only people we've heard of this problem from are tripleo, it's not clear	15:36
melwitt	I think imacdonn said it behaves the same uwsgi and mod_wsgi, IIRC	15:37
imacdonn	I have easily reproduced the exact symptoms (including absense of heartbeats with debug logging) with both mod_wsgi and uWSGI	15:37
cdent	ah, okay	15:37
cdent	cool	15:37
imacdonn	and I'm not tripleo, FWIW	15:38
dansmith	...said with beaming pride :)	15:38
*** _d34dh0r53_ is now known as d34dh0r53		15:38
imacdonn	heh	15:38
cdent	so is the other way around the problem to catch the reset by peer and reopen? because expecting an idle web service to make rpc heartbeats is ... odd	15:38
cdent	imacdonn: sorry, like I said, it's been hard to keep track of the details on this one	15:38
dansmith	yeah	15:38
dansmith	so that's the other thing,	15:38
*** ricolin has joined #openstack-nova		15:38
dansmith	why is the connection not just immediately re-youknow-connected?	15:38
dansmith	because that would be a lot more sane	15:39
* cdent nods		15:39
*** macza has joined #openstack-nova		15:39
imacdonn	the whole point of heartbeats is to keep connections alive when there's no traffic	15:39
imacdonn	for example, a firewall idle timeout could bite you otherwise	15:39
cdent	sure, but a web server process may not even exist in some deployment scenarios when there are no requests	15:39
cdent	who is going to send the heartbeat?	15:40
dansmith	imacdonn: but only really for things where you might be expecting to hear something from the other side	15:40
dansmith	imacdonn: the api server receives no messages other than replies,	15:40
imacdonn	the wsgi app container, I guess? and if that doesn't exist, there's no problem ... a new one would start a new connection	15:40
dansmith	so there's really no reason for a connection except when it's processing a reuest	15:40
imacdonn	I guess I might have assumed that setting up a new connection incurs some expense ... so it's better to keep it open once you have it	15:41
dansmith	open connections that aren't used are pretty expensive in rabbit terms ;)	15:42
imacdonn	kindof like the REST concept	15:42
imacdonn	re. "things where you might be expecting to hear something from the other side" - I wonder if there are cases where that could break - maybe an async RPC call that takes a long time to complete ?	15:43
dansmith	imacdonn: in that case you're processing a request still	15:44
dansmith	what I don't understand is, if the wsgi container has spawned a process to do a thing, and has not killed it off during times of inactivity, what is it doing that prevents the normal threading background stuff from running like any other process?	15:44
dansmith	unless we've tickled something in the python threading library before patching or whatever, but if that's the problem that's the kind of thing we need to know	15:45
dansmith	I'd be interested in what the differences are between you and everyone else.. same exact (unmodified) nova, oslo, et al code? same version of deps, etc?	15:46
dansmith	these are the things worth chasing, imho	15:46
imacdonn	do we know of anyone running stein nova-api on wsgi successfully ?	15:46
imacdonn	cos I only know of the other guys who are seeing the same issue as me	15:47
cdent	it's always running on wsgi, whether you're using nova-api script or using mod_wsgi or uwsgi. the distinguishing factor is standalone eventlet server versus everything else	15:47
*** pcaruana has quit IRC		15:47
imacdonn	dansmith: see also https://bugs.launchpad.net/nova/+bug/1825584/comments/7 - I haven't really tried to digest that yet	15:48
openstack	Launchpad bug 1825584 in OpenStack Compute (nova) "eventlet monkey-patching breaks AMQP heartbeat on uWSGI" [Undecided,New]	15:48
sean-k-mooney	cdent: is it running on wsgi when using the console script	15:48
sean-k-mooney	i did not think that was the case	15:48
cdent	yes, wsgi is the protocol by which a web request is translated into an eviron dict that is passed to a __callable__ handler with a start_response function	15:49
cdent	it's the reason why it is possible to run what amounts to the same code with the eventlet server based console script and "everything else"	15:50
dansmith	imacdonn: okay if it's stopping the interpreter, why does using another real thread change it? does it just pause the main thread's interpreter and release the GIL?\	15:50
sean-k-mooney	cdent: i did not know https://github.com/openstack/nova/blob/master/nova/cmd/api.py was implement the wsgi progocaol but it look like it is	15:50
dansmith	if the wsgi server is pausing the interpreter, I take that to mean it expects the thing to stop doing stuff, and thus, shouldn't really be doing heartbeats?	15:50
cdent	imacdonn: have you tried melwitt's patch? does it "fix it". Because if not, maybe it doesn't...	15:51
*** ttsiouts has joined #openstack-nova		15:51
cdent	and what we said above about reopening the connection is the real fix	15:51
imacdonn	cdent: depends which patch, I guess .. there was a set of 3 related ones, and the last one removed the monkey patching ... I did some basic testing of that, and it seemed to work OK	15:52
imacdonn	by "that", I mean "all three"	15:52
cdent	imacdonn: this one: https://review.opendev.org/#/c/650172/	15:52
imacdonn	I think I did (most of) these: https://review.opendev.org/#/q/topic:cell-scatter-gather-futurist+(status:open+OR+status:merged)	15:53
*** wwriverrat has joined #openstack-nova		15:54
dansmith	so if you disable monkeypatching but still use eventlet, I think everything just becomes synchronous, right?	15:54
dansmith	in that case, the "it works" is probably just because you're doing everything serially, so unless you apply the other patches to make it actually use the native threading, it's not a relevant test, IMHO	15:55
sean-k-mooney	dansmith: if you dont call eventlet directly i think so	15:55
*** ttsiouts has quit IRC		15:55
sean-k-mooney	im not sure what happens if you do call eventlet spawn et al	15:55
dansmith	again, I think we need to repro this in a clean devstack and/or the gate before we get too much further.. or figure out why we can't.	15:56
sean-k-mooney	you mean the rabbitmq issues	15:57
sean-k-mooney	is the theroy that they will start happening if the api is not actively used after a few minutes	15:58
*** rpittau is now known as rpittau\|afk		15:58
sean-k-mooney	e.g. it shoudl break if we stack go have coffee and come back?	15:58
*** imacdonn has quit IRC		15:58
bnemec	There's some analysis of what was happening in http://lists.openstack.org/pipermail/openstack-discuss/2019-April/005310.html	16:01
*** imacdonn has joined #openstack-nova		16:02
imacdonn	I missed the last ~5 mins ... ^&&$% stupid VPN	16:02
*** sridharg has quit IRC		16:02
imacdonn	for repro, I think the best way is to enable debug logging for oslo_messaging and observe heartbeats (or lack thereof)	16:03
cdent	dansmith, mriedem I put a comment with links to this discussion and some other summaries on the ThreadPoolExecutor patch	16:03
imacdonn	(after making an API call that requires RPC - like getting the VNC console URL for an instance)	16:03
dansmith	imacdonn: is it specific to vnc console by chance? because many api calls make rpc calls.. pretty much anything other than get/list instances	16:04
fungi	mriedem et al: i found what was wrong with http://status.openstack.org/reviews/ and got content updating again, though i'm at a loss for explanation as to how it broke (one of the dependencies we pip install for reviewday, cheetah, was somehow not installed on the server and all reasonable explanations have been ruled out)	16:04
fungi	anyway, let us know if you're relying on it and see it go stale again somehow	16:04
fungi	maybe i'll be able to track it down that way	16:05
fungi	should update roughly every 30 minutes	16:05
imacdonn	dansmith: I don't believe so .. that's just an easy way to make an RPC call happen	16:05
*** dims has quit IRC		16:05
*** tesseract has quit IRC		16:05
*** yan0s has quit IRC		16:07
dansmith	confirming that would be good, because I think we have mucked with console stuff recently	16:07
dansmith	and would also help narrow down the differences	16:07
*** ratailor has quit IRC		16:07
imacdonn	I have observed other things failing, but didn't make note of what specifically .. is there a particular call that you would suggest?	16:09
*** awalende has joined #openstack-nova		16:09
dansmith	I suggest reproducing this with a devstack	16:09
imacdonn	working on that (devstack) now	16:09
mriedem	fungi: thanks	16:11
fungi	any time!	16:11
*** awalende has quit IRC		16:13
*** yonglihe has quit IRC		16:14
*** cdent has quit IRC		16:17
*** dtantsur is now known as dtantsur\|afk		16:19
*** luksky has joined #openstack-nova		16:21
openstackgerrit	Matt Riedemann proposed openstack/nova master: Support cross-cell moves in external_instance_event https://review.opendev.org/658478	16:40
openstackgerrit	Matt Riedemann proposed openstack/nova master: Add cross-cell resize policy rule and enable in API https://review.opendev.org/638269	16:40
openstackgerrit	Matt Riedemann proposed openstack/nova master: WIP: Enable cross-cell resize in the nova-multi-cell job https://review.opendev.org/656656	16:40
aspiers	kashyap, efried: are you good with https://review.opendev.org/#/c/655268/ now?	16:40
*** jmlowe has quit IRC		16:41
mriedem	efried: i never +2ed this https://review.opendev.org/#/c/659374/	16:41
mriedem	well,	16:41
mriedem	i did before -1ing it	16:41
*** ricolin has quit IRC		16:44
mriedem	but now i am :)	16:44
*** ttsiouts has joined #openstack-nova		16:50
imacdonn	dansmith: I've reproduced the lack of heartbeats on devstack (and confirmed that the heartbeats happen if monkey patching is removed) ... but somehow (so-far) devstack is able to recover from the unexpected loss of amqp connection, where my RDO environment is (sometimes) not	16:51
dansmith	okay so devstack is behaving like we would expect, assuming the pause-the-interpreter wsgi behavior you describe	16:51
imacdonn	so-far, I suppose so	16:52
imacdonn	if the pause is considered "expected"	16:52
*** mrch_ has quit IRC		16:53
*** bauzas has quit IRC		16:54
dansmith	if the pause isn't, then the problem is in the configuration of the wsgi container or something else	16:56
*** bauzas has joined #openstack-nova		16:57
imacdonn	but it only happens when python has been monkeyed-with	16:57
dansmith	mm, I think you're misunderstanding	16:58
dansmith	I think that wsgi is pausing regardless,	16:58
*** davidsha has quit IRC		16:58
dansmith	and in the eventlet case, there is only one thing to pause, so it pauses	16:58
dansmith	and without eventlet, we spawn a real thread that the wsgi container does not pause (only the main thread)	16:58
dansmith	which is why I say that if the design of the wsgi container is to pause the whole thing between requests, then us subverting that with a real thread seems wrong	16:59
imacdonn	I definitely don't understand the mechanisms well enough ... so my observation is fairly high-level ... in one case, the heartbeats happen, so it seems awake .. in the other case, it goes completely quiet	16:59
imacdonn	so if it's wrong, the whole oslo_messaging model seems inappropriate ..... ???	17:00
*** derekh has quit IRC		17:00
imacdonn	cos it does heartbeats quite intentionally, AFAIK ...	17:00
dansmith	well, o.msg _was_ developed with eventlet servers in mind	17:00
*** jmlowe has joined #openstack-nova		17:01
*** dims has joined #openstack-nova		17:03
*** lbragstad has quit IRC		17:05
dansmith	so here's a random person asking a similar question: https://github.com/benoitc/gunicorn/issues/1924	17:06
dansmith	not much response there, except that it basically says "don't spawn worker threads from your wsgi app"	17:06
dansmith	check this out: https://docs.newrelic.com/docs/agents/python-agent/web-frameworks-servers/python-agent-uwsgi-web-server	17:08
dansmith	"By default uWSGI does not enable threading support within the Python interpreter core. This means it is not possible to create background threads from Python code. "	17:09
dansmith	...which might mean that disabling eventlet and letting threading be unpatched just gets you synchronous behavior	17:09
*** tssurya has quit IRC		17:10
dansmith	I wonder if using --enable-threads but leaving the monkeypatching in place would allow the eventlet background threads to run?	17:11
imacdonn	"By default the Python plugin does not initialize the GIL. This means your app-generated threads will not run. If you need threads, remember to enable them with enable-threads. Running uWSGI in multithreading mode (with the threads options) will automatically enable threading support. This “strange” default behaviour is for performance reasons, no shame in that."	17:12
imacdonn	^ from https://uwsgi-docs.readthedocs.io/en/latest/ThingsToKnow.html	17:12
dansmith	yeah, just saw that.. so... do you have threads enabled?	17:12
*** dims has quit IRC		17:12
imacdonn	checking ... probably not	17:12
*** lbragstad has joined #openstack-nova		17:13
*** panda\|rover is now known as panda\|rover\|off		17:13
*** dims has joined #openstack-nova		17:13
dansmith	if I were you, I'd really want to find out why devstack recovers properly and yours does not though	17:14
imacdonn	or...	17:14
imacdonn	# grep threads /etc/nova/nova-api-uwsgi.ini	17:14
imacdonn	enable-threads = true	17:14
imacdonn	#	17:14
imacdonn	(this is on devstack)	17:14
imacdonn	thinking about that some more ... I think if enable-threads was not on, the heartbeat thread would not work in the non-patched case (?)	17:15
imacdonn	guess I could attempt to test that	17:15
*** ivve has joined #openstack-nova		17:16
imacdonn	confirmed that I do have enable-threads in my "real" (non-devstack) config too	17:16
openstackgerrit	Stephen Finucane proposed openstack/nova master: Fix some issues with the newton release notes https://review.opendev.org/659623	17:18
dansmith	so,	17:18
dansmith	thinking about this some more	17:19
openstackgerrit	Stephen Finucane proposed openstack/nova master: Use new hotness for upper-constraints https://review.opendev.org/659624	17:19
dansmith	I'm not sure how eventlet could really ever work in the wsgi mode, because background tasks really require the main thread to block on the event loop	17:19
dansmith	I'm struggling to find where the entry point is for nova-api in wsgi mode...	17:19
imacdonn	that seems to be the consensus ... eventlet and wsgi containers seem fundamentally incompatible	17:21
dansmith	well,	17:21
dansmith	not really,	17:21
dansmith	just wsgi+eventlet with a background thread	17:21
imacdonn	ok	17:21
imacdonn	FWIW, with enable-threads turned off, and monkey patching still removed, it looks the same - no heartbeats after the API request that opens an RPC connection	17:23
dansmith	yeah, not running the event loop means we'll just never do that thing	17:23
*** ttsiouts has quit IRC		17:24
dansmith	tbh, it kinda seems like what we should be doing is splitting nova-api into two pieces in a wsgi world	17:25
dansmith	which is, of course, not how it was designed	17:25
imacdonn	someone else suggested that	17:25
dansmith	mnaser: you around?	17:27
dansmith	imacdonn: how instantaneous is the reconnect to rabbit on devstack?	17:28
imacdonn	no perceptible delay	17:28
mnaser	dansmith: hiya	17:29
dansmith	mnaser: you are on stein and running nova-api in wsgi something right?	17:29
*** _alastor_ has quit IRC		17:29
mnaser	Correct. Uwsgi	17:29
dansmith	mnaser: and you don't see this rabbit heartbeat timeout freezeup thing correct?	17:30
*** xek__ has joined #openstack-nova		17:30
mnaser	I have not. I thought the ML post thread concluded it was mod_wsgi related ?	17:30
dansmith	mnaser: nope, both apparently	17:30
*** _alastor_ has joined #openstack-nova		17:31
dansmith	mnaser: imacdonn says it's reproducible in devstack too, but it just reconnects instantaneously and thus isn't a problem (which is what ideal behavior should be)	17:31
mnaser	Hmm, I mean, maybe it could be? I’ve not seen any issues but maybe our APIs get hit often enough?	17:31
dansmith	i.e. not requiring the wsgi app to run a daemon thread to maintain a connection	17:31
mnaser	Let me grep some logs	17:32
dansmith	mnaser: possible I guess. is there any way you could take one worker out of LB rotation and see if it hangs up after a couple minutes if you hit it directly?	17:32
mnaser	What string should I look for?	17:32
dansmith	imacdonn: ^	17:32
imacdonn	QRX ... reproducing again	17:32
imacdonn	erm, I mean .. standby :)	17:32
dansmith	imacdonn: ah, so you are the ham I keep finding in my googling :D	17:32
*** xek_ has quit IRC		17:32
imacdonn	even an occasional chirp user ;)	17:33
dansmith	hah nice	17:33
mnaser	gosh the latency is awful from here	17:34
mnaser	oh actually	17:34
mnaser	I have a quiet stein cloud here	17:34
*** dklyle has quit IRC		17:34
dansmith	imacdonn: is it possible that mod_wsgi is the reason you don't get a reconnect and that uwsgi, despite pausing, makes that just work?	17:35
imacdonn	mnaser: one think you could look for is, in your rabbitmq server log:	17:35
imacdonn	=ERROR REPORT==== 16-May-2019::16:45:39 ===	17:35
imacdonn	closing AMQP connection <0.4566.0> (x.x.x.x:34178 -> x.x.x.x:5672 - uwsgi:47892:8bc2c402-d1d4-4762-9b83-29a7b7f419fd):	17:35
imacdonn	missed heartbeats from client, timeout: 60s	17:35
imacdonn	dansmith: no, because I'm using uWSGI "for real" too	17:35
dansmith	okay	17:35
-openstackstatus- NOTICE: Gerrit is being restarted to add gitweb links back to Gerrit. Sorry for the noise.		17:37
mnaser	so is it only something logged inside RabbitMQ?	17:37
*** sean-k-mooney has quit IRC		17:39
*** pcaruana has joined #openstack-nova		17:39
imacdonn	trying to get the nova-api side logs .. I'll have to clean and pastebin them	17:39
*** sean-k-mooney has joined #openstack-nova		17:41
mnaser	im seeing this on a lightly loaded stein cluster 'missed heartbeats from client, timeout: 60s'	17:42
mnaser	gonna see what nova logs say	17:42
imacdonn	is there a handy place I can upload 10k lines of log to ?	17:43
mnaser	2019-05-16 17:38:53.643 3897 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 104] Connection reset by peer	17:44
dansmith	mnaser: but sounds like it's recovering when a new call comes in such that it's not noticeable?	17:45
imacdonn	mnaser: yes! that's one of symptoms	17:45
dansmith	bnemec: do you know if there's a way to turn off the heartbeat thread for rabbit?	17:45
mnaser	but is there a problem however?	17:45
dansmith	bnemec: for o.msg's rabbit driver I mean	17:45
mnaser	like n-api seems to continue to respond fine afterwards	17:45
mnaser	2019-05-16 17:38:53.644 3897 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 104] Connection reset by peer	17:46
bnemec	There used to be. It was considered experimental for a couple of cycles. Let me look.	17:46
dansmith	mnaser: yeah, so that's the behavior we would expect	17:46
mnaser	but at a 17:38:54 a GET /v2.1/servers/uuid worked fine	17:46
dansmith	dunno why imacdonn isn't getting that proper reconnect behavior	17:46
dansmith	turning off the heartbeat thread for the wsgi workers would eliminate the log noise I think	17:46
*** udesale has quit IRC		17:46
mnaser	rabbitmq heartbeats are super unreliable in my experience	17:47
bnemec	dansmith: https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L145	17:47
mnaser	there used to be a nice article about configuring timeouts	17:47
efried	aspiers: not sure I ever looked at that.	17:47
imacdonn	yeah, perhaps I need to look closer at that ... but still, I think that heartbeats are there for a reason, and letting connections rot and "fail by design" is kinda icky	17:47
mnaser	https://medium.com/@george.shuklin/rabbit-heartbeat-timeouts-in-openstack-fa5875e0309a	17:47
mnaser	dunno how relevant it is but	17:48
mnaser	a good read none the less	17:48
dansmith	imacdonn: yeah, I just don't think I agree :)	17:48
imacdonn	if nova-api isn't going to use persistent AMQP connections, it should gracefully close the connection before it goes to sleep, IMO	17:48
mnaser	my question is: is there a problem? are your requests forever 500-ing after?	17:49
dansmith	nova-api doesn't go to sleep	17:49
melwitt	what's the behavior when it doesn't properly reconnect? nova-api responds with a failure?	17:49
mnaser	^^	17:49
imacdonn	yeah, I get some sort of 50x error	17:49
dansmith	presumably because something is getting stuck and not letting the connections close off, which is why I want to know why imacdonn is different from the rest of the world	17:50
mnaser	if you get a 50x that means you have aa traceback, right?	17:50
imacdonn	dansmith: hey now .. there was at least one other person experiencing the failures	17:51
mnaser	should you have a traceback if you're getting a 500?	17:51
dansmith	imacdonn: I meant your deployment, not you as a person :)	17:51
imacdonn	well, yeah, but the other person was using a different deployment, and mod_wsgi	17:51
* mnaser still asks if there is a traceback		17:52
dansmith	imacdonn: ^	17:52
efried	mriedem: Were you going to put up a null nova change depending on https://review.opendev.org/659611 to make sure the ironic-tempest-ipa-wholedisk-bios-agent_ipmitool-tinyipa job passes?	17:53
melwitt	I had thought someone else faced this behavior too, but looking at the bug report, they say it reconnects https://bugs.launchpad.net/tripleo/+bug/1827744	17:53
openstack	Launchpad bug 1827744 in tripleo "Running eventlet.monkey_patch in nova_api breaks the AMQP heartbeat thread" [High,In progress] - Assigned to Damien Ciabrini (dciabrin)	17:53
efried	or does ironic run that job too?	17:53
imacdonn	I'll have to rebuild my "real" environment with uWSGI and reproduce again, if we need more context around the failure to reconnect ... I'd already started down the path of reverting to running nova-api standalone, since there didn't seem to be a path forward	17:53
imacdonn	the "other person" is here: https://bugs.launchpad.net/nova/+bug/1825584/comments/7	17:54
openstack	Launchpad bug 1825584 in OpenStack Compute (nova) "eventlet monkey-patching breaks AMQP heartbeat on uWSGI" [Undecided,New]	17:54
dansmith	imacdonn: can you also try disabling rabbit heartbeats on a wsgi api node? just to see if that eliminates the log noise	17:54
dansmith	imacdonn: even if on the devstack	17:54
melwitt	imacdonn: that's the bug I linked, I think, and the proposed patch is the stop monkey patching nova-api	17:54
imacdonn	what do you mean by "log noise"? or are you just asking me to test if it's possible to disable heartbeats at all ?	17:54
melwitt	but in that bug report, he says that rabbit reconnects	17:55
imacdonn	that's my bug report, but he commented on it	17:55
melwitt	no, the bug I linked you	17:55
dansmith	imacdonn: I mean the log noise where nova-api complains that the connection was reset before reconnecting (in mnaser's case)	17:55
melwitt	imacdonn: https://bugs.launchpad.net/tripleo/+bug/1827744	17:55
openstack	Launchpad bug 1827744 in tripleo "Running eventlet.monkey_patch in nova_api breaks the AMQP heartbeat thread" [High,In progress] - Assigned to Damien Ciabrini (dciabrin)	17:55
dansmith	i.e. this: [10:46:02] <mnaser>2019-05-16 17:38:53.644 3897 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 104] Connection reset by peer	17:55
sean-k-mooney	actully in the api we dont really care about hearbeat to rabbitmq right	17:56
dansmith	sean-k-mooney: I've been shouting that for hours, yes:)	17:57
dansmith	there's really no reason to maintain a persistent connection to rabbit from an idle api worker	17:57
sean-k-mooney	actullly i was going to say we never actully listen for anyting from the api but i guess we do if we make a blocking call	17:57
imacdonn	melwitt: ah, I see .. OK ... yeah, maybe it was non-fatal in his case	17:57
mnaser	I think the connection reset by peer is the real issue, I think the warning is log noise	17:57
dansmith	sean-k-mooney: only in the reply case, which isn't a problem here	17:57
sean-k-mooney	ya	17:57
dansmith	mnaser: the connection reset by peer is because we're not running the heartbeat thread, and thus is expected	17:58
sean-k-mooney	so even if the connect drops due to in activity in the normal case we just fire and forget	17:58
dansmith	mnaser: it's the "this is dead let me reconnect immediately" signal, which seems to be working fine for you	17:58
dansmith	sean-k-mooney: right	17:58
mnaser	I'm in china and its 2am but for thee last time I'll say it	17:58
mnaser	if you're getting 500s, it probably threw a traceback, and that'll probably contain something useful	17:59
mnaser	(which will be the real issue)	17:59
dansmith	mnaser: imacdonn said he needs to rebuild his env to get that traceback	17:59
imacdonn	^ that	17:59
mnaser	aaaha	17:59
mnaser	okay, well in that case, good luck	17:59
mnaser	fwiw let me share the OSA config	17:59
efried	mriedem: I guess it does run it. It failed. And looking closer, the 'contains' match was working properly. The problem is further up: the inventory is never getting set.	17:59
dansmith	the other thing I'm interested in is, if you just don't patch, and everything seems to work, is that just because eventlet is being synchronous, or because eventlet primitives are spawning real threads?	18:00
mnaser	https://opendev.org/openstack/openstack-ansible-os_nova/src/branch/master/templates/nova-uwsgi.ini.j2	18:01
mnaser	this is our uwsgi config	18:01
* mnaser &		18:01
melwitt	I'm also wondering about that	18:01
dansmith	I think that if disabling the heartbeat thread removes the log noise and we reconnect as we would expect for all cases except the unexplained ones, that'd eb my preference as a path forward	18:02
*** nicolasbock has quit IRC		18:02
dansmith	because changing our threading model for some of nova, and maintaining a thread off each wsgi worker just to hold a useless connection to rabbit does not seem like an improvement to me	18:02
mriedem	efried: ack ok just getting back, haven't looked	18:03
efried	mriedem: Left comment on https://review.opendev.org/#/c/659611/	18:04
imacdonn	dansmith: not sure I like the idea of turning off heartbeat .. that could have side-effects ... like leaving stale connections on the rabbitmq-server side for ever (??)	18:04
dansmith	imacdonn: show us that it causes negative effects	18:04
imacdonn	not sure that I can come up with every possible scenario off the top of my head	18:05
dansmith	the other things we could do are:	18:05
dansmith	1. establish and drop connections before/after request	18:06
dansmith	2. leave heartbeats in place, as they are today so that rabbit will close the stale connections, which seems to be working for most people,	18:06
efried	mriedem: n-cpu thinks the inventory is being set: http://logs.openstack.org/11/659611/1/check/ironic-tempest-ipa-wholedisk-bios-agent_ipmitool-tinyipa/9c644a2/controller/logs/screen-n-cpu.txt.gz#_May_16_17_17_55_788562	18:06
sean-k-mooney	if i do	18:06
sean-k-mooney	>>> def func():	18:06
dansmith	3. change our threading model for part of nova to run a heartbeat thread	18:06
sean-k-mooney	... print("test")	18:06
sean-k-mooney	...	18:06
sean-k-mooney	>>> eventlet.greenthread.spawn(func)	18:06
sean-k-mooney	it never prints anything	18:06
dansmith	4. split nova-api into the wsgi request processing side and the persistent bit	18:06
sean-k-mooney	so if you dont monkeypatch it looks liek spawn does nothing but queue the fucntion to run	18:07
sean-k-mooney	and never run it	18:07
dansmith	sean-k-mooney: that makes me wonder how disabling monkeypatching does anything	18:07
sean-k-mooney	dansmith: as long a s you dont call the scater gater code it should apper to work	18:07
dansmith	sean-k-mooney: right, but we do that for instance list, and they say that disabling monkeypatching just makes everything work normally, right?	18:08
sean-k-mooney	that is the only part of the api that actully uses eventlet	18:08
efried	mriedem: oh, reserved is 1 and never gets set to 0	18:08
mriedem	efried: hmm, i wonder if it things the node isn't available	18:08
mriedem	*thinks	18:08
sean-k-mooney	dansmith: maybe its different on python 2	18:08
mriedem	b/c i think that's when the driver sets reserved=1	18:08
sean-k-mooney	i jsut checked with py36	18:09
efried	yup	18:09
mriedem	Node 7c5e9503-1e20-42bf-9ae7-d14cef0f38a9 is not ready for a deployment, reporting resources as reserved for it. Node's provision state is cleaning, power state is power off and maintenance is False.	18:09
efried	just so	18:09
mriedem	yeah so the node is dead	18:10
mriedem	http://logs.openstack.org/11/659611/1/check/ironic-tempest-ipa-wholedisk-bios-agent_ipmitool-tinyipa/9c644a2/controller/logs/screen-ir-cond.txt.gz#_May_16_17_28_09_311564	18:10
imacdonn	sean-k-mooney dansmith: side-note ... my "real" deployment is py2, devstack is py3 ... may be a factor in the failure to reconnect	18:10
* efried edits story text		18:13
mriedem	yeah i just abandoned the change	18:13
sean-k-mooney	dansmith: if you save teh greanthread object and then call wait() it runs all the queued greenthreads	18:15
dansmith	sean-k-mooney: right that's the main loop part we will never do in wsgi mode	18:16
sean-k-mooney	so if i do gt = eventlet.greenthread.spawn(func)	18:16
sean-k-mooney	it prints after i call gt.wait()	18:16
sean-k-mooney	dont we wait in the scater gather however	18:16
dansmith	oh, sorry your point is maybe that's why it's working if they disable?	18:17
dansmith	yeah, so we do that in sequence, which means it becomes serialized	18:17
dansmith	so they just hit cell0, then cell1, etc	18:17
dansmith	that's the behavior I had been assuming for non-patched,	18:18
dansmith	I just wasn't sure if in reality the lower layers would actually use threading.Thread() unpatched for the work	18:19
sean-k-mooney	well i think when we call wait it is entering a coperative green thread mode so we still get concurance	18:19
dansmith	which hmm, I guess could still be	18:19
sean-k-mooney	until that greenthread finishes at least	18:19
sean-k-mooney	and sice we are doing it in a for it will complete the scater gather as we will wait on all the greadthread we queued	18:19
sean-k-mooney	so we should not need to monkey patch	18:20
*** ralonsoh has quit IRC		18:20
dansmith	well, it depends on whether it's spawning a real thread and we're just waiting on it's future, or if we pass through and just exec the thread function during the wait	18:20
sean-k-mooney	let me try a slightly less dumb test function that does a green trhead sleep and increment a counter and see if its serialsed or random	18:21
dansmith	sean-k-mooney: >>> def func2():	18:21
dansmith	... print(len(threading.enumerate()))	18:21
dansmith	...	18:21
dansmith	>>> t = eventlet.greenthread.spawn(func2)	18:21
dansmith	>>> t.wait()	18:21
dansmith	test	18:21
dansmith	1	18:21
dansmith	sean-k-mooney: only one thread running from inside the thread function	18:21
dansmith	so it's sequential	18:21
dansmith	the test came from another thread I spawned first (like your example) which hadn't been waited on	18:22
dansmith	normally enumerate() returns one for the main thread all the time,	18:22
dansmith	so 1 means it didn't spawn something else for func2()	18:23
*** ccamacho has joined #openstack-nova		18:23
dansmith	so one thing we could do in the interim is,	18:24
dansmith	provide a knob to let them avoid the monkeypatch without having to actually patch the code, if they prefer synchronous behavior	18:24
dansmith	that isn't too terrible for people with just the default two cells, and the rest of the people who don't suffer from this can continue to do what we do today	18:25
melwitt	I think that's already available in the form of an env variable. which is what this tripleo patch is proposing to turn off monkey patching https://review.opendev.org/#/c/657168/1/deployment/nova/nova-api-container-puppet.yaml	18:26
dansmith	melwitt: ah even better :)	18:26
sean-k-mooney	dansmith: it is concurrent however	18:27
sean-k-mooney	if i do	18:27
sean-k-mooney	>>> def func():	18:27
sean-k-mooney	... eventlet.greenthread.sleep(10)	18:27
sean-k-mooney	... global i	18:27
sean-k-mooney	... i+=1	18:27
sean-k-mooney	... print(i)	18:27
sean-k-mooney	...	18:27
sean-k-mooney	>>> i = 0	18:27
sean-k-mooney	>>> gt = eventlet.greenthread.spawn(func)	18:27
sean-k-mooney	>>> gt = eventlet.greenthread.spawn(func)	18:27
sean-k-mooney	>>> gt = eventlet.greenthread.spawn(func)	18:27
sean-k-mooney	>>> gt = eventlet.greenthread.spawn(func)	18:27
sean-k-mooney	>>> gt.wait()	18:27
sean-k-mooney	it does not take 40 secods to print	18:27
dansmith	sean-k-mooney: because you're using eventlet's sleep right?	18:27
sean-k-mooney	yes	18:27
imacdonn	mnaser (if still awake?), dansmith, sean-k-mooney: log from the failure-to-reconnect case: http://paste.openstack.org/show/751486/	18:28
dansmith	that does't mean it's concurrent, that means it's warping time for you :)	18:28
mnaser	Looks like conductor yoking our	18:31
mnaser	Timing out	18:32
mnaser	I’ve seen really weird behaviour thy usually gets resolved by deleting every single queue in the vhost	18:32
sean-k-mooney	dansmith: well in anycase i think we now now why its working if you dont monkeypatch	18:33
dansmith	yup, as expected	18:33
sean-k-mooney	there it concurant of fully synconous i dont know	18:33
dansmith	sean-k-mooney: run this and tell me what you think :) https://pastebin.com/XHV9t9cM	18:35
imacdonn	mnaser: this happens when the monkey patching is used in combination with a wsgi container ... and doesn't happen otherwise	18:36
dansmith	sean-k-mooney: one single wait, but it clearly runs them in order, and blocking calls block everything	18:36
melwitt	imacdonn: not sure what else to try other than a python2 devstack test maybe, to try and pinpoint what is the difference in your env vs the others. mnaser, are you running stein under py3 or py2?	18:36
imacdonn	melwitt: yeah, thought about py2 devstack too	18:37
sean-k-mooney	yes it took 53 seconds	18:40
sean-k-mooney	but https://pastebin.com/QN5L7yDY took 5	18:41
sean-k-mooney	time.sleep will suppend the thread	18:41
sean-k-mooney	eventlet.greenthread.sleep shoudl yeild to the next coroutine	18:41
dansmith	sean-k-mooney: ah I see what you mean.. we will still get interleaving but not actual concurrency	18:41
dansmith	sean-k-mooney: meaning we're not running multiple concurrent threads, but we will interleave their execution when they do stuff that calls into eventlet	18:42
sean-k-mooney	we will get concurrency but not parrallisium	18:42
sean-k-mooney	yes	18:42
dansmith	sean-k-mooney: the thing I think you're missing is that if we haven't monkeypatched, then all our IO is blocking and so the first thing will be like time.sleep() andnot like eventlet.sleep()	18:42
dansmith	first thing meaning first bit of IO	18:43
sean-k-mooney	ya i was onding if we could import eventlet.green or somting	18:44
sean-k-mooney	https://eventlet.net/doc/patching.html#import-green	18:44
sean-k-mooney	our use a contex manger to patch or jsut the socket lib	18:44
dansmith	I think that gets us further into the mixing paradigms state and confusion abounds, IMHO	18:45
sean-k-mooney	ya perhaps	18:45
sean-k-mooney	so we orginally merged the new scater gather stuff that used eventlests without monkey patching	18:45
sean-k-mooney	then we noticed we were not monkeypatch and mdbooth fixed that	18:46
dansmith	*for api running under wsgi	18:46
sean-k-mooney	and we strated to see these uwsgi errors	18:46
sean-k-mooney	ya	18:46
sean-k-mooney	so setting the env var would allow people to go back to the un monkeypatched versions under wsgi with no code change but its nolonger paralle	18:48
dansmith	exactly	18:48
sean-k-mooney	although in that case uwsgi/apache would still provde its own concurancy	18:48
dansmith	sean-k-mooney: between requests, but not between cells	18:49
dansmith	cells would still be sequential	18:49
sean-k-mooney	yes	18:49
sean-k-mooney	im just thinking about what people can do as a workaound in the short term	18:49
sean-k-mooney	e.g. what can people do that hit this in production	18:50
dansmith	people being the 1.9 people we know that are having this problem right? :)	18:50
*** trident has quit IRC		18:51
sean-k-mooney	well i think the only reason that we are not seeing this in osp is because hte kolla container use the console-script in the container instead of running under wsgi	18:51
sean-k-mooney	so i think that is why we are not hitting it downstream	18:51
sean-k-mooney	but i dont know what other deployment tools default too	18:51
dansmith	well, we got the vexxhost data that they're reconnecting properly	18:52
sean-k-mooney	in osa	18:52
sean-k-mooney	*with	18:52
dansmith	so they're clearly seeing the same threading behavior, but not suffering	18:52
dansmith	i.e. they're getting the best of both worlds	18:52
*** trident has joined #openstack-nova		18:52
sean-k-mooney	perhaps its because they are installing the nova api under uwsgi and not mod_wsgi	18:59
sean-k-mooney	imacdonn: what are you running the nova-api with?	18:59
imacdonn	sean-k-mooney: uWSGI ... I've also reproduced the heartbeats not happening with mod_wsgi, but I can't say if I've seen the failure to "establish a new connection and move on" in that case	19:00
sean-k-mooney	we know that uwsgi and mod_wsgi handel the lifetime of there python interperters diffrently so model level variable are reinitalised in uwsgi but are not in mod_wsgi becuase it reuses the interpreter when the app is reloaded an uwisg does not	19:01
sean-k-mooney	*module level	19:01
sean-k-mooney	imacdonn: if the wsgi app dies and is restarted by uwsgi it will return to a clean state	19:02
sean-k-mooney	in the mod_wsgi case it will reuse the interperter and maybe that impact the reconnection	19:03
sean-k-mooney	that is a strech however	19:03
sean-k-mooney	we are seeing it not reconnect on mod_wsgi correct?	19:03
imacdonn	I don't see any evidence of the app dying and being restarted	19:04
melwitt	sean-k-mooney: we're not sure what the common denominator is for not reconnecting and 500ing, I think	19:04
melwitt	I have a python2 devstack, going to check whether it reconnects	19:05
sean-k-mooney	if it was you would see nova print it config again and all the other start up stuff so if you are not seeing that its proably not happening then	19:05
sean-k-mooney	ok that is testing with uwsgi then	19:05
imacdonn	yeah .. the 504 error seems to be caused by an RPC timeout following the reconnect ... don't yet know why I'm seeing that where others are not	19:05
sean-k-mooney	do we have any periodic task that could be keeping it alive?	19:06
imacdonn	I can't say I've reproduced the RPC timeout with mod_wsgi - was focused on the absence of heartbeats before	19:06
dansmith	sean-k-mooney: if you read the backscroll, it has been reprod with both	19:07
imacdonn	I'm not able to reproduce the RPC timeout with py2 devstack either	19:07
melwitt	how do you get the connection reset by peer message to show up in nova-api log? just waiting doesn't seem to do it	19:08
dansmith	request after a delay I think	19:08
sean-k-mooney	i tought it was 2 different errors one on each but ok in that case ill stop looking to see what the differe is between how the both run	19:08
imacdonn	I have an instance already running.. I make a call to get the VNC console URL, wait about 5 mins, then repeat the call to get the console URL	19:08
sean-k-mooney	that shoudl not need to hit rabbit	19:09
sean-k-mooney	that should just hit the db directly no?	19:09
dansmith	sean-k-mooney: eh	19:09
sean-k-mooney	to get the url	19:10
dansmith	no, it calls to the driver	19:10
sean-k-mooney	to get teh console it woudl have to do an rpc	19:10
sean-k-mooney	oh ok i assuem we would have sored the console url in the db	19:10
*** imacdonn has quit IRC		19:12
melwitt	it is stored there afterward (base url and token hash, etc) but the driver hands out the connection details initially	19:13
*** imacdonn has joined #openstack-nova		19:14
melwitt	ok yeah I see now the "Unexpected error during heartbeart thread processing, retrying...: error: [Errno 104] Connection reset by peer" happens on the second request after waiting some minutes. but the request works fine	19:14
melwitt	(with py2 and uwsgi)	19:15
sean-k-mooney	ok but this is what get console is actully doing right https://github.com/openstack/nova/blob/master/nova/console/api.py#L27-L29	19:15
sean-k-mooney	or rather openstack console url show	19:17
melwitt	I think that's probably the xen-specific service nova-console stephenfin is looking to deprecate	19:17
melwitt	this is get_vnc_console https://github.com/openstack/nova/blob/master/nova/compute/api.py#L3876	19:17
sean-k-mooney	that is what called form here https://github.com/openstack/nova/blob/master/nova/api/openstack/compute/consoles.py#L48	19:17
sean-k-mooney	oh ok so that one does a downcall to the compute cool	19:18
melwitt	I think that's the xen thing that no one except old rax cluster uses	19:18
dansmith	this is one reason I suggested using a different call	19:19
dansmith	because console has all manner of stuff wrapped up in it	19:19
dansmith	I think pause/unpause is a call right? use that	19:19
dansmith	or just boot/delete	19:19
imacdonn	looking at my py2 devstack, it seems the connection reset is observed at a difference place	19:19
sean-k-mooney	ya or the server diagnostics	19:19
dansmith	well, boot isn't a call from the apiI guess, so pause	19:19
dansmith	sean-k-mooney: or that	19:19
sean-k-mooney	... of cose they are only in the nova client...	19:20
melwitt	imacdonn: in my py2 devstack, I see the connection reset in nova-api log (for get_vnc_console)	19:20
sean-k-mooney	nova diagnostics <server> is definetly a down call to the compute but nova get-vnc-console should be too	19:22
*** ttsiouts has joined #openstack-nova		19:22
*** nicolasbock has joined #openstack-nova		19:22
*** imacdonn has quit IRC		19:25
*** jmlowe has quit IRC		19:29
*** xek__ has quit IRC		19:32
*** xek__ has joined #openstack-nova		19:32
*** imacdonn has joined #openstack-nova		19:42
sean-k-mooney	efried: just finsihed reviewing https://review.opendev.org/#/q/topic:bug/1809095+(status:open+OR+status:merged)	19:43
sean-k-mooney	efried: im provisionly +1 on both i have one question on the second patch but its and edgecase	19:43
*** jmlowe has joined #openstack-nova		19:44
*** imacdonn_ has joined #openstack-nova		19:46
*** maciejjozefczyk has quit IRC		19:49
*** imacdonn has quit IRC		19:49
*** pcaruana has quit IRC		19:52
*** ttsiouts has quit IRC		19:54
imacdonn_	sean-k-mooney: vnc console does rpc .. you can see it in the trace at http://paste.openstack.org/show/751486/	19:55
imacdonn_	(the trace for the RPC timeout)	19:55
sean-k-mooney	those call to the compute node? ya	19:56
imacdonn_	yes	19:56
sean-k-mooney	but it was not clear what api enpoint osc was calling for openstack console url show	19:56
sean-k-mooney	if you are using the nova clintes get-vnc-console that was defietly downcalling	19:57
sean-k-mooney	osc is doing some magic to figure out what console type the vm is uesing	19:57
sean-k-mooney	as it has a singel command that works for all console types	19:57
sean-k-mooney	so i was not sure if the osc command would actully hit the db or the compute node	19:58
imacdonn_	yeah, def does ... that stack trace resulted from 'openstack console url show ...'	19:59
sean-k-mooney	its kind of strange	20:00
sean-k-mooney	we can see the hearbeat_tick printed	20:00
imacdonn_	yeah, but only once	20:01
imacdonn_	it should be period from that point on	20:01
imacdonn_	periodic	20:01
sean-k-mooney	no its there several times	20:01
sean-k-mooney	look at line 140 to 153	20:01
imacdonn_	for the same thread?	20:01
imacdonn_	in any case, in the non-patched case, it's continuous .. in the patched case, it goes quiet	20:01
*** ccamacho has quit IRC		20:02
imacdonn_	those are all within the same second	20:03
imacdonn_	I may have just found what's different about my env vs. others and devstack	20:04
*** whoami-rajat has quit IRC		20:05
*** imacdonn_ is now known as imacdonn		20:07
sean-k-mooney	have you adjusted any of the messaging timeouts?	20:07
imacdonn	no, but in my uwsgi config, had "threads = 8" ... I've removed that now, and it seems that the reconnect is working	20:09
sean-k-mooney	this is the osa one https://opendev.org/openstack/openstack-ansible-os_nova/src/branch/master/templates/nova-uwsgi.ini.j2	20:10
imacdonn	I'm thinking (loosely) that maybe in the threads=X case, the thread making the RPC call has fired before the thread handling heartbeats has noticed the connection failure and reestablished	20:10
sean-k-mooney	which defaults to 1	20:10
imacdonn	yeah. devstack doesn't set it at all	20:11
imacdonn	I was thinking to figure out what uwsgi does by default .. guessing 1	20:11
imacdonn	I was *trying....	20:11
sean-k-mooney	im lookig fot atht too now :)	20:11
efried	sean-k-mooney: thanks for reviewing.	20:20
sean-k-mooney	imacdonn: finally https://github.com/unbit/uwsgi/blob/3149df02ed443131c54ea6afb29fcbb0ed4d1139/core/init.c#L96 it default to 1	20:21
sean-k-mooney	imacdonn: the default is not in the docs	20:22
imacdonn	hah, thanks	20:23
sean-k-mooney	for c code uwsgi is actully not to hard to read but finding stuff is a pain	20:23
sean-k-mooney	imacdonn: so i think we know why vexhost are fine	20:24
sean-k-mooney	osa expecitly sets it to 1	20:24
imacdonn	yeah, I think so ... the reconnect seems to be consistently working since I've changed that	20:25
imacdonn	so now I wonder if I'm giving anything up by doing that, in terms of performance .... haven't really found any good guidance on processes/threads sizing for these things	20:26
sean-k-mooney	well osa is running with 64 processes and 1 thread per process i think	20:26
sean-k-mooney	sorry a max of 16 processes	20:27
sean-k-mooney	each with 1 thread	20:27
dansmith	is this threads per worker process?	20:29
dansmith	if so, it could be that multiple threads per worker try to share a single connection pool,	20:29
sean-k-mooney	yes i belive so	20:29
dansmith	but maybe one of them gets responsibility for the eventlet heartbeat thread,m	20:29
dansmith	so if that thread isn't the first scheduled after a period of inactivity, then heartbeat isn't run, we try to use the connection and are stuck?	20:30
dansmith	something like that?	20:30
imacdonn	yes, "something like that" is what it feels like	20:30
sean-k-mooney	well im not sure if we have process=2 thread=2 if that wil result in 2 or 4 python interperters	20:30
sean-k-mooney	e.g. are teh treads os threads or python threads	20:31
melwitt	so heartbeat threads is the one that reconnects things if detected to not be open	20:31
dansmith	I think that would mean two interpreters, two threads each	20:31
melwitt	*thread	20:31
dansmith	melwitt: no,	20:31
sean-k-mooney	right so the gil would mean that only one of the two thread woudl actully be running in each python interperter	20:31
dansmith	I think any of them would do it, but running or not running the heartbeat but definitely running the other threads is where I can see it getting wonky	20:32
*** tbachman has quit IRC		20:32
sean-k-mooney	so maybe it is staving the heart beat thread as you suggest	20:32
melwitt	I guess I'm confused. so far, I thought the heart beat thread doesn't affect anything other than to log a message saying "connection reset" if it's been idle > 60s. how does its starvation prevent reconnection?	20:34
dansmith	no, I think the heartbeat thread is just the first thin scheduled after a period of inactivity, so it's the first thing that hits the dead connection	20:36
imacdonn	I don't think reconnection is being prevented .. you can see it happen in the log at http://paste.openstack.org/show/751486/ ... but I tihnk it might matter when it happens	20:36
imacdonn	(line 227 in the log shows reconnected)	20:37
melwitt	oh yeah, I see. so why MessagingTimeout? sorry, I just don't get this	20:38
imacdonn	maybe the RPC request got sent to a dead connection, somehow??	20:39
*** dave-mccowan has joined #openstack-nova		20:39
sean-k-mooney	the "best" respocne in https://www.reddit.com/r/Python/comments/4s40ge/understanding_uwsgi_threads_processes_and_gil/ seam to gel with how i tought this would be working	20:39
sean-k-mooney	if we are initalising the connection pool as a module vaiable we are sharing it between the threads and all the therads are being run sequtually under the gil	20:40
sean-k-mooney	so with threads=8 you had 8 instance of the api all sharing the same connection pool	20:40
sean-k-mooney	there is a follow futher down where there is a code wakthorugh which explaing the init flow	20:43
*** mchlumsky has quit IRC		20:47
artom	efried, you added me to that review but I think you'll get through them before I get a chance to even start	20:50
artom	We're all putting out CI fires with our upcoming release	20:51
artom	(Maybe I shouldn't say that in public?)	20:51
efried	artom: no worries, I noticed you were on the one above it, which is the important one, but that one's getting a -1 from me in a few minutes here, so you got time :)	20:51
artom	https://photos.app.goo.gl/fRQoJpE5WnqYeP6E8 was my desk not long ago. Yes, I needed both those drinks to get through the bugs	20:52
artom	... and I need to go pick kids up	20:53
artom	o/	20:53
sean-k-mooney	dansmith: imacdonn so i think this is the issue https://github.com/openstack/nova/blob/760fc2de32057817bafc0a1512a735dab1c7c68a/nova/rpc.py#L69	20:54
sean-k-mooney	we intial the transport as a module level vaiable so that is shared across all wsgi ap instace in the same process	20:55
sean-k-mooney	that end up calling https://github.com/openstack/oslo.messaging/blob/40c25c2bde6d2f5a756e7169060b7ce389caf174/oslo_messaging/transport.py#L192	20:55
dansmith	sean-k-mooney: well, that's the global thing that does the connectioning to rabbit, yeah	20:55
sean-k-mooney	which calls the diver manger to construt RabitmqDriver instance which has a connection pool internally	20:56
dansmith	yup	20:56
sean-k-mooney	so the connection pool become global too	20:56
dansmith	right	20:56
*** tbachman has joined #openstack-nova		20:57
dansmith	sean-k-mooney: this is only part of the equation by the way	20:57
sean-k-mooney	and sicne we have muleple python threads and the gil limits there exectuion we could starge the heartbeat	20:57
dansmith	sean-k-mooney: because I think that TRANSPORT is what we use for initializing a connection which we would use if we were going to start a server and listen	20:57
*** artom has quit IRC		20:57
dansmith	sean-k-mooney: but connections to the cell rpc brokers happens in nova.context, because we connect per cell once we read that out	20:57
dansmith	sean-k-mooney: it's effectively the same situaton though, I imagine they both suffer from the same behavior	20:58
sean-k-mooney	riping out those gloabls would proably be a pain	20:59
*** trident has quit IRC		20:59
dansmith	sean-k-mooney: https://github.com/openstack/nova/blob/760fc2de32057817bafc0a1512a735dab1c7c68a/nova/context.py#L376	20:59
dansmith	sean-k-mooney: having implemented multi-cell tests, I can assure you it is	20:59
*** cmart has joined #openstack-nova		21:00
*** trident has joined #openstack-nova		21:00
dansmith	but also, it's the thing that avoids us having to build that stuff up a hundred times, even for a single request where we make a couple rpc calls	21:00
sean-k-mooney	and yep we are storing the connection in a module level CELL_CACHE	21:00
dansmith	right	21:00
sean-k-mooney	so yep ist the same issue	21:00
melwitt	so if the heartbeat thread gets starved, then connections get closed down, and then if reconnect in another thread races with an API request coming in and the API request hits before the reconnect, it will get the MessagingTimeout? I'm looking through the oslo.messaging code and from what I can tell, the reconnect is done on-the-fly when a message is attempted to be sent, so I don't understand how a different thread is doing the	21:01
melwitt	reconnect	21:01
*** ttsiouts has joined #openstack-nova		21:01
cmart	Hello -- can someone help me understand the process by which ephemeral disks get a filesystem label? It's usually `ephemeral0` but I have an instance created today which has `ephemeral0_160_40`	21:02
melwitt	but if that were true that oslo.messaging automatically tries to reconnect, then the API request shouldn't fail, it would try to reconnect. so, nevermind, I don't know what I'm talking about	21:02
sean-k-mooney	melwitt: well thats praobly the proble it like is not auto recoonectin on making a request	21:02
sean-k-mooney	we poroubl have preiodic task or something that is doing th reconnection	21:03
sean-k-mooney	that is jsut a guess	21:03
sean-k-mooney	i havent look at this that closely in the past	21:03
melwitt	yeah. I just can't tell what triggers a reconnect. in https://github.com/openstack/oslo.messaging/blob/40c25c2bde6d2f5a756e7169060b7ce389caf174/oslo_messaging/_drivers/amqpdriver.py I see a bunch of stuff where "recoverable error" will try to reconnect	21:03
sean-k-mooney	are we using the amqp driver or the rabbitmq one	21:04
melwitt	sorry, wrong link https://github.com/openstack/oslo.messaging/blob/9.5.0/oslo_messaging/_drivers/impl_rabbit.py	21:04
melwitt	that other link was me trying to find where/how it raises MessagingTimeout	21:04
*** tbachman has quit IRC		21:05
sean-k-mooney	hum i wonder if setting https://github.com/openstack/oslo.messaging/blob/9.5.0/oslo_messaging/_drivers/impl_rabbit.py#L145-L149 to 0 would help?	21:05
dansmith	sean-k-mooney: that was mentioned earlier	21:06
sean-k-mooney	oh ok i tought you were asking if there was a way to disable it eariler	21:06
melwitt	that is, you think it might allow threads=8 config to reconnect?	21:06
dansmith	I said it would be good to test with it, but I think the downside of it is that the rabbit server side won't terminate the connection because of lack of heartbeats which means more likelihood of just having stale connections that have been forgotten by the conntracks in between	21:06
sean-k-mooney	i think it might not close the connect	21:06
sean-k-mooney	*connection	21:07
melwitt	oh ok	21:07
sean-k-mooney	e.g. it would only close if the tcp soecket closed	21:07
melwitt	yeah, I see	21:07
imacdonn	pfft, it's not like rabbitmq-server is ever going to run out of connection...............	21:07
dansmith	problem is, stateful firewalls in between client and server may do that without you noticing which will require some delay before you realize the socket is dead	21:07
sean-k-mooney	right which is proably why the heartbeat is there in the first place	21:08
melwitt	yeah	21:08
imacdonn	back in the icehouse days, I had lots of problems trying to do AMQP through a firewall, due to lack of heartbeat support	21:08
dansmith	right and why I think it's fine to let it be there, and if we wake up within a minute, we're good and if not, then we've got an immediate RST waiting for us so we know to reconnect	21:08
openstackgerrit	Matt Riedemann proposed openstack/nova master: Restore connection_info after live migration rollback https://review.opendev.org/551349	21:09
sean-k-mooney	melwitt: this is where the reconnect happens i think	21:12
sean-k-mooney	https://github.com/openstack/oslo.messaging/blob/9.5.0/oslo_messaging/_drivers/impl_rabbit.py#L897-L928	21:12
sean-k-mooney	we call self.ensure_connection() if there was an exception wehn we calle self,_heartbeat_check()	21:12
sean-k-mooney	which is a thread we inital in the dirver wihc is a global so there is again only 1 per process	21:13
sean-k-mooney	so with threads=8 there is a 1 in 9 chacne that it will get scheduled to run	21:14
melwitt	so since it does the reconnecting, if it doesn't run first, other thread will hit a closed connection and get MessagingTimeout	21:17
*** slaweq has quit IRC		21:18
sean-k-mooney	well the message timeout comes form the fact that we did a request and then stop listening because the connection closed	21:18
sean-k-mooney	and we did not reconnect before teh timeout expired	21:18
sean-k-mooney	so the set succfully	21:18
sean-k-mooney	then it closed and they never go tthe resopnece	21:18
melwitt	ok	21:18
sean-k-mooney	which likely means when we recoonet it will be sitting on the queue	21:19
sean-k-mooney	things like the inststance list would likely be fine in that case	21:21
sean-k-mooney	since the api is hitting all the cell dbs but not doing an rpc right	21:21
sean-k-mooney	but things like get_vnc_console woudl fail	21:21
efried	cmart: what does your [DEFAULT]virt_mkfs option look like?	21:21
sean-k-mooney	thing like boot would likely be fine too since thats fire and forget	21:23
imacdonn	so now I'm wondering if the connection reestablishment is working "by accident" with threads=1, or if I can depend on it........	21:23
cmart	efried I haven't defined it (on Queens, Ubuntu 16.04) so probably the default	21:23
sean-k-mooney	well with threads=1 thre is a 50% change the connetion tread will run	21:24
sean-k-mooney	sicne there is the main thread + the heatbeat thread	21:24
melwitt	:\	21:26
sean-k-mooney	i did not say it woudl definetly work :P	21:27
dansmith	sean-k-mooney: there's only one thread,	21:27
sean-k-mooney	but its better then 1 in 9	21:27
imacdonn	so if I set threads=0, will it be 100% ? :)	21:27
melwitt	lol	21:27
sean-k-mooney	dansmith: the rabbitmq direver spawn a thread to run the heatbeat check	21:28
dansmith	and I think that if we're monkeypatched, any io we do will always guarantee that we run the waiting greenthreads, so we'd consume the dead connection with the heartbeat right?	21:28
sean-k-mooney	https://github.com/openstack/oslo.messaging/blob/9.5.0/oslo_messaging/_drivers/impl_rabbit.py#L884	21:28
dansmith	sean-k-mooney: green thread	21:28
sean-k-mooney	ah your right if its monkey patched then its a greenthread	21:28
dansmith	right,	21:29
dansmith	so we'll always run that stuff on first io	21:29
sean-k-mooney	i guess it depends on how many io operation were pending or had yeilded	21:29
dansmith	I think if it was 50% we'd have heard about it	21:29
dansmith	howso?	21:29
efried	cmart: It looks like that name is being assigned in this neighborhood: https://opendev.org/openstack/nova/src/branch/stable/queens/nova/virt/libvirt/driver.py#L3510	21:30
sean-k-mooney	well in the scater gater case at least we are querying a bunch of gread threads to execute then waiting	21:30
dansmith	the heartbeat gthread is always schedulable work on any io switch if we've not run it in t/2 or whatever it's period is	21:30
dansmith	sean-k-mooney: right but on the first io that we do, to even start that, we'll run all the greenthreads pending, which is at least the heartbeat gthread, before we do much of anything else	21:31
efried	cmart: that line or L3533	21:31
sean-k-mooney	ya we should	21:32
*** hongbin has joined #openstack-nova		21:32
cmart	efried thank you for the orientation	21:33
*** xek has joined #openstack-nova		21:33
efried	cmart: Good luck :)	21:33
*** ttsiouts has quit IRC		21:33
*** xek__ has quit IRC		21:35
openstackgerrit	Merged openstack/nova master: Remove fake_libvirt_utils from connection tests. https://review.opendev.org/642557	21:38
imacdonn	I wonder if part of the issue is that the AMQP message is not being resent in the case where the connection reset is detected during publish, vs. during heartbeat	21:39
melwitt	I wonder that too	21:40
*** JamesBenson has quit IRC		21:40
dansmith	imacdonn: meaning if we hit it on publish, we reconnect, but didn't actually send, think we did, and wait for it until timeout?	21:40
dansmith	if so that would be a giant o.msg bug	21:40
imacdonn	yeah, smth like that	21:41
melwitt	but I think we know it won't reconnect on publish right	21:41
melwitt	otherwise the 500 wouldn't happen	21:41
dansmith	no,	21:41
dansmith	the 500 is from timeout right?	21:41
dansmith	which means it's waiting for a reply	21:42
melwitt	yeah	21:42
imacdonn	504 from timeout, yeah	21:42
imacdonn	look at http://paste.openstack.org/show/751486/ - the reset follows the publish failure, not a heartbeat failure	21:42
dansmith	which can happen if we effectively connect to rabbit, subscribe to a reply queue, but never sent the thing and thus time out waiting for a reply to nothing	21:42
imacdonn	the reconnect, I mean	21:42
sean-k-mooney	this is ment to handel the resent i think https://github.com/openstack/oslo.messaging/blob/40c25c2bde6d2f5a756e7169060b7ce389caf174/oslo_messaging/_drivers/impl_rabbit.py#L1088-L1101	21:43
*** JamesBenson has joined #openstack-nova		21:43
sean-k-mooney	for a direct send we do not pass retry	21:43
imacdonn	I don't see any "Failed to publish message to topic" logged	21:44
melwitt	yeah that's what confuses me, if publish does a reconnect, why are we left hanging for a timeout? unless we never send a thing like you both mentioned	21:44
sean-k-mooney	https://github.com/openstack/oslo.messaging/blob/40c25c2bde6d2f5a756e7169060b7ce389caf174/oslo_messaging/_drivers/impl_rabbit.py#L1204-L1213	21:44
*** JamesBen_ has joined #openstack-nova		21:44
dansmith	melwitt: it would be if the act of sending triggered the "huh, this is dead, lemme reconnect" but was not followed by a "okay, now that I have reconnected, let me send that thing I had"	21:45
melwitt	aye	21:45
dansmith	because socket.write() is where you find out it's dead	21:45
dansmith	which is how heartbeat is finding out	21:45
dansmith	it gets scheduled, and does a "shit I'm late.. socket.write(heartbeat mofo!)"	21:45
melwitt	right	21:46
*** JamesBenson has quit IRC		21:47
*** xek_ has joined #openstack-nova		21:47
*** xek has quit IRC		21:47
*** JamesBen_ has quit IRC		21:49
melwitt	sean-k-mooney: yeah, I don't know the difference between direct_send and topic_send. though topic_send sounds more like what's probably being used	21:51
melwitt	(based on the fact that our message queues all have topics, I would think that means those are topic_send, but I don't actually know)	21:52
*** mriedem has quit IRC		21:53
sean-k-mooney	so we are invoking the get_vnc_console as a "call"	21:53
sean-k-mooney	https://github.com/openstack/oslo.messaging/blob/40c25c2bde6d2f5a756e7169060b7ce389caf174/oslo_messaging/rpc/client.py#L157-L182	21:54
sean-k-mooney	which im still tracing to see if it sa topic send or a direct send	21:54
sean-k-mooney	that is where we do the actul call in the nova side https://github.com/openstack/nova/blob/eae1f2257a9ee6e851182bf949568b1cfe2af763/nova/compute/rpcapi.py#L635-L640	21:55
*** xek has joined #openstack-nova		21:56
sean-k-mooney	im getting flashbacks to the openstack bootstrap podcast looking at this code	21:56
*** xek_ has quit IRC		21:56
*** xek has quit IRC		21:58
*** xek has joined #openstack-nova		21:59
sean-k-mooney	ok instesting so	22:01
sean-k-mooney	_send_reply uses a direct send	22:01
sean-k-mooney	and that is used for this heartbeat https://github.com/openstack/oslo.messaging/blob/40c25c2bde6d2f5a756e7169060b7ce389caf174/oslo_messaging/_drivers/amqpdriver.py#L174	22:02
sean-k-mooney	but this is the send function that matters https://github.com/openstack/oslo.messaging/blob/40c25c2bde6d2f5a756e7169060b7ce389caf174/oslo_messaging/_drivers/amqpdriver.py#L581	22:03
sean-k-mooney	ok so ya its a topic_send https://github.com/openstack/oslo.messaging/blob/40c25c2bde6d2f5a756e7169060b7ce389caf174/oslo_messaging/_drivers/amqpdriver.py#L628	22:07
sean-k-mooney	which means the retry behavior is contoled by the context object	22:08
*** mrch_ has joined #openstack-nova		22:25
openstackgerrit	Eric Fried proposed openstack/nova master: DNM: Revert "Use external placement in functional tests" https://review.opendev.org/659680	22:26
sean-k-mooney	ok dumb question but where does the context objec we use to sent the rpc call come form	22:29
sean-k-mooney	context = req.environ['nova.context']	22:31
sean-k-mooney	so what sets that...	22:31
sean-k-mooney	this https://github.com/openstack/nova/blob/c6218428e9b29a2c52808ec7d27b4b21aadc0299/nova/api/auth.py#L95-L105 ?	22:32
sean-k-mooney	no that has to be the wrong context that the keystone atuth context	22:36
*** hongbin has quit IRC		22:39
sean-k-mooney	ok i give up i cant tell if we actully set retry to ture	22:39
imacdonn	I suppose I could add debug logging to find out	22:40
sean-k-mooney	what i can tell is we dont set it expreclity wehn we call prepare in which case retry is check in the context	22:40
melwitt	I think the keystone middleware does that https://docs.openstack.org/keystonemiddleware/latest/middlewarearchitecture.html	22:41
sean-k-mooney	i can see the keysone context being construted here	22:42
sean-k-mooney	https://github.com/openstack/nova/blob/c6218428e9b29a2c52808ec7d27b4b21aadc0299/nova/api/auth.py#L95-L106	22:42
sean-k-mooney	and i think that is the context that is passed here as ctxt https://github.com/openstack/nova/blob/c6218428e9b29a2c52808ec7d27b4b21aadc0299/nova/compute/rpcapi.py#L593-L598	22:42
openstackgerrit	Eric Fried proposed openstack/nova stable/stein: DNM: Revert "Use external placement in functional tests" https://review.opendev.org/659682	22:42
melwitt	yeah, and it's the keystone middleware that sets the headers that are used to construct the context you show	22:43
sean-k-mooney	but when we constuct the rpc client we dont set retry too true as far as i can tell	22:44
*** mlavalle has quit IRC		22:44
openstackgerrit	Eric Fried proposed openstack/nova stable/stein: Skip _exclude_nested_providers() if not nested https://review.opendev.org/659206	22:45
*** UdayTKumar has quit IRC		22:48
melwitt	sean-k-mooney: looks like it comes from here https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/rpc/client.py#L342-L343	22:49
melwitt	(maybe, I didn't trace it)	22:49
sean-k-mooney	ok i found that like 30 mins ago but i mis interpreted the default of None as dont retry	22:50
melwitt	so transport._send() will be done with forever retry	22:50
sean-k-mooney	ok so it shoudl retry for ever	22:50
sean-k-mooney	ya	22:50
sean-k-mooney	i shoudl have read the commet	22:51
sean-k-mooney	ok so this https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L666 is where the retry is implemented	22:53
*** tkajinam has joined #openstack-nova		22:55
*** macza has quit IRC		22:57
sean-k-mooney	ok do i have traced that fully through nova into oslo.messaging and unless there is a bug in the auto retry suppupot in the kombu lib which i doubt the issue is not tha twe think we sent the message on publish	23:00
imacdonn	seems that that retries ensuring that a connection exists, but doesn't retry sending a message ?	23:00
sean-k-mooney	it does it delegate it to kombu	23:00
melwitt	yeah, I saw that too, just passes everything (including the message) to kombu's autoretry method	23:01
sean-k-mooney	https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L745-L752	23:01
melwitt	so you would think after reconnecting, it would send the message	23:01
melwitt	I guess I'm wondering now if what's in these callbacks are what controls how it behaves, whether it would send the message after reconnecting?	23:03
*** xek has quit IRC		23:03
sean-k-mooney	i assume that is what on_revive=on_reconnection is doing?	23:04
imacdonn	I'm missing something .. where is the message passed to the autoretry method ?	23:04
*** luksky has quit IRC		23:04
*** rcernin has joined #openstack-nova		23:04
sean-k-mooney	exec methond is a functoll.partial thing	23:05
*** _erlon_ has quit IRC		23:05
*** slaweq has joined #openstack-nova		23:05
melwitt	imacdonn: here https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L1098-L1101	23:05
sean-k-mooney	https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L1098	23:05
sean-k-mooney	so the execute_method has the mesg and all the args curried in the partial callable	23:06
*** ttsiouts has joined #openstack-nova		23:06
melwitt	yeah, so the on_reconnection method just reconfigures some things	23:07
sean-k-mooney	yes but that is called when the conenct is revived but the auto retry i think would retry?	23:07
melwitt	unfortunately it looks like we're going to have to look at what autoretry source does	23:07
openstackgerrit	Merged openstack/nova master: Test macvtap port with resource request https://review.opendev.org/641309	23:08
sean-k-mooney	unless you think we are ending up here https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L760-L771	23:08
melwitt	you would think so but symptom suggests it's not	23:08
sean-k-mooney	well we set the retry limit to float("inf")	23:08
sean-k-mooney	so it could take a while to hit that :)	23:09
melwitt	no, I don't think we're ending up there (we haven't seen that log)	23:09
*** slaweq has quit IRC		23:10
melwitt	we see a reconnect message, and then we see a timeout fire some milliseconds later http://paste.openstack.org/show/751486/	23:10
*** ttsiouts has quit IRC		23:11
imacdonn	the timeout is one minute later	23:11
*** slaweq has joined #openstack-nova		23:11
melwitt	oh guh, sorry	23:11
imacdonn	;)	23:11
melwitt	I see now	23:11
*** cmart has quit IRC		23:15
*** slaweq has quit IRC		23:16
sean-k-mooney	thats being raised form https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L1125 form inside the partial functionin the auto retry thing	23:17
sean-k-mooney	actully its from the publish line but im looking at master on github	23:18
*** JamesBenson has joined #openstack-nova		23:18
sean-k-mooney	so ya after the first trace which is the publish failing we see the reconnect	23:19
sean-k-mooney	and then i think the timeout for sending that message has expired after we reconnect	23:20
sean-k-mooney	we started to send it at 2019-05-16 18:24:09.089	23:20
*** dklyle has joined #openstack-nova		23:21
sean-k-mooney	but that doesnt make sense	23:22
sean-k-mooney	the conenct is reestablished at 2019-05-16 18:24:10.109	23:22
sean-k-mooney	and then there are no log messages untill the time out 1 minut later	23:22
melwitt	right	23:22
sean-k-mooney	but the rpc timeout should be longer then 60 seconds right?	23:23
melwitt	I think it probably defaults to 60s	23:23
*** JamesBenson has quit IRC		23:23
melwitt	I can't remember where it is, lemme see	23:23
imacdonn	2019-05-16 18:16:05.973 23144 DEBUG nova.api.openstack.wsgi_app [-] long_rpc_timeout = 1800 log_opt_values /usr/lib/python2.7/site-packages/oslo_config/cfg.py:2571	23:23
imacdonn	wait that's not it	23:24
melwitt	default to 60 https://github.com/openstack/oslo.messaging/blob/7102eddb1fd799d949ff5dca8aa2637794bb8a43/oslo_messaging/rpc/client.py#L41	23:24
sean-k-mooney	the long_rpc_timeout is not used for every rpc	23:24
melwitt	rpc_response_timeout is the one	23:24
sean-k-mooney	so this is what kombu is doing https://github.com/celery/kombu/blob/master/kombu/connection.py#L500-L547	23:26
openstackgerrit	Merged openstack/nova master: Fix some issues with the newton release notes https://review.opendev.org/659623	23:33
imacdonn	attempting repro with kombu debug logging	23:34
*** dklyle has quit IRC		23:35
melwitt	good call, because looking at the code it looks like it should be doing the right thing. no clue what's happening	23:38
imacdonn	weird .. not getting the connection resets now	23:42
sean-k-mooney	does it not take time for it to happen	23:42
sean-k-mooney	e.g a few minutes	23:42
melwitt	did you remember to set threads=8?	23:42
imacdonn	yeah, I waited 5 mins the first try, then a little over 3 mins second try	23:43
imacdonn	yes, threads=8, but I should see the connection reset either way	23:43
melwitt	true	23:43
sean-k-mooney	ok its technically friday here so im going to have pizza and then sleep. if ye find anything let me know but for no i guess we shoudl update the bug and tell people to run with thread=1	23:45
sean-k-mooney	our maybe with debug loging...	23:45
*** artom has joined #openstack-nova		23:45
melwitt	pizza, sounds like a solid plan	23:45
imacdonn	Thanks, sean-k-mooney - pizza/sleep well!	23:46
sean-k-mooney	i actully had a c++ bug where debug builds did not have the bug because it was an unitialised variable and in debug mode the compiler was "kind" enouch to initalise it to 0	23:46
imacdonn	ahh, got the reset now ... paste in a few mins	23:48
imacdonn	http://paste.openstack.org/show/AZQET5IOgntK785RHtxh/	23:51
melwitt	weird, it says basic_publish_confirm but then one minute later timeout happens	23:57
melwitt	so confused.	23:58

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!