Friday, 2026-01-30

*** mhen_ is now known as mhen		02:37
opendevreview	Ghanshyam proposed openstack/nova master: Add a new field 'topic_alt' in service: object, DB, notification payload https://review.opendev.org/c/openstack/nova/+/975240	03:51
opendevreview	Ghanshyam proposed openstack/nova master: Add a new field 'topic_alt' in service: object, DB, notification payload https://review.opendev.org/c/openstack/nova/+/975240	03:54
opendevreview	Ghanshyam proposed openstack/nova master: Add 2nd RPC server for compute service https://review.opendev.org/c/openstack/nova/+/975242	04:27
opendevreview	Ghanshyam proposed openstack/nova master: Add 2nd RPC server for compute service https://review.opendev.org/c/openstack/nova/+/975242	04:28
gmaan	gibi: bauzas: these two change (adding 2nd RPC server) for graceful shutdown series are ready for review: https://review.opendev.org/c/openstack/nova/+/975240 https://review.opendev.org/c/openstack/nova/+/975242	04:41
opendevreview	huanhongda proposed openstack/nova master: Live migration with iothreads https://review.opendev.org/c/openstack/nova/+/975000	06:40
opendevreview	huanhongda proposed openstack/nova master: Live migration with iothreads https://review.opendev.org/c/openstack/nova/+/975000	06:52
bauzas	gmaan: saw those, will try to do review them today but my day is already packed with review needs	08:43
opendevreview	huanhongda proposed openstack/nova master: Live migration with iothreads https://review.opendev.org/c/openstack/nova/+/975000	09:39
gibi	gmaan: did my first review round on the second RPC patches and left comments	10:36
gibi	sean-k-mooney: heh I think I see how the oslo.vmware job could run out of threads in the default pool. If there are at least two differnt type of task is using the default pool during a single VM boot like buil_and_run_instance and then later that task spawns _allocate_network_async then if the pool size is e.g. 5 and I send in 10 parallel VM boot requests. Then the first 5 VM boot fills the pool and	15:02
gibi	then none of them can move forward as the _allocate_network_async task for each of them is also queued up waiting for executor.	15:02
sean-k-mooney	ah i see	15:04
sean-k-mooney	i guess im not sure why buil_and_run_instance or _allocate_network_async woudl be dispatced to the defautl thread pool in the first place	15:04
sean-k-mooney	the should both be on the rpc trhead pool.	15:05
sean-k-mooney	if we had to use utils.spawn inside if for some async work i can see it having contention but it shoudl not deadlock	15:06
sean-k-mooney	oh	15:06
sean-k-mooney	no it could	15:06
gibi	for some historical reasons build_and_run is moved to the default pool from the RPC pool. And then that task utils.spawns the network task	15:07
sean-k-mooney	i guess a task that is on a executor shoudl never dispatch an other task to the same executor	15:07
sean-k-mooney	if it does it could alwasy starve	15:07
gibi	yeah	15:07
sean-k-mooney	ok wll i think the network task shoudl jsut move to the io thread pool	15:07
sean-k-mooney	at least for the short term	15:08
sean-k-mooney	rather the the default	15:08
gibi	yeah. Long term I need to think how can we prevent a new task being introduced and spawned to the same pool as the parent task	15:08
sean-k-mooney	so we are not seeing this in the libvirt driver because this behvior is vmware dirver specific?	15:09
gibi	nope I see it in the libvirt dirver too	15:10
gibi	now that I know what to look for	15:10
gibi	both task are from the compute manager so the problem is virt driver agnostic	15:10
gibi	it seems tempest in our CI never tried to spawn more than 2-3 VMs in parallel but oslo.vmaware CI does so	15:11
sean-k-mooney	ack in that case for now we coudl use the io thread pool for all network or file operations and the default for normal function calls?	15:11
gibi	I will take a closer look at all utils.spawn call to see how much contention we have	15:11
gibi	but yes in general we need to separate out these tasks to different pools	15:11
sean-k-mooney	not a now descison but we might actully want a "packaged task" construct in the future where we can package up a closure of the task to be executeded (with cansletion if need) and metadata to tack/prevent this	15:12
sean-k-mooney	i.e. so we know what executor we are on and on what executor we can spawn the task	15:13
gibi	yeah a generic solution probably needs some tracking metadata	15:14
sean-k-mooney	for now we could add a simple execotre fild to the nova context object which is None by defualt but set to a link to the execotre by spawn	15:14
sean-k-mooney	then spawn could check if we are ever tying to spawn a task on teh same executor and raise	15:14
gibi	yeah context is not a bad idea	15:14
gibi	thanks	15:14
sean-k-mooney	the reason i said we might need a packaged task in the future is i dont know if we have the context as an arguemt every where we are usign spawn but we might	15:15
sean-k-mooney	i.e. if we are currently spawning a fucntion that does not take the context as its first argument then we woudl either need to add it or find a diffent solution	15:16
sean-k-mooney	but thats what i woudl try first	15:16
gibi	context should be in a thread local store, I can check if we can grab it from there	15:16
sean-k-mooney	oh you might be right i vagely remoemebr ther ebeing a psudo signelton function to get it at module scope	15:18
gibi	btw, this logic to move the build_and_run_instance from the RPC to the default pool might not be a valid strategy any more for native threading. In eventlet we had 60 RPC worker but potentially unlimited worker in the default pool. So the default pool was bigger. But in native threading the default pool is smaller than the RPC worker pool.	15:18
gibi	(by default)	15:20
sean-k-mooney	ya im not sure it really requried (or why it was done)	15:20
sean-k-mooney	just runign it on the rpc thread i think should be ok	15:20
sean-k-mooney	even if it was the same size i dont think we shoudl design the system so that with the incorrect config value you can deadlock	15:21
sean-k-mooney	so threading may be highlighting the problem but i would say it was still a bug even with eventlet	15:21
gibi	I think it originally was done to avoid build requests to hold up the RPC pool and prevent other lifecycle operations to progress in the meantime	15:23
gibi	https://github.com/openstack/nova/blob/59a7093915298973c72b6d1749a6acd27e0045a9/nova/compute/manager.py#L2452-L2460	15:23
gibi	maybe we need a separete build_and_run_instance pool then	15:24
gibi	to limit the parallel boot requests	15:24
sean-k-mooney	well we have a lock for that	15:24
sean-k-mooney	btu we could	15:24
sean-k-mooney	its technially the build semephore not a lock	15:25
sean-k-mooney	but we could rate limit by the executor worker count instead	15:25
gibi	yeah we can move that semaphor to be implemented by the pool size	15:25
gibi	we discussed before that we want this as this allow shutting down the pool during graceful shutdown	15:26
gibi	then we went to shutting down the whole RPC pool instead	15:26
sean-k-mooney	so you would shutdown the build pool which would block until done but now allow enquing new builds	15:27
gibi	yes	15:27
sean-k-mooney	i assume calling submit on an executore that you have called shutdown on will raise?	15:27
gibi	yepp	15:27
gibi	it is a bit moot point as with the current graceful shutdown plans we will shut down the whole RPC topic so no extra request will come in anyhow	15:28
sean-k-mooney	ok so that could allow us to catch that and gracefully raise a buidl exepction allowing a requeue	15:28
sean-k-mooney	i need to think about this a bit but it almost sound like what we really want instead of multipe execotres is an execotr that under stand task priorities	15:29
sean-k-mooney	i.e. submit the build with the lowest peririty then if that submit a new request iw would submit it with context.priority+1	15:30
sean-k-mooney	well no	15:31
sean-k-mooney	that only works with corutiens that can be suspended	15:31
sean-k-mooney	because we need to be able to premet the lower priority task	15:31
gibi	yeah I vote for the thinking more approach :) it is friday afternoon	15:32
sean-k-mooney	gibi: so im not really concerd with the idea of having 1 or 2 or even 3 pools (default, io and rpc)	15:32
sean-k-mooney	but if we need like 5 pools i woudl be a bit concerned	15:32
gibi	yeah I understand that fear	15:32
sean-k-mooney	am i didnt get a chance to look at your event handeling patch by the way	15:33
sean-k-mooney	its a public holdiay on monday so it will likely be tuesday before i have time	15:33
gibi	no worries	15:33
gibi	I have now plenty of thinking work	15:34
gibi	enjoy your day off	15:34
gibi	I summarized this discussion back to gerrit https://review.opendev.org/c/openstack/nova/+/965467/44#message-191254e8132d5162ced2217faeb06910e85527d5	15:48
gmaan	bauzas: gibi thnaks	15:51
sean-k-mooney	gibi: your probaly done for today but https://review.opendev.org/c/openstack/nova/+/975171/1 is the repoducer for the iothread live migration issue if you have time or anyone else. sylvain is +2 and the fix patch is approved	18:25
sean-k-mooney	gmaan: if you want me to rework https://review.opendev.org/c/openstack/nova/+/973438/comment/ff4bb8ea_53533e58/ i can but it will be a tuesday task	18:25
sean-k-mooney	im going to wrap up for the long weekend shortly	18:26
gmaan	sean-k-mooney: checking your reply	18:28
gmaan	sean-k-mooney: I think catching exception and reseting the internal state is more safe and better	18:42
gmaan	i know release_write_lock does not raise exception other then other writter try to unlock them but 1. its 3rd party so changes can happen there 2. resting internal state before actually releasing lock seems odd to m,e	18:43
gmaan	gibi: let's discuss on Monday about the DB/object changes for new RPC topic. I was thinking the same as you commented but proposed the changes because old field 'topic' was in DB/object. I will comment in review too.	18:46
gmaan	dansmith: maybe you remember if any rational or it is from starting, we have RPC 'topic' field in DB/Object but it is not user facing or changable things - https://github.com/openstack/nova/blob/59a7093915298973c72b6d1749a6acd27e0045a9/nova/objects/service.py#L328	18:49
gmaan	as I am adding a new RPC server for graceful shutdown, I proposed to add that new topic field 'topic_alt' in DB/object - https://review.opendev.org/c/openstack/nova/+/975240/2	18:49
sean-k-mooney	gmaan: ok ill do that quickly then before i leave	18:50
dansmith	I don't remember specifically, I suspect it's a holdover from much longer ago where "everything was pluggable"	18:50
gmaan	ohk, In that case, we can leave this old field but not add the new topic field as no use case as of now	18:50
gmaan	sean-k-mooney: ack	18:50
dansmith	yeah doesn't appear to be used anywhere	18:51
gmaan	k, will abandon my change, if we need it can be added in future anyways	18:52
dansmith	ah, looks like it may have been a nova-network thing	18:52
dansmith	like, to get the network manager for a host, which may be shared	18:52
dansmith	gmaan: https://github.com/openstack/nova/blob/icehouse-eol/nova/network/manager.py#L249	18:53
dansmith	that's a deep cut :)	18:53
gmaan	ohk, I see that. I tried to find history in but did not go that old. :) thanks	18:55
dansmith	objects were maturing around icehouse so I usually use that as the point to look for pre-object histopry :)	18:55
gmaan	i remember icehouse that was the release I started my upstream work :)	18:56
dansmith	you're such a newb ;P	18:56
gmaan	:)	18:57
opendevreview	sean mooney proposed openstack/nova master: FairLockGuard: Support cross-thread sharing and nesting https://review.opendev.org/c/openstack/nova/+/973438	19:19
opendevreview	Ghanshyam proposed openstack/nova master: Add 2nd RPC server for compute service https://review.opendev.org/c/openstack/nova/+/975242	19:41
opendevreview	Ghanshyam proposed openstack/nova master: Add 2nd RPC server for compute service https://review.opendev.org/c/openstack/nova/+/975242	19:43

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!