Friday, 2026-01-30

*** mhen_ is now known as mhen02:37
opendevreviewGhanshyam proposed openstack/nova master: Add a new field 'topic_alt' in service: object, DB, notification payload  https://review.opendev.org/c/openstack/nova/+/97524003:51
opendevreviewGhanshyam proposed openstack/nova master: Add a new field 'topic_alt' in service: object, DB, notification payload  https://review.opendev.org/c/openstack/nova/+/97524003:54
opendevreviewGhanshyam proposed openstack/nova master: Add 2nd RPC server for compute service  https://review.opendev.org/c/openstack/nova/+/97524204:27
opendevreviewGhanshyam proposed openstack/nova master: Add 2nd RPC server for compute service  https://review.opendev.org/c/openstack/nova/+/97524204:28
gmaangibi: bauzas: these two change (adding 2nd RPC server) for graceful shutdown series are ready for review: https://review.opendev.org/c/openstack/nova/+/975240  https://review.opendev.org/c/openstack/nova/+/975242 04:41
opendevreviewhuanhongda proposed openstack/nova master: Live migration with iothreads  https://review.opendev.org/c/openstack/nova/+/97500006:40
opendevreviewhuanhongda proposed openstack/nova master: Live migration with iothreads  https://review.opendev.org/c/openstack/nova/+/97500006:52
bauzasgmaan: saw those, will try to do review them today but my day is already packed with review needs08:43
opendevreviewhuanhongda proposed openstack/nova master: Live migration with iothreads  https://review.opendev.org/c/openstack/nova/+/97500009:39
gibigmaan: did my first review round on the second RPC patches and left comments10:36
gibisean-k-mooney: heh I think I see how the oslo.vmware job could run out of threads in the default pool. If there are at least two differnt type of task is using the default pool during a single VM boot like buil_and_run_instance and then later that task spawns _allocate_network_async then if the pool size is e.g. 5 and I send in 10 parallel VM boot requests. Then the first 5 VM boot fills the pool and 15:02
gibithen none of them can move forward as the _allocate_network_async task for each of them is also queued up waiting for executor. 15:02
sean-k-mooneyah i see15:04
sean-k-mooneyi guess im not sure why buil_and_run_instance or _allocate_network_async woudl be dispatced to the defautl thread pool in the first place15:04
sean-k-mooneythe should both be on the rpc trhead pool. 15:05
sean-k-mooneyif we had to use utils.spawn inside if for some async work i can see it having contention but it shoudl not deadlock15:06
sean-k-mooneyoh15:06
sean-k-mooneyno it could15:06
gibifor some historical reasons build_and_run is moved to the default pool from the RPC pool. And then that task utils.spawns the network task15:07
sean-k-mooneyi guess a task that is on a executor shoudl never dispatch an other task to the same executor15:07
sean-k-mooneyif it does it could alwasy starve15:07
gibiyeah15:07
sean-k-mooneyok wll i think the network task shoudl jsut move to the io thread pool15:07
sean-k-mooneyat least for the short term15:08
sean-k-mooneyrather the the default15:08
gibiyeah. Long term I need to think how can we prevent a new task being introduced and spawned to the same pool as the parent task15:08
sean-k-mooneyso we are not seeing this in the libvirt driver because this behvior is vmware dirver specific?15:09
gibinope I see it in the libvirt dirver too15:10
gibinow that I know what to look for15:10
gibiboth task are from the compute manager so the problem is virt driver agnostic15:10
gibiit seems tempest in our CI never tried to spawn more than 2-3 VMs in parallel but oslo.vmaware CI does so15:11
sean-k-mooneyack in that case for now we coudl use the io thread pool for all network or file operations and the default for normal function calls?15:11
gibiI will take a closer look at all utils.spawn call to see how much contention we have15:11
gibibut yes in general we need to separate out these tasks to different pools15:11
sean-k-mooneynot a now descison but we might actully want a "packaged task" construct in the future where we can package up a closure of the task to be executeded (with cansletion if need) and metadata to tack/prevent this15:12
sean-k-mooneyi.e. so we know what executor we are on and on what executor we can spawn the task15:13
gibiyeah a generic solution probably needs some tracking metadata15:14
sean-k-mooneyfor now we could add a simple execotre fild to the nova context object which is None by defualt but set to a link to the execotre by spawn15:14
sean-k-mooneythen spawn could check if we are ever tying to spawn a task on teh same executor and raise15:14
gibiyeah context is not a bad idea15:14
gibithanks15:14
sean-k-mooneythe reason i said we might need a packaged task in the future is i dont know if we have the context as an arguemt every where we are usign spawn but we might 15:15
sean-k-mooneyi.e. if we are currently spawning a fucntion that does not take the context as its first argument then we woudl either need to add it or find a diffent solution15:16
sean-k-mooneybut thats what i woudl try first15:16
gibicontext should be in a thread local store, I can check if we can grab it from there15:16
sean-k-mooneyoh you might be right i vagely remoemebr ther ebeing a psudo signelton function to get it at module scope15:18
gibibtw, this logic to move the build_and_run_instance from the RPC to the default pool might not be a valid strategy any more for native threading. In eventlet we had 60 RPC worker but potentially unlimited worker in the default pool. So the default pool was bigger. But in native threading the default pool is smaller than the RPC worker pool.15:18
gibi(by default)15:20
sean-k-mooneyya im not sure it really requried (or why it was done)15:20
sean-k-mooneyjust runign it on the rpc thread i think should be ok15:20
sean-k-mooneyeven if it was the same size i dont think we shoudl design the system so that with the incorrect config value you can deadlock15:21
sean-k-mooneyso threading may be highlighting the problem but i would say it was still a bug even with eventlet15:21
gibiI think it originally was done to avoid build requests to hold up the RPC pool and prevent other lifecycle operations to progress in the meantime15:23
gibihttps://github.com/openstack/nova/blob/59a7093915298973c72b6d1749a6acd27e0045a9/nova/compute/manager.py#L2452-L246015:23
gibimaybe we need a separete build_and_run_instance pool then15:24
gibito limit the parallel boot requests 15:24
sean-k-mooney well we have a lock for that15:24
sean-k-mooneybtu we could15:24
sean-k-mooneyits technially the build semephore not a lock15:25
sean-k-mooneybut we could rate limit by the executor worker count instead15:25
gibiyeah we can move that semaphor to be implemented by the pool size15:25
gibiwe discussed before that we want this as this allow shutting down the pool during graceful shutdown15:26
gibithen we went to shutting down the whole RPC pool instead15:26
sean-k-mooneyso you would shutdown the build pool which would block until done but now allow enquing new builds15:27
gibiyes15:27
sean-k-mooneyi assume calling submit on an executore that you have called shutdown on will raise?15:27
gibiyepp15:27
gibiit is a bit moot point as with the current graceful shutdown plans we will shut down the whole RPC topic so no extra request will come in anyhow15:28
sean-k-mooneyok so that could allow us to catch that and gracefully raise a buidl exepction allowing a requeue15:28
sean-k-mooneyi need to think about this a bit but it almost sound like what we really want instead of multipe execotres is an execotr that under stand task priorities15:29
sean-k-mooneyi.e. submit the build with the lowest peririty then if that submit a new request iw would submit it with context.priority+115:30
sean-k-mooneywell no15:31
sean-k-mooneythat only works with corutiens that can be suspended15:31
sean-k-mooneybecause we need to be able to premet the lower priority task15:31
gibiyeah I vote for the thinking more approach :) it is friday afternoon15:32
sean-k-mooneygibi: so im not really concerd with the idea of having 1 or 2 or even 3 pools (default, io and rpc)15:32
sean-k-mooneybut if we need like 5 pools i woudl be a bit concerned15:32
gibiyeah I understand that fear15:32
sean-k-mooneyam i didnt get a chance to look at your event handeling patch by the way15:33
sean-k-mooneyits a public holdiay on monday so it will likely be tuesday before i have time15:33
gibino worries15:33
gibiI have now plenty of thinking work15:34
gibienjoy your day off15:34
gibiI summarized this discussion back to gerrit https://review.opendev.org/c/openstack/nova/+/965467/44#message-191254e8132d5162ced2217faeb06910e85527d515:48
gmaanbauzas: gibi thnaks15:51
sean-k-mooneygibi: your probaly done for today but https://review.opendev.org/c/openstack/nova/+/975171/1 is the repoducer for the iothread live migration issue if you have time or anyone else. sylvain is +2 and the fix patch is approved18:25
sean-k-mooneygmaan: if you want me to rework https://review.opendev.org/c/openstack/nova/+/973438/comment/ff4bb8ea_53533e58/ i can but it will be a tuesday task18:25
sean-k-mooneyim going to wrap up for the long weekend shortly18:26
gmaansean-k-mooney: checking your reply18:28
gmaansean-k-mooney: I think catching exception and reseting the internal state is more safe and better18:42
gmaani know release_write_lock does not raise exception other then other writter try to unlock them but 1. its 3rd party so changes can happen there 2. resting internal state before actually releasing lock seems odd to m,e18:43
gmaangibi: let's discuss on Monday about the DB/object changes for new RPC topic. I was thinking the same as you commented but proposed the changes because old field 'topic' was in DB/object. I will comment in review too.18:46
gmaandansmith: maybe you remember if any rational or it is from starting, we have RPC 'topic' field in DB/Object but it is not user facing or changable things - https://github.com/openstack/nova/blob/59a7093915298973c72b6d1749a6acd27e0045a9/nova/objects/service.py#L32818:49
gmaanas I am adding a new RPC server for graceful shutdown, I proposed to add that new topic field 'topic_alt' in DB/object - https://review.opendev.org/c/openstack/nova/+/975240/218:49
sean-k-mooneygmaan: ok ill do that quickly then before i leave18:50
dansmithI don't remember specifically, I suspect it's a holdover from much longer ago where "everything was pluggable"18:50
gmaanohk, In that case, we can leave this old field but not add the new topic field as no use case as of now18:50
gmaansean-k-mooney: ack18:50
dansmithyeah doesn't appear to be used anywhere18:51
gmaank, will abandon my change, if we need it can be added in future anyways18:52
dansmithah, looks like it may have been a nova-network thing18:52
dansmithlike, to get the network manager for a host, which may be shared18:52
dansmithgmaan: https://github.com/openstack/nova/blob/icehouse-eol/nova/network/manager.py#L24918:53
dansmiththat's a deep cut :)18:53
gmaanohk, I see that. I tried to find history in but did not go that old. :) thanks18:55
dansmithobjects were maturing around icehouse so I usually use that as the point to look for pre-object histopry :)18:55
gmaani remember icehouse that was the release I started my upstream work :)18:56
dansmithyou're such a newb ;P18:56
gmaan:)18:57
opendevreviewsean mooney proposed openstack/nova master: FairLockGuard: Support cross-thread sharing and nesting  https://review.opendev.org/c/openstack/nova/+/97343819:19
opendevreviewGhanshyam proposed openstack/nova master: Add 2nd RPC server for compute service  https://review.opendev.org/c/openstack/nova/+/97524219:41
opendevreviewGhanshyam proposed openstack/nova master: Add 2nd RPC server for compute service  https://review.opendev.org/c/openstack/nova/+/97524219:43

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!