Wednesday, 2025-11-19

opendevreviewchandan kumar proposed openstack/watcher-tempest-plugin master: Refactor: Consolidate test utilities into shared BaseCommon class  https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/96752205:21
opendevreviewchandan kumar proposed openstack/watcher-tempest-plugin master: Refactor: Consolidate test utilities into shared BaseCommon class  https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/96752206:45
opendevreviewsean mooney proposed openstack/watcher master: [DB] adopt the AdHocDbFixture from oslo  https://review.opendev.org/c/openstack/watcher/+/96739008:47
opendevreviewchandan kumar proposed openstack/watcher-tempest-plugin master: Refactor: Consolidate test utilities into shared BaseCommon class  https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/96752209:02
opendevreviewDavid proposed openstack/watcher-tempest-plugin master: Add test for skipped actions  https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/96686010:08
opendevreviewMerged openstack/watcher master: [DB] adopt the AdHocDbFixture from oslo  https://review.opendev.org/c/openstack/watcher/+/96739010:21
opendevreviewAlfredo Moralejo proposed openstack/watcher master: Add configurable timeout for migrate actions with reasonable defautl  https://review.opendev.org/c/openstack/watcher/+/96769313:53
opendevreviewAlfredo Moralejo proposed openstack/watcher master: Add configurable timeout for migrate actions with reasonable defautl  https://review.opendev.org/c/openstack/watcher/+/96769313:55
opendevreviewAlfredo Moralejo proposed openstack/watcher master: Add configurable timeout for migrate actions with reasonable default  https://review.opendev.org/c/openstack/watcher/+/96769313:55
amoralejsean-k-mooney, dviroel wrt ^ rethinking about that. Given that nova has its own timeout based on vm size according to https://docs.openstack.org/nova/pike/admin/configuring-migrations.html#advanced-configuration-for-kvm-and-qemu what if watcher do not timeout it at all and keeps waiting until nova succeeds or aborts ?14:05
amoralej"The timeout in seconds is the instance size multiplied by the configurable parameter live_migration_completion_timeout, whose default is 800. For example, shared-storage live migration of an instance with 8GiB memory will time out after 6400 seconds."14:06
sean-k-mooneyamoralej: so the timeout on wathcer side shoudl eb a hard limit for pathalogical behavior14:07
sean-k-mooneyso basically if we have one at all14:07
amoralejwhat a pathalogical default may be?14:07
sean-k-mooneythen it should be soemthign liek 1 hour or 1 day kind of thing14:07
amoralejyeah, that's what i was thinking about ...14:07
sean-k-mooneybasiclly it shoudl be soemthign that is much longer then we would expect for a typeical action to complete14:08
dviroelwe need to have a timeout from our side, we cannot depend or other services to timeout our op14:08
dviroels/or other/on other14:09
amoralejso, I will increase the default to something higher14:09
dviroelyeah, that's  a good option I think14:09
amoralejso in nova, with default 800s/ramGB, a 32GBs would be something like 7hr, i can do 8hr as default timeout14:10
sean-k-mooneythe problem we have currently with this approch14:12
sean-k-mooneyis it will cause teh action plan to pause on that action for up to 8 hours and tie up the applier14:12
amoralejright, an option would be to set workers=2 in the applier so that, we at least have another worker able to run an AP14:16
amoralejand hopefully, we'll soon have the capability to run multiple appliers better :)14:17
amoralej(i mean with at least automatic re-scheduling or canceling existing APs when an applier dies14:17
sean-k-mooneyso we can have multipel action plans executing because we cna have 1 per RPC thread14:18
amoralejanyway, I'm open to discuss what the default value should be14:18
sean-k-mooneya better approch woudl be to split the startign of the live migration and the waiting for ti to complete into seperate actions14:19
sean-k-mooneylong term we woudl want to fixn the threaded impletion to supprot parelel executiorn or do the refactor to dispatch actions to teh applier not action plans via rpc14:19
sean-k-mooneyfor what its worth 1 hour for a live migraiton is already exceesive14:21
sean-k-mooneyit can take 10s of minutes but it should take form second to maybe 15 minutes 14:21
sean-k-mooneyits a long tail problem, some vms under a lot of load will take much longer but most shoudl complete orders of magnitude faster14:22
amoralejso 15mins may be a good value?14:23
amoralejfeel free to reply in https://review.opendev.org/c/openstack/watcher/+/96769314:23
sean-k-mooneyit could yes. as its a config option you can tune it based on your deployment14:24
amoralejbtw, i extended polling interval to 5s14:24
sean-k-mooneythat shoudl likely be configurable too and i also dont think it shoudl be fixed long term14:24
amoraleji can also add a parameter migrate_polling_time in this patch14:26
sean-k-mooneyim debating if it shoudl be migrate sepcific14:26
sean-k-mooneyor if we shoudl have  action_polling and action_timeout config that woudl be used for each action14:27
sean-k-mooneywe coudl have both14:27
sean-k-mooneylong term im also thinkign we will want something like this https://paste.opendev.org/show/bLqoRAuxohGyMM0H0w1l/14:27
amoralejactually, we have a resize action that may want to use similar timeout14:27
amoralejbut i understand others may have very different timeouts14:27
sean-k-mooneyya so im torn between per action timeouts/intervals vs a default with sepcific overried only for the subset that need that14:28
amoraleji.e. change_nova_service_state whould not need a high timeout14:28
amoralejvolume_migrate may need higher one14:28
sean-k-mooneyright exactly14:29
sean-k-mooneyin general i woudl perfer if tiemout where enforece by nova or cinder not watcher14:29
amoralejyeah, that was my original doubt :)14:30
sean-k-mooneybut if we have them they shoudl be for the pathalogica case where we go "somethign seame to be wrong" with that said we cant abort most oeratiosn althogh we can fo rlive migraton14:30
dviroelwe could have both timeouts, one specific per action, and a global one as fallback, if the user don't provide a specific timeout we fallback to the global one14:36
sean-k-mooneyyep that was my intial tought14:36
amoralejnote that, at some point, we could do per-audit timeouts14:37
sean-k-mooneyhave a default that is used for all action and for the ones that might wnat ad diffent value have a sepreate config option that defautl to the global one14:37
amoralejthat was my initial idea as next improvement14:37
sean-k-mooneyamoralej: we could but im not conviced we shoudl14:37
sean-k-mooneythere are 2 diffent admins here.14:37
sean-k-mooneythe infra admin has knowlage of host the cloud is deployed and the permance of it14:38
sean-k-mooneythe openstack admin has api admin access but may not be involved in managing the deployment and configuration of the cloud14:38
sean-k-mooneyso we cant really assume that the person who created the audit 14:38
sean-k-mooneyhwas deep knowlwage of the cloud14:39
amoralejyes, but, i.e. they know the size and workload activity, which is also a relevant factor14:39
sean-k-mooneynow we can allwo them to provide timeout if they do but it shoudl nto be the primary way to do that14:39
amoraleji mean the openstack admin14:39
sean-k-mooneywell the createor of the audit wotn have knowlage fo the workload in general14:39
sean-k-mooneyeither the esize or the activity14:40
sean-k-mooneyif this is a public cloud which is what we desgin for by default14:40
amoralejyeah, right14:40
sean-k-mooneywe wont have any knoldage of what the end user is runnign in the vm14:40
sean-k-mooneywe can provide a way eventually to do this in the audit or action plan for priviate cloud usecase where the admin is alos the ownwer of the workload14:41
sean-k-mooneybut that proably not the normal case14:41
amoralejso, if we implement two timeouts, action_timeout (global) + migrate_timeout (for migrate actions). How we would implement the global one? I understand that would not be part of the action itself, but something in the method calling execute() ?14:45
sean-k-mooneywe coudl have action have a get_timeout and get_poll_interval function that will read form the relevent config14:47
sean-k-mooneythere are other ways to do it too by having a map form class to config option or simialr14:48
amoralejbut the logic to implement the timeout would be inside each action execution code iuuc14:49
amoralejnot as something "external" checking the execution time on any kind of action14:50
sean-k-mooneyi need to look at the code i tought you just implied the opicite14:50
sean-k-mooneyif tis in the action execute fucntion it much simpler14:50
sean-k-mooneywe define the action to look at its action specific config option14:51
sean-k-mooneyand we set the defautl fo that option to the gloabl one14:51
sean-k-mooneyso if you dont explciyt defien it it will defautl to the global one14:51
dviroelamoralej: i think that each action would need to be aware of the timeout and end its processing, we don't want to have a thread kill outside anymore14:52
amoraleji can easily implement that if in the migrate in that patch, but if we want to apply at least the global value to all the actions, we'll need to patch all actions14:52
sean-k-mooneyhttps://github.com/openstack/nova/blob/master/nova/conf/netconf.py#L25-L5214:53
amoralejand again, i still have the doubt about how to size timeouts for volume_migrations specially14:53
sean-k-mooneyamoralej: yes we woudl have ot modify all the actions14:53
sean-k-mooneywe don thave to do that in one patch however14:53
dviroeli think that all other action are out of the scope for the LP bug scope14:53
sean-k-mooneyright14:54
sean-k-mooneyso for most action i dont think we have a timeout14:54
dviroelI think that we could really backport this server migration timeout fix14:54
dviroelbut it depends how it is implemented14:54
sean-k-mooneyso we can backport it if its just a straight config option for this one case14:55
sean-k-mooneyso we can start with that 14:55
sean-k-mooneybut we shoudl fix this later for all actions14:55
dviroelack14:55
sean-k-mooneyso inially patch can jsut be 2 now migrate sepcific cofnig options14:55
sean-k-mooneys/now/new/14:56
amoralej2 migrate config options you mean migrate_timeout + migrate_polling_interval or migrate_timeout + action_timeout ?14:57
amoralejfor the initial patch, i mean14:58
amoraleji can implement any of those options in https://review.opendev.org/c/openstack/watcher/+/96769314:58
sean-k-mooneytimeout and interval14:58
amoralejack, i think that's a good option14:59
amoralejand global timeout would come as follow-up with the rest of actions14:59
amoralejlemme prepare the patch14:59
sean-k-mooneythat a "stop the bleeding" approch to make live migration actully practical14:59
amoralejwfm14:59
sean-k-mooneywe can then design a better long term solution14:59
dviroelagree14:59
sean-k-mooneyperhasp as part of the scaling work or seperatly14:59
amoralejanyway, i think i will increase the default timeout to something like 1hr at least14:59
sean-k-mooneyi woudl start with 900s and 1s personally but we coudl try an hour15:00
sean-k-mooney1 second might be a bit agressive15:01
amoraleji think so, tbh15:01
sean-k-mooneyin ci we will want to tune this to like 10s and  0.2 seconds15:01
sean-k-mooneyso they shoudl be floats15:01
sean-k-mooneymaybe 1 second min and keeping it an int may be ok15:02
sean-k-mooneyfor ci15:02
amoraleji think under-second is too small even for CI :)15:02
amoralejwe are not running such a big number of migrations that half a second per migration makes big difference ...15:03
sean-k-mooneynot in the tempest tests15:03
sean-k-mooneyi dont want this to be an eveutal problem in the fuctional tests15:03
sean-k-mooneybtu we can alwasy go form a int to a float later15:03
sean-k-mooneythat a backward compatible change15:04
sean-k-mooneyso stick with int for now15:04
sean-k-mooneywe can revisti if ti becomes a problem later15:04
sean-k-mooneyi was prematurly optimizing15:04
amoralejok15:05
opendevreviewAlfredo Moralejo proposed openstack/watcher master: Add configurable timeout for migrate actions with reasonable default  https://review.opendev.org/c/openstack/watcher/+/96769315:59
amoralejsean-k-mooney dviroel ^ finally i have implemented pollint timeout as float, i'm setting it to 1.5 in one of the unit tests15:59
amoralejno timeout, polling time, i meant16:00
amoralejthe case for functional tests convinced me :)16:00
sean-k-mooneyso you didnt fix https://review.opendev.org/c/openstack/watcher/+/967693/comment/b82a145e_9bd7118b/16:03
sean-k-mooneydecrementing rety is incorrect16:04
sean-k-mooneyits not a rety if you are initallising it to the tiem out value16:04
sean-k-mooneyso either you need to change thename16:04
sean-k-mooneyor you need to change the approch 16:04
amoraleji was just replying to that :)16:05
amoralejyeah, retry name seems incorrect16:05
amoralejbut i think logic is correct16:05
sean-k-mooneycall it timeout and im fine with it16:05
sean-k-mooneywell16:05
amoralejyep16:05
sean-k-mooneyya no negitive values are fine16:06
sean-k-mooneyyou dont >0 in the while16:06
sean-k-mooneyyou shhoudl not be using getatrr ther but the exsting host was so i gues sthat is not your bug16:06
sean-k-mooneywe can fix that seperatly16:06
amoralejyeah, i used what was used before, i didn't want to touch that, tbh16:07
sean-k-mooneyya its fien for now 16:11
sean-k-mooneywe shoudl clean up that code evenutlly but its not part of this bug16:12
sean-k-mooneyits just not following good python codeing style16:12
opendevreviewAlfredo Moralejo proposed openstack/watcher master: Add configurable timeout for migrate actions with reasonable default  https://review.opendev.org/c/openstack/watcher/+/96769316:13
opendevreviewDouglas Viroel proposed openstack/watcher master: Adds support for threading mode in applier  https://review.opendev.org/c/openstack/watcher/+/96622620:29
opendevreviewDouglas Viroel proposed openstack/watcher master: Disable eventlet patching for api service in threading job  https://review.opendev.org/c/openstack/watcher/+/96776920:36

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!