Wednesday, 2025-11-19

opendevreview	chandan kumar proposed openstack/watcher-tempest-plugin master: Refactor: Consolidate test utilities into shared BaseCommon class https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/967522	05:21
opendevreview	chandan kumar proposed openstack/watcher-tempest-plugin master: Refactor: Consolidate test utilities into shared BaseCommon class https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/967522	06:45
opendevreview	sean mooney proposed openstack/watcher master: [DB] adopt the AdHocDbFixture from oslo https://review.opendev.org/c/openstack/watcher/+/967390	08:47
opendevreview	chandan kumar proposed openstack/watcher-tempest-plugin master: Refactor: Consolidate test utilities into shared BaseCommon class https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/967522	09:02
opendevreview	David proposed openstack/watcher-tempest-plugin master: Add test for skipped actions https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/966860	10:08
opendevreview	Merged openstack/watcher master: [DB] adopt the AdHocDbFixture from oslo https://review.opendev.org/c/openstack/watcher/+/967390	10:21
opendevreview	Alfredo Moralejo proposed openstack/watcher master: Add configurable timeout for migrate actions with reasonable defautl https://review.opendev.org/c/openstack/watcher/+/967693	13:53
opendevreview	Alfredo Moralejo proposed openstack/watcher master: Add configurable timeout for migrate actions with reasonable defautl https://review.opendev.org/c/openstack/watcher/+/967693	13:55
opendevreview	Alfredo Moralejo proposed openstack/watcher master: Add configurable timeout for migrate actions with reasonable default https://review.opendev.org/c/openstack/watcher/+/967693	13:55
amoralej	sean-k-mooney, dviroel wrt ^ rethinking about that. Given that nova has its own timeout based on vm size according to https://docs.openstack.org/nova/pike/admin/configuring-migrations.html#advanced-configuration-for-kvm-and-qemu what if watcher do not timeout it at all and keeps waiting until nova succeeds or aborts ?	14:05
amoralej	"The timeout in seconds is the instance size multiplied by the configurable parameter live_migration_completion_timeout, whose default is 800. For example, shared-storage live migration of an instance with 8GiB memory will time out after 6400 seconds."	14:06
sean-k-mooney	amoralej: so the timeout on wathcer side shoudl eb a hard limit for pathalogical behavior	14:07
sean-k-mooney	so basically if we have one at all	14:07
amoralej	what a pathalogical default may be?	14:07
sean-k-mooney	then it should be soemthign liek 1 hour or 1 day kind of thing	14:07
amoralej	yeah, that's what i was thinking about ...	14:07
sean-k-mooney	basiclly it shoudl be soemthign that is much longer then we would expect for a typeical action to complete	14:08
dviroel	we need to have a timeout from our side, we cannot depend or other services to timeout our op	14:08
dviroel	s/or other/on other	14:09
amoralej	so, I will increase the default to something higher	14:09
dviroel	yeah, that's a good option I think	14:09
amoralej	so in nova, with default 800s/ramGB, a 32GBs would be something like 7hr, i can do 8hr as default timeout	14:10
sean-k-mooney	the problem we have currently with this approch	14:12
sean-k-mooney	is it will cause teh action plan to pause on that action for up to 8 hours and tie up the applier	14:12
amoralej	right, an option would be to set workers=2 in the applier so that, we at least have another worker able to run an AP	14:16
amoralej	and hopefully, we'll soon have the capability to run multiple appliers better :)	14:17
amoralej	(i mean with at least automatic re-scheduling or canceling existing APs when an applier dies	14:17
sean-k-mooney	so we can have multipel action plans executing because we cna have 1 per RPC thread	14:18
amoralej	anyway, I'm open to discuss what the default value should be	14:18
sean-k-mooney	a better approch woudl be to split the startign of the live migration and the waiting for ti to complete into seperate actions	14:19
sean-k-mooney	long term we woudl want to fixn the threaded impletion to supprot parelel executiorn or do the refactor to dispatch actions to teh applier not action plans via rpc	14:19
sean-k-mooney	for what its worth 1 hour for a live migraiton is already exceesive	14:21
sean-k-mooney	it can take 10s of minutes but it should take form second to maybe 15 minutes	14:21
sean-k-mooney	its a long tail problem, some vms under a lot of load will take much longer but most shoudl complete orders of magnitude faster	14:22
amoralej	so 15mins may be a good value?	14:23
amoralej	feel free to reply in https://review.opendev.org/c/openstack/watcher/+/967693	14:23
sean-k-mooney	it could yes. as its a config option you can tune it based on your deployment	14:24
amoralej	btw, i extended polling interval to 5s	14:24
sean-k-mooney	that shoudl likely be configurable too and i also dont think it shoudl be fixed long term	14:24
amoralej	i can also add a parameter migrate_polling_time in this patch	14:26
sean-k-mooney	im debating if it shoudl be migrate sepcific	14:26
sean-k-mooney	or if we shoudl have action_polling and action_timeout config that woudl be used for each action	14:27
sean-k-mooney	we coudl have both	14:27
sean-k-mooney	long term im also thinkign we will want something like this https://paste.opendev.org/show/bLqoRAuxohGyMM0H0w1l/	14:27
amoralej	actually, we have a resize action that may want to use similar timeout	14:27
amoralej	but i understand others may have very different timeouts	14:27
sean-k-mooney	ya so im torn between per action timeouts/intervals vs a default with sepcific overried only for the subset that need that	14:28
amoralej	i.e. change_nova_service_state whould not need a high timeout	14:28
amoralej	volume_migrate may need higher one	14:28
sean-k-mooney	right exactly	14:29
sean-k-mooney	in general i woudl perfer if tiemout where enforece by nova or cinder not watcher	14:29
amoralej	yeah, that was my original doubt :)	14:30
sean-k-mooney	but if we have them they shoudl be for the pathalogica case where we go "somethign seame to be wrong" with that said we cant abort most oeratiosn althogh we can fo rlive migraton	14:30
dviroel	we could have both timeouts, one specific per action, and a global one as fallback, if the user don't provide a specific timeout we fallback to the global one	14:36
sean-k-mooney	yep that was my intial tought	14:36
amoralej	note that, at some point, we could do per-audit timeouts	14:37
sean-k-mooney	have a default that is used for all action and for the ones that might wnat ad diffent value have a sepreate config option that defautl to the global one	14:37
amoralej	that was my initial idea as next improvement	14:37
sean-k-mooney	amoralej: we could but im not conviced we shoudl	14:37
sean-k-mooney	there are 2 diffent admins here.	14:37
sean-k-mooney	the infra admin has knowlage of host the cloud is deployed and the permance of it	14:38
sean-k-mooney	the openstack admin has api admin access but may not be involved in managing the deployment and configuration of the cloud	14:38
sean-k-mooney	so we cant really assume that the person who created the audit	14:38
sean-k-mooney	hwas deep knowlwage of the cloud	14:39
amoralej	yes, but, i.e. they know the size and workload activity, which is also a relevant factor	14:39
sean-k-mooney	now we can allwo them to provide timeout if they do but it shoudl nto be the primary way to do that	14:39
amoralej	i mean the openstack admin	14:39
sean-k-mooney	well the createor of the audit wotn have knowlage fo the workload in general	14:39
sean-k-mooney	either the esize or the activity	14:40
sean-k-mooney	if this is a public cloud which is what we desgin for by default	14:40
amoralej	yeah, right	14:40
sean-k-mooney	we wont have any knoldage of what the end user is runnign in the vm	14:40
sean-k-mooney	we can provide a way eventually to do this in the audit or action plan for priviate cloud usecase where the admin is alos the ownwer of the workload	14:41
sean-k-mooney	but that proably not the normal case	14:41
amoralej	so, if we implement two timeouts, action_timeout (global) + migrate_timeout (for migrate actions). How we would implement the global one? I understand that would not be part of the action itself, but something in the method calling execute() ?	14:45
sean-k-mooney	we coudl have action have a get_timeout and get_poll_interval function that will read form the relevent config	14:47
sean-k-mooney	there are other ways to do it too by having a map form class to config option or simialr	14:48
amoralej	but the logic to implement the timeout would be inside each action execution code iuuc	14:49
amoralej	not as something "external" checking the execution time on any kind of action	14:50
sean-k-mooney	i need to look at the code i tought you just implied the opicite	14:50
sean-k-mooney	if tis in the action execute fucntion it much simpler	14:50
sean-k-mooney	we define the action to look at its action specific config option	14:51
sean-k-mooney	and we set the defautl fo that option to the gloabl one	14:51
sean-k-mooney	so if you dont explciyt defien it it will defautl to the global one	14:51
dviroel	amoralej: i think that each action would need to be aware of the timeout and end its processing, we don't want to have a thread kill outside anymore	14:52
amoralej	i can easily implement that if in the migrate in that patch, but if we want to apply at least the global value to all the actions, we'll need to patch all actions	14:52
sean-k-mooney	https://github.com/openstack/nova/blob/master/nova/conf/netconf.py#L25-L52	14:53
amoralej	and again, i still have the doubt about how to size timeouts for volume_migrations specially	14:53
sean-k-mooney	amoralej: yes we woudl have ot modify all the actions	14:53
sean-k-mooney	we don thave to do that in one patch however	14:53
dviroel	i think that all other action are out of the scope for the LP bug scope	14:53
sean-k-mooney	right	14:54
sean-k-mooney	so for most action i dont think we have a timeout	14:54
dviroel	I think that we could really backport this server migration timeout fix	14:54
dviroel	but it depends how it is implemented	14:54
sean-k-mooney	so we can backport it if its just a straight config option for this one case	14:55
sean-k-mooney	so we can start with that	14:55
sean-k-mooney	but we shoudl fix this later for all actions	14:55
dviroel	ack	14:55
sean-k-mooney	so inially patch can jsut be 2 now migrate sepcific cofnig options	14:55
sean-k-mooney	s/now/new/	14:56
amoralej	2 migrate config options you mean migrate_timeout + migrate_polling_interval or migrate_timeout + action_timeout ?	14:57
amoralej	for the initial patch, i mean	14:58
amoralej	i can implement any of those options in https://review.opendev.org/c/openstack/watcher/+/967693	14:58
sean-k-mooney	timeout and interval	14:58
amoralej	ack, i think that's a good option	14:59
amoralej	and global timeout would come as follow-up with the rest of actions	14:59
amoralej	lemme prepare the patch	14:59
sean-k-mooney	that a "stop the bleeding" approch to make live migration actully practical	14:59
amoralej	wfm	14:59
sean-k-mooney	we can then design a better long term solution	14:59
dviroel	agree	14:59
sean-k-mooney	perhasp as part of the scaling work or seperatly	14:59
amoralej	anyway, i think i will increase the default timeout to something like 1hr at least	14:59
sean-k-mooney	i woudl start with 900s and 1s personally but we coudl try an hour	15:00
sean-k-mooney	1 second might be a bit agressive	15:01
amoralej	i think so, tbh	15:01
sean-k-mooney	in ci we will want to tune this to like 10s and 0.2 seconds	15:01
sean-k-mooney	so they shoudl be floats	15:01
sean-k-mooney	maybe 1 second min and keeping it an int may be ok	15:02
sean-k-mooney	for ci	15:02
amoralej	i think under-second is too small even for CI :)	15:02
amoralej	we are not running such a big number of migrations that half a second per migration makes big difference ...	15:03
sean-k-mooney	not in the tempest tests	15:03
sean-k-mooney	i dont want this to be an eveutal problem in the fuctional tests	15:03
sean-k-mooney	btu we can alwasy go form a int to a float later	15:03
sean-k-mooney	that a backward compatible change	15:04
sean-k-mooney	so stick with int for now	15:04
sean-k-mooney	we can revisti if ti becomes a problem later	15:04
sean-k-mooney	i was prematurly optimizing	15:04
amoralej	ok	15:05
opendevreview	Alfredo Moralejo proposed openstack/watcher master: Add configurable timeout for migrate actions with reasonable default https://review.opendev.org/c/openstack/watcher/+/967693	15:59
amoralej	sean-k-mooney dviroel ^ finally i have implemented pollint timeout as float, i'm setting it to 1.5 in one of the unit tests	15:59
amoralej	no timeout, polling time, i meant	16:00
amoralej	the case for functional tests convinced me :)	16:00
sean-k-mooney	so you didnt fix https://review.opendev.org/c/openstack/watcher/+/967693/comment/b82a145e_9bd7118b/	16:03
sean-k-mooney	decrementing rety is incorrect	16:04
sean-k-mooney	its not a rety if you are initallising it to the tiem out value	16:04
sean-k-mooney	so either you need to change thename	16:04
sean-k-mooney	or you need to change the approch	16:04
amoralej	i was just replying to that :)	16:05
amoralej	yeah, retry name seems incorrect	16:05
amoralej	but i think logic is correct	16:05
sean-k-mooney	call it timeout and im fine with it	16:05
sean-k-mooney	well	16:05
amoralej	yep	16:05
sean-k-mooney	ya no negitive values are fine	16:06
sean-k-mooney	you dont >0 in the while	16:06
sean-k-mooney	you shhoudl not be using getatrr ther but the exsting host was so i gues sthat is not your bug	16:06
sean-k-mooney	we can fix that seperatly	16:06
amoralej	yeah, i used what was used before, i didn't want to touch that, tbh	16:07
sean-k-mooney	ya its fien for now	16:11
sean-k-mooney	we shoudl clean up that code evenutlly but its not part of this bug	16:12
sean-k-mooney	its just not following good python codeing style	16:12
opendevreview	Alfredo Moralejo proposed openstack/watcher master: Add configurable timeout for migrate actions with reasonable default https://review.opendev.org/c/openstack/watcher/+/967693	16:13
opendevreview	Douglas Viroel proposed openstack/watcher master: Adds support for threading mode in applier https://review.opendev.org/c/openstack/watcher/+/966226	20:29
opendevreview	Douglas Viroel proposed openstack/watcher master: Disable eventlet patching for api service in threading job https://review.opendev.org/c/openstack/watcher/+/967769	20:36

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!