Thursday, 2025-11-27

opendevreview	Alfredo Moralejo proposed openstack/watcher master: Skip migrate actions in pre_condition phase https://review.opendev.org/c/openstack/watcher/+/966699	07:29
opendevreview	Alfredo Moralejo proposed openstack/watcher master: Make VM resize timeout configurable with migration defaults https://review.opendev.org/c/openstack/watcher/+/968610	08:45
opendevreview	Joan Gilabert proposed openstack/watcher-specs master: Add specification for migrating Watcher to OpenStackSDK https://review.opendev.org/c/openstack/watcher-specs/+/968023	10:18
dviroel	hi all o/, watcher meeting will start in 10m	11:50
dviroel	#startmeeting watcher	12:00
opendevmeet	Meeting started Thu Nov 27 12:00:40 2025 UTC and is due to finish in 60 minutes. The chair is dviroel. Information about MeetBot at http://wiki.debian.org/MeetBot.	12:00
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	12:00
opendevmeet	The meeting name has been set to 'watcher'	12:00
dviroel	hi all o/, how is around today?	12:00
jgilaber	o/	12:01
chandankumar	o/	12:01
amoralej	o/	12:01
morenod	o/	12:02
dviroel	courtesy ping list: sean-k-mooney	12:02
dviroel	ty for joining o/	12:02
dviroel	lets start with today's meeting agenda	12:02
dviroel	#link https://etherpad.opendev.org/p/openstack-watcher-irc-meeting#L26 (Meeting agenda)	12:03
dviroel	feel free to add your own topics to the agenda	12:03
dviroel	we already have a few there	12:03
dviroel	#topic Evenlet removal	12:04
dviroel	i am missing some updates in the evenlet removal effort since vPTG	12:04
dviroel	in the vPTG we had some discussions about next steps (applier mainly)	12:04
dviroel	and we have a patch ready for review in the applier	12:05
dviroel	#link https://review.opendev.org/c/openstack/watcher/+/966226 (Adds support for threading mode in applier)	12:05
dviroel	this one was already reviewed by most of you	12:05
dviroel	the tl;dr; in this one is	12:06
dviroel	we found many issues with concurrent database access from multiple threads using taskflow parallel engine, and	12:06
sean-k-mooney	o/	12:06
dviroel	and we had to switch to use serial engine when using native thread	12:06
dviroel	we will have this limitation on how many concurrent actions can be executed for now	12:07
dviroel	but this limitation will be treated in a different patch and launchpad bug, once we merge this first one	12:07
sean-k-mooney	actully	12:08
sean-k-mooney	one thing i want to test is if we can go back to parrallel now that i change how the db fixutures work	12:08
sean-k-mooney	i dont think we will get lucky enouch for that to fix it but its worth a try	12:08
dviroel	i think that I tried that locally, and hit the same issue	12:09
sean-k-mooney	ack	12:09
dviroel	but I can try again to make sure	12:09
dviroel	thanks sean-k-mooney that helped in debugging and proposing a solution for this problem	12:10
dviroel	we had lots of discussions here in the irc during that time	12:10
dviroel	continuing with this patch	12:11
sean-k-mooney	i think the serial aprpoch is fine for now but we wil want to adress that when looking to remove the ablity to run in eventlet mode next cycle	12:11
sean-k-mooney	unless we make progress on the other refactors for scaling	12:11
sean-k-mooney	i.e. distirbutiog actions not action planns to appliers or the other enhancments that may make this a non issue	12:12
dviroel	+1 - it can be treat as a scaling dept for native thread mode	12:12
amoralej	+1	12:12
dviroel	note that in 966226 we also change some ci jobs	12:13
dviroel	openstack-tox-py312-threading will now run all unit tests in threading mode and	12:13
dviroel	watcher-prometheus-integration-threading will disable eventlet monkey patching for the applier too	12:13
dviroel	please take a momento to review this one when possible	12:13
dviroel	next patch in the eventlet series:	12:13
dviroel	another small change in CI, to disable eventlet patching in the api	12:14
dviroel	#link https://review.opendev.org/c/openstack/watcher/+/967769 (Disable eventlet patching for api service in threading job)	12:14
dviroel	so we should end with threading job running with all native thread soon	12:14
dviroel	It is a small patch to review/approve, PTAL when free	12:14
dviroel	finally, there is one more	12:15
dviroel	is a leftover from the decision-engine work	12:15
dviroel	is marked as WIP for now	12:15
dviroel	is the cluster data model collector timeout for native thread mode	12:16
dviroel	today it only works in eventlet mode (and maybe in a contradictory way, using collector's 'period' as timeout)	12:16
dviroel	i sent a first proposal yesterday, to include a new timeout for some collector operations, and to remove the eventlet timeout from the code:	12:17
dviroel	#link https://review.opendev.org/c/openstack/watcher/+/968568 (Remove eventlet-based timeout in CDM collectors)	12:17
amoralej	I think the use of period as timeout was on purpose to make sure two collector syncs don't overlap	12:17
amoralej	that was my understanding, althoug dunno if that's correct	12:18
dviroel	amoralej: right, but still could be a different config	12:18
dviroel	I'm working on some unit testing and testing it locally too	12:19
amoralej	yes, but then we should make sure that timeout is < period	12:19
dviroel	+1	12:19
amoralej	unless it ensures that a new collection is not executed until previous ones finishes	12:20
amoralej	which may be the case, i'm not sure	12:20
dviroel	yes, the scheduler will not start a new one if is still running.	12:20
dviroel	apscheduler log when this happens	12:21
amoralej	ack, then timeout may be even > than period, technically (although it may not make much sense)	12:21
dviroel	but make sense to compare this configs and raise a warning at least	12:21
dviroel	yeah, it will just skip that time, and try again in the next period	12:22
dviroel	ok, so if you folks want to take a look and comment, it will be great	12:23
dviroel	but we can circle back this CDM change again next week in more details I think	12:23
dviroel	that's what I have in eventlet topic for now	12:23
dviroel	lets move to the next one then	12:24
* dviroel waits 1 min		12:24
dviroel	#topic new blueprint for automatic skip actions on pre_condition (second pass)	12:25
dviroel	hey amoralej o/	12:25
amoralej	this is follow up of previous discussion	12:25
dviroel	ack, from last week	12:25
amoralej	yes	12:25
amoralej	i updated the blueprint as discussed https://blueprints.launchpad.net/watcher/+spec/skip-actions-in-pre-condition	12:26
amoralej	I added some more high level details of the conditions covered and added a sentence about the documentation	12:26
amoralej	If it's fine for you, i'd like to get it approved	12:27
dviroel	oh ok, so we should review that and approve	12:28
jgilaber	I think it captures well what we've discussed previously	12:28
sean-k-mooney	amoralej: so my only reall comemnt is i asked for the condtion to be listed in the bluepritn rather then in the revie	12:29
sean-k-mooney	we can do it in the review but the reason for bluepirnt and spec is to agree that before the implemetion is done	12:30
amoralej	I added	12:30
amoralej	The high level conditions to be checked for each action are:	12:30
amoralej	- The instance or volume to be acted on does not exist.	12:30
amoralej	- In a migration action, the instance is not running in source_node (tbd if that should be skipping of failng)	12:30
amoralej	- Destination pool or host, if explicitely indicated, does not exist or is disabled (that will lead to FAILED action.	12:30
sean-k-mooney	i think we have enough to move forward with	12:30
amoralej	- Required element status for the action are met, i.e. an instance is ACTIVE before starting a live migration. Otherwise the action wil	12:30
amoralej	yep	12:30
sean-k-mooney	right i wante doyou to say if it woudl be skiped or failed for each condition as well	12:31
sean-k-mooney	but since we agreed to add action docs	12:31
sean-k-mooney	we can just capture it there	12:31
amoralej	btw, I've sent a review https://review.opendev.org/c/openstack/watcher/+/968025 to add actions documentation btw	12:31
sean-k-mooney	ack we shoudl compelte that first as a prerequist for the atuo skiping	12:32
sean-k-mooney	are there any objection to appoving this?	12:32
sean-k-mooney	if not ill give it another 30 second and make the changes in launchpad	12:33
dviroel	lgtm	12:33
jgilaber	+1	12:33
sean-k-mooney	cool ill do that now and we can move on	12:33
amoralej	Thanks	12:34
dviroel	thanks amoralej for the updates	12:34
dviroel	next topic then amoralej ?	12:35
amoralej	yes	12:35
dviroel	#topic new blueprint for Applier service failure management	12:35
amoralej	#link https://blueprints.launchpad.net/watcher/+spec/monitor-failed-appliers	12:35
amoralej	this also comes from PTG discussion	12:35
amoralej	it's related to implement applier monitoring to manage the case when appliers are detected as FAILED	12:36
amoralej	the behaviour we agreed on PTG is:	12:36
amoralej	- ActionPlans on ONGOING state will be cancelled and a message will be added to the status_message field of the AP.	12:36
amoralej	- ActionPlans in PENDING state will be unnasigned (hostname field will be emptied) and a new launch_action_plan RPC message will be sent. That way, any available applier will pick up and execute it.	12:36
sean-k-mooney	+1 that mataches what i recall	12:38
sean-k-mooney	this personally feel more like it shoudl have a spec to be honest	12:38
sean-k-mooney	have we tought about the upgrade impact here?	12:39
sean-k-mooney	we are not changign the rpc interface i guess	12:39
amoralej	no, we are not	12:39
sean-k-mooney	but we are changing the sematics of how it works	12:39
amoralej	in which sense?	12:40
sean-k-mooney	we are chanign the semantics of the calcelation	12:40
dviroel	I'm still thinking on the action plans from ONGOING to CANCELLED. We will need to do something with the Actions too right?	12:40
amoralej	I don't think so	12:40
amoralej	note that, that transition already exist	12:40
sean-k-mooney	so we are talkign about a case where the applier dies	12:41
amoralej	actually, that behavior occurs currently when an applier is started and it detects previous actionplans were ONGOING on it	12:41
sean-k-mooney	so the curernt applier that was processing it wont be running so the action will be left in ongoing	12:41
sean-k-mooney	amoralej: right	12:41
sean-k-mooney	that what i ment about changing the semantics	12:41
sean-k-mooney	now you sugesting that we cancel the ongoifn gaction plan	12:42
dviroel	right, because a Cancel today is handled by the current running threads in the applier. So if another applier start and cancel the AP, not sure what happens with the Actions	12:42
dviroel	i don't really remember	12:42
sean-k-mooney	but the curent action woudl nto be canceled	12:42
jgilaber	so currently the action plan would be left ongoing indefinitely?	12:43
sean-k-mooney	we likely need to move the actriion to failed as part fo canceleing the action plan	12:43
sean-k-mooney	jgilaber: yes	12:43
dviroel	this ^	12:43
sean-k-mooney	until the appler is restarted	12:43
amoralej	yes ^	12:43
sean-k-mooney	which we shoudl not rely on	12:43
amoralej	exactly, because it may affect scale-down scenarios, i.e.	12:44
dviroel	ack	12:44
amoralej	https://github.com/openstack/watcher/blob/master/watcher/applier/sync.py#L53-L74	12:44
sean-k-mooney	so the reason im askign about upgrade is we need to be sure this works if you have not upgraded all instnace of the applier to the same version	12:44
amoralej	that's what is used today	12:44
sean-k-mooney	i.e. if we have 2025.1 appliers they still need to work with master descions engiones or apis	12:45
amoralej	btw, it cancels actionplans and actions in ongoing pending or ongoing	12:45
sean-k-mooney	ack	12:45
sean-k-mooney	so is you proposal to invoke the same behaivor	12:46
amoralej	yes	12:46
amoralej	for ONGOING	12:46
sean-k-mooney	form the elected leader	12:46
sean-k-mooney	of the appleirs that is doing the monitoring	12:46
sean-k-mooney	or will this be done by the descion engine?	12:47
amoralej	that's my doubt	12:47
sean-k-mooney	or if we add a watcch condcutor in the future will it be done by that	12:47
amoralej	if we should add this to the existing monitor in the decision engine or add a new one	12:47
amoralej	at this point, i don't want to add a new conductor service only for this, tbh	12:47
amoralej	as this would be a big impact for deployment	12:48
sean-k-mooney	well i think we are builign up more and more usecasue for the condurctor	12:48
amoralej	which, i think it's too much for this	12:48
sean-k-mooney	we may be able to punt on addign it for this cycle	12:48
jgilaber	I think it should either be in the applier or a separate service, I would not want to couple decision engines and applier like this	12:48
sean-k-mooney	but i think we will have to seriosly consider makign that change next cycle	12:48
amoralej	the other option, would be to add a new service to the applier, similar to what we do in the decision_engine	12:49
sean-k-mooney	jgilaber: ya same	12:49
sean-k-mooney	amoralej: yes that what we mean bu in the applier. mirriong the approch in the decsion engine and the same leader election approch	12:49
amoralej	i may create a base service class with common code (leader election, get_servise_status, etc..)	12:49
amoralej	wfm	12:49
dviroel	+1	12:49
sean-k-mooney	ya shareing that makes sense. at least in the short term	12:50
jgilaber	+1 for the reuse	12:50
sean-k-mooney	amoralej: on the testing side how are you plannign to test this	12:50
sean-k-mooney	i dont think this shoudl be tested in tempest	12:50
amoralej	yeah	12:50
sean-k-mooney	this is a case where if we had the funtional testing enhacnemtns it woudl be very useful	12:50
amoralej	so, we can create two appliers as we do with decision_engine, to at least see that both are started and leader election works	12:51
sean-k-mooney	amoralej: yes we can enable multiple appleris in the jobs	12:51
amoralej	testing real cases of failing over, etc... requires other kind of testing that tempest, i agree	12:51
sean-k-mooney	i just dont want to use a post playbook or tempest test to kill one and see that it falls over	12:51
sean-k-mooney	*fails over	12:51
sean-k-mooney	tempest is a very poor chocie for this tyep of fault injection / disastor recovery type testing	12:52
sean-k-mooney	but i do want use to test it in a fucntional test if we can or unit test failing that	12:53
amoralej	you can see the kind of unit testing I'm doing in https://review.opendev.org/c/openstack/watcher/+/963252/ for decision-engine	12:54
amoralej	I'd do similar one for the applier	12:54
amoralej	until we can do something better	12:54
amoralej	with functional testing	12:54
sean-k-mooney	ack if you factor out the common code this woudl also be shared	12:54
sean-k-mooney	we can test both explictly too	12:54
sean-k-mooney	but the leager election part can be common	12:54
amoralej	yep	12:55
sean-k-mooney	the specific of cancelign the action/action plan will be differnt	12:55
amoralej	exactly	12:55
amoralej	so, based on that patch, the monitor_services_status method would be specific	12:55
amoralej	all the rest would be shared i think	12:55
amoralej	I need to double check, anyway	12:56
dviroel	ack, so we keep the blueprint or should we consider a spec for this effort?	12:56
amoralej	my inclination to blueprint was based that it's actually not changing apis, config options or deployment factor	12:57
sean-k-mooney	one of my rules of tumb is the bus factor	12:57
jgilaber	judging by the discussion here it looks like the change would be fairly contained in one place, so I would be ok without a spec	12:57
sean-k-mooney	are we confident based on the descrtipon that if alfredo disapers tomorw we coudl compelt ehte work	12:57
sean-k-mooney	this is probaly pushing what can be done without one but it seam the over all concenous is to proceed with it as specless for now	12:58
sean-k-mooney	so i guess if we agree we can approve	12:58
sean-k-mooney	going once	12:59
dviroel	+1 to continue with the bp for now	12:59
sean-k-mooney	twice	12:59
jgilaber	+1	12:59
sean-k-mooney	cool	12:59
* dviroel time check		13:00
dviroel	amoralej: lets circle back this BP again in following meetings if needed	13:00
amoralej	ack, thanks	13:00
dviroel	jgilaber: want to quickly cover your spec or move to next week?	13:01
jgilaber	we can move to next week the discussion and just announce that I pushed a spec for the change to use openstacksdk	13:01
dviroel	we can just call for reviews for now then	13:01
dviroel	ack	13:02
dviroel	#topic Reviews	13:02
dviroel	not going to cover all of them	13:02
dviroel	just asking folks to review the ones that are listed there	13:02
dviroel	also please consider adding them to the status etherpad	13:02
dviroel	#link https://etherpad.opendev.org/p/watcher-2026.1-status	13:02
dviroel	you can better organise your patches there if needed	13:03
dviroel	easy for review to follow up	13:03
dviroel	s/review/reviewers	13:03
dviroel	jgilaber: going to move your LP to next week too	13:04
jgilaber	sounds good,thanks	13:04
dviroel	otherwise we can continue discuss it here after the meeting	13:04
dviroel	#topic Volunteers to chair next meeting	13:04
dviroel	thanks jgilaber for volunteer again o/	13:04
dviroel	let's wrap up for today	13:04
dviroel	we will meet again next week	13:04
dviroel	thank you all for participating	13:05
dviroel	#endmeeting	13:05
opendevmeet	Meeting ended Thu Nov 27 13:05:05 2025 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	13:05
opendevmeet	Minutes: https://meetings.opendev.org/meetings/watcher/2025/watcher.2025-11-27-12.00.html	13:05
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/watcher/2025/watcher.2025-11-27-12.00.txt	13:05
opendevmeet	Log: https://meetings.opendev.org/meetings/watcher/2025/watcher.2025-11-27-12.00.log.html	13:05
jgilaber	thanks dviroel!	13:05
dviroel	thanks folks o/	13:05
morenod	thanks dviroel++	13:05
amoralej	thanks dviroel++	13:05
chandankumar	dviroel++ thanks!	13:05
opendevreview	David proposed openstack/watcher master: [DNM] Testing nodeset with three nodes (two computes + 1 controller) https://review.opendev.org/c/openstack/watcher/+/967331	13:09
opendevreview	Alfredo Moralejo proposed openstack/watcher master: Skip migrate actions in pre_condition phase https://review.opendev.org/c/openstack/watcher/+/966699	13:11
opendevreview	chandan kumar proposed openstack/watcher-tempest-plugin master: Remove deprecated client_functional tests https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/968651	13:26
opendevreview	Douglas Viroel proposed openstack/watcher-tempest-plugin master: Consolidate and improve Zuul CI job definitions https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/968247	14:00
opendevreview	David proposed openstack/watcher master: [DNM] Testing nodeset with three nodes (two computes + 1 controller) https://review.opendev.org/c/openstack/watcher/+/967331	14:02
opendevreview	David proposed openstack/watcher master: [DNM] Testing nodeset with three nodes (two computes + 1 controller) https://review.opendev.org/c/openstack/watcher/+/967331	15:53
opendevreview	Alfredo Moralejo proposed openstack/watcher master: Make VM migrations timeout configurable and apply reasonable defaults https://review.opendev.org/c/openstack/watcher/+/967693	16:35
opendevreview	Alfredo Moralejo proposed openstack/watcher master: Make VM resize timeout configurable with migration defaults https://review.opendev.org/c/openstack/watcher/+/968610	16:35
opendevreview	Alfredo Moralejo proposed openstack/watcher master: Add documentation section for actions https://review.opendev.org/c/openstack/watcher/+/968025	17:01
opendevreview	Douglas Viroel proposed openstack/watcher master: Remove eventlet-based timeout in CDM collectors https://review.opendev.org/c/openstack/watcher/+/968568	18:54
opendevreview	Douglas Viroel proposed openstack/watcher-tempest-plugin master: Wait action plan to finish before asserting state https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/968750	19:57

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!