Thursday, 2025-11-27

opendevreviewAlfredo Moralejo proposed openstack/watcher master: Skip migrate actions in pre_condition phase  https://review.opendev.org/c/openstack/watcher/+/96669907:29
opendevreviewAlfredo Moralejo proposed openstack/watcher master: Make VM resize timeout configurable with migration defaults  https://review.opendev.org/c/openstack/watcher/+/96861008:45
opendevreviewJoan Gilabert proposed openstack/watcher-specs master: Add specification for migrating Watcher to OpenStackSDK  https://review.opendev.org/c/openstack/watcher-specs/+/96802310:18
dviroelhi all o/, watcher meeting will start in 10m11:50
dviroel#startmeeting watcher12:00
opendevmeetMeeting started Thu Nov 27 12:00:40 2025 UTC and is due to finish in 60 minutes.  The chair is dviroel. Information about MeetBot at http://wiki.debian.org/MeetBot.12:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.12:00
opendevmeetThe meeting name has been set to 'watcher'12:00
dviroelhi all o/, how is around today?12:00
jgilabero/12:01
chandankumaro/12:01
amoralejo/12:01
morenodo/12:02
dviroelcourtesy ping list: sean-k-mooney12:02
dviroelty for joining o/12:02
dviroellets start with today's meeting agenda12:02
dviroel#link https://etherpad.opendev.org/p/openstack-watcher-irc-meeting#L26 (Meeting agenda)12:03
dviroelfeel free to add your own topics to the agenda12:03
dviroelwe already have a few there12:03
dviroel#topic Evenlet removal12:04
dviroeli am missing some updates in the evenlet removal effort since vPTG12:04
dviroelin the vPTG we had some discussions about next steps (applier mainly)12:04
dviroeland we have a patch ready for review in the applier12:05
dviroel#link https://review.opendev.org/c/openstack/watcher/+/966226 (Adds support for threading mode in applier)12:05
dviroelthis one was already reviewed by most of you12:05
dviroelthe tl;dr; in this one is12:06
dviroelwe found many issues with concurrent database access from multiple threads using taskflow parallel engine, and12:06
sean-k-mooneyo/12:06
dviroeland we had to switch to use serial engine when using native thread12:06
dviroelwe will have this limitation on how many concurrent actions can be executed for now12:07
dviroelbut this limitation will be treated in a different patch and launchpad bug, once we merge this first one12:07
sean-k-mooneyactully12:08
sean-k-mooneyone thing i want to test is if we can go back to parrallel now that i change how the db fixutures work12:08
sean-k-mooneyi dont think we will get lucky enouch for that to fix it but its worth a try12:08
dviroeli think that I tried that locally, and hit the same issue12:09
sean-k-mooneyack12:09
dviroelbut I can try again to make sure12:09
dviroelthanks sean-k-mooney that helped in debugging and proposing a solution for this problem12:10
dviroelwe had lots of discussions here in the irc during that time12:10
dviroelcontinuing with this patch12:11
sean-k-mooneyi think the serial aprpoch is fine for now but we wil want to adress that when looking to remove the ablity to run in eventlet mode next cycle12:11
sean-k-mooneyunless we make progress on the other refactors for scaling12:11
sean-k-mooneyi.e. distirbutiog actions not action planns to appliers or the other enhancments that may make this a non issue12:12
dviroel+1 - it can be treat as a scaling dept for native thread mode12:12
amoralej+112:12
dviroelnote that in 966226 we also change some ci jobs12:13
dviroelopenstack-tox-py312-threading  will now run all unit tests in threading mode and12:13
dviroelwatcher-prometheus-integration-threading will disable eventlet monkey patching for the applier too12:13
dviroelplease take a momento to review this one when possible12:13
dviroelnext patch in the eventlet series:12:13
dviroelanother small change in CI, to disable eventlet patching in the api12:14
dviroel#link https://review.opendev.org/c/openstack/watcher/+/967769 (Disable eventlet patching for api service in threading job)12:14
dviroelso we should end with threading job running with all native thread soon12:14
dviroelIt is a small patch to review/approve, PTAL when free12:14
dviroelfinally, there is one more12:15
dviroelis a leftover from the decision-engine work12:15
dviroelis marked as WIP for now 12:15
dviroelis the cluster data model collector timeout for native thread mode12:16
dviroeltoday it only works in eventlet mode (and maybe in a contradictory way, using collector's 'period' as timeout)12:16
dviroeli sent a first proposal yesterday, to include a new timeout for some collector operations, and to remove the eventlet timeout from the code:12:17
dviroel#link https://review.opendev.org/c/openstack/watcher/+/968568 (Remove eventlet-based timeout in CDM collectors)12:17
amoralejI think the use of period as timeout was on purpose to make sure two collector syncs don't overlap12:17
amoralejthat was my understanding, althoug dunno if that's correct12:18
dviroelamoralej: right, but still could be a different config12:18
dviroelI'm working on some unit  testing and testing it locally too12:19
amoralejyes, but then we should make sure that timeout is < period12:19
dviroel+112:19
amoralejunless it ensures that a new collection is not executed until previous ones finishes12:20
amoralejwhich may be the case, i'm not sure12:20
dviroelyes, the scheduler will not start a new one if is still running.12:20
dviroelapscheduler log when this happens12:21
amoralejack, then timeout may be even > than period, technically (although it may not make much sense)12:21
dviroelbut make sense to compare this configs and raise a warning at least12:21
dviroelyeah, it will just skip that time, and try again in the next period12:22
dviroelok, so if you folks want to take a look and comment, it will be great12:23
dviroelbut we can circle back this CDM change again next week in more details I think12:23
dviroelthat's what I have in eventlet topic for now12:23
dviroellets move to the next one then12:24
* dviroel waits 1 min12:24
dviroel#topic new blueprint for automatic skip actions on pre_condition (second pass)12:25
dviroelhey amoralej o/12:25
amoralejthis is follow up of previous discussion12:25
dviroelack, from last week12:25
amoralejyes12:25
amoraleji updated the blueprint as discussed https://blueprints.launchpad.net/watcher/+spec/skip-actions-in-pre-condition12:26
amoralejI added some more high level details of the conditions covered and added a sentence about the documentation12:26
amoralejIf it's fine for you, i'd like to get it approved12:27
dviroeloh ok, so we should review that and approve12:28
jgilaberI think it captures well what we've discussed previously12:28
sean-k-mooneyamoralej: so my only reall comemnt is i asked for the condtion to be listed in the bluepritn rather then in the revie12:29
sean-k-mooneywe can do it in the review but the reason for bluepirnt and spec is to agree that before the implemetion is done12:30
amoralejI added12:30
amoralejThe high level conditions to be checked for each action are:12:30
amoralej- The instance or volume to be acted on does not exist.12:30
amoralej- In a migration action, the instance is not running in source_node (tbd if that should be skipping of failng)12:30
amoralej- Destination pool or host, if explicitely indicated, does not exist or is disabled (that will lead to FAILED action.12:30
sean-k-mooneyi think we have enough to move forward with12:30
amoralej- Required element status for the action are met, i.e. an instance is ACTIVE before starting a live migration. Otherwise the action wil12:30
amoralejyep12:30
sean-k-mooneyright i wante doyou to say if it woudl be skiped or failed for each condition as well12:31
sean-k-mooneybut since we agreed to add action docs12:31
sean-k-mooneywe can just capture it there 12:31
amoralejbtw, I've sent a review https://review.opendev.org/c/openstack/watcher/+/968025 to add actions documentation btw12:31
sean-k-mooneyack we shoudl compelte that first as a prerequist for the atuo skiping12:32
sean-k-mooneyare there any objection to appoving this?12:32
sean-k-mooneyif not ill give it another 30 second and make the changes in launchpad12:33
dviroellgtm12:33
jgilaber+112:33
sean-k-mooneycool ill do that now and we can move on12:33
amoralejThanks12:34
dviroelthanks amoralej for the updates12:34
dviroelnext topic then amoralej ?12:35
amoralejyes12:35
dviroel#topic new blueprint for Applier service failure management12:35
amoralej#link https://blueprints.launchpad.net/watcher/+spec/monitor-failed-appliers12:35
amoralejthis also comes from PTG discussion12:35
amoralejit's related to implement applier monitoring to manage the case when appliers are detected as FAILED12:36
amoralejthe behaviour we agreed on PTG is:12:36
amoralej  - ActionPlans on ONGOING state will be cancelled and a message will be added to the status_message field of the AP.12:36
amoralej  - ActionPlans in PENDING state will be unnasigned (hostname field will be emptied) and a new launch_action_plan RPC message will be sent. That way, any available applier will pick up and execute it.12:36
sean-k-mooney+1 that mataches what i recall12:38
sean-k-mooneythis personally feel more like it shoudl have a spec to be honest12:38
sean-k-mooneyhave we tought about the upgrade impact here?12:39
sean-k-mooneywe are not changign the rpc interface i guess 12:39
amoralejno, we are not12:39
sean-k-mooneybut we are changing the sematics of how it works12:39
amoralejin which sense?12:40
sean-k-mooneywe are chanign the semantics of the calcelation 12:40
dviroelI'm still thinking on the action plans from ONGOING to CANCELLED. We will need to do something with the Actions too right? 12:40
amoralejI don't think so12:40
amoralejnote that, that transition already exist12:40
sean-k-mooneyso we are talkign about a case where the applier dies12:41
amoralejactually, that behavior occurs currently when an applier is started and it detects previous actionplans were ONGOING on it12:41
sean-k-mooneyso the curernt applier that was processing it wont be running so the action will be left in ongoing12:41
sean-k-mooneyamoralej: right 12:41
sean-k-mooneythat what i ment about changing the semantics12:41
sean-k-mooneynow you sugesting that we cancel the ongoifn gaction plan12:42
dviroelright, because a Cancel today is handled by the current running threads in the applier. So if another applier start and cancel the AP, not sure what happens with the Actions12:42
dviroeli don't really remember12:42
sean-k-mooneybut the curent action woudl nto be canceled 12:42
jgilaberso currently the action plan would be left ongoing indefinitely?12:43
sean-k-mooneywe likely need to move the actriion to failed as part fo canceleing the action plan12:43
sean-k-mooneyjgilaber: yes12:43
dviroelthis ^12:43
sean-k-mooneyuntil the appler is restarted12:43
amoralejyes ^12:43
sean-k-mooneywhich we shoudl not rely on12:43
amoralejexactly, because it may affect scale-down scenarios, i.e.12:44
dviroelack12:44
amoralejhttps://github.com/openstack/watcher/blob/master/watcher/applier/sync.py#L53-L7412:44
sean-k-mooneyso the reason im askign about upgrade is we need to be sure this works if you have not upgraded all instnace of the applier to the same version12:44
amoralejthat's what is used today12:44
sean-k-mooneyi.e. if we have 2025.1 appliers they still need to work with master descions engiones or apis12:45
amoralejbtw, it cancels actionplans and actions in ongoing pending or ongoing12:45
sean-k-mooneyack12:45
sean-k-mooneyso is you proposal to invoke the same behaivor12:46
amoralejyes12:46
amoralejfor ONGOING12:46
sean-k-mooneyform the elected leader12:46
sean-k-mooneyof the appleirs that is doing the monitoring12:46
sean-k-mooneyor will this be done by the descion engine?12:47
amoralejthat's my doubt12:47
sean-k-mooneyor if we add a watcch condcutor in the future will it be done by that12:47
amoralejif we should add this to the existing monitor in the decision engine or add a new one12:47
amoralejat this point, i don't want to add a new conductor service only for this, tbh12:47
amoralejas this would be a big impact for deployment12:48
sean-k-mooneywell i think we are builign up more and more usecasue for the condurctor12:48
amoralejwhich, i think it's too much for this12:48
sean-k-mooneywe may be able to punt on addign it for this cycle12:48
jgilaberI think it should either be in the applier or a separate service, I would not want to couple decision engines and applier like this12:48
sean-k-mooneybut i think we will have to seriosly consider makign that change next cycle12:48
amoralejthe other option, would be to add a new service to the applier, similar to what we do in the decision_engine12:49
sean-k-mooneyjgilaber: ya same12:49
sean-k-mooneyamoralej: yes that what we mean bu in the applier. mirriong the approch in the decsion engine and the same leader election approch12:49
amoraleji may create a base service class with common code (leader election, get_servise_status, etc..)12:49
amoralejwfm12:49
dviroel+1 12:49
sean-k-mooneyya shareing that makes sense. at least in the short term12:50
jgilaber+1 for the reuse12:50
sean-k-mooneyamoralej: on the testing side how are you plannign to test this12:50
sean-k-mooneyi dont think this shoudl be tested in tempest12:50
amoralejyeah12:50
sean-k-mooneythis is a case where if we had the funtional testing enhacnemtns it woudl be very useful12:50
amoralejso, we can create two appliers as we do with decision_engine, to at least see that both are started and leader election works12:51
sean-k-mooneyamoralej: yes we can enable multiple appleris in the jobs12:51
amoralejtesting real cases of failing over, etc... requires other kind of testing that tempest, i agree12:51
sean-k-mooneyi just dont want to use a post playbook or tempest test to kill one and see that it falls over12:51
sean-k-mooney*fails over12:51
sean-k-mooneytempest is a very poor chocie for this tyep of fault injection / disastor recovery type testing 12:52
sean-k-mooneybut i do want use to test it in a fucntional test if we can or unit test failing that12:53
amoralejyou can see the kind of unit testing I'm doing in https://review.opendev.org/c/openstack/watcher/+/963252/ for decision-engine12:54
amoralejI'd do similar one for the applier12:54
amoralejuntil we can do something better12:54
amoralejwith functional testing12:54
sean-k-mooneyack if you factor out the common code this woudl also be shared12:54
sean-k-mooneywe can test both explictly too12:54
sean-k-mooneybut the leager election part can be common12:54
amoralejyep12:55
sean-k-mooneythe specific of cancelign the action/action plan will be differnt12:55
amoralejexactly12:55
amoralejso, based on that patch, the monitor_services_status method would be specific12:55
amoralejall the rest would be shared i think12:55
amoralejI need to double check, anyway12:56
dviroelack, so we keep the blueprint or should we consider a spec for this effort?12:56
amoralejmy inclination to blueprint was based that it's actually not changing apis, config options or deployment factor12:57
sean-k-mooneyone of my rules of tumb is the bus factor12:57
jgilaberjudging by the discussion here it looks like the change would be fairly contained in one place, so I would be ok without a spec12:57
sean-k-mooneyare we confident based on the descrtipon that if alfredo disapers tomorw we coudl compelt ehte work12:57
sean-k-mooneythis is probaly pushing what can be done without one but it seam the over all concenous is to proceed with it as specless for now12:58
sean-k-mooneyso i guess if we agree we can approve12:58
sean-k-mooneygoing once12:59
dviroel+1 to continue with the bp for now12:59
sean-k-mooneytwice12:59
jgilaber+112:59
sean-k-mooneycool12:59
* dviroel time check13:00
dviroelamoralej: lets circle back this BP again in following meetings if needed13:00
amoralejack, thanks13:00
dviroeljgilaber: want to quickly cover your spec or move to next week?13:01
jgilaberwe can move to next week the discussion and just announce that I pushed a spec for the change to use openstacksdk13:01
dviroelwe can just call for reviews for now then13:01
dviroelack13:02
dviroel#topic Reviews13:02
dviroelnot going to cover all of them13:02
dviroeljust asking folks to review the ones that are listed there13:02
dviroelalso please consider adding them to the status etherpad13:02
dviroel#link https://etherpad.opendev.org/p/watcher-2026.1-status13:02
dviroelyou can better organise your patches there if needed13:03
dviroeleasy for review to follow up13:03
dviroels/review/reviewers13:03
dviroeljgilaber: going to move your LP to next week too13:04
jgilabersounds good,thanks13:04
dviroelotherwise we can continue discuss it here after the meeting13:04
dviroel#topic Volunteers to chair next meeting13:04
dviroelthanks jgilaber for volunteer again o/13:04
dviroellet's wrap up for today13:04
dviroelwe will meet again next week13:04
dviroelthank you all for participating13:05
dviroel#endmeeting13:05
opendevmeetMeeting ended Thu Nov 27 13:05:05 2025 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)13:05
opendevmeetMinutes:        https://meetings.opendev.org/meetings/watcher/2025/watcher.2025-11-27-12.00.html13:05
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/watcher/2025/watcher.2025-11-27-12.00.txt13:05
opendevmeetLog:            https://meetings.opendev.org/meetings/watcher/2025/watcher.2025-11-27-12.00.log.html13:05
jgilaberthanks dviroel!13:05
dviroelthanks folks o/13:05
morenodthanks dviroel++13:05
amoralejthanks dviroel++13:05
chandankumardviroel++ thanks!13:05
opendevreviewDavid proposed openstack/watcher master: [DNM] Testing nodeset with three nodes (two computes + 1 controller)  https://review.opendev.org/c/openstack/watcher/+/96733113:09
opendevreviewAlfredo Moralejo proposed openstack/watcher master: Skip migrate actions in pre_condition phase  https://review.opendev.org/c/openstack/watcher/+/96669913:11
opendevreviewchandan kumar proposed openstack/watcher-tempest-plugin master: Remove deprecated client_functional tests  https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/96865113:26
opendevreviewDouglas Viroel proposed openstack/watcher-tempest-plugin master: Consolidate and improve Zuul CI job definitions  https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/96824714:00
opendevreviewDavid proposed openstack/watcher master: [DNM] Testing nodeset with three nodes (two computes + 1 controller)  https://review.opendev.org/c/openstack/watcher/+/96733114:02
opendevreviewDavid proposed openstack/watcher master: [DNM] Testing nodeset with three nodes (two computes + 1 controller)  https://review.opendev.org/c/openstack/watcher/+/96733115:53
opendevreviewAlfredo Moralejo proposed openstack/watcher master: Make VM migrations timeout configurable and apply reasonable defaults  https://review.opendev.org/c/openstack/watcher/+/96769316:35
opendevreviewAlfredo Moralejo proposed openstack/watcher master: Make VM resize timeout configurable with migration defaults  https://review.opendev.org/c/openstack/watcher/+/96861016:35
opendevreviewAlfredo Moralejo proposed openstack/watcher master: Add documentation section for actions  https://review.opendev.org/c/openstack/watcher/+/96802517:01
opendevreviewDouglas Viroel proposed openstack/watcher master: Remove eventlet-based timeout in CDM collectors  https://review.opendev.org/c/openstack/watcher/+/96856818:54
opendevreviewDouglas Viroel proposed openstack/watcher-tempest-plugin master: Wait action plan to finish before asserting state  https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/96875019:57

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!