Thursday, 2025-11-20

rlandyhello - IRC meeting will be in about 90 mins ... please add any topics and reviews to https://etherpad.opendev.org/p/openstack-watcher-irc-meeting#L3910:15
opendevreviewAlfredo Moralejo proposed openstack/watcher master: Add configurable timeout for migrate actions with reasonable default  https://review.opendev.org/c/openstack/watcher/+/96769310:23
opendevreviewAlfredo Moralejo proposed openstack/watcher master: Add configurable timeout for migrate actions with reasonable default  https://review.opendev.org/c/openstack/watcher/+/96769310:25
rlandy#startmeeting IRC Watcher meeting 20 November 202512:00
opendevmeetMeeting started Thu Nov 20 12:00:45 2025 UTC and is due to finish in 60 minutes.  The chair is rlandy. Information about MeetBot at http://wiki.debian.org/MeetBot.12:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.12:00
opendevmeetThe meeting name has been set to 'irc_watcher_meeting_20_november_2025'12:00
rlandyhello all ...12:00
rlandywho is around?12:00
rlandycourtesy ping list: dviroel amoralej jgilaber sean-k-mooney chandankumar morenod12:01
amoralejo/12:01
jgilabero/12:01
sean-k-mooneyo/12:01
rlandylet's get started ... 12:02
rlandy#topic:  new blueprint for automatic skip actions on pre_condition12:03
rlandy#link: https://blueprints.launchpad.net/watcher/+spec/skip-actions-in-pre-condition12:03
amoralejthat's mine12:03
rlandy#link https://review.opendev.org/c/openstack/watcher/+/96669912:03
rlandypls go ahead amoralej 12:03
chandankumaro/12:03
amoralejso, I started working on it for the migrate action in https://review.opendev.org/c/openstack/watcher/+/96669912:03
amoralejso, I'd like to discuss about the blueprint itself and check if it can get approved as-is or I need something else12:04
sean-k-mooneyso before we approve it we shoudl really agree on the scope12:04
amoralejgiven that it does not imply change in API, etc... I have not created spec so far12:05
sean-k-mooneyi.e. what action will this chagne and what will the be pre-codnions for each12:05
amoralejfor the actions, all where skipping condition is relevant, migrate, change_nova_service_state, volume_migrate, stop12:06
amoralejand resize i guess (although i don't know if something is using resize, tbh)12:06
sean-k-mooneyso the description shoudl be updated with a list of the actions and for each action ideally a high level list of the precondtoins12:06
sean-k-mooneyi.e. the vm/volume must exist12:07
sean-k-mooneythe souce/dest shoudl exist (state when it is an error vs skipped12:07
amoralejok, about skipping conditions I think keeping a high level is good, but i guess we will need to discuss details for each one on each review12:07
sean-k-mooneyover all it does not need to be very long but it shoudl be more detailed then it is currently12:07
sean-k-mooneyyes, we can defer the detials of the exact behavior to the review12:08
sean-k-mooneybut we need to capture the overall scope fo the changes beter12:08
amoralejsure12:09
sean-k-mooneyif the description was 2 -2.5 times its curentl lenght that would proably caputre all that is need12:09
amoralejI get the point12:10
sean-k-mooneyi have targeted it to 2026.1 and set the defintion to dicssion12:10
amoralejIn the same reviews, can we also managed conditions where the action should be FAILED?12:10
amoralejin pre_condition, i mean12:10
sean-k-mooneywe can yes12:11
sean-k-mooneywe proably shoudl have 1 commit per action  but if you want to do the skip cases first and the fail cases after thats ok too12:11
sean-k-mooneydoes anyone have concerns with tthe over all intent or direction?12:11
amoralejUnless you requires it, imo one review per action would be fine12:12
amoralejor it gets too complicated12:12
sean-k-mooneydo we feel comfortable in general with a specless blueprint or would folks like a spec?12:12
jgilaber+1, unless we need large changes for some action one review per action seems fine12:12
sean-k-mooneythen lets loop back to it next week and if we are happy we can approve it12:12
jgilaberalso +1 to specless blueprint we discussed the skip feature well enough last cycle12:13
amoralejwrt test coverage12:13
amoralejwe want to cover that with tempest testing?12:13
sean-k-mooneyit sort of depends. we may want 1 test im not sure we need to test all the skip condtion in tempest12:14
amoralejmorenod already created a wip review for the migrate changes using the actuator strategy with a list of actions where we can simulate12:14
sean-k-mooneyi woudl hope we can comprehelsive test this with unit/funcitonal tests12:14
morenod#link https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/96686012:14
amoralejyeah, i have doubts if we will be able to simulate all, but for some cases i think it's doable and good12:15
amoralejso probably we need to check on each case12:15
sean-k-mooneyim kind of concerned by that test12:15
amoralejwhat is your concern?12:15
sean-k-mooneywell it looks liek we have several bugs12:15
sean-k-mooneyfor one it shoudl not be possibel to generate a action plan that refence a non existing instnace12:16
sean-k-mooneyor a souce nod that does not exist12:16
sean-k-mooneynow the instnace could be on a diffent host the the refeence source host12:17
sean-k-mooneythat valid12:17
amoralejan instance may be removed after the action plan is created12:17
sean-k-mooneycorrect bu tin this case it never existd12:18
amoralejnote that the actuator strategy is just a test-only strategy which is proxy to create any action12:18
amoralejthere is no logic at all in the strategy, but it is usefull to do this kind of simulations to test action logic imo12:18
sean-k-mooneyyes but it shoudl not allow you to put in a fake instance id12:18
sean-k-mooneyit shoudl assort that the inste exists in the model or nova12:19
sean-k-mooneyso that to me is a very big input validation problem12:19
amoralejbut then, the actuator strategy need to have specific per-action logic12:19
sean-k-mooneywe cant assuem someone wont use thsi in production12:19
amoralejthen, we should document that :)12:19
sean-k-mooneyso we have to validate all actions it create agaisnt the model 12:19
sean-k-mooneywell this is borderlien a secuirty problem12:20
sean-k-mooneyso i dont think documentaiton is enough here12:20
amoralejwe could create the vm -> actuator with the action to get the AP -> remove the vm12:20
sean-k-mooney^ would be valid12:20
amoralejactually, that's something i've though in the past12:20
amoralejhiding strategies12:20
amoralejit'd be good to disable / hide test-only strategies and goals12:21
sean-k-mooneyyes it would or provide a way to jsut filter them in the config in general12:21
sean-k-mooneywe can have an enabeld_stragey or enabled_goal config opiton12:21
sean-k-mooneynova and cidner both do that for there schduler filters and neutorn does it for there extentions12:22
amoralejimo, for testing is good to have an strategy that simply copies actions, but i agree it'd be good to have a mechanism to make sure those are not used in production12:22
sean-k-mooneyso going back to testing12:22
amoralejsame for dummy, nop12:22
amoralejyeah, let's go back12:23
sean-k-mooneyif we emulate a reals env by creating the 2 vms, creating the action plan and deleteing one then movign the other12:23
sean-k-mooneythis would be a vaild test again12:23
sean-k-mooneybut i think we need a bug for this lack of input validation and it should returnn a 400 in the future12:24
amoralejI disagree but that's probably another discussion :)12:24
amoraleji think it's good to have that option (running actions with fake data) only for testing12:25
sean-k-mooneyok but im really sondering if i shoudl be opening a security bug or not.12:25
sean-k-mooneyraw passhtough of actions without verifcaiton is not ok in my opipion12:25
sean-k-mooneywe can dicuss that seperatly12:26
amoralejyes, +112:26
amoralejso, let's say we will cover tempest where reasonable, and will discuss on each review12:26
amoralejin parallel, we can create a bug for the actuator and discuss12:27
sean-k-mooney+112:27
amoralejlast discussion, wrt the specific cases where we should SKIP or FAIL for the migrate I left a comment in https://review.opendev.org/c/openstack/watcher/+/966699 where we can discuss12:28
amoralejunless you want to discuss it now12:28
rlandythere are no other topics or reviews (juts bug triage) so up to you12:29
amoralejactually, I think we kind of agree in everything other that the case where source_node != current_node12:29
sean-k-mooneyeither works for me12:29
amoralejI understand that instance do not exist -> SKIPPED is clear12:30
amoralejdestination_node does not exist or is disabled -> FAILED12:30
amoralejInstance status != ACTIVE in migration_type = Live -> FAILED12:30
sean-k-mooneyto me i see the following  options mark it as failed ignore it and move it to the selecte destiaon or skip if tis already on the selected destionat12:30
sean-k-mooneyamoralej: basiclly im not sure you shoud leb looking at the souce node when moving it other then to confirm it not currently on the destination node12:32
amoralejMy logic to incline to SKIP is that in most (or all) cases, the instance is moved because of the host where it is running, i.e. host_maintenance, zone_migrations or consolidating strategies are migrating the instances because of the host where they are running. 12:32
sean-k-mooneyif it moved out of band to the destiation node then its valid to skip12:32
amoralejthere is no point in doing a migration for a host_maintenance i.e. if the vm is not in that host12:33
amoralejand similar to zone_migration, even for consolidation case12:33
sean-k-mooneyso that only appleis to the zonew or host maintaince case12:33
amoralejor consolidation, intended to empty nodes12:34
sean-k-mooneybut no the vm consolidation or workload stablization case12:34
amoralejeven for workload_stabilization, we are moving because source_node it's beyond some threshold12:35
amoralejalthough that's te most complicated case, tbh12:35
amoralejsomething we may do, but that would increase the change is to allow to passed souce_node as None and do not check at all in that case12:36
amoralejso, each strategy would decide if want to check or not12:36
amoralejand if source_node != None and != current_node -> SKIPPED12:37
amoralejI'm not sure, tbh12:37
amoralejany opinions about it?12:37
sean-k-mooneyya im sort of coming around to your argumetn for skip instead of failed12:38
sean-k-mooneyyour effectivly arguing that the data tha twas used to generate the action plan was out of date12:38
amoralejexactly12:39
sean-k-mooneyso we shoudl disregard any actions related to that workload as a result12:39
sean-k-mooneyof couse if the vm is on a diffent host in the zone being drained12:39
sean-k-mooneyro a diffent host that is under mantiance then the goal will not hav ebeen achived12:39
sean-k-mooneyso fi you have any skipped action you may need to run the audit and action plan a second time12:40
sean-k-mooneyor many times until all succeed12:40
amoralejcorrect, but that goes beyond this change12:40
sean-k-mooneythat is why i was orginlly saying the action plan shoudl be failed12:40
sean-k-mooneynot skipped12:40
sean-k-mooney(an the action shoudl not be skipped but marked failed)12:40
amoralejactually, something that may be good in some cases is to recalculate the real efficiency indicators achieved by the execution of an AP12:43
amoralejat the end12:43
amoralejbut that's beyond the scope of this12:43
sean-k-mooneyso lets bring this back to the review12:43
amoralejyes12:43
sean-k-mooneyoen thing we will need to do is docuemnt this in the action docs12:43
sean-k-mooney(and create those if we dont already have them)12:44
amoralejyes, actions have no document at all12:44
amoralejon the PTG we aggreed to create a new sections for actions12:44
amoralejbtw, we may document which ones are test-only12:44
sean-k-mooneywe can yes12:44
amoralejand on each action document the skipping conditions12:44
sean-k-mooneyas i said that is not enough to adresss my concern unless we also disbale them by default and requrie cofig option ot opt in12:45
amoralejbut i don't want to mix adding a brand new documentation section with the implementation of the skipped12:45
sean-k-mooneyi do12:45
sean-k-mooneyyour change shoudl be complete12:46
sean-k-mooneyso eithe we can do the new docs first12:46
sean-k-mooneyor you shoudl do it as part of the change12:46
amoralejthat's likely the best option12:46
amoraleji mean, adding new docs first12:46
sean-k-mooneyack that also works for me12:46
sean-k-mooneysince we do not have a spec for this12:46
amoralejrequires blueprint ?12:46
sean-k-mooneywe need to have the docs 12:46
sean-k-mooneywe can do it as part fo this blueprint i think12:46
amoralejok, i will mention it in the description12:47
sean-k-mooneyalthough its gettign closer to a spec at that poitn but ya it makes sense to do this toghter as part fo the over all feature delviery12:47
sean-k-mooneyanyone have concerns with ^ plan12:47
jgilaber+1 for me to include the docs12:48
sean-k-mooney1 add simpel docs for each supproted action today, 2 enhacn the docs and each action with pre-condtions in seperate commits12:48
rlandysounds fine12:49
amoralejack, I will rewrite the description of the blueprint12:49
rlandyamoralej, did we cover all the questions and issues on this topic?12:49
amoralejyes, i think so12:49
rlandyok - any further questions/comments from anyone else?12:50
rlandymoving on ...12:51
rlandythere were no review added to  the meeting agenda12:51
rlandyso we can move to bug triage12:51
rlandy#topic: Bug triage12:51
rlandy#link: https://bugs.launchpad.net/watcher/+bug/213104312:51
rlandyopened on 2025-11-1012:52
rlandywatcher decision engine fails when evaluating previously migrated instance12:52
rlandybug logged with:12:52
rlandyjammy + caracal12:52
rlandyopenstack 2024.1 charms12:52
rlandynote attached path ... "Attached is a patch that allows the decision engine to continue when it finds the stale or missing data. Testing with this patch in the same environment has results in over 150 successful action plans with no decision engine failures."12:53
sean-k-mooneythis is realted to the other topic12:53
sean-k-mooneyalthoug hits a slighlty diffent problem statement12:54
sean-k-mooneyteh decsion engin shoudl not crash and shoudl handel the instnace not bein gin the datamodel12:54
rlandywill the problem be address with current work in progress?12:55
sean-k-mooneybut it shoudl not geenrate an action for that instance if we do not have data for it and the stragy required data12:55
sean-k-mooneylikely not no12:55
amoralejso the case is a VM is deleted in the middle of an audit execution?12:55
sean-k-mooneythis is before the action case12:55
sean-k-mooneyno the vm is moved and there is no metrics for the host currently12:56
jgilaberthis could also be because the datamodel is outdated12:56
sean-k-mooneyyes12:56
jgilaberin the bug report it mentions the instance being migrated by another action plan12:56
jgilaberthis is a problem we've seen already and solved by enabling notifications12:56
jgilaberalthough the attached patch since like a valid improvement to the strategy12:57
sean-k-mooneyi think we aslo may have fixed this key error in a newer release12:57
sean-k-mooneyat least a verian of it12:57
amoralejI'm not sure what workload_cache has tbh12:57
jgilaberI think that was in another strategy12:58
jgilaberI'm looking at the code in master and there is no mention of KeyError https://github.com/openstack/watcher/blob/master/watcher/decision_engine/strategy/strategies/workload_balance.py12:58
amoralejnote it's running it with --interval 60 so, likely the model is not being refreshed between executions12:58
sean-k-mooneywe may have fixed it for the workload stablisation stratgy or anothe rone12:58
sean-k-mooneywe need to confirm if this happens on master or not12:59
rlandyack12:59
amoralejwe may suggest to enable notifications too as jgilaber mentioned12:59
sean-k-mooneyamoralej: right that hsoudl not resutl in a key error12:59
sean-k-mooneywe shoudl be robust to that12:59
amoralejright12:59
sean-k-mooneythe patch https://launchpadlibrarian.net/830569850/decision-engine.patch13:00
sean-k-mooneyjhust ignores the instance and continue the loop13:00
amoralejbut even if we do improve handling that error, it may create the plan based on incomplete data13:00
sean-k-mooneytaht might be ok but we need to repoduce this in a test and then have test coverage for it when fixed 13:00
sean-k-mooneyamoralej: well if we require metrics as aprt of the stragy and we dont have them for a workload then the workload shoudl be expclded form teh action plan13:01
amoralejactually, we may reproduce it13:01
sean-k-mooneyor we shoudl fail to produce one13:01
amoralejI'd ask about the period parameters they are using and if they have notifications13:02
sean-k-mooneythey are likely not using it that short13:02
sean-k-mooneyi think that is just to help use repoduce it13:02
sean-k-mooneywatcher shoudl nto fail without notificaiton even with short periods for continus audits13:03
sean-k-mooneyas an api use you cant know if notificaoitn are enabel or the period of the collectors13:03
rlandywhat are our next steps here? ... status: triaged? importance: high? ask reporter if they can reproduce in master (without notifications)?13:04
amoralejyes, so i see two different issues, 1 - Watcher whould be more robust on those cases 2 - What are good configuration to get the best audit results13:04
sean-k-mooneyset it to incomplete and either ask if they see this also on master or we will have to do that13:05
amoraleji.e. if no notification, continuous audits with interval < nova_colletor.period will easily lead to incomplete plans13:05
sean-k-mooneyyep13:05
sean-k-mooneybut that is ok13:05
rlandyok .. marking bug 13:05
amoralejyou want me to reply in the case?13:05
rlandyif you would like ... other I will do it after the call13:06
rlandynext ...13:06
sean-k-mooneyamoralej: sure, we have basically 2 option allow it to proceed and log a warning as they proppose13:06
sean-k-mooneyor we need to reject the audit create with a 400 becuase the notificaotn are disabeld and the period is less the nthe collcotor13:06
sean-k-mooneyi dont like the second option13:07
sean-k-mooneybecaus that is leaking privaldge infrastuctre information13:07
sean-k-mooneyi.e. the config option values13:07
amoralejyes, failing is likely too much13:08
jgilaberyes, failing the audit seems overkill in this scenario imo13:08
rlandysorry folks - we're over time - so just touching on the last bug ... 13:09
rlandy#link: https://bugs.launchpad.net/watcher/+bug/212969213:09
rlandythis one jgilaber entered13:10
rlandyas is "in progress"13:10
rlandyjgilaber, any discussion needed here?13:10
jgilaberthis bug surfaced after I implemented support for the 'src_type' parameter13:11
jgilaberI have a patch for it https://review.opendev.org/c/openstack/watcher/+/96471813:11
jgilaberin short the zone migration strategy might not select the correct destination because the get_dst_pool_and_type method does not take into account the src_type13:12
jgilabergiven that we're out of time we can discuss it another time13:12
rlandyjgilaber, ok - sure - pls bring it back next week if needed13:13
rlandythank you all for attending13:13
jgilaberthank rlandy!13:13
rlandydviroel volunteered to chair next meeting13:13
rlandy#endmeeting13:13
opendevmeetMeeting ended Thu Nov 20 13:13:32 2025 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)13:13
opendevmeetMinutes:        https://meetings.opendev.org/meetings/irc_watcher_meeting_20_november_2025/2025/irc_watcher_meeting_20_november_2025.2025-11-20-12.00.html13:13
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/irc_watcher_meeting_20_november_2025/2025/irc_watcher_meeting_20_november_2025.2025-11-20-12.00.txt13:13
opendevmeetLog:            https://meetings.opendev.org/meetings/irc_watcher_meeting_20_november_2025/2025/irc_watcher_meeting_20_november_2025.2025-11-20-12.00.log.html13:13
opendevreviewchandan kumar proposed openstack/watcher-dashboard master: Add skip action feature with microversion support  https://review.opendev.org/c/openstack/watcher-dashboard/+/95820915:41
opendevreviewAlfredo Moralejo proposed openstack/watcher master: APISchedulingService migrate audits also on first discovery of services  https://review.opendev.org/c/openstack/watcher/+/96398115:45
opendevreviewAlfredo Moralejo proposed openstack/watcher master: Move decision-engine monitoring service to the decision-engine  https://review.opendev.org/c/openstack/watcher/+/96325215:45

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!