#openstack-watcher log

12:00:45 <rlandy> #startmeeting IRC Watcher meeting 20 November 2025
12:00:45 <opendevmeet> Meeting started Thu Nov 20 12:00:45 2025 UTC and is due to finish in 60 minutes.  The chair is rlandy. Information about MeetBot at http://wiki.debian.org/MeetBot.
12:00:45 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
12:00:45 <opendevmeet> The meeting name has been set to 'irc_watcher_meeting_20_november_2025'
12:00:51 <rlandy> hello all ...
12:00:56 <rlandy> who is around?
12:01:25 <rlandy> courtesy ping list: dviroel amoralej jgilaber sean-k-mooney chandankumar morenod
12:01:30 <amoralej> o/
12:01:31 <jgilaber> o/
12:01:44 <sean-k-mooney> o/
12:02:50 <rlandy> let's get started ...
12:03:04 <rlandy> #topic:  new blueprint for automatic skip actions on pre_condition
12:03:16 <rlandy> #link: https://blueprints.launchpad.net/watcher/+spec/skip-actions-in-pre-condition
12:03:26 <amoralej> that's mine
12:03:27 <rlandy> #link https://review.opendev.org/c/openstack/watcher/+/966699
12:03:33 <rlandy> pls go ahead amoralej
12:03:43 <chandankumar> o/
12:03:54 <amoralej> so, I started working on it for the migrate action in https://review.opendev.org/c/openstack/watcher/+/966699
12:04:28 <amoralej> so, I'd like to discuss about the blueprint itself and check if it can get approved as-is or I need something else
12:04:49 <sean-k-mooney> so before we approve it we shoudl really agree on the scope
12:05:00 <amoralej> given that it does not imply change in API, etc... I have not created spec so far
12:05:07 <sean-k-mooney> i.e. what action will this chagne and what will the be pre-codnions for each
12:06:04 <amoralej> for the actions, all where skipping condition is relevant, migrate, change_nova_service_state, volume_migrate, stop
12:06:19 <amoralej> and resize i guess (although i don't know if something is using resize, tbh)
12:06:58 <sean-k-mooney> so the description shoudl be updated with a list of the actions and for each action ideally a high level list of the precondtoins
12:07:11 <sean-k-mooney> i.e. the vm/volume must exist
12:07:33 <sean-k-mooney> the souce/dest shoudl exist (state when it is an error vs skipped
12:07:51 <amoralej> ok, about skipping conditions I think keeping a high level is good, but i guess we will need to discuss details for each one on each review
12:07:52 <sean-k-mooney> over all it does not need to be very long but it shoudl be more detailed then it is currently
12:08:19 <sean-k-mooney> yes, we can defer the detials of the exact behavior to the review
12:08:31 <sean-k-mooney> but we need to capture the overall scope fo the changes beter
12:09:27 <amoralej> sure
12:09:37 <sean-k-mooney> if the description was 2 -2.5 times its curentl lenght that would proably caputre all that is need
12:10:15 <amoralej> I get the point
12:10:28 <sean-k-mooney> i have targeted it to 2026.1 and set the defintion to dicssion
12:10:48 <amoralej> In the same reviews, can we also managed conditions where the action should be FAILED?
12:10:52 <amoralej> in pre_condition, i mean
12:11:02 <sean-k-mooney> we can yes
12:11:31 <sean-k-mooney> we proably shoudl have 1 commit per action  but if you want to do the skip cases first and the fail cases after thats ok too
12:11:55 <sean-k-mooney> does anyone have concerns with tthe over all intent or direction?
12:12:03 <amoralej> Unless you requires it, imo one review per action would be fine
12:12:11 <amoralej> or it gets too complicated
12:12:15 <sean-k-mooney> do we feel comfortable in general with a specless blueprint or would folks like a spec?
12:12:33 <jgilaber> +1, unless we need large changes for some action one review per action seems fine
12:12:57 <sean-k-mooney> then lets loop back to it next week and if we are happy we can approve it
12:13:01 <jgilaber> also +1 to specless blueprint we discussed the skip feature well enough last cycle
12:13:21 <amoralej> wrt test coverage
12:13:49 <amoralej> we want to cover that with tempest testing?
12:14:29 <sean-k-mooney> it sort of depends. we may want 1 test im not sure we need to test all the skip condtion in tempest
12:14:30 <amoralej> morenod already created a wip review for the migrate changes using the actuator strategy with a list of actions where we can simulate
12:14:42 <sean-k-mooney> i woudl hope we can comprehelsive test this with unit/funcitonal tests
12:14:50 <morenod> #link https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/966860
12:15:03 <amoralej> yeah, i have doubts if we will be able to simulate all, but for some cases i think it's doable and good
12:15:11 <amoralej> so probably we need to check on each case
12:15:34 <sean-k-mooney> im kind of concerned by that test
12:15:46 <amoralej> what is your concern?
12:15:56 <sean-k-mooney> well it looks liek we have several bugs
12:16:18 <sean-k-mooney> for one it shoudl not be possibel to generate a action plan that refence a non existing instnace
12:16:39 <sean-k-mooney> or a souce nod that does not exist
12:17:09 <sean-k-mooney> now the instnace could be on a diffent host the the refeence source host
12:17:12 <sean-k-mooney> that valid
12:17:43 <amoralej> an instance may be removed after the action plan is created
12:18:03 <sean-k-mooney> correct bu tin this case it never existd
12:18:06 <amoralej> note that the actuator strategy is just a test-only strategy which is proxy to create any action
12:18:46 <amoralej> there is no logic at all in the strategy, but it is usefull to do this kind of simulations to test action logic imo
12:18:47 <sean-k-mooney> yes but it shoudl not allow you to put in a fake instance id
12:19:02 <sean-k-mooney> it shoudl assort that the inste exists in the model or nova
12:19:30 <sean-k-mooney> so that to me is a very big input validation problem
12:19:33 <amoralej> but then, the actuator strategy need to have specific per-action logic
12:19:41 <sean-k-mooney> we cant assuem someone wont use thsi in production
12:19:54 <amoralej> then, we should document that :)
12:19:55 <sean-k-mooney> so we have to validate all actions it create agaisnt the model
12:20:12 <sean-k-mooney> well this is borderlien a secuirty problem
12:20:19 <sean-k-mooney> so i dont think documentaiton is enough here
12:20:22 <amoralej> we could create the vm -> actuator with the action to get the AP -> remove the vm
12:20:36 <sean-k-mooney> ^ would be valid
12:20:37 <amoralej> actually, that's something i've though in the past
12:20:46 <amoralej> hiding strategies
12:21:07 <amoralej> it'd be good to disable / hide test-only strategies and goals
12:21:33 <sean-k-mooney> yes it would or provide a way to jsut filter them in the config in general
12:21:49 <sean-k-mooney> we can have an enabeld_stragey or enabled_goal config opiton
12:22:10 <sean-k-mooney> nova and cidner both do that for there schduler filters and neutorn does it for there extentions
12:22:18 <amoralej> imo, for testing is good to have an strategy that simply copies actions, but i agree it'd be good to have a mechanism to make sure those are not used in production
12:22:47 <sean-k-mooney> so going back to testing
12:22:49 <amoralej> same for dummy, nop
12:23:02 <amoralej> yeah, let's go back
12:23:32 <sean-k-mooney> if we emulate a reals env by creating the 2 vms, creating the action plan and deleteing one then movign the other
12:23:41 <sean-k-mooney> this would be a vaild test again
12:24:15 <sean-k-mooney> but i think we need a bug for this lack of input validation and it should returnn a 400 in the future
12:24:42 <amoralej> I disagree but that's probably another discussion :)
12:25:03 <amoralej> i think it's good to have that option (running actions with fake data) only for testing
12:25:05 <sean-k-mooney> ok but im really sondering if i shoudl be opening a security bug or not.
12:25:46 <sean-k-mooney> raw passhtough of actions without verifcaiton is not ok in my opipion
12:26:01 <sean-k-mooney> we can dicuss that seperatly
12:26:07 <amoralej> yes, +1
12:26:34 <amoralej> so, let's say we will cover tempest where reasonable, and will discuss on each review
12:27:00 <amoralej> in parallel, we can create a bug for the actuator and discuss
12:27:22 <sean-k-mooney> +1
12:28:16 <amoralej> last discussion, wrt the specific cases where we should SKIP or FAIL for the migrate I left a comment in https://review.opendev.org/c/openstack/watcher/+/966699 where we can discuss
12:28:26 <amoralej> unless you want to discuss it now
12:29:11 <rlandy> there are no other topics or reviews (juts bug triage) so up to you
12:29:29 <amoralej> actually, I think we kind of agree in everything other that the case where source_node != current_node
12:29:30 <sean-k-mooney> either works for me
12:30:04 <amoralej> I understand that instance do not exist -> SKIPPED is clear
12:30:19 <amoralej> destination_node does not exist or is disabled -> FAILED
12:30:35 <amoralej> Instance status != ACTIVE in migration_type = Live -> FAILED
12:30:40 <sean-k-mooney> to me i see the following  options mark it as failed ignore it and move it to the selecte destiaon or skip if tis already on the selected destionat
12:32:23 <sean-k-mooney> amoralej: basiclly im not sure you shoud leb looking at the souce node when moving it other then to confirm it not currently on the destination node
12:32:45 <amoralej> My logic to incline to SKIP is that in most (or all) cases, the instance is moved because of the host where it is running, i.e. host_maintenance, zone_migrations or consolidating strategies are migrating the instances because of the host where they are running.
12:32:48 <sean-k-mooney> if it moved out of band to the destiation node then its valid to skip
12:33:13 <amoralej> there is no point in doing a migration for a host_maintenance i.e. if the vm is not in that host
12:33:40 <amoralej> and similar to zone_migration, even for consolidation case
12:33:57 <sean-k-mooney> so that only appleis to the zonew or host maintaince case
12:34:11 <amoralej> or consolidation, intended to empty nodes
12:34:13 <sean-k-mooney> but no the vm consolidation or workload stablization case
12:35:18 <amoralej> even for workload_stabilization, we are moving because source_node it's beyond some threshold
12:35:41 <amoralej> although that's te most complicated case, tbh
12:36:14 <amoralej> something we may do, but that would increase the change is to allow to passed souce_node as None and do not check at all in that case
12:36:30 <amoralej> so, each strategy would decide if want to check or not
12:37:37 <amoralej> and if source_node != None and != current_node -> SKIPPED
12:37:42 <amoralej> I'm not sure, tbh
12:37:54 <amoralej> any opinions about it?
12:38:09 <sean-k-mooney> ya im sort of coming around to your argumetn for skip instead of failed
12:38:43 <sean-k-mooney> your effectivly arguing that the data tha twas used to generate the action plan was out of date
12:39:00 <amoralej> exactly
12:39:01 <sean-k-mooney> so we shoudl disregard any actions related to that workload as a result
12:39:27 <sean-k-mooney> of couse if the vm is on a diffent host in the zone being drained
12:39:44 <sean-k-mooney> ro a diffent host that is under mantiance then the goal will not hav ebeen achived
12:40:00 <sean-k-mooney> so fi you have any skipped action you may need to run the audit and action plan a second time
12:40:08 <sean-k-mooney> or many times until all succeed
12:40:09 <amoralej> correct, but that goes beyond this change
12:40:24 <sean-k-mooney> that is why i was orginlly saying the action plan shoudl be failed
12:40:33 <sean-k-mooney> not skipped
12:40:46 <sean-k-mooney> (an the action shoudl not be skipped but marked failed)
12:43:04 <amoralej> actually, something that may be good in some cases is to recalculate the real efficiency indicators achieved by the execution of an AP
12:43:09 <amoralej> at the end
12:43:21 <amoralej> but that's beyond the scope of this
12:43:37 <sean-k-mooney> so lets bring this back to the review
12:43:41 <amoralej> yes
12:43:52 <sean-k-mooney> oen thing we will need to do is docuemnt this in the action docs
12:44:00 <sean-k-mooney> (and create those if we dont already have them)
12:44:04 <amoralej> yes, actions have no document at all
12:44:16 <amoralej> on the PTG we aggreed to create a new sections for actions
12:44:28 <amoralej> btw, we may document which ones are test-only
12:44:44 <sean-k-mooney> we can yes
12:44:46 <amoralej> and on each action document the skipping conditions
12:45:11 <sean-k-mooney> as i said that is not enough to adresss my concern unless we also disbale them by default and requrie cofig option ot opt in
12:45:17 <amoralej> but i don't want to mix adding a brand new documentation section with the implementation of the skipped
12:45:53 <sean-k-mooney> i do
12:46:02 <sean-k-mooney> your change shoudl be complete
12:46:09 <sean-k-mooney> so eithe we can do the new docs first
12:46:16 <sean-k-mooney> or you shoudl do it as part of the change
12:46:17 <amoralej> that's likely the best option
12:46:25 <amoralej> i mean, adding new docs first
12:46:26 <sean-k-mooney> ack that also works for me
12:46:37 <sean-k-mooney> since we do not have a spec for this
12:46:40 <amoralej> requires blueprint ?
12:46:41 <sean-k-mooney> we need to have the docs
12:46:51 <sean-k-mooney> we can do it as part fo this blueprint i think
12:47:03 <amoralej> ok, i will mention it in the description
12:47:25 <sean-k-mooney> although its gettign closer to a spec at that poitn but ya it makes sense to do this toghter as part fo the over all feature delviery
12:47:59 <sean-k-mooney> anyone have concerns with ^ plan
12:48:41 <jgilaber> +1 for me to include the docs
12:48:47 <sean-k-mooney> 1 add simpel docs for each supproted action today, 2 enhacn the docs and each action with pre-condtions in seperate commits
12:49:09 <rlandy> sounds fine
12:49:46 <amoralej> ack, I will rewrite the description of the blueprint
12:49:47 <rlandy> amoralej, did we cover all the questions and issues on this topic?
12:49:52 <amoralej> yes, i think so
12:50:21 <rlandy> ok - any further questions/comments from anyone else?
12:51:03 <rlandy> moving on ...
12:51:17 <rlandy> there were no review added to  the meeting agenda
12:51:26 <rlandy> so we can move to bug triage
12:51:34 <rlandy> #topic: Bug triage
12:51:44 <rlandy> #link: https://bugs.launchpad.net/watcher/+bug/2131043
12:52:04 <rlandy> opened on 2025-11-10
12:52:11 <rlandy> watcher decision engine fails when evaluating previously migrated instance
12:52:32 <rlandy> bug logged with:
12:52:35 <rlandy> jammy + caracal
12:52:35 <rlandy> openstack 2024.1 charms
12:53:29 <rlandy> note attached path ... "Attached is a patch that allows the decision engine to continue when it finds the stale or missing data. Testing with this patch in the same environment has results in over 150 successful action plans with no decision engine failures."
12:53:52 <sean-k-mooney> this is realted to the other topic
12:54:13 <sean-k-mooney> althoug hits a slighlty diffent problem statement
12:54:37 <sean-k-mooney> teh decsion engin shoudl not crash and shoudl handel the instnace not bein gin the datamodel
12:55:14 <rlandy> will the problem be address with current work in progress?
12:55:29 <sean-k-mooney> but it shoudl not geenrate an action for that instance if we do not have data for it and the stragy required data
12:55:45 <sean-k-mooney> likely not no
12:55:52 <amoralej> so the case is a VM is deleted in the middle of an audit execution?
12:55:54 <sean-k-mooney> this is before the action case
12:56:09 <sean-k-mooney> no the vm is moved and there is no metrics for the host currently
12:56:26 <jgilaber> this could also be because the datamodel is outdated
12:56:36 <sean-k-mooney> yes
12:56:41 <jgilaber> in the bug report it mentions the instance being migrated by another action plan
12:56:55 <jgilaber> this is a problem we've seen already and solved by enabling notifications
12:57:11 <jgilaber> although the attached patch since like a valid improvement to the strategy
12:57:30 <sean-k-mooney> i think we aslo may have fixed this key error in a newer release
12:57:36 <sean-k-mooney> at least a verian of it
12:57:46 <amoralej> I'm not sure what workload_cache has tbh
12:58:04 <jgilaber> I think that was in another strategy
12:58:31 <jgilaber> I'm looking at the code in master and there is no mention of KeyError https://github.com/openstack/watcher/blob/master/watcher/decision_engine/strategy/strategies/workload_balance.py
12:58:41 <amoralej> note it's running it with --interval 60 so, likely the model is not being refreshed between executions
12:58:52 <sean-k-mooney> we may have fixed it for the workload stablisation stratgy or anothe rone
12:59:02 <sean-k-mooney> we need to confirm if this happens on master or not
12:59:13 <rlandy> ack
12:59:26 <amoralej> we may suggest to enable notifications too as jgilaber mentioned
12:59:32 <sean-k-mooney> amoralej: right that hsoudl not resutl in a key error
12:59:38 <sean-k-mooney> we shoudl be robust to that
12:59:45 <amoralej> right
13:00:10 <sean-k-mooney> the patch https://launchpadlibrarian.net/830569850/decision-engine.patch
13:00:19 <sean-k-mooney> jhust ignores the instance and continue the loop
13:00:30 <amoralej> but even if we do improve handling that error, it may create the plan based on incomplete data
13:00:51 <sean-k-mooney> taht might be ok but we need to repoduce this in a test and then have test coverage for it when fixed
13:01:32 <sean-k-mooney> amoralej: well if we require metrics as aprt of the stragy and we dont have them for a workload then the workload shoudl be expclded form teh action plan
13:01:33 <amoralej> actually, we may reproduce it
13:01:39 <sean-k-mooney> or we shoudl fail to produce one
13:02:20 <amoralej> I'd ask about the period parameters they are using and if they have notifications
13:02:36 <sean-k-mooney> they are likely not using it that short
13:02:43 <sean-k-mooney> i think that is just to help use repoduce it
13:03:18 <sean-k-mooney> watcher shoudl nto fail without notificaiton even with short periods for continus audits
13:03:58 <sean-k-mooney> as an api use you cant know if notificaoitn are enabel or the period of the collectors
13:04:17 <rlandy> what are our next steps here? ... status: triaged? importance: high? ask reporter if they can reproduce in master (without notifications)?
13:04:18 <amoralej> yes, so i see two different issues, 1 - Watcher whould be more robust on those cases 2 - What are good configuration to get the best audit results
13:05:09 <sean-k-mooney> set it to incomplete and either ask if they see this also on master or we will have to do that
13:05:18 <amoralej> i.e. if no notification, continuous audits with interval < nova_colletor.period will easily lead to incomplete plans
13:05:30 <sean-k-mooney> yep
13:05:37 <sean-k-mooney> but that is ok
13:05:47 <rlandy> ok .. marking bug
13:05:49 <amoralej> you want me to reply in the case?
13:06:07 <rlandy> if you would like ... other I will do it after the call
13:06:12 <rlandy> next ...
13:06:17 <sean-k-mooney> amoralej: sure, we have basically 2 option allow it to proceed and log a warning as they proppose
13:06:55 <sean-k-mooney> or we need to reject the audit create with a 400 becuase the notificaotn are disabeld and the period is less the nthe collcotor
13:07:00 <sean-k-mooney> i dont like the second option
13:07:11 <sean-k-mooney> becaus that is leaking privaldge infrastuctre information
13:07:18 <sean-k-mooney> i.e. the config option values
13:08:05 <amoralej> yes, failing is likely too much
13:08:33 <jgilaber> yes, failing the audit seems overkill in this scenario imo
13:09:36 <rlandy> sorry folks - we're over time - so just touching on the last bug ...
13:09:45 <rlandy> #link: https://bugs.launchpad.net/watcher/+bug/2129692
13:10:03 <rlandy> this one jgilaber entered
13:10:10 <rlandy> as is "in progress"
13:10:25 <rlandy> jgilaber, any discussion needed here?
13:11:16 <jgilaber> this bug surfaced after I implemented support for the 'src_type' parameter
13:11:30 <jgilaber> I have a patch for it https://review.opendev.org/c/openstack/watcher/+/964718
13:12:16 <jgilaber> in short the zone migration strategy might not select the correct destination because the get_dst_pool_and_type method does not take into account the src_type
13:12:31 <jgilaber> given that we're out of time we can discuss it another time
13:13:01 <rlandy> jgilaber, ok - sure - pls bring it back next week if needed
13:13:09 <rlandy> thank you all for attending
13:13:16 <jgilaber> thank rlandy!
13:13:20 <rlandy> dviroel volunteered to chair next meeting
13:13:32 <rlandy> #endmeeting