12:00:45 <rlandy> #startmeeting IRC Watcher meeting 20 November 2025 12:00:45 <opendevmeet> Meeting started Thu Nov 20 12:00:45 2025 UTC and is due to finish in 60 minutes. The chair is rlandy. Information about MeetBot at http://wiki.debian.org/MeetBot. 12:00:45 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 12:00:45 <opendevmeet> The meeting name has been set to 'irc_watcher_meeting_20_november_2025' 12:00:51 <rlandy> hello all ... 12:00:56 <rlandy> who is around? 12:01:25 <rlandy> courtesy ping list: dviroel amoralej jgilaber sean-k-mooney chandankumar morenod 12:01:30 <amoralej> o/ 12:01:31 <jgilaber> o/ 12:01:44 <sean-k-mooney> o/ 12:02:50 <rlandy> let's get started ... 12:03:04 <rlandy> #topic: new blueprint for automatic skip actions on pre_condition 12:03:16 <rlandy> #link: https://blueprints.launchpad.net/watcher/+spec/skip-actions-in-pre-condition 12:03:26 <amoralej> that's mine 12:03:27 <rlandy> #link https://review.opendev.org/c/openstack/watcher/+/966699 12:03:33 <rlandy> pls go ahead amoralej 12:03:43 <chandankumar> o/ 12:03:54 <amoralej> so, I started working on it for the migrate action in https://review.opendev.org/c/openstack/watcher/+/966699 12:04:28 <amoralej> so, I'd like to discuss about the blueprint itself and check if it can get approved as-is or I need something else 12:04:49 <sean-k-mooney> so before we approve it we shoudl really agree on the scope 12:05:00 <amoralej> given that it does not imply change in API, etc... I have not created spec so far 12:05:07 <sean-k-mooney> i.e. what action will this chagne and what will the be pre-codnions for each 12:06:04 <amoralej> for the actions, all where skipping condition is relevant, migrate, change_nova_service_state, volume_migrate, stop 12:06:19 <amoralej> and resize i guess (although i don't know if something is using resize, tbh) 12:06:58 <sean-k-mooney> so the description shoudl be updated with a list of the actions and for each action ideally a high level list of the precondtoins 12:07:11 <sean-k-mooney> i.e. the vm/volume must exist 12:07:33 <sean-k-mooney> the souce/dest shoudl exist (state when it is an error vs skipped 12:07:51 <amoralej> ok, about skipping conditions I think keeping a high level is good, but i guess we will need to discuss details for each one on each review 12:07:52 <sean-k-mooney> over all it does not need to be very long but it shoudl be more detailed then it is currently 12:08:19 <sean-k-mooney> yes, we can defer the detials of the exact behavior to the review 12:08:31 <sean-k-mooney> but we need to capture the overall scope fo the changes beter 12:09:27 <amoralej> sure 12:09:37 <sean-k-mooney> if the description was 2 -2.5 times its curentl lenght that would proably caputre all that is need 12:10:15 <amoralej> I get the point 12:10:28 <sean-k-mooney> i have targeted it to 2026.1 and set the defintion to dicssion 12:10:48 <amoralej> In the same reviews, can we also managed conditions where the action should be FAILED? 12:10:52 <amoralej> in pre_condition, i mean 12:11:02 <sean-k-mooney> we can yes 12:11:31 <sean-k-mooney> we proably shoudl have 1 commit per action but if you want to do the skip cases first and the fail cases after thats ok too 12:11:55 <sean-k-mooney> does anyone have concerns with tthe over all intent or direction? 12:12:03 <amoralej> Unless you requires it, imo one review per action would be fine 12:12:11 <amoralej> or it gets too complicated 12:12:15 <sean-k-mooney> do we feel comfortable in general with a specless blueprint or would folks like a spec? 12:12:33 <jgilaber> +1, unless we need large changes for some action one review per action seems fine 12:12:57 <sean-k-mooney> then lets loop back to it next week and if we are happy we can approve it 12:13:01 <jgilaber> also +1 to specless blueprint we discussed the skip feature well enough last cycle 12:13:21 <amoralej> wrt test coverage 12:13:49 <amoralej> we want to cover that with tempest testing? 12:14:29 <sean-k-mooney> it sort of depends. we may want 1 test im not sure we need to test all the skip condtion in tempest 12:14:30 <amoralej> morenod already created a wip review for the migrate changes using the actuator strategy with a list of actions where we can simulate 12:14:42 <sean-k-mooney> i woudl hope we can comprehelsive test this with unit/funcitonal tests 12:14:50 <morenod> #link https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/966860 12:15:03 <amoralej> yeah, i have doubts if we will be able to simulate all, but for some cases i think it's doable and good 12:15:11 <amoralej> so probably we need to check on each case 12:15:34 <sean-k-mooney> im kind of concerned by that test 12:15:46 <amoralej> what is your concern? 12:15:56 <sean-k-mooney> well it looks liek we have several bugs 12:16:18 <sean-k-mooney> for one it shoudl not be possibel to generate a action plan that refence a non existing instnace 12:16:39 <sean-k-mooney> or a souce nod that does not exist 12:17:09 <sean-k-mooney> now the instnace could be on a diffent host the the refeence source host 12:17:12 <sean-k-mooney> that valid 12:17:43 <amoralej> an instance may be removed after the action plan is created 12:18:03 <sean-k-mooney> correct bu tin this case it never existd 12:18:06 <amoralej> note that the actuator strategy is just a test-only strategy which is proxy to create any action 12:18:46 <amoralej> there is no logic at all in the strategy, but it is usefull to do this kind of simulations to test action logic imo 12:18:47 <sean-k-mooney> yes but it shoudl not allow you to put in a fake instance id 12:19:02 <sean-k-mooney> it shoudl assort that the inste exists in the model or nova 12:19:30 <sean-k-mooney> so that to me is a very big input validation problem 12:19:33 <amoralej> but then, the actuator strategy need to have specific per-action logic 12:19:41 <sean-k-mooney> we cant assuem someone wont use thsi in production 12:19:54 <amoralej> then, we should document that :) 12:19:55 <sean-k-mooney> so we have to validate all actions it create agaisnt the model 12:20:12 <sean-k-mooney> well this is borderlien a secuirty problem 12:20:19 <sean-k-mooney> so i dont think documentaiton is enough here 12:20:22 <amoralej> we could create the vm -> actuator with the action to get the AP -> remove the vm 12:20:36 <sean-k-mooney> ^ would be valid 12:20:37 <amoralej> actually, that's something i've though in the past 12:20:46 <amoralej> hiding strategies 12:21:07 <amoralej> it'd be good to disable / hide test-only strategies and goals 12:21:33 <sean-k-mooney> yes it would or provide a way to jsut filter them in the config in general 12:21:49 <sean-k-mooney> we can have an enabeld_stragey or enabled_goal config opiton 12:22:10 <sean-k-mooney> nova and cidner both do that for there schduler filters and neutorn does it for there extentions 12:22:18 <amoralej> imo, for testing is good to have an strategy that simply copies actions, but i agree it'd be good to have a mechanism to make sure those are not used in production 12:22:47 <sean-k-mooney> so going back to testing 12:22:49 <amoralej> same for dummy, nop 12:23:02 <amoralej> yeah, let's go back 12:23:32 <sean-k-mooney> if we emulate a reals env by creating the 2 vms, creating the action plan and deleteing one then movign the other 12:23:41 <sean-k-mooney> this would be a vaild test again 12:24:15 <sean-k-mooney> but i think we need a bug for this lack of input validation and it should returnn a 400 in the future 12:24:42 <amoralej> I disagree but that's probably another discussion :) 12:25:03 <amoralej> i think it's good to have that option (running actions with fake data) only for testing 12:25:05 <sean-k-mooney> ok but im really sondering if i shoudl be opening a security bug or not. 12:25:46 <sean-k-mooney> raw passhtough of actions without verifcaiton is not ok in my opipion 12:26:01 <sean-k-mooney> we can dicuss that seperatly 12:26:07 <amoralej> yes, +1 12:26:34 <amoralej> so, let's say we will cover tempest where reasonable, and will discuss on each review 12:27:00 <amoralej> in parallel, we can create a bug for the actuator and discuss 12:27:22 <sean-k-mooney> +1 12:28:16 <amoralej> last discussion, wrt the specific cases where we should SKIP or FAIL for the migrate I left a comment in https://review.opendev.org/c/openstack/watcher/+/966699 where we can discuss 12:28:26 <amoralej> unless you want to discuss it now 12:29:11 <rlandy> there are no other topics or reviews (juts bug triage) so up to you 12:29:29 <amoralej> actually, I think we kind of agree in everything other that the case where source_node != current_node 12:29:30 <sean-k-mooney> either works for me 12:30:04 <amoralej> I understand that instance do not exist -> SKIPPED is clear 12:30:19 <amoralej> destination_node does not exist or is disabled -> FAILED 12:30:35 <amoralej> Instance status != ACTIVE in migration_type = Live -> FAILED 12:30:40 <sean-k-mooney> to me i see the following options mark it as failed ignore it and move it to the selecte destiaon or skip if tis already on the selected destionat 12:32:23 <sean-k-mooney> amoralej: basiclly im not sure you shoud leb looking at the souce node when moving it other then to confirm it not currently on the destination node 12:32:45 <amoralej> My logic to incline to SKIP is that in most (or all) cases, the instance is moved because of the host where it is running, i.e. host_maintenance, zone_migrations or consolidating strategies are migrating the instances because of the host where they are running. 12:32:48 <sean-k-mooney> if it moved out of band to the destiation node then its valid to skip 12:33:13 <amoralej> there is no point in doing a migration for a host_maintenance i.e. if the vm is not in that host 12:33:40 <amoralej> and similar to zone_migration, even for consolidation case 12:33:57 <sean-k-mooney> so that only appleis to the zonew or host maintaince case 12:34:11 <amoralej> or consolidation, intended to empty nodes 12:34:13 <sean-k-mooney> but no the vm consolidation or workload stablization case 12:35:18 <amoralej> even for workload_stabilization, we are moving because source_node it's beyond some threshold 12:35:41 <amoralej> although that's te most complicated case, tbh 12:36:14 <amoralej> something we may do, but that would increase the change is to allow to passed souce_node as None and do not check at all in that case 12:36:30 <amoralej> so, each strategy would decide if want to check or not 12:37:37 <amoralej> and if source_node != None and != current_node -> SKIPPED 12:37:42 <amoralej> I'm not sure, tbh 12:37:54 <amoralej> any opinions about it? 12:38:09 <sean-k-mooney> ya im sort of coming around to your argumetn for skip instead of failed 12:38:43 <sean-k-mooney> your effectivly arguing that the data tha twas used to generate the action plan was out of date 12:39:00 <amoralej> exactly 12:39:01 <sean-k-mooney> so we shoudl disregard any actions related to that workload as a result 12:39:27 <sean-k-mooney> of couse if the vm is on a diffent host in the zone being drained 12:39:44 <sean-k-mooney> ro a diffent host that is under mantiance then the goal will not hav ebeen achived 12:40:00 <sean-k-mooney> so fi you have any skipped action you may need to run the audit and action plan a second time 12:40:08 <sean-k-mooney> or many times until all succeed 12:40:09 <amoralej> correct, but that goes beyond this change 12:40:24 <sean-k-mooney> that is why i was orginlly saying the action plan shoudl be failed 12:40:33 <sean-k-mooney> not skipped 12:40:46 <sean-k-mooney> (an the action shoudl not be skipped but marked failed) 12:43:04 <amoralej> actually, something that may be good in some cases is to recalculate the real efficiency indicators achieved by the execution of an AP 12:43:09 <amoralej> at the end 12:43:21 <amoralej> but that's beyond the scope of this 12:43:37 <sean-k-mooney> so lets bring this back to the review 12:43:41 <amoralej> yes 12:43:52 <sean-k-mooney> oen thing we will need to do is docuemnt this in the action docs 12:44:00 <sean-k-mooney> (and create those if we dont already have them) 12:44:04 <amoralej> yes, actions have no document at all 12:44:16 <amoralej> on the PTG we aggreed to create a new sections for actions 12:44:28 <amoralej> btw, we may document which ones are test-only 12:44:44 <sean-k-mooney> we can yes 12:44:46 <amoralej> and on each action document the skipping conditions 12:45:11 <sean-k-mooney> as i said that is not enough to adresss my concern unless we also disbale them by default and requrie cofig option ot opt in 12:45:17 <amoralej> but i don't want to mix adding a brand new documentation section with the implementation of the skipped 12:45:53 <sean-k-mooney> i do 12:46:02 <sean-k-mooney> your change shoudl be complete 12:46:09 <sean-k-mooney> so eithe we can do the new docs first 12:46:16 <sean-k-mooney> or you shoudl do it as part of the change 12:46:17 <amoralej> that's likely the best option 12:46:25 <amoralej> i mean, adding new docs first 12:46:26 <sean-k-mooney> ack that also works for me 12:46:37 <sean-k-mooney> since we do not have a spec for this 12:46:40 <amoralej> requires blueprint ? 12:46:41 <sean-k-mooney> we need to have the docs 12:46:51 <sean-k-mooney> we can do it as part fo this blueprint i think 12:47:03 <amoralej> ok, i will mention it in the description 12:47:25 <sean-k-mooney> although its gettign closer to a spec at that poitn but ya it makes sense to do this toghter as part fo the over all feature delviery 12:47:59 <sean-k-mooney> anyone have concerns with ^ plan 12:48:41 <jgilaber> +1 for me to include the docs 12:48:47 <sean-k-mooney> 1 add simpel docs for each supproted action today, 2 enhacn the docs and each action with pre-condtions in seperate commits 12:49:09 <rlandy> sounds fine 12:49:46 <amoralej> ack, I will rewrite the description of the blueprint 12:49:47 <rlandy> amoralej, did we cover all the questions and issues on this topic? 12:49:52 <amoralej> yes, i think so 12:50:21 <rlandy> ok - any further questions/comments from anyone else? 12:51:03 <rlandy> moving on ... 12:51:17 <rlandy> there were no review added to the meeting agenda 12:51:26 <rlandy> so we can move to bug triage 12:51:34 <rlandy> #topic: Bug triage 12:51:44 <rlandy> #link: https://bugs.launchpad.net/watcher/+bug/2131043 12:52:04 <rlandy> opened on 2025-11-10 12:52:11 <rlandy> watcher decision engine fails when evaluating previously migrated instance 12:52:32 <rlandy> bug logged with: 12:52:35 <rlandy> jammy + caracal 12:52:35 <rlandy> openstack 2024.1 charms 12:53:29 <rlandy> note attached path ... "Attached is a patch that allows the decision engine to continue when it finds the stale or missing data. Testing with this patch in the same environment has results in over 150 successful action plans with no decision engine failures." 12:53:52 <sean-k-mooney> this is realted to the other topic 12:54:13 <sean-k-mooney> althoug hits a slighlty diffent problem statement 12:54:37 <sean-k-mooney> teh decsion engin shoudl not crash and shoudl handel the instnace not bein gin the datamodel 12:55:14 <rlandy> will the problem be address with current work in progress? 12:55:29 <sean-k-mooney> but it shoudl not geenrate an action for that instance if we do not have data for it and the stragy required data 12:55:45 <sean-k-mooney> likely not no 12:55:52 <amoralej> so the case is a VM is deleted in the middle of an audit execution? 12:55:54 <sean-k-mooney> this is before the action case 12:56:09 <sean-k-mooney> no the vm is moved and there is no metrics for the host currently 12:56:26 <jgilaber> this could also be because the datamodel is outdated 12:56:36 <sean-k-mooney> yes 12:56:41 <jgilaber> in the bug report it mentions the instance being migrated by another action plan 12:56:55 <jgilaber> this is a problem we've seen already and solved by enabling notifications 12:57:11 <jgilaber> although the attached patch since like a valid improvement to the strategy 12:57:30 <sean-k-mooney> i think we aslo may have fixed this key error in a newer release 12:57:36 <sean-k-mooney> at least a verian of it 12:57:46 <amoralej> I'm not sure what workload_cache has tbh 12:58:04 <jgilaber> I think that was in another strategy 12:58:31 <jgilaber> I'm looking at the code in master and there is no mention of KeyError https://github.com/openstack/watcher/blob/master/watcher/decision_engine/strategy/strategies/workload_balance.py 12:58:41 <amoralej> note it's running it with --interval 60 so, likely the model is not being refreshed between executions 12:58:52 <sean-k-mooney> we may have fixed it for the workload stablisation stratgy or anothe rone 12:59:02 <sean-k-mooney> we need to confirm if this happens on master or not 12:59:13 <rlandy> ack 12:59:26 <amoralej> we may suggest to enable notifications too as jgilaber mentioned 12:59:32 <sean-k-mooney> amoralej: right that hsoudl not resutl in a key error 12:59:38 <sean-k-mooney> we shoudl be robust to that 12:59:45 <amoralej> right 13:00:10 <sean-k-mooney> the patch https://launchpadlibrarian.net/830569850/decision-engine.patch 13:00:19 <sean-k-mooney> jhust ignores the instance and continue the loop 13:00:30 <amoralej> but even if we do improve handling that error, it may create the plan based on incomplete data 13:00:51 <sean-k-mooney> taht might be ok but we need to repoduce this in a test and then have test coverage for it when fixed 13:01:32 <sean-k-mooney> amoralej: well if we require metrics as aprt of the stragy and we dont have them for a workload then the workload shoudl be expclded form teh action plan 13:01:33 <amoralej> actually, we may reproduce it 13:01:39 <sean-k-mooney> or we shoudl fail to produce one 13:02:20 <amoralej> I'd ask about the period parameters they are using and if they have notifications 13:02:36 <sean-k-mooney> they are likely not using it that short 13:02:43 <sean-k-mooney> i think that is just to help use repoduce it 13:03:18 <sean-k-mooney> watcher shoudl nto fail without notificaiton even with short periods for continus audits 13:03:58 <sean-k-mooney> as an api use you cant know if notificaoitn are enabel or the period of the collectors 13:04:17 <rlandy> what are our next steps here? ... status: triaged? importance: high? ask reporter if they can reproduce in master (without notifications)? 13:04:18 <amoralej> yes, so i see two different issues, 1 - Watcher whould be more robust on those cases 2 - What are good configuration to get the best audit results 13:05:09 <sean-k-mooney> set it to incomplete and either ask if they see this also on master or we will have to do that 13:05:18 <amoralej> i.e. if no notification, continuous audits with interval < nova_colletor.period will easily lead to incomplete plans 13:05:30 <sean-k-mooney> yep 13:05:37 <sean-k-mooney> but that is ok 13:05:47 <rlandy> ok .. marking bug 13:05:49 <amoralej> you want me to reply in the case? 13:06:07 <rlandy> if you would like ... other I will do it after the call 13:06:12 <rlandy> next ... 13:06:17 <sean-k-mooney> amoralej: sure, we have basically 2 option allow it to proceed and log a warning as they proppose 13:06:55 <sean-k-mooney> or we need to reject the audit create with a 400 becuase the notificaotn are disabeld and the period is less the nthe collcotor 13:07:00 <sean-k-mooney> i dont like the second option 13:07:11 <sean-k-mooney> becaus that is leaking privaldge infrastuctre information 13:07:18 <sean-k-mooney> i.e. the config option values 13:08:05 <amoralej> yes, failing is likely too much 13:08:33 <jgilaber> yes, failing the audit seems overkill in this scenario imo 13:09:36 <rlandy> sorry folks - we're over time - so just touching on the last bug ... 13:09:45 <rlandy> #link: https://bugs.launchpad.net/watcher/+bug/2129692 13:10:03 <rlandy> this one jgilaber entered 13:10:10 <rlandy> as is "in progress" 13:10:25 <rlandy> jgilaber, any discussion needed here? 13:11:16 <jgilaber> this bug surfaced after I implemented support for the 'src_type' parameter 13:11:30 <jgilaber> I have a patch for it https://review.opendev.org/c/openstack/watcher/+/964718 13:12:16 <jgilaber> in short the zone migration strategy might not select the correct destination because the get_dst_pool_and_type method does not take into account the src_type 13:12:31 <jgilaber> given that we're out of time we can discuss it another time 13:13:01 <rlandy> jgilaber, ok - sure - pls bring it back next week if needed 13:13:09 <rlandy> thank you all for attending 13:13:16 <jgilaber> thank rlandy! 13:13:20 <rlandy> dviroel volunteered to chair next meeting 13:13:32 <rlandy> #endmeeting