| rlandy | hello - IRC meeting will be in about 90 mins ... please add any topics and reviews to https://etherpad.opendev.org/p/openstack-watcher-irc-meeting#L39 | 10:15 |
|---|---|---|
| opendevreview | Alfredo Moralejo proposed openstack/watcher master: Add configurable timeout for migrate actions with reasonable default https://review.opendev.org/c/openstack/watcher/+/967693 | 10:23 |
| opendevreview | Alfredo Moralejo proposed openstack/watcher master: Add configurable timeout for migrate actions with reasonable default https://review.opendev.org/c/openstack/watcher/+/967693 | 10:25 |
| rlandy | #startmeeting IRC Watcher meeting 20 November 2025 | 12:00 |
| opendevmeet | Meeting started Thu Nov 20 12:00:45 2025 UTC and is due to finish in 60 minutes. The chair is rlandy. Information about MeetBot at http://wiki.debian.org/MeetBot. | 12:00 |
| opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 12:00 |
| opendevmeet | The meeting name has been set to 'irc_watcher_meeting_20_november_2025' | 12:00 |
| rlandy | hello all ... | 12:00 |
| rlandy | who is around? | 12:00 |
| rlandy | courtesy ping list: dviroel amoralej jgilaber sean-k-mooney chandankumar morenod | 12:01 |
| amoralej | o/ | 12:01 |
| jgilaber | o/ | 12:01 |
| sean-k-mooney | o/ | 12:01 |
| rlandy | let's get started ... | 12:02 |
| rlandy | #topic: new blueprint for automatic skip actions on pre_condition | 12:03 |
| rlandy | #link: https://blueprints.launchpad.net/watcher/+spec/skip-actions-in-pre-condition | 12:03 |
| amoralej | that's mine | 12:03 |
| rlandy | #link https://review.opendev.org/c/openstack/watcher/+/966699 | 12:03 |
| rlandy | pls go ahead amoralej | 12:03 |
| chandankumar | o/ | 12:03 |
| amoralej | so, I started working on it for the migrate action in https://review.opendev.org/c/openstack/watcher/+/966699 | 12:03 |
| amoralej | so, I'd like to discuss about the blueprint itself and check if it can get approved as-is or I need something else | 12:04 |
| sean-k-mooney | so before we approve it we shoudl really agree on the scope | 12:04 |
| amoralej | given that it does not imply change in API, etc... I have not created spec so far | 12:05 |
| sean-k-mooney | i.e. what action will this chagne and what will the be pre-codnions for each | 12:05 |
| amoralej | for the actions, all where skipping condition is relevant, migrate, change_nova_service_state, volume_migrate, stop | 12:06 |
| amoralej | and resize i guess (although i don't know if something is using resize, tbh) | 12:06 |
| sean-k-mooney | so the description shoudl be updated with a list of the actions and for each action ideally a high level list of the precondtoins | 12:06 |
| sean-k-mooney | i.e. the vm/volume must exist | 12:07 |
| sean-k-mooney | the souce/dest shoudl exist (state when it is an error vs skipped | 12:07 |
| amoralej | ok, about skipping conditions I think keeping a high level is good, but i guess we will need to discuss details for each one on each review | 12:07 |
| sean-k-mooney | over all it does not need to be very long but it shoudl be more detailed then it is currently | 12:07 |
| sean-k-mooney | yes, we can defer the detials of the exact behavior to the review | 12:08 |
| sean-k-mooney | but we need to capture the overall scope fo the changes beter | 12:08 |
| amoralej | sure | 12:09 |
| sean-k-mooney | if the description was 2 -2.5 times its curentl lenght that would proably caputre all that is need | 12:09 |
| amoralej | I get the point | 12:10 |
| sean-k-mooney | i have targeted it to 2026.1 and set the defintion to dicssion | 12:10 |
| amoralej | In the same reviews, can we also managed conditions where the action should be FAILED? | 12:10 |
| amoralej | in pre_condition, i mean | 12:10 |
| sean-k-mooney | we can yes | 12:11 |
| sean-k-mooney | we proably shoudl have 1 commit per action but if you want to do the skip cases first and the fail cases after thats ok too | 12:11 |
| sean-k-mooney | does anyone have concerns with tthe over all intent or direction? | 12:11 |
| amoralej | Unless you requires it, imo one review per action would be fine | 12:12 |
| amoralej | or it gets too complicated | 12:12 |
| sean-k-mooney | do we feel comfortable in general with a specless blueprint or would folks like a spec? | 12:12 |
| jgilaber | +1, unless we need large changes for some action one review per action seems fine | 12:12 |
| sean-k-mooney | then lets loop back to it next week and if we are happy we can approve it | 12:12 |
| jgilaber | also +1 to specless blueprint we discussed the skip feature well enough last cycle | 12:13 |
| amoralej | wrt test coverage | 12:13 |
| amoralej | we want to cover that with tempest testing? | 12:13 |
| sean-k-mooney | it sort of depends. we may want 1 test im not sure we need to test all the skip condtion in tempest | 12:14 |
| amoralej | morenod already created a wip review for the migrate changes using the actuator strategy with a list of actions where we can simulate | 12:14 |
| sean-k-mooney | i woudl hope we can comprehelsive test this with unit/funcitonal tests | 12:14 |
| morenod | #link https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/966860 | 12:14 |
| amoralej | yeah, i have doubts if we will be able to simulate all, but for some cases i think it's doable and good | 12:15 |
| amoralej | so probably we need to check on each case | 12:15 |
| sean-k-mooney | im kind of concerned by that test | 12:15 |
| amoralej | what is your concern? | 12:15 |
| sean-k-mooney | well it looks liek we have several bugs | 12:15 |
| sean-k-mooney | for one it shoudl not be possibel to generate a action plan that refence a non existing instnace | 12:16 |
| sean-k-mooney | or a souce nod that does not exist | 12:16 |
| sean-k-mooney | now the instnace could be on a diffent host the the refeence source host | 12:17 |
| sean-k-mooney | that valid | 12:17 |
| amoralej | an instance may be removed after the action plan is created | 12:17 |
| sean-k-mooney | correct bu tin this case it never existd | 12:18 |
| amoralej | note that the actuator strategy is just a test-only strategy which is proxy to create any action | 12:18 |
| amoralej | there is no logic at all in the strategy, but it is usefull to do this kind of simulations to test action logic imo | 12:18 |
| sean-k-mooney | yes but it shoudl not allow you to put in a fake instance id | 12:18 |
| sean-k-mooney | it shoudl assort that the inste exists in the model or nova | 12:19 |
| sean-k-mooney | so that to me is a very big input validation problem | 12:19 |
| amoralej | but then, the actuator strategy need to have specific per-action logic | 12:19 |
| sean-k-mooney | we cant assuem someone wont use thsi in production | 12:19 |
| amoralej | then, we should document that :) | 12:19 |
| sean-k-mooney | so we have to validate all actions it create agaisnt the model | 12:19 |
| sean-k-mooney | well this is borderlien a secuirty problem | 12:20 |
| sean-k-mooney | so i dont think documentaiton is enough here | 12:20 |
| amoralej | we could create the vm -> actuator with the action to get the AP -> remove the vm | 12:20 |
| sean-k-mooney | ^ would be valid | 12:20 |
| amoralej | actually, that's something i've though in the past | 12:20 |
| amoralej | hiding strategies | 12:20 |
| amoralej | it'd be good to disable / hide test-only strategies and goals | 12:21 |
| sean-k-mooney | yes it would or provide a way to jsut filter them in the config in general | 12:21 |
| sean-k-mooney | we can have an enabeld_stragey or enabled_goal config opiton | 12:21 |
| sean-k-mooney | nova and cidner both do that for there schduler filters and neutorn does it for there extentions | 12:22 |
| amoralej | imo, for testing is good to have an strategy that simply copies actions, but i agree it'd be good to have a mechanism to make sure those are not used in production | 12:22 |
| sean-k-mooney | so going back to testing | 12:22 |
| amoralej | same for dummy, nop | 12:22 |
| amoralej | yeah, let's go back | 12:23 |
| sean-k-mooney | if we emulate a reals env by creating the 2 vms, creating the action plan and deleteing one then movign the other | 12:23 |
| sean-k-mooney | this would be a vaild test again | 12:23 |
| sean-k-mooney | but i think we need a bug for this lack of input validation and it should returnn a 400 in the future | 12:24 |
| amoralej | I disagree but that's probably another discussion :) | 12:24 |
| amoralej | i think it's good to have that option (running actions with fake data) only for testing | 12:25 |
| sean-k-mooney | ok but im really sondering if i shoudl be opening a security bug or not. | 12:25 |
| sean-k-mooney | raw passhtough of actions without verifcaiton is not ok in my opipion | 12:25 |
| sean-k-mooney | we can dicuss that seperatly | 12:26 |
| amoralej | yes, +1 | 12:26 |
| amoralej | so, let's say we will cover tempest where reasonable, and will discuss on each review | 12:26 |
| amoralej | in parallel, we can create a bug for the actuator and discuss | 12:27 |
| sean-k-mooney | +1 | 12:27 |
| amoralej | last discussion, wrt the specific cases where we should SKIP or FAIL for the migrate I left a comment in https://review.opendev.org/c/openstack/watcher/+/966699 where we can discuss | 12:28 |
| amoralej | unless you want to discuss it now | 12:28 |
| rlandy | there are no other topics or reviews (juts bug triage) so up to you | 12:29 |
| amoralej | actually, I think we kind of agree in everything other that the case where source_node != current_node | 12:29 |
| sean-k-mooney | either works for me | 12:29 |
| amoralej | I understand that instance do not exist -> SKIPPED is clear | 12:30 |
| amoralej | destination_node does not exist or is disabled -> FAILED | 12:30 |
| amoralej | Instance status != ACTIVE in migration_type = Live -> FAILED | 12:30 |
| sean-k-mooney | to me i see the following options mark it as failed ignore it and move it to the selecte destiaon or skip if tis already on the selected destionat | 12:30 |
| sean-k-mooney | amoralej: basiclly im not sure you shoud leb looking at the souce node when moving it other then to confirm it not currently on the destination node | 12:32 |
| amoralej | My logic to incline to SKIP is that in most (or all) cases, the instance is moved because of the host where it is running, i.e. host_maintenance, zone_migrations or consolidating strategies are migrating the instances because of the host where they are running. | 12:32 |
| sean-k-mooney | if it moved out of band to the destiation node then its valid to skip | 12:32 |
| amoralej | there is no point in doing a migration for a host_maintenance i.e. if the vm is not in that host | 12:33 |
| amoralej | and similar to zone_migration, even for consolidation case | 12:33 |
| sean-k-mooney | so that only appleis to the zonew or host maintaince case | 12:33 |
| amoralej | or consolidation, intended to empty nodes | 12:34 |
| sean-k-mooney | but no the vm consolidation or workload stablization case | 12:34 |
| amoralej | even for workload_stabilization, we are moving because source_node it's beyond some threshold | 12:35 |
| amoralej | although that's te most complicated case, tbh | 12:35 |
| amoralej | something we may do, but that would increase the change is to allow to passed souce_node as None and do not check at all in that case | 12:36 |
| amoralej | so, each strategy would decide if want to check or not | 12:36 |
| amoralej | and if source_node != None and != current_node -> SKIPPED | 12:37 |
| amoralej | I'm not sure, tbh | 12:37 |
| amoralej | any opinions about it? | 12:37 |
| sean-k-mooney | ya im sort of coming around to your argumetn for skip instead of failed | 12:38 |
| sean-k-mooney | your effectivly arguing that the data tha twas used to generate the action plan was out of date | 12:38 |
| amoralej | exactly | 12:39 |
| sean-k-mooney | so we shoudl disregard any actions related to that workload as a result | 12:39 |
| sean-k-mooney | of couse if the vm is on a diffent host in the zone being drained | 12:39 |
| sean-k-mooney | ro a diffent host that is under mantiance then the goal will not hav ebeen achived | 12:39 |
| sean-k-mooney | so fi you have any skipped action you may need to run the audit and action plan a second time | 12:40 |
| sean-k-mooney | or many times until all succeed | 12:40 |
| amoralej | correct, but that goes beyond this change | 12:40 |
| sean-k-mooney | that is why i was orginlly saying the action plan shoudl be failed | 12:40 |
| sean-k-mooney | not skipped | 12:40 |
| sean-k-mooney | (an the action shoudl not be skipped but marked failed) | 12:40 |
| amoralej | actually, something that may be good in some cases is to recalculate the real efficiency indicators achieved by the execution of an AP | 12:43 |
| amoralej | at the end | 12:43 |
| amoralej | but that's beyond the scope of this | 12:43 |
| sean-k-mooney | so lets bring this back to the review | 12:43 |
| amoralej | yes | 12:43 |
| sean-k-mooney | oen thing we will need to do is docuemnt this in the action docs | 12:43 |
| sean-k-mooney | (and create those if we dont already have them) | 12:44 |
| amoralej | yes, actions have no document at all | 12:44 |
| amoralej | on the PTG we aggreed to create a new sections for actions | 12:44 |
| amoralej | btw, we may document which ones are test-only | 12:44 |
| sean-k-mooney | we can yes | 12:44 |
| amoralej | and on each action document the skipping conditions | 12:44 |
| sean-k-mooney | as i said that is not enough to adresss my concern unless we also disbale them by default and requrie cofig option ot opt in | 12:45 |
| amoralej | but i don't want to mix adding a brand new documentation section with the implementation of the skipped | 12:45 |
| sean-k-mooney | i do | 12:45 |
| sean-k-mooney | your change shoudl be complete | 12:46 |
| sean-k-mooney | so eithe we can do the new docs first | 12:46 |
| sean-k-mooney | or you shoudl do it as part of the change | 12:46 |
| amoralej | that's likely the best option | 12:46 |
| amoralej | i mean, adding new docs first | 12:46 |
| sean-k-mooney | ack that also works for me | 12:46 |
| sean-k-mooney | since we do not have a spec for this | 12:46 |
| amoralej | requires blueprint ? | 12:46 |
| sean-k-mooney | we need to have the docs | 12:46 |
| sean-k-mooney | we can do it as part fo this blueprint i think | 12:46 |
| amoralej | ok, i will mention it in the description | 12:47 |
| sean-k-mooney | although its gettign closer to a spec at that poitn but ya it makes sense to do this toghter as part fo the over all feature delviery | 12:47 |
| sean-k-mooney | anyone have concerns with ^ plan | 12:47 |
| jgilaber | +1 for me to include the docs | 12:48 |
| sean-k-mooney | 1 add simpel docs for each supproted action today, 2 enhacn the docs and each action with pre-condtions in seperate commits | 12:48 |
| rlandy | sounds fine | 12:49 |
| amoralej | ack, I will rewrite the description of the blueprint | 12:49 |
| rlandy | amoralej, did we cover all the questions and issues on this topic? | 12:49 |
| amoralej | yes, i think so | 12:49 |
| rlandy | ok - any further questions/comments from anyone else? | 12:50 |
| rlandy | moving on ... | 12:51 |
| rlandy | there were no review added to the meeting agenda | 12:51 |
| rlandy | so we can move to bug triage | 12:51 |
| rlandy | #topic: Bug triage | 12:51 |
| rlandy | #link: https://bugs.launchpad.net/watcher/+bug/2131043 | 12:51 |
| rlandy | opened on 2025-11-10 | 12:52 |
| rlandy | watcher decision engine fails when evaluating previously migrated instance | 12:52 |
| rlandy | bug logged with: | 12:52 |
| rlandy | jammy + caracal | 12:52 |
| rlandy | openstack 2024.1 charms | 12:52 |
| rlandy | note attached path ... "Attached is a patch that allows the decision engine to continue when it finds the stale or missing data. Testing with this patch in the same environment has results in over 150 successful action plans with no decision engine failures." | 12:53 |
| sean-k-mooney | this is realted to the other topic | 12:53 |
| sean-k-mooney | althoug hits a slighlty diffent problem statement | 12:54 |
| sean-k-mooney | teh decsion engin shoudl not crash and shoudl handel the instnace not bein gin the datamodel | 12:54 |
| rlandy | will the problem be address with current work in progress? | 12:55 |
| sean-k-mooney | but it shoudl not geenrate an action for that instance if we do not have data for it and the stragy required data | 12:55 |
| sean-k-mooney | likely not no | 12:55 |
| amoralej | so the case is a VM is deleted in the middle of an audit execution? | 12:55 |
| sean-k-mooney | this is before the action case | 12:55 |
| sean-k-mooney | no the vm is moved and there is no metrics for the host currently | 12:56 |
| jgilaber | this could also be because the datamodel is outdated | 12:56 |
| sean-k-mooney | yes | 12:56 |
| jgilaber | in the bug report it mentions the instance being migrated by another action plan | 12:56 |
| jgilaber | this is a problem we've seen already and solved by enabling notifications | 12:56 |
| jgilaber | although the attached patch since like a valid improvement to the strategy | 12:57 |
| sean-k-mooney | i think we aslo may have fixed this key error in a newer release | 12:57 |
| sean-k-mooney | at least a verian of it | 12:57 |
| amoralej | I'm not sure what workload_cache has tbh | 12:57 |
| jgilaber | I think that was in another strategy | 12:58 |
| jgilaber | I'm looking at the code in master and there is no mention of KeyError https://github.com/openstack/watcher/blob/master/watcher/decision_engine/strategy/strategies/workload_balance.py | 12:58 |
| amoralej | note it's running it with --interval 60 so, likely the model is not being refreshed between executions | 12:58 |
| sean-k-mooney | we may have fixed it for the workload stablisation stratgy or anothe rone | 12:58 |
| sean-k-mooney | we need to confirm if this happens on master or not | 12:59 |
| rlandy | ack | 12:59 |
| amoralej | we may suggest to enable notifications too as jgilaber mentioned | 12:59 |
| sean-k-mooney | amoralej: right that hsoudl not resutl in a key error | 12:59 |
| sean-k-mooney | we shoudl be robust to that | 12:59 |
| amoralej | right | 12:59 |
| sean-k-mooney | the patch https://launchpadlibrarian.net/830569850/decision-engine.patch | 13:00 |
| sean-k-mooney | jhust ignores the instance and continue the loop | 13:00 |
| amoralej | but even if we do improve handling that error, it may create the plan based on incomplete data | 13:00 |
| sean-k-mooney | taht might be ok but we need to repoduce this in a test and then have test coverage for it when fixed | 13:00 |
| sean-k-mooney | amoralej: well if we require metrics as aprt of the stragy and we dont have them for a workload then the workload shoudl be expclded form teh action plan | 13:01 |
| amoralej | actually, we may reproduce it | 13:01 |
| sean-k-mooney | or we shoudl fail to produce one | 13:01 |
| amoralej | I'd ask about the period parameters they are using and if they have notifications | 13:02 |
| sean-k-mooney | they are likely not using it that short | 13:02 |
| sean-k-mooney | i think that is just to help use repoduce it | 13:02 |
| sean-k-mooney | watcher shoudl nto fail without notificaiton even with short periods for continus audits | 13:03 |
| sean-k-mooney | as an api use you cant know if notificaoitn are enabel or the period of the collectors | 13:03 |
| rlandy | what are our next steps here? ... status: triaged? importance: high? ask reporter if they can reproduce in master (without notifications)? | 13:04 |
| amoralej | yes, so i see two different issues, 1 - Watcher whould be more robust on those cases 2 - What are good configuration to get the best audit results | 13:04 |
| sean-k-mooney | set it to incomplete and either ask if they see this also on master or we will have to do that | 13:05 |
| amoralej | i.e. if no notification, continuous audits with interval < nova_colletor.period will easily lead to incomplete plans | 13:05 |
| sean-k-mooney | yep | 13:05 |
| sean-k-mooney | but that is ok | 13:05 |
| rlandy | ok .. marking bug | 13:05 |
| amoralej | you want me to reply in the case? | 13:05 |
| rlandy | if you would like ... other I will do it after the call | 13:06 |
| rlandy | next ... | 13:06 |
| sean-k-mooney | amoralej: sure, we have basically 2 option allow it to proceed and log a warning as they proppose | 13:06 |
| sean-k-mooney | or we need to reject the audit create with a 400 becuase the notificaotn are disabeld and the period is less the nthe collcotor | 13:06 |
| sean-k-mooney | i dont like the second option | 13:07 |
| sean-k-mooney | becaus that is leaking privaldge infrastuctre information | 13:07 |
| sean-k-mooney | i.e. the config option values | 13:07 |
| amoralej | yes, failing is likely too much | 13:08 |
| jgilaber | yes, failing the audit seems overkill in this scenario imo | 13:08 |
| rlandy | sorry folks - we're over time - so just touching on the last bug ... | 13:09 |
| rlandy | #link: https://bugs.launchpad.net/watcher/+bug/2129692 | 13:09 |
| rlandy | this one jgilaber entered | 13:10 |
| rlandy | as is "in progress" | 13:10 |
| rlandy | jgilaber, any discussion needed here? | 13:10 |
| jgilaber | this bug surfaced after I implemented support for the 'src_type' parameter | 13:11 |
| jgilaber | I have a patch for it https://review.opendev.org/c/openstack/watcher/+/964718 | 13:11 |
| jgilaber | in short the zone migration strategy might not select the correct destination because the get_dst_pool_and_type method does not take into account the src_type | 13:12 |
| jgilaber | given that we're out of time we can discuss it another time | 13:12 |
| rlandy | jgilaber, ok - sure - pls bring it back next week if needed | 13:13 |
| rlandy | thank you all for attending | 13:13 |
| jgilaber | thank rlandy! | 13:13 |
| rlandy | dviroel volunteered to chair next meeting | 13:13 |
| rlandy | #endmeeting | 13:13 |
| opendevmeet | Meeting ended Thu Nov 20 13:13:32 2025 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 13:13 |
| opendevmeet | Minutes: https://meetings.opendev.org/meetings/irc_watcher_meeting_20_november_2025/2025/irc_watcher_meeting_20_november_2025.2025-11-20-12.00.html | 13:13 |
| opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/irc_watcher_meeting_20_november_2025/2025/irc_watcher_meeting_20_november_2025.2025-11-20-12.00.txt | 13:13 |
| opendevmeet | Log: https://meetings.opendev.org/meetings/irc_watcher_meeting_20_november_2025/2025/irc_watcher_meeting_20_november_2025.2025-11-20-12.00.log.html | 13:13 |
| opendevreview | chandan kumar proposed openstack/watcher-dashboard master: Add skip action feature with microversion support https://review.opendev.org/c/openstack/watcher-dashboard/+/958209 | 15:41 |
| opendevreview | Alfredo Moralejo proposed openstack/watcher master: APISchedulingService migrate audits also on first discovery of services https://review.opendev.org/c/openstack/watcher/+/963981 | 15:45 |
| opendevreview | Alfredo Moralejo proposed openstack/watcher master: Move decision-engine monitoring service to the decision-engine https://review.opendev.org/c/openstack/watcher/+/963252 | 15:45 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!