| opendevreview | Merged openstack/watcher master: Freeze data_model API response fields https://review.opendev.org/c/openstack/watcher/+/986777 | 10:25 |
|---|---|---|
| opendevreview | Joan Gilabert proposed openstack/watcher master: Remove nova_helper retries for openstacksdk params https://review.opendev.org/c/openstack/watcher/+/975498 | 10:47 |
| amoralej_ | watcher meeting is starting in a while, remember to add your topics into https://etherpad.opendev.org/p/openstack-watcher-irc-meeting | 11:59 |
| amoralej_ | #startmeeting Watcher 21 May 2026 | 12:00 |
| opendevmeet | Meeting started Thu May 21 12:00:29 2026 UTC and is due to finish in 60 minutes. The chair is amoralej_. Information about MeetBot at http://wiki.debian.org/MeetBot. | 12:00 |
| opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 12:00 |
| opendevmeet | The meeting name has been set to 'watcher_21_may_2026' | 12:00 |
| amoralej_ | who is around today? | 12:00 |
| morenod | o/ | 12:01 |
| dviroel | o/ | 12:01 |
| amoralej_ | courtesy ping jgilaber sean-k-mooney[m] rlandy | 12:01 |
| rlandy | o/ | 12:02 |
| jgilaber | o/ | 12:02 |
| amoralej_ | you still can add last minute topics to https://etherpad.opendev.org/p/openstack-watcher-irc-meeting | 12:02 |
| amoralej_ | let's start with the first topic | 12:03 |
| amoralej_ | #topic https://review.opendev.org/c/openstack/watcher/+/975498 default behaviour for retries in nova helper | 12:03 |
| amoralej_ | this is from jgilaber | 12:04 |
| amoralej_ | the stage is yours | 12:04 |
| jgilaber | thanks | 12:04 |
| jgilaber | I wanted to discuss how to handle the defaults for the retry parameters | 12:04 |
| jgilaber | in the linked patch I deprecate the existing http_retry* parameters from the [nova] group | 12:05 |
| jgilaber | in favour of the keystoneauth ones that will be used in for all services | 12:05 |
| jgilaber | but the default behaviour is different | 12:05 |
| jgilaber | before this patch we had 3 retries with 0.5 delay | 12:06 |
| jgilaber | but for the new parameters is no retries, exponential backoff for the delay | 12:06 |
| amoralej_ | for how many retries? | 12:06 |
| jgilaber | by default none | 12:06 |
| amoralej_ | what means exponentical backoff if there are no retries? | 12:07 |
| jgilaber | I assume no retries | 12:07 |
| jgilaber | and then the backoff is used if some retries are configured | 12:07 |
| jgilaber | #link https://docs.openstack.org/watcher/latest/configuration/watcher.html#keystone.connect_retry_delay | 12:08 |
| jgilaber | this is the docs for the new parameters | 12:08 |
| dviroel | "maximum of 60 seconds" | 12:08 |
| dviroel | when set to exponential | 12:09 |
| amoralej_ | yeah, so i understand the default connect_retry_delay is not relevant if connect_retries is not set | 12:10 |
| amoralej_ | we should have some retries by default | 12:10 |
| amoralej_ | i wouldn't mind too much if the behavior is exactly the same, but we need some retries by default | 12:11 |
| jgilaber | this is I believe the relevant code https://github.com/openstack/keystoneauth/blob/92dade5b56767908b62cf516667b93278c59dd84/keystoneauth1/session.py#L1184 | 12:13 |
| amoralej_ | would it be possible to keep the existing parameters and use their values to initialize the connection with the keystoneauth params ? | 12:13 |
| jgilaber | that is an option | 12:13 |
| jgilaber | we could do that and add the same parameters for all services | 12:13 |
| amoralej_ | yep | 12:14 |
| jgilaber | that seems to be the approach taken by nova | 12:15 |
| jgilaber | e.g https://github.com/openstack/nova/blob/6901f0cdddc199fe23aef0cd34f86ae2066ead30/nova/conf/cinder.py#L72 and https://github.com/openstack/nova/blob/6901f0cdddc199fe23aef0cd34f86ae2066ead30/nova/conf/glance.py#L54 | 12:15 |
| amoralej_ | looks reasonable to me | 12:16 |
| jgilaber | ack, I'll update the patch with this approach and we can discuss further there | 12:16 |
| amoralej_ | we'd maintain our capacity of setting our default behaviour while using the logic of the sdk for the actual retries | 12:17 |
| amoralej_ | you'd override values in services from watcher_clients section ? | 12:17 |
| amoralej_ | no need probably | 12:18 |
| jgilaber | this would be in each service section | 12:18 |
| amoralej_ | wfm | 12:18 |
| jgilaber | no point in adding this parameter to the watcher_clients imo if that is what you're asking | 12:18 |
| amoralej_ | yes | 12:19 |
| amoralej_ | +1 | 12:19 |
| amoralej_ | so, we can move on to next topic ? | 12:19 |
| jgilaber | yes, we can move on, thanks for the discussion | 12:19 |
| amoralej_ | dviroel, you want to discuss now your bugs? | 12:20 |
| dviroel | yeah, we can jump to bugs topic | 12:20 |
| dviroel | the first one in the list is actually morenod bug | 12:21 |
| amoralej_ | ok, let's move there, i also missed jgilaber review request | 12:21 |
| dviroel | oh right | 12:21 |
| amoralej_ | #topic Update docs for removed integrations in stable/2026.1 https://review.opendev.org/c/openstack/watcher/+/989342 | 12:21 |
| amoralej_ | that came first | 12:21 |
| dviroel | i already voted in this one ^ | 12:21 |
| jgilaber | np, it'll be quick | 12:21 |
| jgilaber | it's just a small change removing from the docs the glance and neutron integrations that we deleted | 12:22 |
| jgilaber | I backported it to 2026.1 which is when we actually deleted them IIRC | 12:22 |
| dviroel | they were deleted in gazpacho yeah +1 | 12:22 |
| amoralej_ | merging, yep | 12:23 |
| jgilaber | I don't expect the patch to be controversial :) so I think we can move on | 12:23 |
| dviroel | thanks jgilaber | 12:23 |
| amoralej_ | good to split this change into its own patch to backport | 12:23 |
| amoralej_ | let's move to the bugs then | 12:23 |
| amoralej_ | #topic https://bugs.launchpad.net/watcher/+bug/2151776 audit_scope parameter cannot be applied on some strategies | 12:24 |
| morenod | ey | 12:24 |
| amoralej_ | that's the one from morenod | 12:24 |
| morenod | so working on the tests for audit scope, I saw this | 12:24 |
| dviroel | so instances are not really excluded from the model | 12:25 |
| dviroel | they are labeled as nstance.watcher_exclude | 12:25 |
| amoralej_ | when the host is not excluded | 12:25 |
| dviroel | correct | 12:25 |
| dviroel | some strategies may need to know that this instance exists, but should not be considered as part of the solution | 12:26 |
| amoralej_ | what would be good is to describe in the bug the problematic behavior instead of the code itself, imo | 12:26 |
| morenod | ok, I will update the description | 12:27 |
| amoralej_ | i.e. audit with strategy X and scope Y, migrate vms which should not be migrated, or something like that | 12:27 |
| dviroel | so one follow up to this is, based on the current implementation of watcher exclude instance, which strategies are not working properly | 12:27 |
| dviroel | all of them need to know that? not sure | 12:28 |
| morenod | host maintenance, node resource consolidation and zone migration are not working | 12:28 |
| morenod | there are another ones that do not use, like noisy_neighbor, but I think they are out of scope at this moment | 12:29 |
| morenod | most important ones are host_maintenance, node_resource_consolidation and zone_migration | 12:29 |
| dviroel | yeah, noisy neighbor is deprecated for removal for instance | 12:29 |
| dviroel | another one like saving_energy | 12:30 |
| dviroel | it does other checks | 12:30 |
| dviroel | "hypervisor_node['running_vms'] == 0" | 12:30 |
| dviroel | to known if a node can be powered off | 12:30 |
| dviroel | in this case ^ at least some documentation is needed I think, so user is aware that a excluded node will not work for saving energy strategy | 12:31 |
| dviroel | it is not expected to work | 12:31 |
| amoralej_ | yep, i feel changing that behavior will be problematic | 12:31 |
| dviroel | i think that we can originally fix the broken strategies | 12:32 |
| amoralej_ | i mean, keeping somehow all the vms in the model if the host is not excluded | 12:32 |
| amoralej_ | i agree | 12:32 |
| dviroel | document the ones that don't have any usage of the excluded instance | 12:32 |
| dviroel | and discuss a better approach for it as a enhancement | 12:32 |
| jgilaber | +1 to document carefully | 12:32 |
| opendevreview | Merged openstack/watcher stable/2026.1: Update configuration and integrations docs https://review.opendev.org/c/openstack/watcher/+/989342 | 12:33 |
| jgilaber | I think users might interpret different potential behaviours for zone migration and host_maintenance | 12:33 |
| dviroel | yeah | 12:33 |
| jgilaber | for host maintenance it makes sense to think this exclusion would not apply | 12:33 |
| dviroel | we don't know how these strategies are handling these scope restrictions | 12:34 |
| dviroel | user would be lost | 12:34 |
| jgilaber | in zone migration there could contradictions between the audit input parameters and the scop | 12:34 |
| amoralej_ | yes | 12:35 |
| jgilaber | so we'll have to decide and document which one would prevail | 12:35 |
| amoralej_ | it's important to understand and document | 12:35 |
| dviroel | this bug will track many fixes i think, we may end up creating more? | 12:36 |
| amoralej_ | so, next step would be to get the bug documented more clearly and then, i'd say create one bug per strategy and discuss each case by its own | 12:36 |
| dviroel | ack, some strategies will be code + docs, others will be doc only | 12:37 |
| dviroel | so we can create the bugs in the process of fixing them | 12:37 |
| dviroel | lets keep this one as central discussions | 12:38 |
| amoralej_ | ok | 12:38 |
| dviroel | can we set a priority to this one | 12:38 |
| amoralej_ | high? | 12:38 |
| dviroel | I think that we should fix this soon, at least some strategies | 12:38 |
| dviroel | yeah, which translate for High | 12:38 |
| amoralej_ | I agree | 12:38 |
| amoralej_ | at least the more obvious cases | 12:39 |
| dviroel | jgilaber morenod what do you think? | 12:39 |
| morenod | yeah, to continue validating the feature, we need to fix it asap, so high is ok for me | 12:39 |
| jgilaber | +1 to high, this could be disruptive | 12:40 |
| morenod | I will create one bug per strategy | 12:40 |
| amoralej_ | set | 12:40 |
| dviroel | jgilaber: already updated the bug | 12:40 |
| amoralej_ | thanks morenod | 12:40 |
| amoralej_ | next bug? | 12:40 |
| dviroel | +1 | 12:41 |
| amoralej_ | #topic NEW https://bugs.launchpad.net/watcher/+bug/2152254 reentrant deadlock in ModelRoot when calling map_instance() with string UUID arguments | 12:41 |
| dviroel | ok | 12:41 |
| dviroel | a CDM bug that I reported recently | 12:41 |
| dviroel | that was identified when debuging issues with cdm locking implementation | 12:42 |
| dviroel | this bus exists in the code, but none of its function call is using string UUID, so we don't hit the deadlock in watcher today | 12:43 |
| dviroel | but the buggy code it there | 12:43 |
| amoralej_ | thanks for that, that's clearly a bug we need to fix | 12:43 |
| jgilaber | I haven't yet looked at the patches for the locking, but this also seems like a high priority bug for me | 12:43 |
| amoralej_ | and you already provided the fix, so great :) | 12:43 |
| dviroel | right the fix should be this one | 12:44 |
| dviroel | #link https://review.opendev.org/c/openstack/watcher/+/987540 | 12:44 |
| dviroel | to replace the current locking method | 12:44 |
| dviroel | this adds a per model instance lock, which is reentrant lock | 12:45 |
| amoralej_ | this particular bug could be medium, as does not affect watcher because of uuid is not used imo | 12:45 |
| amoralej_ | i reset my vote from ^ given that you rebased it on top of the fix for the other issue, i'll review later | 12:46 |
| dviroel | right, i don't think that other external plugins may also use the affected methods | 12:46 |
| amoralej_ | bug high / medium, anyway we want to merge the fix soon for the perf too, so... | 12:46 |
| dviroel | but yeah, not sure. | 12:46 |
| amoralej_ | we can set high, no problem | 12:46 |
| dviroel | ack | 12:47 |
| dviroel | agrre | 12:47 |
| dviroel | so when proposing this fix | 12:47 |
| amoralej_ | #topic https://bugs.launchpad.net/watcher/+bug/2152645 Decision engine loses notification updates during cluster data model synchronization | 12:47 |
| dviroel | amoralej_: noticed that we were hitting other issue in ci | 12:47 |
| dviroel | that decision engine loses notification updates during cluster data model synchronization | 12:48 |
| dviroel | :) | 12:48 |
| amoralej_ | yep | 12:48 |
| dviroel | that is miore likely to happen when working with big environments | 12:48 |
| dviroel | the collector periodically builds a new model and replace the old one | 12:49 |
| dviroel | some notification may be lost during this process | 12:49 |
| dviroel | there is a example scenario in the LP report | 12:49 |
| dviroel | there are 2 possible solutions identified | 12:50 |
| dviroel | the first one would be a simple sync lock between these two threads | 12:50 |
| dviroel | the collector and the notification ones | 12:50 |
| dviroel | the trade off could be the collector taking too long to return the lock | 12:51 |
| dviroel | and we also loose some notifications due to timeouts/queue limits | 12:51 |
| dviroel | which I couldn't validate the impact ^ | 12:51 |
| dviroel | the other option would be to, while collector is updating the model, we cache the notifications to process them later | 12:52 |
| dviroel | we could got with both approaches, the first one is better than current code | 12:52 |
| dviroel | and propose the second one as follow-up and to get more feedbacks | 12:53 |
| dviroel | so that would not block us on fixing the first bug too | 12:53 |
| amoralej_ | one question, so, iiuc the threads for running the model via collector and the one for the notifications are dedicated to that, right? | 12:53 |
| dviroel | i will ping sean-k-mooney[m] here so they can also get some feedback on this apprach when they have some time... | 12:54 |
| amoralej_ | i mean, no impact on audit executions, right? | 12:54 |
| dviroel | they are dedicated yes | 12:54 |
| dviroel | in the audit itself not | 12:54 |
| dviroel | since the audit gets a copy from current model | 12:55 |
| amoralej_ | about "loose some notifications due to timeouts/queue limits", i'm not sure in which cases we may have problems | 12:55 |
| dviroel | and the lock for getting this model is a different one | 12:55 |
| dviroel | what would be happening is that, with the new lock, once the collector is running, we will not receive updates in the current/old model anymore, since the notification thread will be waiting for the sync_lock | 12:56 |
| amoralej_ | but that's not a problem by itself, as soon as the changes are applied after the collection is done | 12:57 |
| dviroel | yeah, this means that during this time, the audit will be getting a model that was not updated with recent notifications | 12:57 |
| dviroel | this ^ can be fixing with the notitication cache and replay of them, which is the second approach | 12:57 |
| dviroel | s/fixing/fixed | 12:58 |
| amoralej_ | but in the first approach, would also apply the notifications later, right? | 12:58 |
| dviroel | later, after the collector finishes | 12:58 |
| dviroel | yeah | 12:58 |
| amoralej_ | exactly | 12:58 |
| dviroel | if the collector takes 2 min to complete, the notifications during this 2 min will be held | 12:59 |
| dviroel | and today they may be lost | 12:59 |
| dviroel | without any fix i mean | 12:59 |
| amoralej_ | in which case lost ? notifications should remain in the notification topic | 12:59 |
| amoralej_ | actually, iirc notification have no (or very long) ttl | 12:59 |
| dviroel | the lost scenario is today, without any fix I mean | 13:00 |
| amoralej_ | ah, sorry, right | 13:00 |
| amoralej_ | sure, we need to fix that | 13:00 |
| amoralej_ | i was referring to option 1 vs 2 | 13:01 |
| dviroel | so my proposal for now is with 1) and furhter discuss the need of 2 | 13:01 |
| amoralej_ | makes sense | 13:01 |
| dviroel | we can sync more about that in the patch itself | 13:01 |
| amoralej_ | i went directly to the "further discuss the need of 2" sorry :) | 13:01 |
| dviroel | since it is already proposed | 13:01 |
| amoralej_ | i agree | 13:01 |
| dviroel | #link https://review.opendev.org/c/openstack/watcher/+/989396 | 13:01 |
| dviroel | ptal when have some time | 13:02 |
| dviroel | we need to set a priority | 13:02 |
| amoralej_ | i'd put high | 13:02 |
| amoralej_ | given that it's also a dependency to merge the rest of locking changes | 13:02 |
| dviroel | ack | 13:02 |
| dviroel | depending on the collector period, this could be fixed only in hours, so the model can be outdated for a long time | 13:03 |
| dviroel | or only when dec-eng is restarted | 13:03 |
| dviroel | set to Hight them | 13:03 |
| amoralej_ | yes, worst case, the following periodic collector would fix it | 13:03 |
| amoralej_ | we can discuss about the need of 2 in future meetings | 13:04 |
| dviroel | ack | 13:04 |
| dviroel | that's it | 13:04 |
| amoralej_ | thanks for proposing the fix so fast dviroel | 13:04 |
| amoralej_ | we are out of topics and out of time :) | 13:04 |
| amoralej_ | dviroel is volunteer to chair next meeting, thanks! | 13:04 |
| dviroel | i can chair next week yeah | 13:05 |
| dviroel | since i will be out the following one | 13:05 |
| amoralej_ | #action dviroel to chair next meeting | 13:05 |
| amoralej_ | unless there is some last minute topic to discuss, we can close the meeting | 13:05 |
| amoralej_ | thank you all for joining! | 13:06 |
| dviroel | thanks amoralej_ o/ | 13:06 |
| amoralej_ | #endmeeting | 13:06 |
| opendevmeet | Meeting ended Thu May 21 13:06:19 2026 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 13:06 |
| opendevmeet | Minutes: https://meetings.opendev.org/meetings/watcher_21_may_2026/2026/watcher_21_may_2026.2026-05-21-12.00.html | 13:06 |
| opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/watcher_21_may_2026/2026/watcher_21_may_2026.2026-05-21-12.00.txt | 13:06 |
| opendevmeet | Log: https://meetings.opendev.org/meetings/watcher_21_may_2026/2026/watcher_21_may_2026.2026-05-21-12.00.log.html | 13:06 |
| jgilaber | thanks amoralej_ ! | 13:06 |
| *** haleyb is now known as haleyb|out | 21:48 | |
Generated by irclog2html.py 4.1.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!