Thursday, 2026-05-21

opendevreviewMerged openstack/watcher master: Freeze data_model API response fields  https://review.opendev.org/c/openstack/watcher/+/98677710:25
opendevreviewJoan Gilabert proposed openstack/watcher master: Remove nova_helper retries for openstacksdk params  https://review.opendev.org/c/openstack/watcher/+/97549810:47
amoralej_watcher meeting is starting in a while, remember to add your topics into https://etherpad.opendev.org/p/openstack-watcher-irc-meeting11:59
amoralej_#startmeeting Watcher 21 May 202612:00
opendevmeetMeeting started Thu May 21 12:00:29 2026 UTC and is due to finish in 60 minutes.  The chair is amoralej_. Information about MeetBot at http://wiki.debian.org/MeetBot.12:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.12:00
opendevmeetThe meeting name has been set to 'watcher_21_may_2026'12:00
amoralej_who is around today?12:00
morenodo/12:01
dviroelo/12:01
amoralej_courtesy ping jgilaber sean-k-mooney[m] rlandy 12:01
rlandyo/12:02
jgilabero/12:02
amoralej_you still can add last minute topics to https://etherpad.opendev.org/p/openstack-watcher-irc-meeting12:02
amoralej_let's start with the first topic12:03
amoralej_#topic https://review.opendev.org/c/openstack/watcher/+/975498 default behaviour for retries in nova helper12:03
amoralej_this is from jgilaber 12:04
amoralej_the stage is yours12:04
jgilaberthanks12:04
jgilaberI wanted to discuss how to handle the defaults for the retry parameters12:04
jgilaberin the linked patch I deprecate the existing http_retry* parameters from the [nova] group12:05
jgilaberin favour of the keystoneauth ones that will be used in for all services12:05
jgilaberbut the default behaviour is different12:05
jgilaberbefore this patch we had 3 retries with 0.5 delay12:06
jgilaberbut for the new parameters is no retries, exponential backoff for the delay12:06
amoralej_for how many retries?12:06
jgilaberby default none12:06
amoralej_what means exponentical backoff if there are no retries?12:07
jgilaberI assume no retries12:07
jgilaberand then the backoff is used if some retries are configured12:07
jgilaber#link https://docs.openstack.org/watcher/latest/configuration/watcher.html#keystone.connect_retry_delay12:08
jgilaberthis is the docs for the new parameters12:08
dviroel"maximum of 60 seconds"12:08
dviroelwhen set to exponential12:09
amoralej_yeah, so i understand the default connect_retry_delay is not relevant if connect_retries is not set12:10
amoralej_we should have some retries by default12:10
amoralej_i wouldn't mind too much if the behavior is exactly the same, but we need some retries by default12:11
jgilaberthis is I believe the relevant code https://github.com/openstack/keystoneauth/blob/92dade5b56767908b62cf516667b93278c59dd84/keystoneauth1/session.py#L118412:13
amoralej_would it be possible to keep the existing parameters and use their values to initialize the connection with the keystoneauth params ?12:13
jgilaberthat is an option12:13
jgilaberwe could do that and add the same parameters for all services12:13
amoralej_yep12:14
jgilaberthat seems to be the approach taken by nova12:15
jgilabere.g https://github.com/openstack/nova/blob/6901f0cdddc199fe23aef0cd34f86ae2066ead30/nova/conf/cinder.py#L72 and https://github.com/openstack/nova/blob/6901f0cdddc199fe23aef0cd34f86ae2066ead30/nova/conf/glance.py#L5412:15
amoralej_looks reasonable to me12:16
jgilaberack, I'll update the patch with this approach and we can discuss further there12:16
amoralej_we'd maintain our capacity of setting our default behaviour while using the logic of the sdk for the actual retries12:17
amoralej_you'd override values in services from watcher_clients section ?12:17
amoralej_no need probably12:18
jgilaberthis would be in each service section12:18
amoralej_wfm12:18
jgilaberno point in adding this parameter to the watcher_clients imo if that is what you're asking12:18
amoralej_yes12:19
amoralej_+112:19
amoralej_so, we can move on to next topic ?12:19
jgilaberyes, we can move on, thanks for the discussion12:19
amoralej_dviroel, you want to discuss now your bugs?12:20
dviroelyeah, we can jump to bugs topic12:20
dviroelthe first one in the list is actually morenod bug12:21
amoralej_ok, let's move there, i also missed jgilaber review request12:21
dviroeloh right12:21
amoralej_#topic Update docs for removed integrations in stable/2026.1 https://review.opendev.org/c/openstack/watcher/+/98934212:21
amoralej_that came first12:21
dviroeli already voted in this one ^12:21
jgilabernp, it'll be quick12:21
jgilaberit's just a small change removing from the docs the glance and neutron integrations that we deleted12:22
jgilaberI backported it to 2026.1 which is when we actually deleted them IIRC12:22
dviroelthey were deleted in gazpacho yeah +112:22
amoralej_merging, yep12:23
jgilaberI don't expect the patch to be controversial :) so I think we can move on12:23
dviroelthanks jgilaber 12:23
amoralej_good to split this change into its own patch to backport12:23
amoralej_let's move to the bugs then12:23
amoralej_#topic https://bugs.launchpad.net/watcher/+bug/2151776 audit_scope parameter cannot be applied on some strategies12:24
morenodey12:24
amoralej_that's the one from morenod 12:24
morenodso working on the tests for audit scope, I saw this12:24
dviroelso instances are not really excluded from the model12:25
dviroelthey are labeled as nstance.watcher_exclude12:25
amoralej_when the host is not excluded12:25
dviroelcorrect12:25
dviroelsome strategies may need to know that this instance exists, but should not be considered as part of the solution12:26
amoralej_what would be good is to describe in the bug the problematic behavior instead of the code itself, imo12:26
morenodok, I will update the description12:27
amoralej_i.e. audit with strategy X and scope Y, migrate vms which should not be migrated, or something like that12:27
dviroelso one follow up to this is, based on the current implementation of watcher exclude instance, which strategies are not working properly12:27
dviroelall of them need to know that? not sure12:28
morenodhost maintenance, node resource consolidation and zone migration are not working12:28
morenodthere are another ones that do not use, like noisy_neighbor, but I think they are out of scope at this moment12:29
morenodmost important ones are host_maintenance, node_resource_consolidation and zone_migration12:29
dviroelyeah, noisy neighbor is deprecated for removal for instance12:29
dviroelanother one like saving_energy12:30
dviroelit does other checks12:30
dviroel"hypervisor_node['running_vms'] == 0"12:30
dviroelto known if a node can be powered off12:30
dviroelin this case ^ at least some documentation is needed I think, so user is aware that a excluded node will not work for saving energy strategy12:31
dviroelit is not expected to work12:31
amoralej_yep, i feel changing that behavior will be problematic12:31
dviroeli think that we can originally fix the broken strategies12:32
amoralej_i mean, keeping somehow all the vms in the model if the host is not excluded12:32
amoralej_i agree12:32
dviroeldocument the ones that don't have any usage of the excluded instance12:32
dviroeland discuss a better approach for it as a enhancement12:32
jgilaber+1 to document carefully12:32
opendevreviewMerged openstack/watcher stable/2026.1: Update configuration and integrations docs  https://review.opendev.org/c/openstack/watcher/+/98934212:33
jgilaberI think users might interpret different potential behaviours for zone migration and host_maintenance12:33
dviroelyeah12:33
jgilaberfor host maintenance it makes sense to think this exclusion would not apply12:33
dviroelwe don't know how these strategies are handling these scope restrictions12:34
dviroeluser would be lost12:34
jgilaberin zone migration there could contradictions between the audit input parameters and the scop12:34
amoralej_yes12:35
jgilaberso we'll have to decide and document which one would prevail12:35
amoralej_it's important to understand and document12:35
dviroelthis bug will track many fixes i think, we may end up creating more?12:36
amoralej_so, next step would be to get the bug documented more clearly and then, i'd say create one bug per strategy and discuss each case by its own12:36
dviroelack, some strategies will be code + docs, others will be doc only12:37
dviroelso we can create the bugs in the process of fixing them12:37
dviroellets keep this one as central discussions12:38
amoralej_ok12:38
dviroelcan we set a priority to this one12:38
amoralej_high?12:38
dviroelI think that we should fix this soon, at least some strategies12:38
dviroelyeah, which translate for High12:38
amoralej_I agree12:38
amoralej_at least the more obvious cases12:39
dviroeljgilaber morenod what do you think?12:39
morenodyeah, to continue validating the feature, we need to fix it asap, so high is ok for me12:39
jgilaber+1 to high, this could be disruptive12:40
morenodI will create one bug per strategy12:40
amoralej_set12:40
dviroeljgilaber: already updated the bug12:40
amoralej_thanks morenod 12:40
amoralej_next bug?12:40
dviroel+112:41
amoralej_#topic NEW https://bugs.launchpad.net/watcher/+bug/2152254 reentrant deadlock in ModelRoot when calling map_instance() with string UUID arguments12:41
dviroelok12:41
dviroela CDM bug that I reported recently12:41
dviroelthat was identified when debuging issues with cdm locking implementation12:42
dviroelthis bus exists in the code, but none of its function call is using string UUID, so we don't hit the deadlock in watcher today12:43
dviroelbut the buggy code it there12:43
amoralej_thanks for that, that's clearly a bug we need to fix12:43
jgilaberI haven't yet looked at the patches for the locking, but this also seems like a high priority bug for me12:43
amoralej_and you already provided the fix, so great :) 12:43
dviroelright the fix should be this one12:44
dviroel#link https://review.opendev.org/c/openstack/watcher/+/98754012:44
dviroelto replace the current locking method12:44
dviroelthis adds a per model instance lock, which is reentrant lock12:45
amoralej_this particular bug could be medium, as does not affect watcher because of uuid is not used imo12:45
amoralej_i reset my vote from ^ given that you rebased it on top of the fix for the other issue, i'll review later12:46
dviroelright, i don't think that other external plugins may also use the affected methods12:46
amoralej_bug high / medium, anyway we want to merge the fix soon for the perf too, so...12:46
dviroelbut yeah, not sure.12:46
amoralej_we can set high, no problem12:46
dviroelack12:47
dviroelagrre12:47
dviroelso when proposing this fix12:47
amoralej_#topic https://bugs.launchpad.net/watcher/+bug/2152645 Decision engine loses notification updates during cluster data model synchronization12:47
dviroelamoralej_: noticed that we were hitting other issue in ci12:47
dviroelthat  decision engine loses notification updates during cluster data model synchronization12:48
dviroel:) 12:48
amoralej_yep12:48
dviroelthat is miore likely to happen when working with big environments12:48
dviroelthe collector periodically builds a new model and replace the old one12:49
dviroelsome notification may be lost during this process12:49
dviroelthere is a example scenario in the LP report12:49
dviroelthere are 2 possible solutions identified12:50
dviroelthe first one would be a simple sync lock between these two threads12:50
dviroelthe collector and the notification ones12:50
dviroelthe trade off could be the collector taking too long to return the lock12:51
dviroeland we also loose some notifications due to timeouts/queue limits12:51
dviroelwhich I couldn't validate the impact ^12:51
dviroelthe other option would be to, while collector is updating the model, we cache the notifications to process them later12:52
dviroelwe could got with both approaches, the first one is better than current code12:52
dviroeland propose the second one as follow-up and to get more feedbacks12:53
dviroelso that would not block us on fixing the first bug too12:53
amoralej_one question, so, iiuc the threads for running the model via collector and the one for the notifications are dedicated to that, right?12:53
dviroeli will ping sean-k-mooney[m] here so they can also get some feedback on this apprach when they have some time...12:54
amoralej_i mean, no impact on audit executions, right?12:54
dviroelthey are dedicated yes12:54
dviroelin the audit itself not12:54
dviroelsince the audit gets a copy from current model12:55
amoralej_about "loose some notifications due to timeouts/queue limits", i'm not sure in which cases we may have problems12:55
dviroeland the lock for getting this model is a different one12:55
dviroelwhat would be happening is that, with the new lock, once the collector is running, we will not receive updates in the current/old model anymore, since the notification thread will be waiting for the sync_lock12:56
amoralej_but that's not a problem by itself, as soon as the changes are applied after the collection is done12:57
dviroelyeah, this means that during this time, the audit will be getting a model that was not updated with recent notifications12:57
dviroelthis ^ can be fixing with the notitication cache and replay of them, which is the second approach12:57
dviroels/fixing/fixed12:58
amoralej_but in the first approach, would also apply the notifications later, right?12:58
dviroellater, after the collector finishes12:58
dviroelyeah12:58
amoralej_exactly12:58
dviroelif the collector takes 2  min to complete, the notifications during this 2 min will be held12:59
dviroeland today they may be lost 12:59
dviroelwithout any fix i mean12:59
amoralej_in which case lost ? notifications should remain in the notification topic12:59
amoralej_actually, iirc notification have no (or very long) ttl12:59
dviroelthe lost scenario is today, without any fix I mean13:00
amoralej_ah, sorry, right13:00
amoralej_sure, we need to fix that13:00
amoralej_i was referring to option 1 vs 213:01
dviroelso my proposal for now is with 1) and furhter discuss the need of 213:01
amoralej_makes sense13:01
dviroelwe can sync more about that in the patch itself13:01
amoralej_i went directly to the "further discuss the need of 2" sorry :) 13:01
dviroelsince it is already proposed13:01
amoralej_i agree13:01
dviroel#link https://review.opendev.org/c/openstack/watcher/+/98939613:01
dviroelptal when have some time13:02
dviroelwe need to set a priority 13:02
amoralej_i'd put high13:02
amoralej_given that it's also a dependency to merge the  rest of locking changes13:02
dviroelack13:02
dviroeldepending on the collector period, this could be fixed only in hours, so the model can be outdated for a long time13:03
dviroelor only when dec-eng is restarted13:03
dviroelset to Hight them13:03
amoralej_yes, worst case, the following periodic collector would fix it13:03
amoralej_we can discuss about the need of 2 in future meetings13:04
dviroelack13:04
dviroelthat's it13:04
amoralej_thanks for proposing the fix so fast dviroel 13:04
amoralej_we are out of topics and out of time :)13:04
amoralej_dviroel is volunteer to chair next meeting, thanks!13:04
dviroeli can chair next week yeah13:05
dviroelsince i will be out the following one13:05
amoralej_#action dviroel to chair next meeting13:05
amoralej_unless there is some last minute topic to discuss, we can close the meeting13:05
amoralej_thank you all for joining!13:06
dviroelthanks amoralej_ o/13:06
amoralej_#endmeeting13:06
opendevmeetMeeting ended Thu May 21 13:06:19 2026 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)13:06
opendevmeetMinutes:        https://meetings.opendev.org/meetings/watcher_21_may_2026/2026/watcher_21_may_2026.2026-05-21-12.00.html13:06
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/watcher_21_may_2026/2026/watcher_21_may_2026.2026-05-21-12.00.txt13:06
opendevmeetLog:            https://meetings.opendev.org/meetings/watcher_21_may_2026/2026/watcher_21_may_2026.2026-05-21-12.00.log.html13:06
jgilaberthanks amoralej_ !13:06
*** haleyb is now known as haleyb|out21:48

Generated by irclog2html.py 4.1.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!