12:00:33 <dviroel> #startmeeting watcher
12:00:33 <opendevmeet> Meeting started Thu Apr 24 12:00:33 2025 UTC and is due to finish in 60 minutes.  The chair is dviroel. Information about MeetBot at http://wiki.debian.org/MeetBot.
12:00:33 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
12:00:33 <opendevmeet> The meeting name has been set to 'watcher'
12:01:02 <dviroel> hi o/ - who's around today?
12:01:10 <rlandy> o/
12:01:16 <mtembo> Hello o/
12:01:23 <jgilaber> hello o/
12:02:59 <dviroel> here is our meeting agenda for today
12:03:06 <dviroel> #link https://etherpad.opendev.org/p/openstack-watcher-irc-meeting#L33 (Meeting agenda)
12:03:18 <amoralej> o/
12:03:23 <dviroel> please, feel free to add your own topics to the agenda
12:03:49 <dviroel> there is a topic to place your changes that requires attention from reviewers
12:04:13 <dviroel> there is also a topic for bugs, if you want to discuss any, please add to the end of the list too
12:05:10 <dviroel> #topic Courtesy ping
12:05:15 <dviroel> I added this one
12:05:30 <dviroel> just want to propose the courtesy ping list idea
12:05:47 <dviroel> it is part of the manila meetings for a long time already, and imho is useful
12:06:10 <dviroel> we keep a list of irc nicks, at the top of the schedule, that want to receive a courtesy ping when the meeting starts, as a reminder
12:06:46 <dviroel> if you want to receive the ping, just add your nick there, you can also remove it anytime
12:07:18 <dviroel> the chair will only need to copy and paste the list, when the irc meeting starts
12:07:36 <jgilaber> +1 sounds useful
12:07:53 <dviroel> so you don't miss anything :)
12:09:36 <dviroel> alright then, people can just add/remove their nicks as they want there
12:09:38 <dviroel> tks
12:09:40 <dviroel> next one
12:09:57 <dviroel> #topic Reviews that need attention
12:10:18 <dviroel> the first 2 are specs ready to review
12:10:35 <dviroel> which I already bring in the last meeting
12:10:42 <dviroel> #link https://review.opendev.org/c/openstack/watcher-specs/+/943873 (disable cold/live migration for host maintenance strategy)
12:10:57 <dviroel> this one I also need to get back and review the latest patch sets
12:11:15 <dviroel> we already discussed about it in the ptg too
12:11:17 <dviroel> pls take a look
12:11:29 <dviroel> #link https://review.opendev.org/c/openstack/watcher-specs/+/947282 (Adds spec for extend compute model attributes)
12:12:08 <dviroel> recently added this spec, to start discussing about how we can extend the compute model attributes, and how to use this information to improve our strategies
12:12:22 <dviroel> i would like to receive some feedback there too
12:12:27 <amoralej> wrt disable cold/live i was thinking it'd be nice probably for other strategies too, so it'd be nice if it can be implemented in a reusable way
12:13:23 <dviroel> right, probably something that we could consider, depending on the strategy
12:14:07 <dviroel> most of the strategies implemente their own decision on that, which is usually check the status of the instance
12:14:22 <dviroel> to decide between live and cold migration
12:15:28 <dviroel> we can discuss more about that in the spec, or even in the gerrit change proposed
12:15:35 <amoralej> yep
12:15:57 <dviroel> ++
12:16:13 <dviroel> ok, next in the list
12:16:26 <dviroel> #link https://review.opendev.org/c/openstack/watcher/+/945331 (Make prometheus the default devstack example)
12:17:05 <dviroel> both depends-on merged already
12:17:10 <jgilaber> this change is ready for review, it adds devstack local.conf samples to deploy with prometheus as datasource
12:17:32 <jgilaber> it also keeps the gnocchi samples that we have currently
12:18:02 <dviroel> nice, thanks for updating the docs too
12:18:08 <dviroel> I will take a look after the meeting
12:18:27 <jgilaber> thanks dviroel
12:18:46 <dviroel> ping sean-k-mooney to revisit it too
12:19:11 <sean-k-mooney> o/
12:19:34 * dviroel sean-k-mooney o/
12:19:34 <sean-k-mooney> yep ill try and look at that again
12:19:41 <dviroel> tks
12:20:20 <dviroel> next
12:20:34 <dviroel> #link https://review.opendev.org/c/openstack/watcher/+/946153 (Query by fqdn_label instead of instance for host metrics)
12:20:57 <dviroel> this is a backport of a important fix
12:21:11 <dviroel> but depends on the other one in the chain
12:21:25 <dviroel> which is:
12:21:34 <dviroel> #link https://review.opendev.org/c/openstack/watcher/+/946737/ (Drop sg_core prometheus related vars)
12:21:54 <dviroel> sean-k-mooney: ^ pls, this is a small backport, for ci
12:22:02 <sean-k-mooney> so that approved
12:22:03 <dviroel> when you have some time
12:22:10 <sean-k-mooney> oh the second one
12:22:14 <sean-k-mooney> sure ill take a lookk
12:22:21 <sean-k-mooney> https://review.opendev.org/c/openstack/watcher/+/946153
12:22:33 <sean-k-mooney> i assume is just sitting in zuul gate pipelien right
12:22:45 <dviroel> no, there is a relation chain
12:22:57 <sean-k-mooney> oh right
12:23:09 <sean-k-mooney> ok ya
12:23:14 <amoralej> https://review.opendev.org/c/openstack/watcher/+/946737/ is required to unblock the ci, that's why the rest are rebased on it
12:23:45 <sean-k-mooney> ack i have approved that now
12:23:49 <dviroel> and the 3rd one in the chain will be
12:23:52 <amoralej> thanks
12:23:59 <dviroel> #link https://review.opendev.org/c/openstack/watcher/+/946732 (Aggregate by fqdn label instead instance in host cpu metrics)
12:24:13 <sean-k-mooney> ya we shoudl merge all 3 togheter
12:24:14 <dviroel> which is a follow up on the second one
12:24:33 <dviroel> ack, thanks amoralej for proposing them
12:24:57 <amoralej> I can propose a bug release once we have the three merged
12:25:27 <sean-k-mooney> we could
12:25:42 <sean-k-mooney> the more imporant release is the final bobcat release
12:25:47 <sean-k-mooney> that shoudl happen this week
12:25:58 <sean-k-mooney> but we can do a release of all stable branches this/next week
12:26:13 <dviroel> +1
12:26:18 <amoralej> +1
12:26:52 <dviroel> ok, anyone want to bring any other review?
12:27:18 <dviroel> #topic Bug Triage
12:27:39 <dviroel> #link https://bugs.launchpad.net/watcher/+bug/2107467 (workload_stabilization strategy does show standard_deviation if it's below the audit threshold)
12:27:54 <dviroel> jgilaber o/
12:28:05 <jgilaber> I wanted to get some thoughts on this bug
12:28:10 <jgilaber> to me it seems a UX bug
12:28:37 <sean-k-mooney> its mostly cosmetic so i woudl mark it as low impoarntce but its definetly valid
12:28:52 <sean-k-mooney> you also have been workign on fixing this already
12:29:06 <sean-k-mooney> right htis is related to how we store it in the db?
12:29:11 <jgilaber> it's not quite the same
12:29:26 <sean-k-mooney> oh how is this differnt
12:29:31 <jgilaber> I found this bug when testing my fix for the db
12:30:05 <jgilaber> the standard deviation is only stored if it larger than the user defined threshold for any of the metrics
12:30:10 <jgilaber> it comes from https://github.com/openstack/watcher/blob/c4acce91d6bb87b4ab865bc8e4d442a148dba1d5/watcher/decision_engine/strategy/strategies/workload_stabilization.py#L472
12:30:24 <jgilaber> see the if in line 465
12:30:34 <jgilaber> if it's not, then the default value of 0.0 is stored
12:30:35 <sean-k-mooney> you could argue that its expected
12:30:51 <sean-k-mooney> since it did not meet the specified treshold it was "skipped"
12:30:55 <jgilaber> yes, it's somewhat amibigous what to expect, that's why I wanted to bring it up
12:30:55 <sean-k-mooney> witht hat said
12:31:06 <amoralej> even if it's below the threshold, i'd say it should be displayed
12:31:13 <sean-k-mooney> i think long term w ewill want to enhance wathcer to emit notifications for audits
12:31:25 <sean-k-mooney> and we will want to includ ethe effincaly indeicators
12:31:40 <jgilaber> on the one hand I would expect to show the deviation calculated even if below the threshold
12:31:40 <sean-k-mooney> so i think it would be ok to change the behavior to alwasy store the calusated values
12:31:46 <amoralej> it may be useful to track trends over time, i.e.
12:31:49 <amoralej> +1
12:31:58 <dviroel> hum, if is below the expected value, there is no optimization to be done, which means that there is no efficacy indicators?
12:32:23 <sean-k-mooney> to me this is a precondition failutre
12:32:34 <sean-k-mooney> it did not meet tehre minitum required treshold
12:32:48 <sean-k-mooney> so they orginal authors choose not to store the values
12:33:02 <sean-k-mooney> but i get the ux side that jgilaber is raising
12:33:11 <sean-k-mooney> and i agree its both confusing and ambigouse
12:33:31 <sean-k-mooney> so i think we can set this to triaged and low
12:33:44 <dviroel> i agree that is a useful information too, to be displayed
12:33:46 <sean-k-mooney> and then fix when we have time unless others object
12:34:01 <jgilaber> there is another complication if we decide to change, what to do when there is more than one metric, display the largest deviation?
12:34:46 <sean-k-mooney> we need to display all of them
12:34:55 <sean-k-mooney> i thnk we alrady modifed the dashboard to do that
12:35:09 <sean-k-mooney> so to be clear it would not be ok to change the resoce format
12:35:24 <sean-k-mooney> we can save the calulated value instead of 0.0
12:35:36 <amoralej> it manages the thresholds independently if it uses two moetrics? (cpu and memory, i.e)
12:35:41 <sean-k-mooney> btu we cannot add or remvoe filed or change tohe overall respocen as that would requrie a new api microversion
12:35:45 <sean-k-mooney> and therefor a spec
12:36:04 <jgilaber> I don't think we can store more than one value in an efficacy indicator
12:36:21 <jgilaber> we would need to add additional ones
12:36:28 <sean-k-mooney> its a list i belive
12:36:43 <sean-k-mooney> lets look at the api ref
12:37:09 <sean-k-mooney> https://docs.openstack.org/api-ref/resource-optimization/#show-action-plan
12:37:30 <sean-k-mooney> so efficacy_indicators  is an array of indcators
12:38:00 <sean-k-mooney> and that can have multiple values which we did fix in watcher-dashboard to display properly
12:38:10 <dviroel> which was fixed in the ui too, I think that was amoralej that fixed
12:38:36 <sean-k-mooney> yep
12:38:42 <amoralej> yes, but the problem here is that the list of metrics considered is configurable
12:38:54 <amoralej> and, iiuc, we have one deviation per-metric
12:38:55 <sean-k-mooney> thats ok
12:39:18 <amoralej> os it'd be deviation_before_cpu deviation_before_memory, etc... ?
12:39:24 <sean-k-mooney> in the api ref the indictors in the efficacy_indicators is not part of the schema
12:39:27 <amoralej> is that how it works?
12:39:38 <sean-k-mooney> im not sure i think we need more info in the bug
12:39:46 <sean-k-mooney> specificly we need the raw api responce
12:40:02 <amoralej> there is also a weight parameter for the metrics, so i assumed the different deviation were aggregated somehow
12:40:05 <sean-k-mooney> not how its rendedd in the client but what is actully beign returned when there are multiple metrics
12:40:15 <amoralej> yes ^ that
12:40:28 <jgilaber> it simply stores the first deviation that is larger than the trheshold
12:40:41 <jgilaber> https://github.com/openstack/watcher/blob/c4acce91d6bb87b4ab865bc8e4d442a148dba1d5/watcher/decision_engine/strategy/strategies/workload_stabilization.py#L461
12:41:01 <amoralej> and do the optimization based on only the first metric is above it?
12:41:03 <jgilaber> it iterates over the metrics and the first time one goes over the threshold it returns
12:41:58 <sean-k-mooney> that seam incorreect
12:42:20 <sean-k-mooney> unless megrics is a preferenically orderd list
12:42:36 <sean-k-mooney> so this is starting to grow out of the scope of a simple bug
12:42:40 <sean-k-mooney> and into a feature
12:42:45 <amoralej> ah, so the weight is only considered for the simulation, not for the initial deviation found https://github.com/openstack/watcher/blob/c4acce91d6bb87b4ab865bc8e4d442a148dba1d5/watcher/decision_engine/strategy/strategies/workload_stabilization.py#L423
12:42:58 <amoralej> that's strange, tbh ...
12:43:19 <jgilaber> there is nothing in the metrics description that suggests it should be sorted by importance https://github.com/openstack/watcher/blob/master/watcher/decision_engine/strategy/strategies/workload_stabilization.py#L107
12:43:55 <sean-k-mooney> jgilaber: i was ok with treatign this as a bug if it was only a infromational change
12:44:03 <sean-k-mooney> but if its goign to chagne the behvior fo the stagey
12:44:09 <opendevreview> Merged openstack/watcher stable/2025.1: Drop sg_core prometheus related vars  https://review.opendev.org/c/openstack/watcher/+/946737
12:44:22 <sean-k-mooney> then i think this is creapign into a spec
12:44:27 <sean-k-mooney> or at least something
12:44:35 <sean-k-mooney> that need more dicussionthetn we can do right now
12:44:48 <sean-k-mooney> shall we loop back to this again next week
12:44:52 <sean-k-mooney> and think about it a bit more.
12:44:56 <dviroel> sure
12:45:01 <jgilaber> agreed, I did not intend to change the strategy behaviour with my bug, initially I just noticed the UX
12:45:02 <jgilaber> +1
12:45:17 <amoralej> it may be correct, but at least, i'd like to understand better how that works to drive expectations
12:45:27 <sean-k-mooney> jgilaber: if you can you use --debug on openstack client to attach the raw api output to the bug if you have time
12:45:38 <jgilaber> sure, I'll do that
12:46:03 <dviroel> thanks for raising that jgilaber
12:46:04 <jgilaber> sean-k-mooney: which output, the action plan?
12:46:19 <sean-k-mooney> the action plan show yes
12:46:30 <jgilaber> ack, will do after the mtg
12:46:30 <sean-k-mooney> i want to see if the api responce and the cli output align
12:46:53 <sean-k-mooney> we may be truncating the output or rounding in the clint
12:47:07 <sean-k-mooney> but in general i just want to see the actual repsocne
12:47:14 <sean-k-mooney> shall we move on?
12:47:21 <dviroel> sure, ok, the next 2 bugs we already discussed at the ptg
12:47:31 <dviroel> #link https://bugs.launchpad.net/watcher/+bug/2106407 (Action Plans status is wrongly reported when Actions fail)
12:47:46 <dviroel> it was missing status
12:47:54 <amoralej> i set it as triaged
12:47:57 <dviroel> thanks amoralej
12:48:00 <amoralej> as we discussed it in ptg
12:48:06 <dviroel> next
12:48:06 <sean-k-mooney> yep and i agree with it beign high also
12:48:10 <amoralej> i plan to work on it but didn't have the time for it
12:48:21 <dviroel> +1
12:48:22 <sean-k-mooney> ack
12:48:25 <dviroel> #link https://bugs.launchpad.net/watcher/+bug/2104220 (NovaClusterDataModelCollector cache generate wrong action plan)
12:48:42 <dviroel> we also discussed at the ptg
12:48:56 <sean-k-mooney> ya so we still need ot confirm if we are lookign at the wrong field
12:49:08 <sean-k-mooney> i.e. saving the sourc host instead of the dest
12:49:18 <dviroel> yes, we need to check the nova notification to see if that is working as expected
12:49:42 <sean-k-mooney> i just set this to high also
12:49:48 <dviroel> i was planning to validate in my env too
12:49:51 <sean-k-mooney> since this is the primary way we update the cache
12:50:03 <sean-k-mooney> that would be good if you have time
12:50:17 <sean-k-mooney> this may have been a regression in nova
12:50:31 <sean-k-mooney> i.e. we coudl have changed when that event got sent
12:50:36 <dviroel> so I will assign to myself for now, unless someone is already looking at it
12:50:56 <sean-k-mooney> we did do that a few release ago but its been long enoguh that if that was the cause we likely should just update wathcer
12:51:33 <amoralej> to be clear, we'd totally fix the issue with models out of sync if we have nova notifications enabled (and we don't have a bug in the model update logic) ?
12:51:46 <sean-k-mooney> we have the notifcation enabled
12:51:53 <sean-k-mooney> so that not the problem
12:52:16 <sean-k-mooney> the problem is the filed we are updatign form seams to have the source host not the destination
12:52:21 <sean-k-mooney> which is what we were expecting
12:52:45 <sean-k-mooney> so the bug is either in how we are parsing the notification and updatign the model
12:52:55 <amoralej> i don't mean for that particular bug, but about the expected behavior of watcher. Having nova notifications enabled should ensure watcher is always correct?
12:52:59 <sean-k-mooney> or nvoa acidently change the behavior a few cycle ago and noone noticed
12:53:15 <sean-k-mooney> amoralej: in general yes
12:53:28 <sean-k-mooney> amoralej: that is the recommend way to deploy watcher
12:53:36 <amoralej> good
12:53:44 <sean-k-mooney> i say in general as there is a short interval
12:53:55 <sean-k-mooney> where we wont have processed the notificaion yet
12:54:05 <sean-k-mooney> btu its much smaller then relying only on the periodic
12:54:08 <amoralej> tbh, i had missed that update based on notifications ...
12:54:09 <amoralej> sure
12:54:21 <amoralej> that is much better
12:54:29 <sean-k-mooney> downstream we skiped enabling it becuase notificaon are not supproted in our new installer yet
12:54:46 <sean-k-mooney> specificly in nova
12:54:58 <sean-k-mooney> so we will also need to supprot that in our new installer once that gap is closed
12:55:00 <amoralej> from performance pov, enabling notifications, is it expensive?
12:55:06 <sean-k-mooney> devstafck does it by defualt
12:55:15 <sean-k-mooney> kind of
12:55:18 <amoralej> ack
12:55:24 <sean-k-mooney> it puts a lot of extra load on rabbit
12:55:36 <sean-k-mooney> its actully recomend to have a seperate rabbit service just for notificaions
12:55:37 <amoralej> no need to go into details now, we are almost out of time, but thanks for the clarification
12:55:40 <sean-k-mooney> btu the bigger issue
12:55:44 <sean-k-mooney> is if there is no consumer
12:55:51 <sean-k-mooney> the rabbit queue builds forever
12:56:01 <sean-k-mooney> and just fills up ram
12:56:05 <dviroel> ++
12:56:09 <dviroel> ok, we don't have too much time to cover the next 2 bugs in the list
12:56:19 <dviroel> so moving them to the next meeting
12:56:28 <sean-k-mooney> i do have one to highlight
12:56:30 <sean-k-mooney> https://bugs.launchpad.net/watcher/+bug/2108855
12:56:43 <sean-k-mooney> this is a feature request nto an actual bug
12:56:52 <dviroel> ack, sean-k-mooney i was reading through
12:57:03 <sean-k-mooney> unfotully we did not discss this in the ptg
12:57:29 <dviroel> we can recommend everybody to read this LP bug
12:57:32 <amoralej> ok, so it should not be too hard based on the proposed implementation in observabilityclient
12:57:36 <dviroel> #link https://bugs.launchpad.net/watcher/+bug/2108855 (Watcher should include keystone session when creating PrometheusAPICLient)
12:57:39 <sean-k-mooney> the tldr is the openstack telemetry team are proposing to add a auth reverse proxy for providing multi tenancy on top of prometheus
12:57:59 <sean-k-mooney> this is a non tivial change even if the code is small
12:58:23 <sean-k-mooney> and normally this is a classing exampel of where a spec woudl be reuired because it has supprot implication for testing and upgrade
12:58:24 <amoralej> in case the session has admin role it will return data for all tenants?
12:58:45 <amoralej> (i hope so) ...
12:58:47 <sean-k-mooney> so that is one of the design questiosn we need to resolve
12:58:53 <sean-k-mooney> otherwise we shoudl not suprpot this
12:59:06 <sean-k-mooney> but yes i belive that is the intet
12:59:35 <dviroel> right, so we should bring back this topic to the next meeting
12:59:42 <sean-k-mooney> yes
12:59:49 <sean-k-mooney> lets reach out to jaromir
12:59:51 <dviroel> about next meeting
12:59:57 <sean-k-mooney> and see if they can attend next week
13:00:01 <dviroel> #topic chair next meetings
13:00:15 <dviroel> i will be out next week, due to holiday
13:00:23 <dviroel> not sure about others
13:00:29 <dviroel> we need someone to chair
13:00:30 <amoralej> next thursday/friday are local holiday here
13:00:58 <jgilaber> +1 next week is a holiday for me as well
13:01:04 <dviroel> yeah
13:01:24 <mtembo> It's a holiday for me too
13:01:30 <sean-k-mooney> ack
13:01:32 <dviroel> i will let rlandy decide about  cancelling or not
13:01:36 <sean-k-mooney> we can skip next week
13:01:38 <amoralej> maybe we should cancel it
13:01:40 <dviroel> but I think that we should skip
13:01:44 <sean-k-mooney> if we do not have quoram
13:01:47 <dviroel> ack
13:02:07 <rlandy> if enough people are out - yeah
13:02:08 <sean-k-mooney> dviroel: can you send a 2 line message to the list jsut declaring it skipped
13:02:22 <sean-k-mooney> i think we have 4 peopel that will not be here at least
13:02:27 <dviroel> #action dviroel to cancel next meeting (ML email)
13:02:31 <sean-k-mooney> so that over half the normal attendes
13:02:32 <dviroel> ack
13:02:34 <opendevreview> Merged openstack/watcher stable/2025.1: Query by fqdn_label instead of instance for host metrics  https://review.opendev.org/c/openstack/watcher/+/946153
13:02:36 <opendevreview> Merged openstack/watcher stable/2025.1: Aggregate by fqdn label instead instance in host cpu metrics  https://review.opendev.org/c/openstack/watcher/+/946732
13:02:38 <opendevreview> Merged openstack/watcher stable/2024.2: Replace deprecated LegacyEngineFacade  https://review.opendev.org/c/openstack/watcher/+/942909
13:02:39 <dviroel> we are out of time
13:02:39 <opendevreview> Merged openstack/watcher stable/2024.2: Further database refactoring  https://review.opendev.org/c/openstack/watcher/+/942910
13:02:48 <dviroel> thanks for joinning all
13:02:58 <dviroel> #endmeeting