12:00:33 <dviroel> #startmeeting watcher 12:00:33 <opendevmeet> Meeting started Thu Apr 24 12:00:33 2025 UTC and is due to finish in 60 minutes. The chair is dviroel. Information about MeetBot at http://wiki.debian.org/MeetBot. 12:00:33 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 12:00:33 <opendevmeet> The meeting name has been set to 'watcher' 12:01:02 <dviroel> hi o/ - who's around today? 12:01:10 <rlandy> o/ 12:01:16 <mtembo> Hello o/ 12:01:23 <jgilaber> hello o/ 12:02:59 <dviroel> here is our meeting agenda for today 12:03:06 <dviroel> #link https://etherpad.opendev.org/p/openstack-watcher-irc-meeting#L33 (Meeting agenda) 12:03:18 <amoralej> o/ 12:03:23 <dviroel> please, feel free to add your own topics to the agenda 12:03:49 <dviroel> there is a topic to place your changes that requires attention from reviewers 12:04:13 <dviroel> there is also a topic for bugs, if you want to discuss any, please add to the end of the list too 12:05:10 <dviroel> #topic Courtesy ping 12:05:15 <dviroel> I added this one 12:05:30 <dviroel> just want to propose the courtesy ping list idea 12:05:47 <dviroel> it is part of the manila meetings for a long time already, and imho is useful 12:06:10 <dviroel> we keep a list of irc nicks, at the top of the schedule, that want to receive a courtesy ping when the meeting starts, as a reminder 12:06:46 <dviroel> if you want to receive the ping, just add your nick there, you can also remove it anytime 12:07:18 <dviroel> the chair will only need to copy and paste the list, when the irc meeting starts 12:07:36 <jgilaber> +1 sounds useful 12:07:53 <dviroel> so you don't miss anything :) 12:09:36 <dviroel> alright then, people can just add/remove their nicks as they want there 12:09:38 <dviroel> tks 12:09:40 <dviroel> next one 12:09:57 <dviroel> #topic Reviews that need attention 12:10:18 <dviroel> the first 2 are specs ready to review 12:10:35 <dviroel> which I already bring in the last meeting 12:10:42 <dviroel> #link https://review.opendev.org/c/openstack/watcher-specs/+/943873 (disable cold/live migration for host maintenance strategy) 12:10:57 <dviroel> this one I also need to get back and review the latest patch sets 12:11:15 <dviroel> we already discussed about it in the ptg too 12:11:17 <dviroel> pls take a look 12:11:29 <dviroel> #link https://review.opendev.org/c/openstack/watcher-specs/+/947282 (Adds spec for extend compute model attributes) 12:12:08 <dviroel> recently added this spec, to start discussing about how we can extend the compute model attributes, and how to use this information to improve our strategies 12:12:22 <dviroel> i would like to receive some feedback there too 12:12:27 <amoralej> wrt disable cold/live i was thinking it'd be nice probably for other strategies too, so it'd be nice if it can be implemented in a reusable way 12:13:23 <dviroel> right, probably something that we could consider, depending on the strategy 12:14:07 <dviroel> most of the strategies implemente their own decision on that, which is usually check the status of the instance 12:14:22 <dviroel> to decide between live and cold migration 12:15:28 <dviroel> we can discuss more about that in the spec, or even in the gerrit change proposed 12:15:35 <amoralej> yep 12:15:57 <dviroel> ++ 12:16:13 <dviroel> ok, next in the list 12:16:26 <dviroel> #link https://review.opendev.org/c/openstack/watcher/+/945331 (Make prometheus the default devstack example) 12:17:05 <dviroel> both depends-on merged already 12:17:10 <jgilaber> this change is ready for review, it adds devstack local.conf samples to deploy with prometheus as datasource 12:17:32 <jgilaber> it also keeps the gnocchi samples that we have currently 12:18:02 <dviroel> nice, thanks for updating the docs too 12:18:08 <dviroel> I will take a look after the meeting 12:18:27 <jgilaber> thanks dviroel 12:18:46 <dviroel> ping sean-k-mooney to revisit it too 12:19:11 <sean-k-mooney> o/ 12:19:34 * dviroel sean-k-mooney o/ 12:19:34 <sean-k-mooney> yep ill try and look at that again 12:19:41 <dviroel> tks 12:20:20 <dviroel> next 12:20:34 <dviroel> #link https://review.opendev.org/c/openstack/watcher/+/946153 (Query by fqdn_label instead of instance for host metrics) 12:20:57 <dviroel> this is a backport of a important fix 12:21:11 <dviroel> but depends on the other one in the chain 12:21:25 <dviroel> which is: 12:21:34 <dviroel> #link https://review.opendev.org/c/openstack/watcher/+/946737/ (Drop sg_core prometheus related vars) 12:21:54 <dviroel> sean-k-mooney: ^ pls, this is a small backport, for ci 12:22:02 <sean-k-mooney> so that approved 12:22:03 <dviroel> when you have some time 12:22:10 <sean-k-mooney> oh the second one 12:22:14 <sean-k-mooney> sure ill take a lookk 12:22:21 <sean-k-mooney> https://review.opendev.org/c/openstack/watcher/+/946153 12:22:33 <sean-k-mooney> i assume is just sitting in zuul gate pipelien right 12:22:45 <dviroel> no, there is a relation chain 12:22:57 <sean-k-mooney> oh right 12:23:09 <sean-k-mooney> ok ya 12:23:14 <amoralej> https://review.opendev.org/c/openstack/watcher/+/946737/ is required to unblock the ci, that's why the rest are rebased on it 12:23:45 <sean-k-mooney> ack i have approved that now 12:23:49 <dviroel> and the 3rd one in the chain will be 12:23:52 <amoralej> thanks 12:23:59 <dviroel> #link https://review.opendev.org/c/openstack/watcher/+/946732 (Aggregate by fqdn label instead instance in host cpu metrics) 12:24:13 <sean-k-mooney> ya we shoudl merge all 3 togheter 12:24:14 <dviroel> which is a follow up on the second one 12:24:33 <dviroel> ack, thanks amoralej for proposing them 12:24:57 <amoralej> I can propose a bug release once we have the three merged 12:25:27 <sean-k-mooney> we could 12:25:42 <sean-k-mooney> the more imporant release is the final bobcat release 12:25:47 <sean-k-mooney> that shoudl happen this week 12:25:58 <sean-k-mooney> but we can do a release of all stable branches this/next week 12:26:13 <dviroel> +1 12:26:18 <amoralej> +1 12:26:52 <dviroel> ok, anyone want to bring any other review? 12:27:18 <dviroel> #topic Bug Triage 12:27:39 <dviroel> #link https://bugs.launchpad.net/watcher/+bug/2107467 (workload_stabilization strategy does show standard_deviation if it's below the audit threshold) 12:27:54 <dviroel> jgilaber o/ 12:28:05 <jgilaber> I wanted to get some thoughts on this bug 12:28:10 <jgilaber> to me it seems a UX bug 12:28:37 <sean-k-mooney> its mostly cosmetic so i woudl mark it as low impoarntce but its definetly valid 12:28:52 <sean-k-mooney> you also have been workign on fixing this already 12:29:06 <sean-k-mooney> right htis is related to how we store it in the db? 12:29:11 <jgilaber> it's not quite the same 12:29:26 <sean-k-mooney> oh how is this differnt 12:29:31 <jgilaber> I found this bug when testing my fix for the db 12:30:05 <jgilaber> the standard deviation is only stored if it larger than the user defined threshold for any of the metrics 12:30:10 <jgilaber> it comes from https://github.com/openstack/watcher/blob/c4acce91d6bb87b4ab865bc8e4d442a148dba1d5/watcher/decision_engine/strategy/strategies/workload_stabilization.py#L472 12:30:24 <jgilaber> see the if in line 465 12:30:34 <jgilaber> if it's not, then the default value of 0.0 is stored 12:30:35 <sean-k-mooney> you could argue that its expected 12:30:51 <sean-k-mooney> since it did not meet the specified treshold it was "skipped" 12:30:55 <jgilaber> yes, it's somewhat amibigous what to expect, that's why I wanted to bring it up 12:30:55 <sean-k-mooney> witht hat said 12:31:06 <amoralej> even if it's below the threshold, i'd say it should be displayed 12:31:13 <sean-k-mooney> i think long term w ewill want to enhance wathcer to emit notifications for audits 12:31:25 <sean-k-mooney> and we will want to includ ethe effincaly indeicators 12:31:40 <jgilaber> on the one hand I would expect to show the deviation calculated even if below the threshold 12:31:40 <sean-k-mooney> so i think it would be ok to change the behavior to alwasy store the calusated values 12:31:46 <amoralej> it may be useful to track trends over time, i.e. 12:31:49 <amoralej> +1 12:31:58 <dviroel> hum, if is below the expected value, there is no optimization to be done, which means that there is no efficacy indicators? 12:32:23 <sean-k-mooney> to me this is a precondition failutre 12:32:34 <sean-k-mooney> it did not meet tehre minitum required treshold 12:32:48 <sean-k-mooney> so they orginal authors choose not to store the values 12:33:02 <sean-k-mooney> but i get the ux side that jgilaber is raising 12:33:11 <sean-k-mooney> and i agree its both confusing and ambigouse 12:33:31 <sean-k-mooney> so i think we can set this to triaged and low 12:33:44 <dviroel> i agree that is a useful information too, to be displayed 12:33:46 <sean-k-mooney> and then fix when we have time unless others object 12:34:01 <jgilaber> there is another complication if we decide to change, what to do when there is more than one metric, display the largest deviation? 12:34:46 <sean-k-mooney> we need to display all of them 12:34:55 <sean-k-mooney> i thnk we alrady modifed the dashboard to do that 12:35:09 <sean-k-mooney> so to be clear it would not be ok to change the resoce format 12:35:24 <sean-k-mooney> we can save the calulated value instead of 0.0 12:35:36 <amoralej> it manages the thresholds independently if it uses two moetrics? (cpu and memory, i.e) 12:35:41 <sean-k-mooney> btu we cannot add or remvoe filed or change tohe overall respocen as that would requrie a new api microversion 12:35:45 <sean-k-mooney> and therefor a spec 12:36:04 <jgilaber> I don't think we can store more than one value in an efficacy indicator 12:36:21 <jgilaber> we would need to add additional ones 12:36:28 <sean-k-mooney> its a list i belive 12:36:43 <sean-k-mooney> lets look at the api ref 12:37:09 <sean-k-mooney> https://docs.openstack.org/api-ref/resource-optimization/#show-action-plan 12:37:30 <sean-k-mooney> so efficacy_indicators is an array of indcators 12:38:00 <sean-k-mooney> and that can have multiple values which we did fix in watcher-dashboard to display properly 12:38:10 <dviroel> which was fixed in the ui too, I think that was amoralej that fixed 12:38:36 <sean-k-mooney> yep 12:38:42 <amoralej> yes, but the problem here is that the list of metrics considered is configurable 12:38:54 <amoralej> and, iiuc, we have one deviation per-metric 12:38:55 <sean-k-mooney> thats ok 12:39:18 <amoralej> os it'd be deviation_before_cpu deviation_before_memory, etc... ? 12:39:24 <sean-k-mooney> in the api ref the indictors in the efficacy_indicators is not part of the schema 12:39:27 <amoralej> is that how it works? 12:39:38 <sean-k-mooney> im not sure i think we need more info in the bug 12:39:46 <sean-k-mooney> specificly we need the raw api responce 12:40:02 <amoralej> there is also a weight parameter for the metrics, so i assumed the different deviation were aggregated somehow 12:40:05 <sean-k-mooney> not how its rendedd in the client but what is actully beign returned when there are multiple metrics 12:40:15 <amoralej> yes ^ that 12:40:28 <jgilaber> it simply stores the first deviation that is larger than the trheshold 12:40:41 <jgilaber> https://github.com/openstack/watcher/blob/c4acce91d6bb87b4ab865bc8e4d442a148dba1d5/watcher/decision_engine/strategy/strategies/workload_stabilization.py#L461 12:41:01 <amoralej> and do the optimization based on only the first metric is above it? 12:41:03 <jgilaber> it iterates over the metrics and the first time one goes over the threshold it returns 12:41:58 <sean-k-mooney> that seam incorreect 12:42:20 <sean-k-mooney> unless megrics is a preferenically orderd list 12:42:36 <sean-k-mooney> so this is starting to grow out of the scope of a simple bug 12:42:40 <sean-k-mooney> and into a feature 12:42:45 <amoralej> ah, so the weight is only considered for the simulation, not for the initial deviation found https://github.com/openstack/watcher/blob/c4acce91d6bb87b4ab865bc8e4d442a148dba1d5/watcher/decision_engine/strategy/strategies/workload_stabilization.py#L423 12:42:58 <amoralej> that's strange, tbh ... 12:43:19 <jgilaber> there is nothing in the metrics description that suggests it should be sorted by importance https://github.com/openstack/watcher/blob/master/watcher/decision_engine/strategy/strategies/workload_stabilization.py#L107 12:43:55 <sean-k-mooney> jgilaber: i was ok with treatign this as a bug if it was only a infromational change 12:44:03 <sean-k-mooney> but if its goign to chagne the behvior fo the stagey 12:44:09 <opendevreview> Merged openstack/watcher stable/2025.1: Drop sg_core prometheus related vars https://review.opendev.org/c/openstack/watcher/+/946737 12:44:22 <sean-k-mooney> then i think this is creapign into a spec 12:44:27 <sean-k-mooney> or at least something 12:44:35 <sean-k-mooney> that need more dicussionthetn we can do right now 12:44:48 <sean-k-mooney> shall we loop back to this again next week 12:44:52 <sean-k-mooney> and think about it a bit more. 12:44:56 <dviroel> sure 12:45:01 <jgilaber> agreed, I did not intend to change the strategy behaviour with my bug, initially I just noticed the UX 12:45:02 <jgilaber> +1 12:45:17 <amoralej> it may be correct, but at least, i'd like to understand better how that works to drive expectations 12:45:27 <sean-k-mooney> jgilaber: if you can you use --debug on openstack client to attach the raw api output to the bug if you have time 12:45:38 <jgilaber> sure, I'll do that 12:46:03 <dviroel> thanks for raising that jgilaber 12:46:04 <jgilaber> sean-k-mooney: which output, the action plan? 12:46:19 <sean-k-mooney> the action plan show yes 12:46:30 <jgilaber> ack, will do after the mtg 12:46:30 <sean-k-mooney> i want to see if the api responce and the cli output align 12:46:53 <sean-k-mooney> we may be truncating the output or rounding in the clint 12:47:07 <sean-k-mooney> but in general i just want to see the actual repsocne 12:47:14 <sean-k-mooney> shall we move on? 12:47:21 <dviroel> sure, ok, the next 2 bugs we already discussed at the ptg 12:47:31 <dviroel> #link https://bugs.launchpad.net/watcher/+bug/2106407 (Action Plans status is wrongly reported when Actions fail) 12:47:46 <dviroel> it was missing status 12:47:54 <amoralej> i set it as triaged 12:47:57 <dviroel> thanks amoralej 12:48:00 <amoralej> as we discussed it in ptg 12:48:06 <dviroel> next 12:48:06 <sean-k-mooney> yep and i agree with it beign high also 12:48:10 <amoralej> i plan to work on it but didn't have the time for it 12:48:21 <dviroel> +1 12:48:22 <sean-k-mooney> ack 12:48:25 <dviroel> #link https://bugs.launchpad.net/watcher/+bug/2104220 (NovaClusterDataModelCollector cache generate wrong action plan) 12:48:42 <dviroel> we also discussed at the ptg 12:48:56 <sean-k-mooney> ya so we still need ot confirm if we are lookign at the wrong field 12:49:08 <sean-k-mooney> i.e. saving the sourc host instead of the dest 12:49:18 <dviroel> yes, we need to check the nova notification to see if that is working as expected 12:49:42 <sean-k-mooney> i just set this to high also 12:49:48 <dviroel> i was planning to validate in my env too 12:49:51 <sean-k-mooney> since this is the primary way we update the cache 12:50:03 <sean-k-mooney> that would be good if you have time 12:50:17 <sean-k-mooney> this may have been a regression in nova 12:50:31 <sean-k-mooney> i.e. we coudl have changed when that event got sent 12:50:36 <dviroel> so I will assign to myself for now, unless someone is already looking at it 12:50:56 <sean-k-mooney> we did do that a few release ago but its been long enoguh that if that was the cause we likely should just update wathcer 12:51:33 <amoralej> to be clear, we'd totally fix the issue with models out of sync if we have nova notifications enabled (and we don't have a bug in the model update logic) ? 12:51:46 <sean-k-mooney> we have the notifcation enabled 12:51:53 <sean-k-mooney> so that not the problem 12:52:16 <sean-k-mooney> the problem is the filed we are updatign form seams to have the source host not the destination 12:52:21 <sean-k-mooney> which is what we were expecting 12:52:45 <sean-k-mooney> so the bug is either in how we are parsing the notification and updatign the model 12:52:55 <amoralej> i don't mean for that particular bug, but about the expected behavior of watcher. Having nova notifications enabled should ensure watcher is always correct? 12:52:59 <sean-k-mooney> or nvoa acidently change the behavior a few cycle ago and noone noticed 12:53:15 <sean-k-mooney> amoralej: in general yes 12:53:28 <sean-k-mooney> amoralej: that is the recommend way to deploy watcher 12:53:36 <amoralej> good 12:53:44 <sean-k-mooney> i say in general as there is a short interval 12:53:55 <sean-k-mooney> where we wont have processed the notificaion yet 12:54:05 <sean-k-mooney> btu its much smaller then relying only on the periodic 12:54:08 <amoralej> tbh, i had missed that update based on notifications ... 12:54:09 <amoralej> sure 12:54:21 <amoralej> that is much better 12:54:29 <sean-k-mooney> downstream we skiped enabling it becuase notificaon are not supproted in our new installer yet 12:54:46 <sean-k-mooney> specificly in nova 12:54:58 <sean-k-mooney> so we will also need to supprot that in our new installer once that gap is closed 12:55:00 <amoralej> from performance pov, enabling notifications, is it expensive? 12:55:06 <sean-k-mooney> devstafck does it by defualt 12:55:15 <sean-k-mooney> kind of 12:55:18 <amoralej> ack 12:55:24 <sean-k-mooney> it puts a lot of extra load on rabbit 12:55:36 <sean-k-mooney> its actully recomend to have a seperate rabbit service just for notificaions 12:55:37 <amoralej> no need to go into details now, we are almost out of time, but thanks for the clarification 12:55:40 <sean-k-mooney> btu the bigger issue 12:55:44 <sean-k-mooney> is if there is no consumer 12:55:51 <sean-k-mooney> the rabbit queue builds forever 12:56:01 <sean-k-mooney> and just fills up ram 12:56:05 <dviroel> ++ 12:56:09 <dviroel> ok, we don't have too much time to cover the next 2 bugs in the list 12:56:19 <dviroel> so moving them to the next meeting 12:56:28 <sean-k-mooney> i do have one to highlight 12:56:30 <sean-k-mooney> https://bugs.launchpad.net/watcher/+bug/2108855 12:56:43 <sean-k-mooney> this is a feature request nto an actual bug 12:56:52 <dviroel> ack, sean-k-mooney i was reading through 12:57:03 <sean-k-mooney> unfotully we did not discss this in the ptg 12:57:29 <dviroel> we can recommend everybody to read this LP bug 12:57:32 <amoralej> ok, so it should not be too hard based on the proposed implementation in observabilityclient 12:57:36 <dviroel> #link https://bugs.launchpad.net/watcher/+bug/2108855 (Watcher should include keystone session when creating PrometheusAPICLient) 12:57:39 <sean-k-mooney> the tldr is the openstack telemetry team are proposing to add a auth reverse proxy for providing multi tenancy on top of prometheus 12:57:59 <sean-k-mooney> this is a non tivial change even if the code is small 12:58:23 <sean-k-mooney> and normally this is a classing exampel of where a spec woudl be reuired because it has supprot implication for testing and upgrade 12:58:24 <amoralej> in case the session has admin role it will return data for all tenants? 12:58:45 <amoralej> (i hope so) ... 12:58:47 <sean-k-mooney> so that is one of the design questiosn we need to resolve 12:58:53 <sean-k-mooney> otherwise we shoudl not suprpot this 12:59:06 <sean-k-mooney> but yes i belive that is the intet 12:59:35 <dviroel> right, so we should bring back this topic to the next meeting 12:59:42 <sean-k-mooney> yes 12:59:49 <sean-k-mooney> lets reach out to jaromir 12:59:51 <dviroel> about next meeting 12:59:57 <sean-k-mooney> and see if they can attend next week 13:00:01 <dviroel> #topic chair next meetings 13:00:15 <dviroel> i will be out next week, due to holiday 13:00:23 <dviroel> not sure about others 13:00:29 <dviroel> we need someone to chair 13:00:30 <amoralej> next thursday/friday are local holiday here 13:00:58 <jgilaber> +1 next week is a holiday for me as well 13:01:04 <dviroel> yeah 13:01:24 <mtembo> It's a holiday for me too 13:01:30 <sean-k-mooney> ack 13:01:32 <dviroel> i will let rlandy decide about cancelling or not 13:01:36 <sean-k-mooney> we can skip next week 13:01:38 <amoralej> maybe we should cancel it 13:01:40 <dviroel> but I think that we should skip 13:01:44 <sean-k-mooney> if we do not have quoram 13:01:47 <dviroel> ack 13:02:07 <rlandy> if enough people are out - yeah 13:02:08 <sean-k-mooney> dviroel: can you send a 2 line message to the list jsut declaring it skipped 13:02:22 <sean-k-mooney> i think we have 4 peopel that will not be here at least 13:02:27 <dviroel> #action dviroel to cancel next meeting (ML email) 13:02:31 <sean-k-mooney> so that over half the normal attendes 13:02:32 <dviroel> ack 13:02:34 <opendevreview> Merged openstack/watcher stable/2025.1: Query by fqdn_label instead of instance for host metrics https://review.opendev.org/c/openstack/watcher/+/946153 13:02:36 <opendevreview> Merged openstack/watcher stable/2025.1: Aggregate by fqdn label instead instance in host cpu metrics https://review.opendev.org/c/openstack/watcher/+/946732 13:02:38 <opendevreview> Merged openstack/watcher stable/2024.2: Replace deprecated LegacyEngineFacade https://review.opendev.org/c/openstack/watcher/+/942909 13:02:39 <dviroel> we are out of time 13:02:39 <opendevreview> Merged openstack/watcher stable/2024.2: Further database refactoring https://review.opendev.org/c/openstack/watcher/+/942910 13:02:48 <dviroel> thanks for joinning all 13:02:58 <dviroel> #endmeeting