12:00:33 #startmeeting watcher 12:00:33 Meeting started Thu Apr 24 12:00:33 2025 UTC and is due to finish in 60 minutes. The chair is dviroel. Information about MeetBot at http://wiki.debian.org/MeetBot. 12:00:33 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 12:00:33 The meeting name has been set to 'watcher' 12:01:02 hi o/ - who's around today? 12:01:10 o/ 12:01:16 Hello o/ 12:01:23 hello o/ 12:02:59 here is our meeting agenda for today 12:03:06 #link https://etherpad.opendev.org/p/openstack-watcher-irc-meeting#L33 (Meeting agenda) 12:03:18 o/ 12:03:23 please, feel free to add your own topics to the agenda 12:03:49 there is a topic to place your changes that requires attention from reviewers 12:04:13 there is also a topic for bugs, if you want to discuss any, please add to the end of the list too 12:05:10 #topic Courtesy ping 12:05:15 I added this one 12:05:30 just want to propose the courtesy ping list idea 12:05:47 it is part of the manila meetings for a long time already, and imho is useful 12:06:10 we keep a list of irc nicks, at the top of the schedule, that want to receive a courtesy ping when the meeting starts, as a reminder 12:06:46 if you want to receive the ping, just add your nick there, you can also remove it anytime 12:07:18 the chair will only need to copy and paste the list, when the irc meeting starts 12:07:36 +1 sounds useful 12:07:53 so you don't miss anything :) 12:09:36 alright then, people can just add/remove their nicks as they want there 12:09:38 tks 12:09:40 next one 12:09:57 #topic Reviews that need attention 12:10:18 the first 2 are specs ready to review 12:10:35 which I already bring in the last meeting 12:10:42 #link https://review.opendev.org/c/openstack/watcher-specs/+/943873 (disable cold/live migration for host maintenance strategy) 12:10:57 this one I also need to get back and review the latest patch sets 12:11:15 we already discussed about it in the ptg too 12:11:17 pls take a look 12:11:29 #link https://review.opendev.org/c/openstack/watcher-specs/+/947282 (Adds spec for extend compute model attributes) 12:12:08 recently added this spec, to start discussing about how we can extend the compute model attributes, and how to use this information to improve our strategies 12:12:22 i would like to receive some feedback there too 12:12:27 wrt disable cold/live i was thinking it'd be nice probably for other strategies too, so it'd be nice if it can be implemented in a reusable way 12:13:23 right, probably something that we could consider, depending on the strategy 12:14:07 most of the strategies implemente their own decision on that, which is usually check the status of the instance 12:14:22 to decide between live and cold migration 12:15:28 we can discuss more about that in the spec, or even in the gerrit change proposed 12:15:35 yep 12:15:57 ++ 12:16:13 ok, next in the list 12:16:26 #link https://review.opendev.org/c/openstack/watcher/+/945331 (Make prometheus the default devstack example) 12:17:05 both depends-on merged already 12:17:10 this change is ready for review, it adds devstack local.conf samples to deploy with prometheus as datasource 12:17:32 it also keeps the gnocchi samples that we have currently 12:18:02 nice, thanks for updating the docs too 12:18:08 I will take a look after the meeting 12:18:27 thanks dviroel 12:18:46 ping sean-k-mooney to revisit it too 12:19:11 o/ 12:19:34 * dviroel sean-k-mooney o/ 12:19:34 yep ill try and look at that again 12:19:41 tks 12:20:20 next 12:20:34 #link https://review.opendev.org/c/openstack/watcher/+/946153 (Query by fqdn_label instead of instance for host metrics) 12:20:57 this is a backport of a important fix 12:21:11 but depends on the other one in the chain 12:21:25 which is: 12:21:34 #link https://review.opendev.org/c/openstack/watcher/+/946737/ (Drop sg_core prometheus related vars) 12:21:54 sean-k-mooney: ^ pls, this is a small backport, for ci 12:22:02 so that approved 12:22:03 when you have some time 12:22:10 oh the second one 12:22:14 sure ill take a lookk 12:22:21 https://review.opendev.org/c/openstack/watcher/+/946153 12:22:33 i assume is just sitting in zuul gate pipelien right 12:22:45 no, there is a relation chain 12:22:57 oh right 12:23:09 ok ya 12:23:14 https://review.opendev.org/c/openstack/watcher/+/946737/ is required to unblock the ci, that's why the rest are rebased on it 12:23:45 ack i have approved that now 12:23:49 and the 3rd one in the chain will be 12:23:52 thanks 12:23:59 #link https://review.opendev.org/c/openstack/watcher/+/946732 (Aggregate by fqdn label instead instance in host cpu metrics) 12:24:13 ya we shoudl merge all 3 togheter 12:24:14 which is a follow up on the second one 12:24:33 ack, thanks amoralej for proposing them 12:24:57 I can propose a bug release once we have the three merged 12:25:27 we could 12:25:42 the more imporant release is the final bobcat release 12:25:47 that shoudl happen this week 12:25:58 but we can do a release of all stable branches this/next week 12:26:13 +1 12:26:18 +1 12:26:52 ok, anyone want to bring any other review? 12:27:18 #topic Bug Triage 12:27:39 #link https://bugs.launchpad.net/watcher/+bug/2107467 (workload_stabilization strategy does show standard_deviation if it's below the audit threshold) 12:27:54 jgilaber o/ 12:28:05 I wanted to get some thoughts on this bug 12:28:10 to me it seems a UX bug 12:28:37 its mostly cosmetic so i woudl mark it as low impoarntce but its definetly valid 12:28:52 you also have been workign on fixing this already 12:29:06 right htis is related to how we store it in the db? 12:29:11 it's not quite the same 12:29:26 oh how is this differnt 12:29:31 I found this bug when testing my fix for the db 12:30:05 the standard deviation is only stored if it larger than the user defined threshold for any of the metrics 12:30:10 it comes from https://github.com/openstack/watcher/blob/c4acce91d6bb87b4ab865bc8e4d442a148dba1d5/watcher/decision_engine/strategy/strategies/workload_stabilization.py#L472 12:30:24 see the if in line 465 12:30:34 if it's not, then the default value of 0.0 is stored 12:30:35 you could argue that its expected 12:30:51 since it did not meet the specified treshold it was "skipped" 12:30:55 yes, it's somewhat amibigous what to expect, that's why I wanted to bring it up 12:30:55 witht hat said 12:31:06 even if it's below the threshold, i'd say it should be displayed 12:31:13 i think long term w ewill want to enhance wathcer to emit notifications for audits 12:31:25 and we will want to includ ethe effincaly indeicators 12:31:40 on the one hand I would expect to show the deviation calculated even if below the threshold 12:31:40 so i think it would be ok to change the behavior to alwasy store the calusated values 12:31:46 it may be useful to track trends over time, i.e. 12:31:49 +1 12:31:58 hum, if is below the expected value, there is no optimization to be done, which means that there is no efficacy indicators? 12:32:23 to me this is a precondition failutre 12:32:34 it did not meet tehre minitum required treshold 12:32:48 so they orginal authors choose not to store the values 12:33:02 but i get the ux side that jgilaber is raising 12:33:11 and i agree its both confusing and ambigouse 12:33:31 so i think we can set this to triaged and low 12:33:44 i agree that is a useful information too, to be displayed 12:33:46 and then fix when we have time unless others object 12:34:01 there is another complication if we decide to change, what to do when there is more than one metric, display the largest deviation? 12:34:46 we need to display all of them 12:34:55 i thnk we alrady modifed the dashboard to do that 12:35:09 so to be clear it would not be ok to change the resoce format 12:35:24 we can save the calulated value instead of 0.0 12:35:36 it manages the thresholds independently if it uses two moetrics? (cpu and memory, i.e) 12:35:41 btu we cannot add or remvoe filed or change tohe overall respocen as that would requrie a new api microversion 12:35:45 and therefor a spec 12:36:04 I don't think we can store more than one value in an efficacy indicator 12:36:21 we would need to add additional ones 12:36:28 its a list i belive 12:36:43 lets look at the api ref 12:37:09 https://docs.openstack.org/api-ref/resource-optimization/#show-action-plan 12:37:30 so efficacy_indicators is an array of indcators 12:38:00 and that can have multiple values which we did fix in watcher-dashboard to display properly 12:38:10 which was fixed in the ui too, I think that was amoralej that fixed 12:38:36 yep 12:38:42 yes, but the problem here is that the list of metrics considered is configurable 12:38:54 and, iiuc, we have one deviation per-metric 12:38:55 thats ok 12:39:18 os it'd be deviation_before_cpu deviation_before_memory, etc... ? 12:39:24 in the api ref the indictors in the efficacy_indicators is not part of the schema 12:39:27 is that how it works? 12:39:38 im not sure i think we need more info in the bug 12:39:46 specificly we need the raw api responce 12:40:02 there is also a weight parameter for the metrics, so i assumed the different deviation were aggregated somehow 12:40:05 not how its rendedd in the client but what is actully beign returned when there are multiple metrics 12:40:15 yes ^ that 12:40:28 it simply stores the first deviation that is larger than the trheshold 12:40:41 https://github.com/openstack/watcher/blob/c4acce91d6bb87b4ab865bc8e4d442a148dba1d5/watcher/decision_engine/strategy/strategies/workload_stabilization.py#L461 12:41:01 and do the optimization based on only the first metric is above it? 12:41:03 it iterates over the metrics and the first time one goes over the threshold it returns 12:41:58 that seam incorreect 12:42:20 unless megrics is a preferenically orderd list 12:42:36 so this is starting to grow out of the scope of a simple bug 12:42:40 and into a feature 12:42:45 ah, so the weight is only considered for the simulation, not for the initial deviation found https://github.com/openstack/watcher/blob/c4acce91d6bb87b4ab865bc8e4d442a148dba1d5/watcher/decision_engine/strategy/strategies/workload_stabilization.py#L423 12:42:58 that's strange, tbh ... 12:43:19 there is nothing in the metrics description that suggests it should be sorted by importance https://github.com/openstack/watcher/blob/master/watcher/decision_engine/strategy/strategies/workload_stabilization.py#L107 12:43:55 jgilaber: i was ok with treatign this as a bug if it was only a infromational change 12:44:03 but if its goign to chagne the behvior fo the stagey 12:44:09 Merged openstack/watcher stable/2025.1: Drop sg_core prometheus related vars https://review.opendev.org/c/openstack/watcher/+/946737 12:44:22 then i think this is creapign into a spec 12:44:27 or at least something 12:44:35 that need more dicussionthetn we can do right now 12:44:48 shall we loop back to this again next week 12:44:52 and think about it a bit more. 12:44:56 sure 12:45:01 agreed, I did not intend to change the strategy behaviour with my bug, initially I just noticed the UX 12:45:02 +1 12:45:17 it may be correct, but at least, i'd like to understand better how that works to drive expectations 12:45:27 jgilaber: if you can you use --debug on openstack client to attach the raw api output to the bug if you have time 12:45:38 sure, I'll do that 12:46:03 thanks for raising that jgilaber 12:46:04 sean-k-mooney: which output, the action plan? 12:46:19 the action plan show yes 12:46:30 ack, will do after the mtg 12:46:30 i want to see if the api responce and the cli output align 12:46:53 we may be truncating the output or rounding in the clint 12:47:07 but in general i just want to see the actual repsocne 12:47:14 shall we move on? 12:47:21 sure, ok, the next 2 bugs we already discussed at the ptg 12:47:31 #link https://bugs.launchpad.net/watcher/+bug/2106407 (Action Plans status is wrongly reported when Actions fail) 12:47:46 it was missing status 12:47:54 i set it as triaged 12:47:57 thanks amoralej 12:48:00 as we discussed it in ptg 12:48:06 next 12:48:06 yep and i agree with it beign high also 12:48:10 i plan to work on it but didn't have the time for it 12:48:21 +1 12:48:22 ack 12:48:25 #link https://bugs.launchpad.net/watcher/+bug/2104220 (NovaClusterDataModelCollector cache generate wrong action plan) 12:48:42 we also discussed at the ptg 12:48:56 ya so we still need ot confirm if we are lookign at the wrong field 12:49:08 i.e. saving the sourc host instead of the dest 12:49:18 yes, we need to check the nova notification to see if that is working as expected 12:49:42 i just set this to high also 12:49:48 i was planning to validate in my env too 12:49:51 since this is the primary way we update the cache 12:50:03 that would be good if you have time 12:50:17 this may have been a regression in nova 12:50:31 i.e. we coudl have changed when that event got sent 12:50:36 so I will assign to myself for now, unless someone is already looking at it 12:50:56 we did do that a few release ago but its been long enoguh that if that was the cause we likely should just update wathcer 12:51:33 to be clear, we'd totally fix the issue with models out of sync if we have nova notifications enabled (and we don't have a bug in the model update logic) ? 12:51:46 we have the notifcation enabled 12:51:53 so that not the problem 12:52:16 the problem is the filed we are updatign form seams to have the source host not the destination 12:52:21 which is what we were expecting 12:52:45 so the bug is either in how we are parsing the notification and updatign the model 12:52:55 i don't mean for that particular bug, but about the expected behavior of watcher. Having nova notifications enabled should ensure watcher is always correct? 12:52:59 or nvoa acidently change the behavior a few cycle ago and noone noticed 12:53:15 amoralej: in general yes 12:53:28 amoralej: that is the recommend way to deploy watcher 12:53:36 good 12:53:44 i say in general as there is a short interval 12:53:55 where we wont have processed the notificaion yet 12:54:05 btu its much smaller then relying only on the periodic 12:54:08 tbh, i had missed that update based on notifications ... 12:54:09 sure 12:54:21 that is much better 12:54:29 downstream we skiped enabling it becuase notificaon are not supproted in our new installer yet 12:54:46 specificly in nova 12:54:58 so we will also need to supprot that in our new installer once that gap is closed 12:55:00 from performance pov, enabling notifications, is it expensive? 12:55:06 devstafck does it by defualt 12:55:15 kind of 12:55:18 ack 12:55:24 it puts a lot of extra load on rabbit 12:55:36 its actully recomend to have a seperate rabbit service just for notificaions 12:55:37 no need to go into details now, we are almost out of time, but thanks for the clarification 12:55:40 btu the bigger issue 12:55:44 is if there is no consumer 12:55:51 the rabbit queue builds forever 12:56:01 and just fills up ram 12:56:05 ++ 12:56:09 ok, we don't have too much time to cover the next 2 bugs in the list 12:56:19 so moving them to the next meeting 12:56:28 i do have one to highlight 12:56:30 https://bugs.launchpad.net/watcher/+bug/2108855 12:56:43 this is a feature request nto an actual bug 12:56:52 ack, sean-k-mooney i was reading through 12:57:03 unfotully we did not discss this in the ptg 12:57:29 we can recommend everybody to read this LP bug 12:57:32 ok, so it should not be too hard based on the proposed implementation in observabilityclient 12:57:36 #link https://bugs.launchpad.net/watcher/+bug/2108855 (Watcher should include keystone session when creating PrometheusAPICLient) 12:57:39 the tldr is the openstack telemetry team are proposing to add a auth reverse proxy for providing multi tenancy on top of prometheus 12:57:59 this is a non tivial change even if the code is small 12:58:23 and normally this is a classing exampel of where a spec woudl be reuired because it has supprot implication for testing and upgrade 12:58:24 in case the session has admin role it will return data for all tenants? 12:58:45 (i hope so) ... 12:58:47 so that is one of the design questiosn we need to resolve 12:58:53 otherwise we shoudl not suprpot this 12:59:06 but yes i belive that is the intet 12:59:35 right, so we should bring back this topic to the next meeting 12:59:42 yes 12:59:49 lets reach out to jaromir 12:59:51 about next meeting 12:59:57 and see if they can attend next week 13:00:01 #topic chair next meetings 13:00:15 i will be out next week, due to holiday 13:00:23 not sure about others 13:00:29 we need someone to chair 13:00:30 next thursday/friday are local holiday here 13:00:58 +1 next week is a holiday for me as well 13:01:04 yeah 13:01:24 It's a holiday for me too 13:01:30 ack 13:01:32 i will let rlandy decide about cancelling or not 13:01:36 we can skip next week 13:01:38 maybe we should cancel it 13:01:40 but I think that we should skip 13:01:44 if we do not have quoram 13:01:47 ack 13:02:07 if enough people are out - yeah 13:02:08 dviroel: can you send a 2 line message to the list jsut declaring it skipped 13:02:22 i think we have 4 peopel that will not be here at least 13:02:27 #action dviroel to cancel next meeting (ML email) 13:02:31 so that over half the normal attendes 13:02:32 ack 13:02:34 Merged openstack/watcher stable/2025.1: Query by fqdn_label instead of instance for host metrics https://review.opendev.org/c/openstack/watcher/+/946153 13:02:36 Merged openstack/watcher stable/2025.1: Aggregate by fqdn label instead instance in host cpu metrics https://review.opendev.org/c/openstack/watcher/+/946732 13:02:38 Merged openstack/watcher stable/2024.2: Replace deprecated LegacyEngineFacade https://review.opendev.org/c/openstack/watcher/+/942909 13:02:39 we are out of time 13:02:39 Merged openstack/watcher stable/2024.2: Further database refactoring https://review.opendev.org/c/openstack/watcher/+/942910 13:02:48 thanks for joinning all 13:02:58 #endmeeting