#openstack-watcher log

12:00:26 <mtembo> #startmeeting Watcher Meeting - 19 June 2025
12:00:26 <opendevmeet> Meeting started Thu Jun 19 12:00:26 2025 UTC and is due to finish in 60 minutes.  The chair is mtembo. Information about MeetBot at http://wiki.debian.org/MeetBot.
12:00:26 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
12:00:26 <opendevmeet> The meeting name has been set to 'watcher_meeting___19_june_2025'
12:01:25 <mtembo> Hello folks, who is around ?
12:01:34 <jgilaber> o/
12:01:36 <sean-k-mooney> o/
12:01:37 <chandankumar> o/
12:01:46 <amoralej_> o/
12:02:11 <mtembo> Topics for today:
12:02:12 <mtembo> #link: https://etherpad.opendev.org/p/openstack-watcher-irc-meeting
12:02:45 <morenod> o/
12:04:10 <mtembo> Alright thank you. Let's get started
12:05:06 <mtembo> #topic: (morenod): Refact on creating instances and inject metrics
12:05:23 <mtembo> #link: https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/952884
12:08:02 <morenod> so the idea behind this is to create a baseline on how we should manage the instance creation on tempest tests, so please check the code provided and add your comments so we can get a solution valid for all of us. on base.py is where most of the job has been done.
12:12:09 <mtembo> Alright, thank you David. Let's please review the code and provide feedback.
12:13:10 <mtembo> Moving on to bug triage
12:13:28 <mtembo> #link: https://bugs.launchpad.net/watcher/+bug/2113862
12:16:37 <sean-k-mooney> so that a docs bug not a request for code change correct
12:16:48 <amoralej> I opened that to report some issues i found about the workload_stabilization documentation
12:16:54 <sean-k-mooney> you jsut want to clarify how we descibe things
12:16:58 <amoralej> yes, that's docs only
12:17:13 <sean-k-mooney> ok then lest mark it as traiged, low and add the doc tag
12:17:23 <sean-k-mooney> ill go do that now and we can fix it when we have time
12:17:37 <opendevreview> David proposed openstack/watcher-tempest-plugin master: Refact on creating instances and inject metrics  https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/952884
12:17:41 <sean-k-mooney> the exact wording changes we can review in the patch but skiming the report it seams ok
12:17:41 <amoralej> yes, that's fine
12:19:48 <mtembo> Thank you. Next bug
12:19:48 <sean-k-mooney> cool done
12:19:49 <mtembo> #link: https://bugs.launchpad.net/watcher/+bug/2113936
12:21:13 <sean-k-mooney> so we did disable the cpu reporting at the host levelbut we didnt do it for ceimoemter
12:21:17 <sean-k-mooney> itn eh fake data test
12:21:28 <sean-k-mooney> we coudl jsut disbael deploying celiomenter in that job
12:22:00 <sean-k-mooney> it would speed it up and force use to ensure all requried metrics were injected
12:22:05 <amoralej> anyway, imo that's uncovering a real bug in the way we pull metrics
12:22:34 <amoralej> we fixed a similar one for the host metrics, but not for the instance ones
12:22:35 <amoralej> https://review.opendev.org/c/openstack/watcher/+/952364
12:23:27 <amoralej> the way we identify metrics from a vm is using resource=instance_id so that's what we should use to aggregate too
12:23:37 <sean-k-mooney> yes that looks similar
12:23:54 <sean-k-mooney> i tought we orgianly did na audit of all of them to use the label explicty
12:24:48 <amoralej> for host metrics, but we missed this one
12:25:01 <amoralej> i think this is the only one missing
12:25:22 <sean-k-mooney> ok can you double check that we are uisng the lables properly on all fo them. ill try and review the patch this week
12:25:26 <sean-k-mooney> ack
12:25:38 <sean-k-mooney> the bug is in progress and we ahve a fix so i think we can move on
12:26:06 <mtembo> moving on
12:26:07 <mtembo> #link: https://bugs.launchpad.net/watcher/+bug/2112450
12:26:57 <sean-k-mooney> ya so this is really just imporvign debugabllty which is a valid thing to treat as a bug
12:27:29 <morenod> the important thing to fix is to find the why
12:27:40 <sean-k-mooney> its technially a mini feature but we are building the new backend with production deployment in mind
12:27:42 <morenod> not always we are missing the logs, it is about 25% of the executions
12:27:56 <sean-k-mooney> oh
12:28:04 <sean-k-mooney> thats very difent
12:28:07 <sean-k-mooney> that is concerning
12:28:11 <amoralej> it is
12:28:27 <amoralej> may somehow related to concurrency when using gnocchi ?
12:28:40 <sean-k-mooney> is there a patteren i.e. it does not log if a test failes or random
12:29:07 <sean-k-mooney> amoralej: techincally our logging is buffered by default and oslo has some concurrance protections
12:29:23 <sean-k-mooney> but its posisbel that if the greentret resumse on teh wrogn thread
12:29:35 <sean-k-mooney> then looging can be lost
12:29:46 <morenod> I've seen errors on succeeded strategies and failed ones. I havent found any pattern
12:30:02 <morenod> but I can add to the log more examples so maybe somebody find any pattern
12:30:07 <sean-k-mooney> in general you are not ment to mix eventlet with real thread  and we are doing that in watcher
12:30:44 <morenod> the fact is that we are not losing a line, we are losing the entire decission manager log related to the strategy
12:30:46 <amoralej> in prometheus case, you didn't find that issue, right?
12:31:08 <morenod> I cant remember any now, probably not, all I remember are on gnocchi
12:31:29 <amoralej> i haven't find it when debugging prometheus, i'd say
12:33:06 <sean-k-mooney> my guess is its eventlet releated but how exactly im not sure. it sould like a valid but but without more info it will be hard to determine.
12:34:34 <sean-k-mooney> what do peopel think, should we mark it whislist to reflect we may not have capstity to find out why and fix it, do me make it high as it indicates thre may be concurnacy/logging probelms in general
12:34:51 <sean-k-mooney> or leave it as incomplete
12:34:58 <sean-k-mooney> since we do not have a reliable repoducer
12:35:07 <sean-k-mooney> and or working thory of where the exact problem is
12:35:44 <sean-k-mooney> there is obviously something wrong but this could take a lot of time to root cause
12:36:38 <jgilaber> I think wishlist is better than incomplete in this case
12:37:08 <amoralej> may it be related to the model update execution with audit run or something
12:37:09 <jgilaber> and it seems important, so at least medium
12:37:55 <sean-k-mooney> amoralej: if we saw the logs later
12:38:19 <sean-k-mooney> then i would thing that maybe the delay was caused by blocking task or something like that
12:38:28 <jgilaber> the bug says that it does not happen with prometheus, that is in a different zuul job right?
12:38:36 <sean-k-mooney> we kind of need to confirm if they are ever printed or just not there at all
12:38:38 <jgilaber> could it be some setup difference between the two jobs?
12:39:03 <sean-k-mooney> well morenod said it happens sometimes in the gnooci job
12:39:12 <amoralej> it may be good to reproduce it out of ci, but yep, it will be hard to debug
12:39:15 <sean-k-mooney> so that imples its either caused by the speicic patch under review
12:39:25 <sean-k-mooney> or tis an intermitent latent bug
12:39:43 <morenod> yes, same code on same job sometimes works and sometimes fails
12:40:22 <sean-k-mooney> lets move on for now and come back to this
12:40:37 <mtembo> how do we triage this one. final verdict ?
12:41:00 <sean-k-mooney> so incomplete means cannot be verified and need more info to triage
12:41:17 <sean-k-mooney> so incomplete and medium sound about right
12:41:31 <sean-k-mooney> or leave it in new and we cna look at it again next week
12:41:50 <sean-k-mooney> i think this woudl need someon to trace the logs and code very carfully to debug
12:42:15 <amoralej> let's keep adding links to logs if we ever find it again in ci so that we have more data points
12:42:22 <amoralej> in the bug
12:42:40 <morenod> ok
12:42:43 <mtembo> Thanks, updating status ... moving on to next one
12:42:46 <mtembo> #link: https://bugs.launchpad.net/watcher/+bug/2111712
12:44:07 <jgilaber> that does not look related to watcher, is it?
12:44:18 <amoralej> i'd say so
12:44:50 <sean-k-mooney> no its does not i would mark it as invlid
12:45:17 <jgilaber> +
12:45:19 <jgilaber> +1
12:46:21 <mtembo> triaged as invalid. moving on to the next one
12:46:22 <mtembo> #link: https://bugs.launchpad.net/watcher/+bug/2111113
12:47:11 <jgilaber> this one is about zone migration
12:47:20 <jgilaber> the strategy trusts the user input blindly
12:47:37 <jgilaber> and can create migrations to nodes that do not exist
12:47:48 <jgilaber> which nova then correctly refuses and the action fails
12:48:27 <jgilaber> I don't think it's the most urgent problem but it would be nice to warn the user at least
12:49:16 <sean-k-mooney> so there are two ways to make thie better
12:49:25 <sean-k-mooney> 1 the migratoin action need a pre_condtion
12:49:39 <sean-k-mooney> to check that 1 the isntance stilll exists and 2 the dest exists
12:49:50 <sean-k-mooney> but this shoudl be caught in the decsion engine or api
12:50:13 <sean-k-mooney> probaly the desciosn enging and it shoudl check the dest actully exsits before compute an action plan
12:50:25 <jgilaber> +1 that was my thought as well
12:50:35 <sean-k-mooney> effectionvly the audit shoudl validate the inputs before executiting it
12:50:37 <amoralej> i was wondering if we should validate this in api, but that would require specific per-strategy logic
12:50:56 <sean-k-mooney> ya the api woudl need too much knowlage of the stragies
12:51:22 <sean-k-mooney> but the audit logic shoudl validate this before attemepting to caludalte teh action plan
12:52:04 <sean-k-mooney> we an implemelnt a pre_conditon fucntion on the stragies to validte the inputs liek we have for actions
12:52:33 <sean-k-mooney> then all the desions engine needs to do is call that in the audit execution and its nicely encpsulated
12:52:34 <amoralej> i was thinking something like that, a validate_input() method on the strategies that the api may call
12:52:54 <amoralej> but yeah, doing it in the decision-engine at audit execution will work
12:52:55 <sean-k-mooney> maybe
12:53:06 <sean-k-mooney> the benifit of that is we could return a 400
12:53:12 <amoralej> exactly
12:53:23 <sean-k-mooney> i think that woudl be accpable
12:53:26 <amoralej> don't allow to create the audit instead of gettin it failed
12:53:43 <sean-k-mooney> nova does od some prevaliation in the api for things like does the neutron netowrk exist
12:53:46 <sean-k-mooney> this is in line with that
12:53:56 <sean-k-mooney> ya
12:54:19 <sean-k-mooney> where we can we do not allow the instance recored to be created if we can validate it sanely in the api
12:54:45 <sean-k-mooney> so i think 1 add the validation logic to each stragey and 2 call that uniformly in teh api
12:55:02 <sean-k-mooney> so valid and medium?
12:55:11 <jgilaber> sounds right
12:57:04 <sean-k-mooney> ok updated
12:57:10 <sean-k-mooney> i have a meeting at the top of the houor
12:57:15 <mtembo> I think we are out of time.
12:57:19 <mtembo> volunteers to chair next week's meeting?
12:57:19 <sean-k-mooney> so i suggest we wrap there
12:58:49 <mtembo> I will chair the next meeting
12:59:27 <mtembo> Also the bugs we have not had time for will be transfered to next week
12:59:34 <mtembo> thank you all for attending
13:00:09 <mtembo> #endmeeting