12:00:26 <mtembo> #startmeeting Watcher Meeting - 19 June 2025 12:00:26 <opendevmeet> Meeting started Thu Jun 19 12:00:26 2025 UTC and is due to finish in 60 minutes. The chair is mtembo. Information about MeetBot at http://wiki.debian.org/MeetBot. 12:00:26 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 12:00:26 <opendevmeet> The meeting name has been set to 'watcher_meeting___19_june_2025' 12:01:25 <mtembo> Hello folks, who is around ? 12:01:34 <jgilaber> o/ 12:01:36 <sean-k-mooney> o/ 12:01:37 <chandankumar> o/ 12:01:46 <amoralej_> o/ 12:02:11 <mtembo> Topics for today: 12:02:12 <mtembo> #link: https://etherpad.opendev.org/p/openstack-watcher-irc-meeting 12:02:45 <morenod> o/ 12:04:10 <mtembo> Alright thank you. Let's get started 12:05:06 <mtembo> #topic: (morenod): Refact on creating instances and inject metrics 12:05:23 <mtembo> #link: https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/952884 12:08:02 <morenod> so the idea behind this is to create a baseline on how we should manage the instance creation on tempest tests, so please check the code provided and add your comments so we can get a solution valid for all of us. on base.py is where most of the job has been done. 12:12:09 <mtembo> Alright, thank you David. Let's please review the code and provide feedback. 12:13:10 <mtembo> Moving on to bug triage 12:13:28 <mtembo> #link: https://bugs.launchpad.net/watcher/+bug/2113862 12:16:37 <sean-k-mooney> so that a docs bug not a request for code change correct 12:16:48 <amoralej> I opened that to report some issues i found about the workload_stabilization documentation 12:16:54 <sean-k-mooney> you jsut want to clarify how we descibe things 12:16:58 <amoralej> yes, that's docs only 12:17:13 <sean-k-mooney> ok then lest mark it as traiged, low and add the doc tag 12:17:23 <sean-k-mooney> ill go do that now and we can fix it when we have time 12:17:37 <opendevreview> David proposed openstack/watcher-tempest-plugin master: Refact on creating instances and inject metrics https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/952884 12:17:41 <sean-k-mooney> the exact wording changes we can review in the patch but skiming the report it seams ok 12:17:41 <amoralej> yes, that's fine 12:19:48 <mtembo> Thank you. Next bug 12:19:48 <sean-k-mooney> cool done 12:19:49 <mtembo> #link: https://bugs.launchpad.net/watcher/+bug/2113936 12:21:13 <sean-k-mooney> so we did disable the cpu reporting at the host levelbut we didnt do it for ceimoemter 12:21:17 <sean-k-mooney> itn eh fake data test 12:21:28 <sean-k-mooney> we coudl jsut disbael deploying celiomenter in that job 12:22:00 <sean-k-mooney> it would speed it up and force use to ensure all requried metrics were injected 12:22:05 <amoralej> anyway, imo that's uncovering a real bug in the way we pull metrics 12:22:34 <amoralej> we fixed a similar one for the host metrics, but not for the instance ones 12:22:35 <amoralej> https://review.opendev.org/c/openstack/watcher/+/952364 12:23:27 <amoralej> the way we identify metrics from a vm is using resource=instance_id so that's what we should use to aggregate too 12:23:37 <sean-k-mooney> yes that looks similar 12:23:54 <sean-k-mooney> i tought we orgianly did na audit of all of them to use the label explicty 12:24:48 <amoralej> for host metrics, but we missed this one 12:25:01 <amoralej> i think this is the only one missing 12:25:22 <sean-k-mooney> ok can you double check that we are uisng the lables properly on all fo them. ill try and review the patch this week 12:25:26 <sean-k-mooney> ack 12:25:38 <sean-k-mooney> the bug is in progress and we ahve a fix so i think we can move on 12:26:06 <mtembo> moving on 12:26:07 <mtembo> #link: https://bugs.launchpad.net/watcher/+bug/2112450 12:26:57 <sean-k-mooney> ya so this is really just imporvign debugabllty which is a valid thing to treat as a bug 12:27:29 <morenod> the important thing to fix is to find the why 12:27:40 <sean-k-mooney> its technially a mini feature but we are building the new backend with production deployment in mind 12:27:42 <morenod> not always we are missing the logs, it is about 25% of the executions 12:27:56 <sean-k-mooney> oh 12:28:04 <sean-k-mooney> thats very difent 12:28:07 <sean-k-mooney> that is concerning 12:28:11 <amoralej> it is 12:28:27 <amoralej> may somehow related to concurrency when using gnocchi ? 12:28:40 <sean-k-mooney> is there a patteren i.e. it does not log if a test failes or random 12:29:07 <sean-k-mooney> amoralej: techincally our logging is buffered by default and oslo has some concurrance protections 12:29:23 <sean-k-mooney> but its posisbel that if the greentret resumse on teh wrogn thread 12:29:35 <sean-k-mooney> then looging can be lost 12:29:46 <morenod> I've seen errors on succeeded strategies and failed ones. I havent found any pattern 12:30:02 <morenod> but I can add to the log more examples so maybe somebody find any pattern 12:30:07 <sean-k-mooney> in general you are not ment to mix eventlet with real thread and we are doing that in watcher 12:30:44 <morenod> the fact is that we are not losing a line, we are losing the entire decission manager log related to the strategy 12:30:46 <amoralej> in prometheus case, you didn't find that issue, right? 12:31:08 <morenod> I cant remember any now, probably not, all I remember are on gnocchi 12:31:29 <amoralej> i haven't find it when debugging prometheus, i'd say 12:33:06 <sean-k-mooney> my guess is its eventlet releated but how exactly im not sure. it sould like a valid but but without more info it will be hard to determine. 12:34:34 <sean-k-mooney> what do peopel think, should we mark it whislist to reflect we may not have capstity to find out why and fix it, do me make it high as it indicates thre may be concurnacy/logging probelms in general 12:34:51 <sean-k-mooney> or leave it as incomplete 12:34:58 <sean-k-mooney> since we do not have a reliable repoducer 12:35:07 <sean-k-mooney> and or working thory of where the exact problem is 12:35:44 <sean-k-mooney> there is obviously something wrong but this could take a lot of time to root cause 12:36:38 <jgilaber> I think wishlist is better than incomplete in this case 12:37:08 <amoralej> may it be related to the model update execution with audit run or something 12:37:09 <jgilaber> and it seems important, so at least medium 12:37:55 <sean-k-mooney> amoralej: if we saw the logs later 12:38:19 <sean-k-mooney> then i would thing that maybe the delay was caused by blocking task or something like that 12:38:28 <jgilaber> the bug says that it does not happen with prometheus, that is in a different zuul job right? 12:38:36 <sean-k-mooney> we kind of need to confirm if they are ever printed or just not there at all 12:38:38 <jgilaber> could it be some setup difference between the two jobs? 12:39:03 <sean-k-mooney> well morenod said it happens sometimes in the gnooci job 12:39:12 <amoralej> it may be good to reproduce it out of ci, but yep, it will be hard to debug 12:39:15 <sean-k-mooney> so that imples its either caused by the speicic patch under review 12:39:25 <sean-k-mooney> or tis an intermitent latent bug 12:39:43 <morenod> yes, same code on same job sometimes works and sometimes fails 12:40:22 <sean-k-mooney> lets move on for now and come back to this 12:40:37 <mtembo> how do we triage this one. final verdict ? 12:41:00 <sean-k-mooney> so incomplete means cannot be verified and need more info to triage 12:41:17 <sean-k-mooney> so incomplete and medium sound about right 12:41:31 <sean-k-mooney> or leave it in new and we cna look at it again next week 12:41:50 <sean-k-mooney> i think this woudl need someon to trace the logs and code very carfully to debug 12:42:15 <amoralej> let's keep adding links to logs if we ever find it again in ci so that we have more data points 12:42:22 <amoralej> in the bug 12:42:40 <morenod> ok 12:42:43 <mtembo> Thanks, updating status ... moving on to next one 12:42:46 <mtembo> #link: https://bugs.launchpad.net/watcher/+bug/2111712 12:44:07 <jgilaber> that does not look related to watcher, is it? 12:44:18 <amoralej> i'd say so 12:44:50 <sean-k-mooney> no its does not i would mark it as invlid 12:45:17 <jgilaber> + 12:45:19 <jgilaber> +1 12:46:21 <mtembo> triaged as invalid. moving on to the next one 12:46:22 <mtembo> #link: https://bugs.launchpad.net/watcher/+bug/2111113 12:47:11 <jgilaber> this one is about zone migration 12:47:20 <jgilaber> the strategy trusts the user input blindly 12:47:37 <jgilaber> and can create migrations to nodes that do not exist 12:47:48 <jgilaber> which nova then correctly refuses and the action fails 12:48:27 <jgilaber> I don't think it's the most urgent problem but it would be nice to warn the user at least 12:49:16 <sean-k-mooney> so there are two ways to make thie better 12:49:25 <sean-k-mooney> 1 the migratoin action need a pre_condtion 12:49:39 <sean-k-mooney> to check that 1 the isntance stilll exists and 2 the dest exists 12:49:50 <sean-k-mooney> but this shoudl be caught in the decsion engine or api 12:50:13 <sean-k-mooney> probaly the desciosn enging and it shoudl check the dest actully exsits before compute an action plan 12:50:25 <jgilaber> +1 that was my thought as well 12:50:35 <sean-k-mooney> effectionvly the audit shoudl validate the inputs before executiting it 12:50:37 <amoralej> i was wondering if we should validate this in api, but that would require specific per-strategy logic 12:50:56 <sean-k-mooney> ya the api woudl need too much knowlage of the stragies 12:51:22 <sean-k-mooney> but the audit logic shoudl validate this before attemepting to caludalte teh action plan 12:52:04 <sean-k-mooney> we an implemelnt a pre_conditon fucntion on the stragies to validte the inputs liek we have for actions 12:52:33 <sean-k-mooney> then all the desions engine needs to do is call that in the audit execution and its nicely encpsulated 12:52:34 <amoralej> i was thinking something like that, a validate_input() method on the strategies that the api may call 12:52:54 <amoralej> but yeah, doing it in the decision-engine at audit execution will work 12:52:55 <sean-k-mooney> maybe 12:53:06 <sean-k-mooney> the benifit of that is we could return a 400 12:53:12 <amoralej> exactly 12:53:23 <sean-k-mooney> i think that woudl be accpable 12:53:26 <amoralej> don't allow to create the audit instead of gettin it failed 12:53:43 <sean-k-mooney> nova does od some prevaliation in the api for things like does the neutron netowrk exist 12:53:46 <sean-k-mooney> this is in line with that 12:53:56 <sean-k-mooney> ya 12:54:19 <sean-k-mooney> where we can we do not allow the instance recored to be created if we can validate it sanely in the api 12:54:45 <sean-k-mooney> so i think 1 add the validation logic to each stragey and 2 call that uniformly in teh api 12:55:02 <sean-k-mooney> so valid and medium? 12:55:11 <jgilaber> sounds right 12:57:04 <sean-k-mooney> ok updated 12:57:10 <sean-k-mooney> i have a meeting at the top of the houor 12:57:15 <mtembo> I think we are out of time. 12:57:19 <mtembo> volunteers to chair next week's meeting? 12:57:19 <sean-k-mooney> so i suggest we wrap there 12:58:49 <mtembo> I will chair the next meeting 12:59:27 <mtembo> Also the bugs we have not had time for will be transfered to next week 12:59:34 <mtembo> thank you all for attending 13:00:09 <mtembo> #endmeeting