opendevreview | David proposed openstack/watcher-tempest-plugin master: Refact on creating instances and inject metrics https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/952884 | 09:18 |
---|---|---|
opendevreview | David proposed openstack/watcher-tempest-plugin master: Refact on creating instances and inject metrics https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/952884 | 09:31 |
opendevreview | David proposed openstack/watcher-tempest-plugin master: Refact on creating instances and inject metrics https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/952884 | 09:35 |
opendevreview | Alfredo Moralejo proposed openstack/watcher master: Report host_ram_usage in KiB when using prometheus datasource https://review.opendev.org/c/openstack/watcher/+/952212 | 09:58 |
opendevreview | David proposed openstack/watcher-tempest-plugin master: Refact on creating instances and inject metrics https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/952884 | 11:24 |
opendevreview | Alfredo Moralejo proposed openstack/watcher master: Check result of retype action based on type and status https://review.opendev.org/c/openstack/watcher/+/951513 | 11:30 |
mtembo | #startmeeting Watcher Meeting - 19 June 2025 | 12:00 |
opendevmeet | Meeting started Thu Jun 19 12:00:26 2025 UTC and is due to finish in 60 minutes. The chair is mtembo. Information about MeetBot at http://wiki.debian.org/MeetBot. | 12:00 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 12:00 |
opendevmeet | The meeting name has been set to 'watcher_meeting___19_june_2025' | 12:00 |
mtembo | Hello folks, who is around ? | 12:01 |
jgilaber | o/ | 12:01 |
sean-k-mooney | o/ | 12:01 |
chandankumar | o/ | 12:01 |
amoralej_ | o/ | 12:01 |
mtembo | Topics for today: | 12:02 |
mtembo | #link: https://etherpad.opendev.org/p/openstack-watcher-irc-meeting | 12:02 |
morenod | o/ | 12:02 |
mtembo | Alright thank you. Let's get started | 12:04 |
mtembo | #topic: (morenod): Refact on creating instances and inject metrics | 12:05 |
mtembo | #link: https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/952884 | 12:05 |
*** amoralej_ is now known as amoralej | 12:06 | |
morenod | so the idea behind this is to create a baseline on how we should manage the instance creation on tempest tests, so please check the code provided and add your comments so we can get a solution valid for all of us. on base.py is where most of the job has been done. | 12:08 |
mtembo | Alright, thank you David. Let's please review the code and provide feedback. | 12:12 |
mtembo | Moving on to bug triage | 12:13 |
mtembo | #link: https://bugs.launchpad.net/watcher/+bug/2113862 | 12:13 |
sean-k-mooney | so that a docs bug not a request for code change correct | 12:16 |
amoralej | I opened that to report some issues i found about the workload_stabilization documentation | 12:16 |
sean-k-mooney | you jsut want to clarify how we descibe things | 12:16 |
amoralej | yes, that's docs only | 12:16 |
sean-k-mooney | ok then lest mark it as traiged, low and add the doc tag | 12:17 |
sean-k-mooney | ill go do that now and we can fix it when we have time | 12:17 |
opendevreview | David proposed openstack/watcher-tempest-plugin master: Refact on creating instances and inject metrics https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/952884 | 12:17 |
sean-k-mooney | the exact wording changes we can review in the patch but skiming the report it seams ok | 12:17 |
amoralej | yes, that's fine | 12:17 |
mtembo | Thank you. Next bug | 12:19 |
sean-k-mooney | cool done | 12:19 |
mtembo | #link: https://bugs.launchpad.net/watcher/+bug/2113936 | 12:19 |
sean-k-mooney | so we did disable the cpu reporting at the host levelbut we didnt do it for ceimoemter | 12:21 |
sean-k-mooney | itn eh fake data test | 12:21 |
sean-k-mooney | we coudl jsut disbael deploying celiomenter in that job | 12:21 |
sean-k-mooney | it would speed it up and force use to ensure all requried metrics were injected | 12:22 |
amoralej | anyway, imo that's uncovering a real bug in the way we pull metrics | 12:22 |
amoralej | we fixed a similar one for the host metrics, but not for the instance ones | 12:22 |
amoralej | https://review.opendev.org/c/openstack/watcher/+/952364 | 12:22 |
amoralej | the way we identify metrics from a vm is using resource=instance_id so that's what we should use to aggregate too | 12:23 |
sean-k-mooney | yes that looks similar | 12:23 |
sean-k-mooney | i tought we orgianly did na audit of all of them to use the label explicty | 12:23 |
amoralej | for host metrics, but we missed this one | 12:24 |
amoralej | i think this is the only one missing | 12:25 |
sean-k-mooney | ok can you double check that we are uisng the lables properly on all fo them. ill try and review the patch this week | 12:25 |
sean-k-mooney | ack | 12:25 |
sean-k-mooney | the bug is in progress and we ahve a fix so i think we can move on | 12:25 |
mtembo | moving on | 12:26 |
mtembo | #link: https://bugs.launchpad.net/watcher/+bug/2112450 | 12:26 |
sean-k-mooney | ya so this is really just imporvign debugabllty which is a valid thing to treat as a bug | 12:26 |
morenod | the important thing to fix is to find the why | 12:27 |
sean-k-mooney | its technially a mini feature but we are building the new backend with production deployment in mind | 12:27 |
morenod | not always we are missing the logs, it is about 25% of the executions | 12:27 |
sean-k-mooney | oh | 12:27 |
sean-k-mooney | thats very difent | 12:28 |
sean-k-mooney | that is concerning | 12:28 |
amoralej | it is | 12:28 |
amoralej | may somehow related to concurrency when using gnocchi ? | 12:28 |
sean-k-mooney | is there a patteren i.e. it does not log if a test failes or random | 12:28 |
sean-k-mooney | amoralej: techincally our logging is buffered by default and oslo has some concurrance protections | 12:29 |
sean-k-mooney | but its posisbel that if the greentret resumse on teh wrogn thread | 12:29 |
sean-k-mooney | then looging can be lost | 12:29 |
morenod | I've seen errors on succeeded strategies and failed ones. I havent found any pattern | 12:29 |
morenod | but I can add to the log more examples so maybe somebody find any pattern | 12:30 |
sean-k-mooney | in general you are not ment to mix eventlet with real thread and we are doing that in watcher | 12:30 |
morenod | the fact is that we are not losing a line, we are losing the entire decission manager log related to the strategy | 12:30 |
amoralej | in prometheus case, you didn't find that issue, right? | 12:30 |
morenod | I cant remember any now, probably not, all I remember are on gnocchi | 12:31 |
amoralej | i haven't find it when debugging prometheus, i'd say | 12:31 |
sean-k-mooney | my guess is its eventlet releated but how exactly im not sure. it sould like a valid but but without more info it will be hard to determine. | 12:33 |
sean-k-mooney | what do peopel think, should we mark it whislist to reflect we may not have capstity to find out why and fix it, do me make it high as it indicates thre may be concurnacy/logging probelms in general | 12:34 |
sean-k-mooney | or leave it as incomplete | 12:34 |
sean-k-mooney | since we do not have a reliable repoducer | 12:34 |
sean-k-mooney | and or working thory of where the exact problem is | 12:35 |
sean-k-mooney | there is obviously something wrong but this could take a lot of time to root cause | 12:35 |
jgilaber | I think wishlist is better than incomplete in this case | 12:36 |
amoralej | may it be related to the model update execution with audit run or something | 12:37 |
jgilaber | and it seems important, so at least medium | 12:37 |
sean-k-mooney | amoralej: if we saw the logs later | 12:37 |
sean-k-mooney | then i would thing that maybe the delay was caused by blocking task or something like that | 12:38 |
jgilaber | the bug says that it does not happen with prometheus, that is in a different zuul job right? | 12:38 |
sean-k-mooney | we kind of need to confirm if they are ever printed or just not there at all | 12:38 |
jgilaber | could it be some setup difference between the two jobs? | 12:38 |
sean-k-mooney | well morenod said it happens sometimes in the gnooci job | 12:39 |
amoralej | it may be good to reproduce it out of ci, but yep, it will be hard to debug | 12:39 |
sean-k-mooney | so that imples its either caused by the speicic patch under review | 12:39 |
sean-k-mooney | or tis an intermitent latent bug | 12:39 |
morenod | yes, same code on same job sometimes works and sometimes fails | 12:39 |
sean-k-mooney | lets move on for now and come back to this | 12:40 |
mtembo | how do we triage this one. final verdict ? | 12:40 |
sean-k-mooney | so incomplete means cannot be verified and need more info to triage | 12:41 |
sean-k-mooney | so incomplete and medium sound about right | 12:41 |
sean-k-mooney | or leave it in new and we cna look at it again next week | 12:41 |
sean-k-mooney | i think this woudl need someon to trace the logs and code very carfully to debug | 12:41 |
amoralej | let's keep adding links to logs if we ever find it again in ci so that we have more data points | 12:42 |
amoralej | in the bug | 12:42 |
morenod | ok | 12:42 |
mtembo | Thanks, updating status ... moving on to next one | 12:42 |
mtembo | #link: https://bugs.launchpad.net/watcher/+bug/2111712 | 12:42 |
jgilaber | that does not look related to watcher, is it? | 12:44 |
amoralej | i'd say so | 12:44 |
sean-k-mooney | no its does not i would mark it as invlid | 12:44 |
jgilaber | + | 12:45 |
jgilaber | +1 | 12:45 |
mtembo | triaged as invalid. moving on to the next one | 12:46 |
mtembo | #link: https://bugs.launchpad.net/watcher/+bug/2111113 | 12:46 |
jgilaber | this one is about zone migration | 12:47 |
jgilaber | the strategy trusts the user input blindly | 12:47 |
jgilaber | and can create migrations to nodes that do not exist | 12:47 |
jgilaber | which nova then correctly refuses and the action fails | 12:47 |
jgilaber | I don't think it's the most urgent problem but it would be nice to warn the user at least | 12:48 |
sean-k-mooney | so there are two ways to make thie better | 12:49 |
sean-k-mooney | 1 the migratoin action need a pre_condtion | 12:49 |
sean-k-mooney | to check that 1 the isntance stilll exists and 2 the dest exists | 12:49 |
sean-k-mooney | but this shoudl be caught in the decsion engine or api | 12:49 |
sean-k-mooney | probaly the desciosn enging and it shoudl check the dest actully exsits before compute an action plan | 12:50 |
jgilaber | +1 that was my thought as well | 12:50 |
sean-k-mooney | effectionvly the audit shoudl validate the inputs before executiting it | 12:50 |
amoralej | i was wondering if we should validate this in api, but that would require specific per-strategy logic | 12:50 |
sean-k-mooney | ya the api woudl need too much knowlage of the stragies | 12:50 |
sean-k-mooney | but the audit logic shoudl validate this before attemepting to caludalte teh action plan | 12:51 |
sean-k-mooney | we an implemelnt a pre_conditon fucntion on the stragies to validte the inputs liek we have for actions | 12:52 |
sean-k-mooney | then all the desions engine needs to do is call that in the audit execution and its nicely encpsulated | 12:52 |
amoralej | i was thinking something like that, a validate_input() method on the strategies that the api may call | 12:52 |
amoralej | but yeah, doing it in the decision-engine at audit execution will work | 12:52 |
sean-k-mooney | maybe | 12:52 |
sean-k-mooney | the benifit of that is we could return a 400 | 12:53 |
amoralej | exactly | 12:53 |
sean-k-mooney | i think that woudl be accpable | 12:53 |
amoralej | don't allow to create the audit instead of gettin it failed | 12:53 |
sean-k-mooney | nova does od some prevaliation in the api for things like does the neutron netowrk exist | 12:53 |
sean-k-mooney | this is in line with that | 12:53 |
sean-k-mooney | ya | 12:53 |
sean-k-mooney | where we can we do not allow the instance recored to be created if we can validate it sanely in the api | 12:54 |
sean-k-mooney | so i think 1 add the validation logic to each stragey and 2 call that uniformly in teh api | 12:54 |
sean-k-mooney | so valid and medium? | 12:55 |
jgilaber | sounds right | 12:55 |
sean-k-mooney | ok updated | 12:57 |
sean-k-mooney | i have a meeting at the top of the houor | 12:57 |
mtembo | I think we are out of time. | 12:57 |
mtembo | volunteers to chair next week's meeting? | 12:57 |
sean-k-mooney | so i suggest we wrap there | 12:57 |
mtembo | I will chair the next meeting | 12:58 |
mtembo | Also the bugs we have not had time for will be transfered to next week | 12:59 |
mtembo | thank you all for attending | 12:59 |
mtembo | #endmeeting | 13:00 |
opendevmeet | Meeting ended Thu Jun 19 13:00:09 2025 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 13:00 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/watcher_meeting___19_june_2025/2025/watcher_meeting___19_june_2025.2025-06-19-12.00.html | 13:00 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/watcher_meeting___19_june_2025/2025/watcher_meeting___19_june_2025.2025-06-19-12.00.txt | 13:00 |
opendevmeet | Log: https://meetings.opendev.org/meetings/watcher_meeting___19_june_2025/2025/watcher_meeting___19_june_2025.2025-06-19-12.00.log.html | 13:00 |
opendevreview | Alfredo Moralejo proposed openstack/watcher master: Use KiB as unit for host_ram_usage when using prometheus datasource https://review.opendev.org/c/openstack/watcher/+/952212 | 14:26 |
amoralej | sean-k-mooney, ^ I hope it's more clear now | 14:26 |
sean-k-mooney | amoralej: yep looks good to me +2 | 14:28 |
opendevreview | Alfredo Moralejo proposed openstack/watcher-tempest-plugin master: Fix injected host_ram_usage metrics https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/952897 | 14:33 |
opendevreview | David proposed openstack/watcher-tempest-plugin master: Add workload_stabilization RAM tests and adapt the current one for CPU https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/952807 | 14:33 |
opendevreview | Merged openstack/watcher master: Aggregate by label when querying instance cpu usage in prometheus https://review.opendev.org/c/openstack/watcher/+/952364 | 14:46 |
opendevreview | Alfredo Moralejo proposed openstack/watcher stable/2025.1: Aggregate by label when querying instance cpu usage in prometheus https://review.opendev.org/c/openstack/watcher/+/952898 | 14:55 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!