Thursday, 2025-06-19

opendevreviewDavid proposed openstack/watcher-tempest-plugin master: Refact on creating instances and inject metrics  https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/95288409:18
opendevreviewDavid proposed openstack/watcher-tempest-plugin master: Refact on creating instances and inject metrics  https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/95288409:31
opendevreviewDavid proposed openstack/watcher-tempest-plugin master: Refact on creating instances and inject metrics  https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/95288409:35
opendevreviewAlfredo Moralejo proposed openstack/watcher master: Report host_ram_usage in KiB when using prometheus datasource  https://review.opendev.org/c/openstack/watcher/+/95221209:58
opendevreviewDavid proposed openstack/watcher-tempest-plugin master: Refact on creating instances and inject metrics  https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/95288411:24
opendevreviewAlfredo Moralejo proposed openstack/watcher master: Check result of retype action based on type and status  https://review.opendev.org/c/openstack/watcher/+/95151311:30
mtembo#startmeeting Watcher Meeting - 19 June 202512:00
opendevmeetMeeting started Thu Jun 19 12:00:26 2025 UTC and is due to finish in 60 minutes.  The chair is mtembo. Information about MeetBot at http://wiki.debian.org/MeetBot.12:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.12:00
opendevmeetThe meeting name has been set to 'watcher_meeting___19_june_2025'12:00
mtemboHello folks, who is around ?12:01
jgilabero/12:01
sean-k-mooneyo/12:01
chandankumaro/12:01
amoralej_o/12:01
mtemboTopics for today:12:02
mtembo#link: https://etherpad.opendev.org/p/openstack-watcher-irc-meeting12:02
morenodo/12:02
mtemboAlright thank you. Let's get started12:04
mtembo#topic: (morenod): Refact on creating instances and inject metrics12:05
mtembo#link: https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/95288412:05
*** amoralej_ is now known as amoralej12:06
morenodso the idea behind this is to create a baseline on how we should manage the instance creation on tempest tests, so please check the code provided and add your comments so we can get a solution valid for all of us. on base.py is where most of the job has been done.12:08
mtemboAlright, thank you David. Let's please review the code and provide feedback. 12:12
mtemboMoving on to bug triage 12:13
mtembo#link: https://bugs.launchpad.net/watcher/+bug/211386212:13
sean-k-mooneyso that a docs bug not a request for code change correct12:16
amoralejI opened that to report some issues i found about the workload_stabilization documentation12:16
sean-k-mooneyyou jsut want to clarify how we descibe things12:16
amoralejyes, that's docs only12:16
sean-k-mooneyok then lest mark it as traiged, low and add the doc tag12:17
sean-k-mooneyill go do that now and we can fix it when we have time12:17
opendevreviewDavid proposed openstack/watcher-tempest-plugin master: Refact on creating instances and inject metrics  https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/95288412:17
sean-k-mooneythe exact wording changes we can review in the patch but skiming the report it seams ok12:17
amoralejyes, that's fine12:17
mtemboThank you. Next bug 12:19
sean-k-mooneycool done12:19
mtembo#link: https://bugs.launchpad.net/watcher/+bug/211393612:19
sean-k-mooneyso we did disable the cpu reporting at the host levelbut we didnt do it for ceimoemter12:21
sean-k-mooneyitn eh fake data test12:21
sean-k-mooneywe coudl jsut disbael deploying celiomenter in that job12:21
sean-k-mooneyit would speed it up and force use to ensure all requried metrics were injected12:22
amoralejanyway, imo that's uncovering a real bug in the way we pull metrics12:22
amoralejwe fixed a similar one for the host metrics, but not for the instance ones12:22
amoralejhttps://review.opendev.org/c/openstack/watcher/+/95236412:22
amoralejthe way we identify metrics from a vm is using resource=instance_id so that's what we should use to aggregate too12:23
sean-k-mooneyyes that looks similar12:23
sean-k-mooneyi tought we orgianly did na audit of all of them to use the label explicty12:23
amoralejfor host metrics, but we missed this one12:24
amoraleji think this is the only one missing12:25
sean-k-mooneyok can you double check that we are uisng the lables properly on all fo them. ill try and review the patch this week12:25
sean-k-mooneyack12:25
sean-k-mooneythe bug is in progress and we ahve a fix so i think we can move on12:25
mtembomoving on 12:26
mtembo#link: https://bugs.launchpad.net/watcher/+bug/211245012:26
sean-k-mooneyya so this is really just imporvign debugabllty which is a valid thing to treat as a bug12:26
morenodthe important thing to fix is to find the why12:27
sean-k-mooneyits technially a mini feature but we are building the new backend with production deployment in mind12:27
morenodnot always we are missing the logs, it is about 25% of the executions12:27
sean-k-mooneyoh12:27
sean-k-mooneythats very difent12:28
sean-k-mooneythat is concerning12:28
amoralejit is12:28
amoralejmay somehow related to concurrency when using gnocchi ?12:28
sean-k-mooneyis there a patteren i.e. it does not log if a test failes or random12:28
sean-k-mooneyamoralej: techincally our logging is buffered by default and oslo has some concurrance protections12:29
sean-k-mooneybut its posisbel that if the greentret resumse on teh wrogn thread12:29
sean-k-mooneythen looging can be lost12:29
morenodI've seen errors on succeeded strategies and failed ones. I havent found any pattern12:29
morenodbut I can add to the log more examples so maybe somebody find any pattern12:30
sean-k-mooneyin general you are not ment to mix eventlet with real thread  and we are doing that in watcher12:30
morenodthe fact is that we are not losing a line, we are losing the entire decission manager log related to the strategy12:30
amoralejin prometheus case, you didn't find that issue, right?12:30
morenodI cant remember any now, probably not, all I remember are on gnocchi12:31
amoraleji haven't find it when debugging prometheus, i'd say12:31
sean-k-mooneymy guess is its eventlet releated but how exactly im not sure. it sould like a valid but but without more info it will be hard to determine.12:33
sean-k-mooneywhat do peopel think, should we mark it whislist to reflect we may not have capstity to find out why and fix it, do me make it high as it indicates thre may be concurnacy/logging probelms in general12:34
sean-k-mooneyor leave it as incomplete12:34
sean-k-mooneysince we do not have a reliable repoducer12:34
sean-k-mooneyand or working thory of where the exact problem is12:35
sean-k-mooneythere is obviously something wrong but this could take a lot of time to root cause12:35
jgilaberI think wishlist is better than incomplete in this case12:36
amoralejmay it be related to the model update execution with audit run or something12:37
jgilaberand it seems important, so at least medium12:37
sean-k-mooneyamoralej: if we saw the logs later12:37
sean-k-mooneythen i would thing that maybe the delay was caused by blocking task or something like that12:38
jgilaberthe bug says that it does not happen with prometheus, that is in a different zuul job right?12:38
sean-k-mooneywe kind of need to confirm if they are ever printed or just not there at all12:38
jgilabercould it be some setup difference between the two jobs?12:38
sean-k-mooneywell morenod said it happens sometimes in the gnooci job12:39
amoralejit may be good to reproduce it out of ci, but yep, it will be hard to debug12:39
sean-k-mooneyso that imples its either caused by the speicic patch under review 12:39
sean-k-mooneyor tis an intermitent latent bug12:39
morenodyes, same code on same job sometimes works and sometimes fails12:39
sean-k-mooneylets move on for now and come back to this12:40
mtembohow do we triage this one. final verdict ?12:40
sean-k-mooneyso incomplete means cannot be verified and need more info to triage12:41
sean-k-mooneyso incomplete and medium sound about right12:41
sean-k-mooneyor leave it in new and we cna look at it again next week12:41
sean-k-mooneyi think this woudl need someon to trace the logs and code very carfully to debug12:41
amoralejlet's keep adding links to logs if we ever find it again in ci so that we have more data points12:42
amoralejin the bug12:42
morenodok12:42
mtemboThanks, updating status ... moving on to next one 12:42
mtembo#link: https://bugs.launchpad.net/watcher/+bug/211171212:42
jgilaberthat does not look related to watcher, is it?12:44
amoraleji'd say so12:44
sean-k-mooneyno its does not i would mark it as invlid12:44
jgilaber+12:45
jgilaber+112:45
mtembotriaged as invalid. moving on to the next one 12:46
mtembo#link: https://bugs.launchpad.net/watcher/+bug/211111312:46
jgilaberthis one is about zone migration12:47
jgilaberthe strategy trusts the user input blindly12:47
jgilaberand can create migrations to nodes that do not exist12:47
jgilaberwhich nova then correctly refuses and the action fails12:47
jgilaberI don't think it's the most urgent problem but it would be nice to warn the user at least12:48
sean-k-mooneyso there are two ways to make thie better12:49
sean-k-mooney1 the migratoin action need a pre_condtion12:49
sean-k-mooneyto check that 1 the isntance stilll exists and 2 the dest exists12:49
sean-k-mooneybut this shoudl be caught in the decsion engine or api12:49
sean-k-mooneyprobaly the desciosn enging and it shoudl check the dest actully exsits before compute an action plan12:50
jgilaber+1 that was my thought as well12:50
sean-k-mooneyeffectionvly the audit shoudl validate the inputs before executiting it12:50
amoraleji was wondering if we should validate this in api, but that would require specific per-strategy logic12:50
sean-k-mooneyya the api woudl need too much knowlage of the stragies12:50
sean-k-mooneybut the audit logic shoudl validate this before attemepting to caludalte teh action plan12:51
sean-k-mooneywe an implemelnt a pre_conditon fucntion on the stragies to validte the inputs liek we have for actions12:52
sean-k-mooneythen all the desions engine needs to do is call that in the audit execution and its nicely encpsulated12:52
amoraleji was thinking something like that, a validate_input() method on the strategies that the api may call12:52
amoralejbut yeah, doing it in the decision-engine at audit execution will work12:52
sean-k-mooneymaybe12:52
sean-k-mooneythe benifit of that is we could return a 40012:53
amoralejexactly12:53
sean-k-mooneyi think that woudl be accpable12:53
amoralejdon't allow to create the audit instead of gettin it failed12:53
sean-k-mooneynova does od some prevaliation in the api for things like does the neutron netowrk exist12:53
sean-k-mooneythis is in line with that12:53
sean-k-mooneyya12:53
sean-k-mooneywhere we can we do not allow the instance recored to be created if we can validate it sanely in the api12:54
sean-k-mooneyso i think 1 add the validation logic to each stragey and 2 call that uniformly in teh api12:54
sean-k-mooneyso valid and medium?12:55
jgilabersounds right12:55
sean-k-mooneyok updated 12:57
sean-k-mooneyi have a meeting at the top of the houor12:57
mtemboI think we are out of time. 12:57
mtembovolunteers to chair next week's meeting?12:57
sean-k-mooneyso i suggest we wrap there12:57
mtemboI will chair the next meeting12:58
mtemboAlso the bugs we have not had time for will be transfered to next week 12:59
mtembothank you all for attending12:59
mtembo#endmeeting13:00
opendevmeetMeeting ended Thu Jun 19 13:00:09 2025 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)13:00
opendevmeetMinutes:        https://meetings.opendev.org/meetings/watcher_meeting___19_june_2025/2025/watcher_meeting___19_june_2025.2025-06-19-12.00.html13:00
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/watcher_meeting___19_june_2025/2025/watcher_meeting___19_june_2025.2025-06-19-12.00.txt13:00
opendevmeetLog:            https://meetings.opendev.org/meetings/watcher_meeting___19_june_2025/2025/watcher_meeting___19_june_2025.2025-06-19-12.00.log.html13:00
opendevreviewAlfredo Moralejo proposed openstack/watcher master: Use KiB as unit for host_ram_usage when using prometheus datasource  https://review.opendev.org/c/openstack/watcher/+/95221214:26
amoralejsean-k-mooney, ^ I hope it's more clear now14:26
sean-k-mooneyamoralej: yep looks good to me +214:28
opendevreviewAlfredo Moralejo proposed openstack/watcher-tempest-plugin master: Fix injected host_ram_usage metrics  https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/95289714:33
opendevreviewDavid proposed openstack/watcher-tempest-plugin master: Add workload_stabilization RAM tests and adapt the current one for CPU  https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/95280714:33
opendevreviewMerged openstack/watcher master: Aggregate by label when querying instance cpu usage in prometheus  https://review.opendev.org/c/openstack/watcher/+/95236414:46
opendevreviewAlfredo Moralejo proposed openstack/watcher stable/2025.1: Aggregate by label when querying instance cpu usage in prometheus  https://review.opendev.org/c/openstack/watcher/+/95289814:55

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!