12:00:37 <rlandy> #startmeeting Watcher IRC Meeting - July 17, 2025 12:00:37 <opendevmeet> Meeting started Thu Jul 17 12:00:37 2025 UTC and is due to finish in 60 minutes. The chair is rlandy. Information about MeetBot at http://wiki.debian.org/MeetBot. 12:00:37 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 12:00:37 <opendevmeet> The meeting name has been set to 'watcher_irc_meeting___july_17__2025' 12:01:07 <rlandy> Hi all ... who is around? 12:01:18 <amoralej> o/ 12:01:21 <morenod> o/ 12:01:47 <rlandy> courtesy ping list: dviroel jgilaber sean-k-mooney 12:01:51 <jgilaber> o/ 12:01:53 <dviroel> o/ 12:03:13 <rlandy> chandankumar is away atm ... let's start 12:03:42 <rlandy> #topic: (chandan|not around today): Move workload_balance_strategy_cpu|ram tests to exclude list to unblock upstream watcher reviews 12:03:56 <rlandy> #link: https://launchpad.net/bugs/2116875 12:04:22 <dviroel> right, it is failing more often now 12:04:52 <dviroel> so we can skip the tests while working on a fix 12:05:20 <dviroel> morenod is already investigating and knows what is happenning I think 12:05:28 <rlandy> any objections to merging this and reverting when the fix is in? 12:06:07 <sean-k-mooney> o/ 12:06:20 <amoralej> it's expected that, different runs of the same jobs have different node sizes? 12:06:26 <morenod> I've added a comment on the bug, the problem is that compute nodes are sometimes 8vcpus and sometimes 4vcpus... we need to create the test with a dynamic threshold, based on the number of vcpus of the compute nodes 12:06:42 <sean-k-mooney> im fine with skipping it for now the other way to do that is to use the skip_because decorator in the tempest plugin 12:06:49 <sean-k-mooney> that takes a bug ref 12:07:06 <sean-k-mooney> but im ok with the regex approch for now 12:07:35 <sean-k-mooney> skip_because is slightly less work 12:07:38 <morenod> Im also ok with skipping, I will need a few more days to have the fix 12:07:40 <sean-k-mooney> because it wil ksip everywhere 12:08:35 <sean-k-mooney> if there isnt a parch already i woudl prefer to use https://github.com/openstack/tempest/blob/master/tempest/lib/decorators.py#L60 12:08:45 <rlandy> morenod: reading your comment, the fix will take some time? 12:09:04 <sean-k-mooney> you just add it like this https://github.com/openstack/tempest/blob/master/tempest/api/image/v2/admin/test_image_task.py#L98 12:09:43 <dviroel> sean-k-mooney: yeah, it is preferable 12:09:45 <morenod> rlandy, im working on it now, maybe sometime between tomorrow and monday it will be ready 12:10:23 <morenod> I like the skip_because solution, it is very clear 12:11:15 <jgilaber> amoralej, I can't find the nodeset definitions, but it could be possible that different providers have a label with the same name but using different flavours 12:11:18 <rlandy> #action rlandy to contact chandankumar to review above suggestions while morenod finishes real fix 12:11:46 <amoralej> i guess that's what is happening, i thought there was a consensus about nodeset definitions 12:11:54 <sean-k-mooney> morenod: there is also a tempest cli command to list all the decorated test i belive so you can keep track of them over time 12:12:04 <amoralej> anyway, good to adjust the threshold to the actual node sizes 12:12:25 <sean-k-mooney> keep in mind the node size can differ upstream vs downstream and even in upstream 12:12:42 <morenod> related but not related to this issue, we disabled in the node_exporter in the watcher-operator, but not on devstack based jobs. I have created this review for that https://review.opendev.org/c/openstack/watcher/+/955281 12:12:47 <sean-k-mooney> upstream we alwasy shoudl ahve at leat 8GB for ram but we can have 4 or 8 cpus depelnding on perfroamce 12:12:48 <dviroel> yes, we can run these tests anywhere, so it should be adjusted to node specs 12:13:21 <morenod> we will have dynamic flavors to fix RAM and dynamic threshold to fix CPU 12:13:56 <sean-k-mooney> that an approch and one tha t comptue has used to some success in whitebox but its not alwasy easy to do 12:14:06 <sean-k-mooney> but ok lets see what that looks ike 12:15:27 <rlandy> anything more on this topic? 12:16:30 <sean-k-mooney> crickets generally means we can move on :) 12:16:32 <rlandy> thank you for the input - will alert chandankumar to review the conversation 12:16:46 <rlandy> #topic: (dviroel) Eventlet Removal 12:16:54 <rlandy> dviroel, do you want to take this one? 12:17:00 <dviroel> yes 12:17:04 <dviroel> #link https://etherpad.opendev.org/p/watcher-eventlet-removal 12:17:18 <dviroel> the etherpad has links to the changes ready for review 12:17:25 <dviroel> i also added to the meeting etherpad 12:17:39 <dviroel> tl;dr; the decision engine changes are ready for review 12:18:10 <dviroel> there are other discussions that are not code related like 12:18:30 <dviroel> should we keep a prometheus-threading job as voting? 12:18:59 <dviroel> which we can discuss in the change itseld 12:19:23 <sean-k-mooney> hum 12:19:59 <sean-k-mooney> so i think we want to run with both version and pershpas start with it as non voting for now 12:20:33 <dviroel> in the same line, I added a new tox py3 job, to run a subset of tests with eventlet patching disabled 12:20:34 <sean-k-mooney> but if we are going to offially supprot both models in 2025.2 then we shoudl make it voting before m3 12:20:58 <dviroel> #link https://review.opendev.org/c/openstack/watcher/+/955097 12:21:15 <sean-k-mooney> what i would suggest is let start iwth as non voting and look to make the treadign jobs voting around the start of august 12:21:29 <dviroel> sean-k-mooney: right, I can add that as a task for m3, to move to voting 12:21:44 <dviroel> ans we can look at job's history 12:21:45 <sean-k-mooney> for the unit test job if its passing i woudl be more agressive with that and make it voting right away 12:22:41 <dviroel> ack, it is passing now, buy skipping 'applier' ones, which will be part of next effort, to add support to applier too 12:22:58 <sean-k-mooney> ya that what we are doing in nova as well 12:23:08 <sean-k-mooney> we have 75% of the unit test passing maybe higher 12:23:22 <sean-k-mooney> so we are using an exclude list to skip the failing ones and burning that down 12:23:23 <dviroel> nice 12:24:04 <sean-k-mooney> on https://review.opendev.org/c/openstack/watcher/+/952499/4 12:24:14 <sean-k-mooney> 1 you rote it so it has an implict +2 12:24:24 <sean-k-mooney> but i have also left it open now for about a week 12:24:42 <sean-k-mooney> so i was planning to +w it after the meeting if there were no other objects 12:24:55 <dviroel> ++ 12:25:35 <dviroel> i see no objections :) 12:25:36 <sean-k-mooney> by the way the watcher-prometheus-integration-threading job failed on the unit test patch which is partly why i want to keep it non-voting for a week or two ot make sure that not a regular thing 12:25:39 <dviroel> tks sean-k-mooney 12:26:15 <sean-k-mooney> oh it was just test_execute_workload_balance_strategy_cpu 12:26:18 <dviroel> sean-k-mooney: but failng 12:26:20 <dviroel> yeah 12:26:25 <sean-k-mooney> that the instablity we dicussed above 12:26:25 <dviroel> i was about to say that 12:26:45 <sean-k-mooney> ok well that a good sign 12:27:16 <dviroel> and the same issue can block the decision engine patch to merge too, just fyi 12:27:26 <dviroel> or trigger some rechecks 12:27:38 <dviroel> so maybe we could wait the skip if needed 12:27:42 <dviroel> lets see 12:28:00 <sean-k-mooney> ack, i may not have time to complete my review of the 2 later patche stoday but we can try to get those mergd somethime next week i think 12:28:06 <dviroel> ack 12:28:11 <dviroel> there is one more: 12:28:13 <dviroel> #link https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/954264 12:28:25 <dviroel> it adds a new scenario test, with continuous audit 12:28:41 <sean-k-mooney> ya that not really reventlet removal as such 12:28:45 <dviroel> there is a specific scenario that I wanted to test, which needs 2 audits to be created 12:28:46 <sean-k-mooney> just missing test coverage 12:29:02 <dviroel> ack 12:29:27 <dviroel> it is a scenario that fails when we move to threading mode 12:29:57 <sean-k-mooney> i see do you knwo why? 12:30:14 <sean-k-mooney> did you update the defautl executor for apsschduler 12:30:31 <sean-k-mooney> to not use green pools in your treadign patch 12:31:00 <dviroel> today continuous audit is started at Audit Endpoint constructor, before the main decision engine service fork 12:31:18 <dviroel> so this thread was running on a different process 12:31:28 <dviroel> and getting an outdated model 12:31:54 <sean-k-mooney> is that adressed by https://review.opendev.org/c/openstack/watcher/+/952499/4 12:32:03 <sean-k-mooney> it shoudl be right? 12:32:08 <dviroel> #link https://review.opendev.org/c/openstack/watcher/+/952257 12:32:13 <dviroel> is the one that address that 12:32:25 <sean-k-mooney> oh ok 12:32:35 <sean-k-mooney> so when that merges the new senario test shoudl pass 12:32:36 <dviroel> here https://review.opendev.org/c/openstack/watcher/+/952257/9/watcher/decision_engine/service.py 12:32:43 <sean-k-mooney> can you add a depend on to the tempest change to show that 12:32:55 <dviroel> there is already 12:33:12 <dviroel> there is also one DNM patch that shows the failure too 12:33:21 <sean-k-mooney> not tha ti can see https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/954264 12:33:34 <dviroel> #link https://review.opendev.org/c/openstack/watcher/+/954364 12:33:38 <sean-k-mooney> oh you have the depend on in the wrong direction 12:33:42 <dviroel> reproduces the issue 12:34:07 <sean-k-mooney> it need to be form watcher-tempest-plug -> watcher in this case 12:34:28 <sean-k-mooney> well 12:34:37 <dviroel> sean-k-mooney: yes and no, because the tempest change is passing too, in other jobs 12:34:45 <sean-k-mooney> i guess we could merge the tempest test first assuming it passes in eventlet mode 12:35:08 <dviroel> correct, there are other jobs that will run that test too 12:35:26 <sean-k-mooney> ok i assume the last two failures of the promethius job are also the real data tests? 12:35:35 <dviroel> #link https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_1f5/openstack/1f55b6937c9b47a9afb510b960ef12ea/testr_results.html 12:35:55 <dviroel> passing on watcher-tempest-strategies with eventlet 12:36:11 <sean-k-mooney> actully since we are talking about jobs the tempest repo does have watcher-tempest-functional-2024-1 12:36:13 <dviroel> chandan added a comment about the failures, yes 12:36:39 <sean-k-mooney> i.e. jobs for the stable brances but we should add version of watcher-prometheus-integration for 2025.1 12:37:07 <dviroel> ++ 12:37:14 <sean-k-mooney> to make sure we do not break epoxy with promethus as we extend the test suite 12:38:12 <sean-k-mooney> ok we can do that seperately, i think we can move on. ill take a look at that test today/tomorrow and we can likely proceed with it 12:38:20 <dviroel> sure 12:38:42 <dviroel> I think that I covered everything, any other question? 12:38:58 <sean-k-mooney> just a meta one 12:39:12 <sean-k-mooney> it looks like the descion engin will be done this cycle 12:39:25 <sean-k-mooney> how do you feel about the applier 12:39:42 <sean-k-mooney> also what is the status of the api. are we eventlet free there? 12:40:06 <dviroel> ack, I will still look how decision engine will perf, about resource usage and default number of workers, but it is almost done with these changes 12:40:41 <sean-k-mooney> well for this cycle it wont be the defautl so we can tweak thos as we gain experince with it 12:40:46 <dviroel> ++ 12:40:50 <sean-k-mooney> i woudl start small ie 4 workers max 12:41:17 <sean-k-mooney> *thread in the pools not workers 12:41:33 <dviroel> ack, so one more change for dec-eng is expected for this 12:41:55 <dviroel> yes, in the code is called workers, but yes, nb of threads in the pool 12:42:21 <sean-k-mooney> well we have 2 concepts 12:42:29 <dviroel> sean-k-mooney: i plan to work in the applier within this cycle, but not sure if we are going to have it working until the end of the cycle 12:42:33 <sean-k-mooney> workers in oslo means the number of processes normally 12:43:29 <sean-k-mooney> oh i see... CONF.watcher_decision_engine.max_general_workers 12:43:47 <sean-k-mooney> so watcher is using workers for eventlet already 12:43:52 <sean-k-mooney> ok 12:44:07 <sean-k-mooney> so in nova we are intentioally adding new config opitons 12:44:26 <sean-k-mooney> because the default likely wont be the same but ill look at what watcher has today and comment in the review 12:44:48 <dviroel> ack, the background scheduler is one that has no config for instance 12:45:02 <sean-k-mooney> ok its 4 already https://docs.openstack.org/watcher/latest/configuration/watcher.html#watcher_decision_engine.max_general_workers 12:45:36 <dviroel> and this one ^ - is for the decision engine threadpool 12:45:38 <sean-k-mooney> ya i think its fine normlly the eventlet pool size in most servers is set to around 10000 12:45:43 <sean-k-mooney> which woudl obvirly be a problem 12:45:55 <sean-k-mooney> but 4 is fine 12:46:10 <dviroel> decision engine threadpool today covers the model synchronize threads 12:46:20 <sean-k-mooney> ack 12:46:24 <dviroel> ok, I think that we can move on 12:46:33 <dviroel> and continue in gerrit 12:46:42 <rlandy> thanks dviroel 12:46:44 <dviroel> tks sean-k-mooney 12:47:05 <rlandy> there were no other reviews added on list 12:47:33 <rlandy> anyone want to raise any other patches needing review now? 12:48:19 <rlandy> k - moving on ... 12:48:19 <sean-k-mooney> i have a topic for the end of the meeting but its not strictly related to a patch 12:48:23 <rlandy> oops 12:48:29 <sean-k-mooney> we can move on 12:48:38 <rlandy> ok - well - bug triage and then all yours 12:48:46 <rlandy> #topic: Bug Triage 12:49:01 <rlandy> Looking at the status of the watcher related bugs: 12:49:31 <rlandy> #link: https://bugs.launchpad.net/watcher/+bugs 12:49:36 <rlandy> has 33 bugs listed 12:49:43 <rlandy> 7 of which are in progress 12:50:08 <rlandy> and 2 incomplete ... 12:50:11 <rlandy> https://bugs.launchpad.net/watcher/+bugs?orderby=status&start=0 12:50:27 <rlandy> #link https://bugs.launchpad.net/watcher/+bug/1837400 12:50:36 <rlandy> ^^ only that one is marked "new" 12:51:12 <rlandy> dashboard, client and tempest are all under control with 2 or 3 bugs either in progress or doc related 12:51:28 <sean-k-mooney> the bug seam valid if it still happens 12:51:39 <sean-k-mooney> however i agree that its low priorioty 12:51:50 <sean-k-mooney> we marked it as need-re-triage 12:51:55 <rlandy> so raising only this one today: 12:51:57 <sean-k-mooney> becuase i think we wanted to see if thtis was fixed 12:52:24 <rlandy> https://bugs.launchpad.net/watcher/+bug/1877956 (bug about canceling action plans) 12:53:02 <rlandy> as work was done to fix canceling action plans and greg and I tested it yesterday (admitted from the UI) and that is now working 12:53:43 <dviroel> we found evidences in the code, but I didn't tried to reproduce 12:53:43 <sean-k-mooney> so this was just a looing bug i think 12:53:53 <sean-k-mooney> when i loged at it before i think this is still a problem 12:53:55 <dviroel> should be a real one 12:54:05 <dviroel> and also easy to fix 12:54:09 <sean-k-mooney> yep 12:54:25 <sean-k-mooney> do we have tempest test for cancelaion yet 12:54:30 <sean-k-mooney> i dont think so right 12:55:06 <sean-k-mooney> i thik we can do this by using the sleep action and maybe the actoator stragy 12:55:06 <rlandy> not as far as I know 12:55:29 <dviroel> yeah, a good opportunity to add one too 12:55:44 <sean-k-mooney> i think we should keep this open and just fix the issue when we have time 12:55:50 <sean-k-mooney> ill set it to low? 12:55:57 <dviroel> ++ 12:56:04 <sean-k-mooney> cool 12:56:33 <rlandy> ok - that's it for triage ... 12:56:40 <rlandy> sean-k-mooney: your topic? 12:56:45 <sean-k-mooney> ya... 12:56:59 <sean-k-mooney> so how has heard of the service-type-athority repo? 12:57:43 <amoralej> i haven't 12:57:47 <sean-k-mooney> for wider context https://specs.openstack.org/openstack/service-types-authority/ its a thing that was created a very long time ago and is not documented as part of the project creation process 12:57:48 <jgilaber> me neither 12:58:16 <sean-k-mooney> i disocverd or rediscoverd it tuesday night/yesterday 12:58:41 <sean-k-mooney> Aetos is not listed there and "promethus" does nto follow the requrie naming convetions 12:58:56 <sean-k-mooney> so the keyston endpoint they want to use, sepcificly the service-type 12:59:02 <sean-k-mooney> is not valid 12:59:24 <sean-k-mooney> so they are going to have to create a servifce type "tenant-metrics" is my suggetion 12:59:30 <sean-k-mooney> then we need ot update the spec 12:59:33 <sean-k-mooney> and use that 13:00:03 <sean-k-mooney> but we need to get the tc to approve thatn and we need to tell the telemetry team about this requirement 13:00:32 <sean-k-mooney> i spend a while on the tc channel trying to understand thsi yesterday 13:01:14 <sean-k-mooney> so ya we need to let juan and jaromir know 13:01:16 <amoralej> did the telemetry team start using the wrong names somewhere? 13:01:33 <sean-k-mooney> they planned to start using promethus 13:01:40 <sean-k-mooney> for Aetos 13:01:48 <amoralej> at least no need to revert any code, i hope :) 13:02:00 <sean-k-mooney> not yet 13:02:13 <sean-k-mooney> but watcher will need to know the name to do the check for the endpoint 13:02:21 <sean-k-mooney> and the installer will need ot use the correct name too 13:02:31 <sean-k-mooney> the ohter thing i found out 13:02:44 <sean-k-mooney> is we are using the legacy name for watcher downstream i think 13:03:12 <sean-k-mooney> https://opendev.org/openstack/service-types-authority/src/branch/master/service-types.yaml#L31-L34 13:03:37 <sean-k-mooney> its offical service-type shoudl be resource-optimization not infra-optim 13:03:45 <dviroel> oh, good to know 13:03:55 <sean-k-mooney> so that a donwstream bug that we shoudl fix in the operator 13:04:12 <sean-k-mooney> both are technically valid but it woudl be better to use the non alias version 13:04:55 <sean-k-mooney> so jaromir i belvie is on pto for the next week or two 13:05:29 <sean-k-mooney> so we need to sync with the telemetry folks and wiehte rwe or they can update the service-types-athurity file with the right content 13:05:54 <sean-k-mooney> anyay way that all i had on this 13:06:30 <dviroel> tks for finding and pursuing this issue sean-k-mooney 13:06:43 <rlandy> thanks for raising this - a lot of PTOs atm ... mtunge is also out from nect week so maybe we try juan if possible 13:07:27 <rlandy> we are over time so I'll move on to ... 13:07:27 <sean-k-mooney> it was mainly by acident i skim the tc meeting notes and the repo came up this week 13:07:32 <sean-k-mooney> or last 13:07:49 <sean-k-mooney> ya we can wrap up and move on 13:08:06 <rlandy> Volunteers to chair next meeting: 13:09:11 <opendevreview> Merged openstack/watcher master: Merge decision engine services into a single one https://review.opendev.org/c/openstack/watcher/+/952499 13:09:17 <dviroel> o/ 13:09:23 <dviroel> I can chair 13:09:25 <rlandy> thank you dviroel 13:09:31 <rlandy> much appreciated 13:09:36 <rlandy> k folks ... closing out 13:09:40 <rlandy> thank you for attending 13:09:43 <rlandy> #endmeeting