12:00:37 <rlandy> #startmeeting Watcher IRC Meeting - July 17, 2025
12:00:37 <opendevmeet> Meeting started Thu Jul 17 12:00:37 2025 UTC and is due to finish in 60 minutes.  The chair is rlandy. Information about MeetBot at http://wiki.debian.org/MeetBot.
12:00:37 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
12:00:37 <opendevmeet> The meeting name has been set to 'watcher_irc_meeting___july_17__2025'
12:01:07 <rlandy> Hi all ... who is around?
12:01:18 <amoralej> o/
12:01:21 <morenod> o/
12:01:47 <rlandy> courtesy ping list: dviroel jgilaber sean-k-mooney
12:01:51 <jgilaber> o/
12:01:53 <dviroel> o/
12:03:13 <rlandy> chandankumar is away atm ... let's start
12:03:42 <rlandy> #topic: (chandan|not around today):  Move workload_balance_strategy_cpu|ram tests to exclude list to unblock upstream watcher reviews
12:03:56 <rlandy> #link:  https://launchpad.net/bugs/2116875
12:04:22 <dviroel> right, it is failing more often now
12:04:52 <dviroel> so we can skip the tests while working on a fix
12:05:20 <dviroel> morenod is already investigating and knows what is happenning I think
12:05:28 <rlandy> any objections to merging this and reverting when the fix is in?
12:06:07 <sean-k-mooney> o/
12:06:20 <amoralej> it's expected that, different runs of the same jobs have different node sizes?
12:06:26 <morenod> I've added a comment on the bug, the problem is that compute nodes are sometimes 8vcpus and sometimes 4vcpus... we need to create the test with a dynamic threshold, based on the number of vcpus of the compute nodes
12:06:42 <sean-k-mooney> im fine with skipping it for now the other way to do that is to use the skip_because decorator in the tempest plugin
12:06:49 <sean-k-mooney> that takes a bug ref
12:07:06 <sean-k-mooney> but im ok with the regex approch for now
12:07:35 <sean-k-mooney> skip_because is slightly less work
12:07:38 <morenod> Im also ok with skipping, I will need a few more days to have the fix
12:07:40 <sean-k-mooney> because it wil ksip everywhere
12:08:35 <sean-k-mooney> if there isnt a parch already i woudl prefer to use https://github.com/openstack/tempest/blob/master/tempest/lib/decorators.py#L60
12:08:45 <rlandy> morenod: reading your comment, the fix will take some time?
12:09:04 <sean-k-mooney> you just add it like this https://github.com/openstack/tempest/blob/master/tempest/api/image/v2/admin/test_image_task.py#L98
12:09:43 <dviroel> sean-k-mooney: yeah, it is preferable
12:09:45 <morenod> rlandy, im working on it now, maybe sometime between tomorrow and monday it will be ready
12:10:23 <morenod> I like the skip_because solution, it is very clear
12:11:15 <jgilaber> amoralej, I can't find the nodeset definitions, but it could be possible that different providers have a label with the same name but using different flavours
12:11:18 <rlandy> #action rlandy to contact chandankumar to review above suggestions while morenod finishes real fix
12:11:46 <amoralej> i guess that's what is happening, i thought there was a consensus about nodeset definitions
12:11:54 <sean-k-mooney> morenod: there is also a tempest cli command to list all the decorated test i belive so you can keep track of them over time
12:12:04 <amoralej> anyway, good to adjust the threshold to the actual node sizes
12:12:25 <sean-k-mooney> keep in mind the node size can differ upstream vs downstream and even in upstream
12:12:42 <morenod> related but not related to this issue, we disabled in the node_exporter in the watcher-operator, but not on devstack based jobs. I have created this review for that https://review.opendev.org/c/openstack/watcher/+/955281
12:12:47 <sean-k-mooney> upstream we alwasy shoudl ahve at leat 8GB for ram but we can have 4 or 8 cpus depelnding on perfroamce
12:12:48 <dviroel> yes, we can run these tests anywhere, so it should be adjusted to node specs
12:13:21 <morenod> we will have dynamic flavors to fix RAM and dynamic threshold to fix CPU
12:13:56 <sean-k-mooney> that an approch and one tha t comptue has used to some success in whitebox but its not alwasy easy to do
12:14:06 <sean-k-mooney> but ok lets see what that looks ike
12:15:27 <rlandy> anything more on this topic?
12:16:30 <sean-k-mooney> crickets generally means we can move on :)
12:16:32 <rlandy> thank you for the input - will alert chandankumar to review the conversation
12:16:46 <rlandy> #topic: (dviroel) Eventlet Removal
12:16:54 <rlandy> dviroel, do you want to take this one?
12:17:00 <dviroel> yes
12:17:04 <dviroel> #link https://etherpad.opendev.org/p/watcher-eventlet-removal
12:17:18 <dviroel> the etherpad has links to the changes ready for review
12:17:25 <dviroel> i also added to the meeting etherpad
12:17:39 <dviroel> tl;dr; the decision engine changes are ready for review
12:18:10 <dviroel> there are other discussions that are not code related like
12:18:30 <dviroel> should we keep a prometheus-threading job as voting?
12:18:59 <dviroel> which we can discuss in the change itseld
12:19:23 <sean-k-mooney> hum
12:19:59 <sean-k-mooney> so i think we want to run with both version and pershpas start with it as non voting for now
12:20:33 <dviroel> in the same line, I added a new tox py3 job, to run a subset of tests with eventlet patching disabled
12:20:34 <sean-k-mooney> but if we are going to offially supprot both models in 2025.2 then we shoudl make it voting before m3
12:20:58 <dviroel> #link https://review.opendev.org/c/openstack/watcher/+/955097
12:21:15 <sean-k-mooney> what i would suggest is let start iwth as non voting and look to make the treadign jobs voting around the start of august
12:21:29 <dviroel> sean-k-mooney: right, I can add that as a task for m3, to move to voting
12:21:44 <dviroel> ans we can look at job's history
12:21:45 <sean-k-mooney> for the unit test job if its passing i woudl be more agressive with that and make it voting right away
12:22:41 <dviroel> ack, it is passing now, buy skipping 'applier' ones, which will be part of next effort, to add support to applier too
12:22:58 <sean-k-mooney> ya that what we are doing in nova as well
12:23:08 <sean-k-mooney> we have 75% of the unit test passing maybe higher
12:23:22 <sean-k-mooney> so we are using an exclude list to skip the failing ones and burning that down
12:23:23 <dviroel> nice
12:24:04 <sean-k-mooney> on https://review.opendev.org/c/openstack/watcher/+/952499/4
12:24:14 <sean-k-mooney> 1 you rote it so it has an implict +2
12:24:24 <sean-k-mooney> but i have also left it open now for about a week
12:24:42 <sean-k-mooney> so i was planning to +w it after the meeting if there were no other objects
12:24:55 <dviroel> ++
12:25:35 <dviroel> i see no objections :)
12:25:36 <sean-k-mooney> by the way the watcher-prometheus-integration-threading job failed on the unit test patch which is partly why i want to keep it non-voting for a week or two ot make sure that not a regular thing
12:25:39 <dviroel> tks sean-k-mooney
12:26:15 <sean-k-mooney> oh it was just test_execute_workload_balance_strategy_cpu
12:26:18 <dviroel> sean-k-mooney: but failng
12:26:20 <dviroel> yeah
12:26:25 <sean-k-mooney> that the instablity we dicussed above
12:26:25 <dviroel> i was about to say that
12:26:45 <sean-k-mooney> ok well that a good sign
12:27:16 <dviroel> and the same issue can block the decision engine patch to merge too, just fyi
12:27:26 <dviroel> or trigger some rechecks
12:27:38 <dviroel> so maybe we could wait the skip if needed
12:27:42 <dviroel> lets see
12:28:00 <sean-k-mooney> ack, i may not have time to complete my review of the 2 later patche stoday but we can try to get those mergd somethime next week i think
12:28:06 <dviroel> ack
12:28:11 <dviroel> there is one more:
12:28:13 <dviroel> #link https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/954264
12:28:25 <dviroel> it adds a new scenario test, with continuous audit
12:28:41 <sean-k-mooney> ya that not really reventlet removal as such
12:28:45 <dviroel> there is a specific scenario that I wanted to test, which needs 2 audits to be created
12:28:46 <sean-k-mooney> just missing test coverage
12:29:02 <dviroel> ack
12:29:27 <dviroel> it is a scenario that fails when we move to threading mode
12:29:57 <sean-k-mooney> i see do you knwo why?
12:30:14 <sean-k-mooney> did you update the defautl executor for apsschduler
12:30:31 <sean-k-mooney> to not use green pools in your treadign patch
12:31:00 <dviroel> today continuous audit is started at Audit Endpoint constructor, before the main decision engine service fork
12:31:18 <dviroel> so this thread was running on a different process
12:31:28 <dviroel> and getting an outdated model
12:31:54 <sean-k-mooney> is that adressed by https://review.opendev.org/c/openstack/watcher/+/952499/4
12:32:03 <sean-k-mooney> it shoudl be right?
12:32:08 <dviroel> #link https://review.opendev.org/c/openstack/watcher/+/952257
12:32:13 <dviroel> is the one that address that
12:32:25 <sean-k-mooney> oh ok
12:32:35 <sean-k-mooney> so when that merges the new senario test shoudl pass
12:32:36 <dviroel> here https://review.opendev.org/c/openstack/watcher/+/952257/9/watcher/decision_engine/service.py
12:32:43 <sean-k-mooney> can you add a depend on to the tempest change to show that
12:32:55 <dviroel> there is already
12:33:12 <dviroel> there is also one DNM patch that shows the failure too
12:33:21 <sean-k-mooney> not tha ti can see https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/954264
12:33:34 <dviroel> #link https://review.opendev.org/c/openstack/watcher/+/954364
12:33:38 <sean-k-mooney> oh you have the depend on in the wrong direction
12:33:42 <dviroel> reproduces the issue
12:34:07 <sean-k-mooney> it need to be form watcher-tempest-plug -> watcher in this case
12:34:28 <sean-k-mooney> well
12:34:37 <dviroel> sean-k-mooney: yes and no, because the tempest change is passing too, in other jobs
12:34:45 <sean-k-mooney> i guess we could merge the tempest test first assuming it passes in eventlet mode
12:35:08 <dviroel> correct, there are other jobs that will run that test too
12:35:26 <sean-k-mooney> ok i assume the last two failures of the promethius job are also the real data tests?
12:35:35 <dviroel> #link https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_1f5/openstack/1f55b6937c9b47a9afb510b960ef12ea/testr_results.html
12:35:55 <dviroel> passing on watcher-tempest-strategies  with eventlet
12:36:11 <sean-k-mooney> actully since we are talking about jobs the tempest repo does have watcher-tempest-functional-2024-1
12:36:13 <dviroel> chandan added a comment about the failures, yes
12:36:39 <sean-k-mooney> i.e. jobs for the stable brances but we should add version of watcher-prometheus-integration for 2025.1
12:37:07 <dviroel> ++
12:37:14 <sean-k-mooney> to make sure we do not break epoxy with promethus as we extend the test suite
12:38:12 <sean-k-mooney> ok we can do that seperately, i think we can move on. ill take a look at that test today/tomorrow and we can likely proceed with it
12:38:20 <dviroel> sure
12:38:42 <dviroel> I think that I covered everything, any other question?
12:38:58 <sean-k-mooney> just a meta one
12:39:12 <sean-k-mooney> it looks like the descion engin will be done this cycle
12:39:25 <sean-k-mooney> how do you feel about the applier
12:39:42 <sean-k-mooney> also what is the status of the api. are we eventlet free there?
12:40:06 <dviroel> ack, I will still look how decision engine will perf, about resource usage and default number of workers, but it is almost done with these changes
12:40:41 <sean-k-mooney> well for this cycle it wont be the defautl so we can tweak thos as we gain experince with it
12:40:46 <dviroel> ++
12:40:50 <sean-k-mooney> i woudl start small ie 4 workers max
12:41:17 <sean-k-mooney> *thread in the pools not workers
12:41:33 <dviroel> ack, so one more change for dec-eng is expected for this
12:41:55 <dviroel> yes, in the code is called workers, but yes, nb of threads in the pool
12:42:21 <sean-k-mooney> well we have 2 concepts
12:42:29 <dviroel> sean-k-mooney: i plan to work in the applier within this cycle, but not sure if we are going to have it working until the end of the cycle
12:42:33 <sean-k-mooney> workers in oslo means the number of processes normally
12:43:29 <sean-k-mooney> oh i see... CONF.watcher_decision_engine.max_general_workers
12:43:47 <sean-k-mooney> so watcher is using workers for eventlet already
12:43:52 <sean-k-mooney> ok
12:44:07 <sean-k-mooney> so in nova we are intentioally adding new config opitons
12:44:26 <sean-k-mooney> because the default likely wont be the same but ill look at what watcher has today and comment in the review
12:44:48 <dviroel> ack, the background scheduler is one that has no config for instance
12:45:02 <sean-k-mooney> ok its 4 already https://docs.openstack.org/watcher/latest/configuration/watcher.html#watcher_decision_engine.max_general_workers
12:45:36 <dviroel> and this one ^ - is for the decision engine threadpool
12:45:38 <sean-k-mooney> ya i think its fine normlly the eventlet pool size in most servers is set to around 10000
12:45:43 <sean-k-mooney> which woudl obvirly be a problem
12:45:55 <sean-k-mooney> but 4 is fine
12:46:10 <dviroel> decision engine threadpool today covers the model synchronize threads
12:46:20 <sean-k-mooney> ack
12:46:24 <dviroel> ok, I think that we can move on
12:46:33 <dviroel> and continue in gerrit
12:46:42 <rlandy> thanks dviroel
12:46:44 <dviroel> tks sean-k-mooney
12:47:05 <rlandy> there were no other reviews added on list
12:47:33 <rlandy> anyone want to raise any other patches needing review now?
12:48:19 <rlandy> k - moving on ...
12:48:19 <sean-k-mooney> i have a topic for the end of the meeting but its not strictly related to a patch
12:48:23 <rlandy> oops
12:48:29 <sean-k-mooney> we can move on
12:48:38 <rlandy> ok - well - bug triage and then all yours
12:48:46 <rlandy> #topic: Bug Triage
12:49:01 <rlandy> Looking at the status of the watcher related bugs:
12:49:31 <rlandy> #link: https://bugs.launchpad.net/watcher/+bugs
12:49:36 <rlandy> has 33 bugs listed
12:49:43 <rlandy> 7  of which are in progress
12:50:08 <rlandy> and 2 incomplete ...
12:50:11 <rlandy> https://bugs.launchpad.net/watcher/+bugs?orderby=status&start=0
12:50:27 <rlandy> #link https://bugs.launchpad.net/watcher/+bug/1837400
12:50:36 <rlandy> ^^ only that one is marked "new"
12:51:12 <rlandy> dashboard, client and tempest are all under control with 2 or 3 bugs either in progress or doc related
12:51:28 <sean-k-mooney> the bug seam valid if it still happens
12:51:39 <sean-k-mooney> however i agree that its low priorioty
12:51:50 <sean-k-mooney> we marked it as need-re-triage
12:51:55 <rlandy> so raising only this one today:
12:51:57 <sean-k-mooney> becuase i think we wanted to see if thtis was fixed
12:52:24 <rlandy> https://bugs.launchpad.net/watcher/+bug/1877956 (bug about canceling action plans)
12:53:02 <rlandy> as work was done to fix canceling action plans and greg and I tested it yesterday (admitted from the UI) and that is now working
12:53:43 <dviroel> we found evidences in the code, but I didn't tried to reproduce
12:53:43 <sean-k-mooney> so this was just a looing bug i think
12:53:53 <sean-k-mooney> when i loged at it before i think this is still a problem
12:53:55 <dviroel> should be a real one
12:54:05 <dviroel> and also easy to fix
12:54:09 <sean-k-mooney> yep
12:54:25 <sean-k-mooney> do we have tempest test for cancelaion yet
12:54:30 <sean-k-mooney> i dont think so right
12:55:06 <sean-k-mooney> i thik we can do this by using the sleep action and maybe the actoator stragy
12:55:06 <rlandy> not as far as I know
12:55:29 <dviroel> yeah, a good opportunity to add one too
12:55:44 <sean-k-mooney> i think we should keep this open and just fix the issue when we have time
12:55:50 <sean-k-mooney> ill set it to low?
12:55:57 <dviroel> ++
12:56:04 <sean-k-mooney> cool
12:56:33 <rlandy> ok - that's it for triage ...
12:56:40 <rlandy> sean-k-mooney: your topic?
12:56:45 <sean-k-mooney> ya...
12:56:59 <sean-k-mooney> so how has heard of the service-type-athority repo?
12:57:43 <amoralej> i haven't
12:57:47 <sean-k-mooney> for wider context https://specs.openstack.org/openstack/service-types-authority/ its a thing that was created a very long time ago and is not documented as part of the project creation process
12:57:48 <jgilaber> me neither
12:58:16 <sean-k-mooney> i disocverd or rediscoverd it tuesday night/yesterday
12:58:41 <sean-k-mooney> Aetos is not listed there and "promethus" does nto follow the requrie naming convetions
12:58:56 <sean-k-mooney> so the keyston endpoint they want to use, sepcificly the service-type
12:59:02 <sean-k-mooney> is not valid
12:59:24 <sean-k-mooney> so they are going to have to create a servifce type "tenant-metrics" is my suggetion
12:59:30 <sean-k-mooney> then we need ot update the spec
12:59:33 <sean-k-mooney> and use that
13:00:03 <sean-k-mooney> but we need to get the tc to approve thatn and we need to tell the telemetry team about this requirement
13:00:32 <sean-k-mooney> i spend a while on the tc channel trying to understand thsi yesterday
13:01:14 <sean-k-mooney> so ya we need to let juan and jaromir know
13:01:16 <amoralej> did the telemetry team start using the wrong names somewhere?
13:01:33 <sean-k-mooney> they planned to start using promethus
13:01:40 <sean-k-mooney> for Aetos
13:01:48 <amoralej> at least no need to revert any code, i hope :)
13:02:00 <sean-k-mooney> not yet
13:02:13 <sean-k-mooney> but watcher will need to know the name to do the check for the endpoint
13:02:21 <sean-k-mooney> and the installer will need ot use the correct name too
13:02:31 <sean-k-mooney> the ohter thing i found out
13:02:44 <sean-k-mooney> is we are using the legacy name for watcher downstream i think
13:03:12 <sean-k-mooney> https://opendev.org/openstack/service-types-authority/src/branch/master/service-types.yaml#L31-L34
13:03:37 <sean-k-mooney> its offical service-type shoudl be resource-optimization not infra-optim
13:03:45 <dviroel> oh, good to know
13:03:55 <sean-k-mooney> so that a donwstream bug that we shoudl fix in the operator
13:04:12 <sean-k-mooney> both are technically valid but it woudl be better to use the non alias version
13:04:55 <sean-k-mooney> so jaromir i belvie is on pto for the next week or two
13:05:29 <sean-k-mooney> so we need to sync with the telemetry folks and wiehte rwe or they can update the service-types-athurity file with the right content
13:05:54 <sean-k-mooney> anyay way that all i had on this
13:06:30 <dviroel> tks for finding and pursuing this issue sean-k-mooney
13:06:43 <rlandy> thanks for raising this - a lot of PTOs atm ... mtunge is also out from nect week so maybe we try juan if possible
13:07:27 <rlandy> we are over time so I'll move on to ...
13:07:27 <sean-k-mooney> it was mainly by acident i skim the tc meeting notes and the repo came up this week
13:07:32 <sean-k-mooney> or last
13:07:49 <sean-k-mooney> ya we can wrap up and move on
13:08:06 <rlandy> Volunteers to chair next meeting:
13:09:11 <opendevreview> Merged openstack/watcher master: Merge decision engine services into a single one  https://review.opendev.org/c/openstack/watcher/+/952499
13:09:17 <dviroel> o/
13:09:23 <dviroel> I can chair
13:09:25 <rlandy> thank you dviroel
13:09:31 <rlandy> much appreciated
13:09:36 <rlandy> k folks ... closing out
13:09:40 <rlandy> thank you for attending
13:09:43 <rlandy> #endmeeting