12:00:37 #startmeeting Watcher IRC Meeting - July 17, 2025 12:00:37 Meeting started Thu Jul 17 12:00:37 2025 UTC and is due to finish in 60 minutes. The chair is rlandy. Information about MeetBot at http://wiki.debian.org/MeetBot. 12:00:37 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 12:00:37 The meeting name has been set to 'watcher_irc_meeting___july_17__2025' 12:01:07 Hi all ... who is around? 12:01:18 o/ 12:01:21 o/ 12:01:47 courtesy ping list: dviroel jgilaber sean-k-mooney 12:01:51 o/ 12:01:53 o/ 12:03:13 chandankumar is away atm ... let's start 12:03:42 #topic: (chandan|not around today): Move workload_balance_strategy_cpu|ram tests to exclude list to unblock upstream watcher reviews 12:03:56 #link: https://launchpad.net/bugs/2116875 12:04:22 right, it is failing more often now 12:04:52 so we can skip the tests while working on a fix 12:05:20 morenod is already investigating and knows what is happenning I think 12:05:28 any objections to merging this and reverting when the fix is in? 12:06:07 o/ 12:06:20 it's expected that, different runs of the same jobs have different node sizes? 12:06:26 I've added a comment on the bug, the problem is that compute nodes are sometimes 8vcpus and sometimes 4vcpus... we need to create the test with a dynamic threshold, based on the number of vcpus of the compute nodes 12:06:42 im fine with skipping it for now the other way to do that is to use the skip_because decorator in the tempest plugin 12:06:49 that takes a bug ref 12:07:06 but im ok with the regex approch for now 12:07:35 skip_because is slightly less work 12:07:38 Im also ok with skipping, I will need a few more days to have the fix 12:07:40 because it wil ksip everywhere 12:08:35 if there isnt a parch already i woudl prefer to use https://github.com/openstack/tempest/blob/master/tempest/lib/decorators.py#L60 12:08:45 morenod: reading your comment, the fix will take some time? 12:09:04 you just add it like this https://github.com/openstack/tempest/blob/master/tempest/api/image/v2/admin/test_image_task.py#L98 12:09:43 sean-k-mooney: yeah, it is preferable 12:09:45 rlandy, im working on it now, maybe sometime between tomorrow and monday it will be ready 12:10:23 I like the skip_because solution, it is very clear 12:11:15 amoralej, I can't find the nodeset definitions, but it could be possible that different providers have a label with the same name but using different flavours 12:11:18 #action rlandy to contact chandankumar to review above suggestions while morenod finishes real fix 12:11:46 i guess that's what is happening, i thought there was a consensus about nodeset definitions 12:11:54 morenod: there is also a tempest cli command to list all the decorated test i belive so you can keep track of them over time 12:12:04 anyway, good to adjust the threshold to the actual node sizes 12:12:25 keep in mind the node size can differ upstream vs downstream and even in upstream 12:12:42 related but not related to this issue, we disabled in the node_exporter in the watcher-operator, but not on devstack based jobs. I have created this review for that https://review.opendev.org/c/openstack/watcher/+/955281 12:12:47 upstream we alwasy shoudl ahve at leat 8GB for ram but we can have 4 or 8 cpus depelnding on perfroamce 12:12:48 yes, we can run these tests anywhere, so it should be adjusted to node specs 12:13:21 we will have dynamic flavors to fix RAM and dynamic threshold to fix CPU 12:13:56 that an approch and one tha t comptue has used to some success in whitebox but its not alwasy easy to do 12:14:06 but ok lets see what that looks ike 12:15:27 anything more on this topic? 12:16:30 crickets generally means we can move on :) 12:16:32 thank you for the input - will alert chandankumar to review the conversation 12:16:46 #topic: (dviroel) Eventlet Removal 12:16:54 dviroel, do you want to take this one? 12:17:00 yes 12:17:04 #link https://etherpad.opendev.org/p/watcher-eventlet-removal 12:17:18 the etherpad has links to the changes ready for review 12:17:25 i also added to the meeting etherpad 12:17:39 tl;dr; the decision engine changes are ready for review 12:18:10 there are other discussions that are not code related like 12:18:30 should we keep a prometheus-threading job as voting? 12:18:59 which we can discuss in the change itseld 12:19:23 hum 12:19:59 so i think we want to run with both version and pershpas start with it as non voting for now 12:20:33 in the same line, I added a new tox py3 job, to run a subset of tests with eventlet patching disabled 12:20:34 but if we are going to offially supprot both models in 2025.2 then we shoudl make it voting before m3 12:20:58 #link https://review.opendev.org/c/openstack/watcher/+/955097 12:21:15 what i would suggest is let start iwth as non voting and look to make the treadign jobs voting around the start of august 12:21:29 sean-k-mooney: right, I can add that as a task for m3, to move to voting 12:21:44 ans we can look at job's history 12:21:45 for the unit test job if its passing i woudl be more agressive with that and make it voting right away 12:22:41 ack, it is passing now, buy skipping 'applier' ones, which will be part of next effort, to add support to applier too 12:22:58 ya that what we are doing in nova as well 12:23:08 we have 75% of the unit test passing maybe higher 12:23:22 so we are using an exclude list to skip the failing ones and burning that down 12:23:23 nice 12:24:04 on https://review.opendev.org/c/openstack/watcher/+/952499/4 12:24:14 1 you rote it so it has an implict +2 12:24:24 but i have also left it open now for about a week 12:24:42 so i was planning to +w it after the meeting if there were no other objects 12:24:55 ++ 12:25:35 i see no objections :) 12:25:36 by the way the watcher-prometheus-integration-threading job failed on the unit test patch which is partly why i want to keep it non-voting for a week or two ot make sure that not a regular thing 12:25:39 tks sean-k-mooney 12:26:15 oh it was just test_execute_workload_balance_strategy_cpu 12:26:18 sean-k-mooney: but failng 12:26:20 yeah 12:26:25 that the instablity we dicussed above 12:26:25 i was about to say that 12:26:45 ok well that a good sign 12:27:16 and the same issue can block the decision engine patch to merge too, just fyi 12:27:26 or trigger some rechecks 12:27:38 so maybe we could wait the skip if needed 12:27:42 lets see 12:28:00 ack, i may not have time to complete my review of the 2 later patche stoday but we can try to get those mergd somethime next week i think 12:28:06 ack 12:28:11 there is one more: 12:28:13 #link https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/954264 12:28:25 it adds a new scenario test, with continuous audit 12:28:41 ya that not really reventlet removal as such 12:28:45 there is a specific scenario that I wanted to test, which needs 2 audits to be created 12:28:46 just missing test coverage 12:29:02 ack 12:29:27 it is a scenario that fails when we move to threading mode 12:29:57 i see do you knwo why? 12:30:14 did you update the defautl executor for apsschduler 12:30:31 to not use green pools in your treadign patch 12:31:00 today continuous audit is started at Audit Endpoint constructor, before the main decision engine service fork 12:31:18 so this thread was running on a different process 12:31:28 and getting an outdated model 12:31:54 is that adressed by https://review.opendev.org/c/openstack/watcher/+/952499/4 12:32:03 it shoudl be right? 12:32:08 #link https://review.opendev.org/c/openstack/watcher/+/952257 12:32:13 is the one that address that 12:32:25 oh ok 12:32:35 so when that merges the new senario test shoudl pass 12:32:36 here https://review.opendev.org/c/openstack/watcher/+/952257/9/watcher/decision_engine/service.py 12:32:43 can you add a depend on to the tempest change to show that 12:32:55 there is already 12:33:12 there is also one DNM patch that shows the failure too 12:33:21 not tha ti can see https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/954264 12:33:34 #link https://review.opendev.org/c/openstack/watcher/+/954364 12:33:38 oh you have the depend on in the wrong direction 12:33:42 reproduces the issue 12:34:07 it need to be form watcher-tempest-plug -> watcher in this case 12:34:28 well 12:34:37 sean-k-mooney: yes and no, because the tempest change is passing too, in other jobs 12:34:45 i guess we could merge the tempest test first assuming it passes in eventlet mode 12:35:08 correct, there are other jobs that will run that test too 12:35:26 ok i assume the last two failures of the promethius job are also the real data tests? 12:35:35 #link https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_1f5/openstack/1f55b6937c9b47a9afb510b960ef12ea/testr_results.html 12:35:55 passing on watcher-tempest-strategies with eventlet 12:36:11 actully since we are talking about jobs the tempest repo does have watcher-tempest-functional-2024-1 12:36:13 chandan added a comment about the failures, yes 12:36:39 i.e. jobs for the stable brances but we should add version of watcher-prometheus-integration for 2025.1 12:37:07 ++ 12:37:14 to make sure we do not break epoxy with promethus as we extend the test suite 12:38:12 ok we can do that seperately, i think we can move on. ill take a look at that test today/tomorrow and we can likely proceed with it 12:38:20 sure 12:38:42 I think that I covered everything, any other question? 12:38:58 just a meta one 12:39:12 it looks like the descion engin will be done this cycle 12:39:25 how do you feel about the applier 12:39:42 also what is the status of the api. are we eventlet free there? 12:40:06 ack, I will still look how decision engine will perf, about resource usage and default number of workers, but it is almost done with these changes 12:40:41 well for this cycle it wont be the defautl so we can tweak thos as we gain experince with it 12:40:46 ++ 12:40:50 i woudl start small ie 4 workers max 12:41:17 *thread in the pools not workers 12:41:33 ack, so one more change for dec-eng is expected for this 12:41:55 yes, in the code is called workers, but yes, nb of threads in the pool 12:42:21 well we have 2 concepts 12:42:29 sean-k-mooney: i plan to work in the applier within this cycle, but not sure if we are going to have it working until the end of the cycle 12:42:33 workers in oslo means the number of processes normally 12:43:29 oh i see... CONF.watcher_decision_engine.max_general_workers 12:43:47 so watcher is using workers for eventlet already 12:43:52 ok 12:44:07 so in nova we are intentioally adding new config opitons 12:44:26 because the default likely wont be the same but ill look at what watcher has today and comment in the review 12:44:48 ack, the background scheduler is one that has no config for instance 12:45:02 ok its 4 already https://docs.openstack.org/watcher/latest/configuration/watcher.html#watcher_decision_engine.max_general_workers 12:45:36 and this one ^ - is for the decision engine threadpool 12:45:38 ya i think its fine normlly the eventlet pool size in most servers is set to around 10000 12:45:43 which woudl obvirly be a problem 12:45:55 but 4 is fine 12:46:10 decision engine threadpool today covers the model synchronize threads 12:46:20 ack 12:46:24 ok, I think that we can move on 12:46:33 and continue in gerrit 12:46:42 thanks dviroel 12:46:44 tks sean-k-mooney 12:47:05 there were no other reviews added on list 12:47:33 anyone want to raise any other patches needing review now? 12:48:19 k - moving on ... 12:48:19 i have a topic for the end of the meeting but its not strictly related to a patch 12:48:23 oops 12:48:29 we can move on 12:48:38 ok - well - bug triage and then all yours 12:48:46 #topic: Bug Triage 12:49:01 Looking at the status of the watcher related bugs: 12:49:31 #link: https://bugs.launchpad.net/watcher/+bugs 12:49:36 has 33 bugs listed 12:49:43 7 of which are in progress 12:50:08 and 2 incomplete ... 12:50:11 https://bugs.launchpad.net/watcher/+bugs?orderby=status&start=0 12:50:27 #link https://bugs.launchpad.net/watcher/+bug/1837400 12:50:36 ^^ only that one is marked "new" 12:51:12 dashboard, client and tempest are all under control with 2 or 3 bugs either in progress or doc related 12:51:28 the bug seam valid if it still happens 12:51:39 however i agree that its low priorioty 12:51:50 we marked it as need-re-triage 12:51:55 so raising only this one today: 12:51:57 becuase i think we wanted to see if thtis was fixed 12:52:24 https://bugs.launchpad.net/watcher/+bug/1877956 (bug about canceling action plans) 12:53:02 as work was done to fix canceling action plans and greg and I tested it yesterday (admitted from the UI) and that is now working 12:53:43 we found evidences in the code, but I didn't tried to reproduce 12:53:43 so this was just a looing bug i think 12:53:53 when i loged at it before i think this is still a problem 12:53:55 should be a real one 12:54:05 and also easy to fix 12:54:09 yep 12:54:25 do we have tempest test for cancelaion yet 12:54:30 i dont think so right 12:55:06 i thik we can do this by using the sleep action and maybe the actoator stragy 12:55:06 not as far as I know 12:55:29 yeah, a good opportunity to add one too 12:55:44 i think we should keep this open and just fix the issue when we have time 12:55:50 ill set it to low? 12:55:57 ++ 12:56:04 cool 12:56:33 ok - that's it for triage ... 12:56:40 sean-k-mooney: your topic? 12:56:45 ya... 12:56:59 so how has heard of the service-type-athority repo? 12:57:43 i haven't 12:57:47 for wider context https://specs.openstack.org/openstack/service-types-authority/ its a thing that was created a very long time ago and is not documented as part of the project creation process 12:57:48 me neither 12:58:16 i disocverd or rediscoverd it tuesday night/yesterday 12:58:41 Aetos is not listed there and "promethus" does nto follow the requrie naming convetions 12:58:56 so the keyston endpoint they want to use, sepcificly the service-type 12:59:02 is not valid 12:59:24 so they are going to have to create a servifce type "tenant-metrics" is my suggetion 12:59:30 then we need ot update the spec 12:59:33 and use that 13:00:03 but we need to get the tc to approve thatn and we need to tell the telemetry team about this requirement 13:00:32 i spend a while on the tc channel trying to understand thsi yesterday 13:01:14 so ya we need to let juan and jaromir know 13:01:16 did the telemetry team start using the wrong names somewhere? 13:01:33 they planned to start using promethus 13:01:40 for Aetos 13:01:48 at least no need to revert any code, i hope :) 13:02:00 not yet 13:02:13 but watcher will need to know the name to do the check for the endpoint 13:02:21 and the installer will need ot use the correct name too 13:02:31 the ohter thing i found out 13:02:44 is we are using the legacy name for watcher downstream i think 13:03:12 https://opendev.org/openstack/service-types-authority/src/branch/master/service-types.yaml#L31-L34 13:03:37 its offical service-type shoudl be resource-optimization not infra-optim 13:03:45 oh, good to know 13:03:55 so that a donwstream bug that we shoudl fix in the operator 13:04:12 both are technically valid but it woudl be better to use the non alias version 13:04:55 so jaromir i belvie is on pto for the next week or two 13:05:29 so we need to sync with the telemetry folks and wiehte rwe or they can update the service-types-athurity file with the right content 13:05:54 anyay way that all i had on this 13:06:30 tks for finding and pursuing this issue sean-k-mooney 13:06:43 thanks for raising this - a lot of PTOs atm ... mtunge is also out from nect week so maybe we try juan if possible 13:07:27 we are over time so I'll move on to ... 13:07:27 it was mainly by acident i skim the tc meeting notes and the repo came up this week 13:07:32 or last 13:07:49 ya we can wrap up and move on 13:08:06 Volunteers to chair next meeting: 13:09:11 Merged openstack/watcher master: Merge decision engine services into a single one https://review.opendev.org/c/openstack/watcher/+/952499 13:09:17 o/ 13:09:23 I can chair 13:09:25 thank you dviroel 13:09:31 much appreciated 13:09:36 k folks ... closing out 13:09:40 thank you for attending 13:09:43 #endmeeting