| opendevreview | Alfredo Moralejo proposed openstack/watcher master: Skip migrate actions in pre_condition phase https://review.opendev.org/c/openstack/watcher/+/966699 | 07:29 |
|---|---|---|
| opendevreview | Alfredo Moralejo proposed openstack/watcher master: Make VM resize timeout configurable with migration defaults https://review.opendev.org/c/openstack/watcher/+/968610 | 08:45 |
| opendevreview | Joan Gilabert proposed openstack/watcher-specs master: Add specification for migrating Watcher to OpenStackSDK https://review.opendev.org/c/openstack/watcher-specs/+/968023 | 10:18 |
| dviroel | hi all o/, watcher meeting will start in 10m | 11:50 |
| dviroel | #startmeeting watcher | 12:00 |
| opendevmeet | Meeting started Thu Nov 27 12:00:40 2025 UTC and is due to finish in 60 minutes. The chair is dviroel. Information about MeetBot at http://wiki.debian.org/MeetBot. | 12:00 |
| opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 12:00 |
| opendevmeet | The meeting name has been set to 'watcher' | 12:00 |
| dviroel | hi all o/, how is around today? | 12:00 |
| jgilaber | o/ | 12:01 |
| chandankumar | o/ | 12:01 |
| amoralej | o/ | 12:01 |
| morenod | o/ | 12:02 |
| dviroel | courtesy ping list: sean-k-mooney | 12:02 |
| dviroel | ty for joining o/ | 12:02 |
| dviroel | lets start with today's meeting agenda | 12:02 |
| dviroel | #link https://etherpad.opendev.org/p/openstack-watcher-irc-meeting#L26 (Meeting agenda) | 12:03 |
| dviroel | feel free to add your own topics to the agenda | 12:03 |
| dviroel | we already have a few there | 12:03 |
| dviroel | #topic Evenlet removal | 12:04 |
| dviroel | i am missing some updates in the evenlet removal effort since vPTG | 12:04 |
| dviroel | in the vPTG we had some discussions about next steps (applier mainly) | 12:04 |
| dviroel | and we have a patch ready for review in the applier | 12:05 |
| dviroel | #link https://review.opendev.org/c/openstack/watcher/+/966226 (Adds support for threading mode in applier) | 12:05 |
| dviroel | this one was already reviewed by most of you | 12:05 |
| dviroel | the tl;dr; in this one is | 12:06 |
| dviroel | we found many issues with concurrent database access from multiple threads using taskflow parallel engine, and | 12:06 |
| sean-k-mooney | o/ | 12:06 |
| dviroel | and we had to switch to use serial engine when using native thread | 12:06 |
| dviroel | we will have this limitation on how many concurrent actions can be executed for now | 12:07 |
| dviroel | but this limitation will be treated in a different patch and launchpad bug, once we merge this first one | 12:07 |
| sean-k-mooney | actully | 12:08 |
| sean-k-mooney | one thing i want to test is if we can go back to parrallel now that i change how the db fixutures work | 12:08 |
| sean-k-mooney | i dont think we will get lucky enouch for that to fix it but its worth a try | 12:08 |
| dviroel | i think that I tried that locally, and hit the same issue | 12:09 |
| sean-k-mooney | ack | 12:09 |
| dviroel | but I can try again to make sure | 12:09 |
| dviroel | thanks sean-k-mooney that helped in debugging and proposing a solution for this problem | 12:10 |
| dviroel | we had lots of discussions here in the irc during that time | 12:10 |
| dviroel | continuing with this patch | 12:11 |
| sean-k-mooney | i think the serial aprpoch is fine for now but we wil want to adress that when looking to remove the ablity to run in eventlet mode next cycle | 12:11 |
| sean-k-mooney | unless we make progress on the other refactors for scaling | 12:11 |
| sean-k-mooney | i.e. distirbutiog actions not action planns to appliers or the other enhancments that may make this a non issue | 12:12 |
| dviroel | +1 - it can be treat as a scaling dept for native thread mode | 12:12 |
| amoralej | +1 | 12:12 |
| dviroel | note that in 966226 we also change some ci jobs | 12:13 |
| dviroel | openstack-tox-py312-threading will now run all unit tests in threading mode and | 12:13 |
| dviroel | watcher-prometheus-integration-threading will disable eventlet monkey patching for the applier too | 12:13 |
| dviroel | please take a momento to review this one when possible | 12:13 |
| dviroel | next patch in the eventlet series: | 12:13 |
| dviroel | another small change in CI, to disable eventlet patching in the api | 12:14 |
| dviroel | #link https://review.opendev.org/c/openstack/watcher/+/967769 (Disable eventlet patching for api service in threading job) | 12:14 |
| dviroel | so we should end with threading job running with all native thread soon | 12:14 |
| dviroel | It is a small patch to review/approve, PTAL when free | 12:14 |
| dviroel | finally, there is one more | 12:15 |
| dviroel | is a leftover from the decision-engine work | 12:15 |
| dviroel | is marked as WIP for now | 12:15 |
| dviroel | is the cluster data model collector timeout for native thread mode | 12:16 |
| dviroel | today it only works in eventlet mode (and maybe in a contradictory way, using collector's 'period' as timeout) | 12:16 |
| dviroel | i sent a first proposal yesterday, to include a new timeout for some collector operations, and to remove the eventlet timeout from the code: | 12:17 |
| dviroel | #link https://review.opendev.org/c/openstack/watcher/+/968568 (Remove eventlet-based timeout in CDM collectors) | 12:17 |
| amoralej | I think the use of period as timeout was on purpose to make sure two collector syncs don't overlap | 12:17 |
| amoralej | that was my understanding, althoug dunno if that's correct | 12:18 |
| dviroel | amoralej: right, but still could be a different config | 12:18 |
| dviroel | I'm working on some unit testing and testing it locally too | 12:19 |
| amoralej | yes, but then we should make sure that timeout is < period | 12:19 |
| dviroel | +1 | 12:19 |
| amoralej | unless it ensures that a new collection is not executed until previous ones finishes | 12:20 |
| amoralej | which may be the case, i'm not sure | 12:20 |
| dviroel | yes, the scheduler will not start a new one if is still running. | 12:20 |
| dviroel | apscheduler log when this happens | 12:21 |
| amoralej | ack, then timeout may be even > than period, technically (although it may not make much sense) | 12:21 |
| dviroel | but make sense to compare this configs and raise a warning at least | 12:21 |
| dviroel | yeah, it will just skip that time, and try again in the next period | 12:22 |
| dviroel | ok, so if you folks want to take a look and comment, it will be great | 12:23 |
| dviroel | but we can circle back this CDM change again next week in more details I think | 12:23 |
| dviroel | that's what I have in eventlet topic for now | 12:23 |
| dviroel | lets move to the next one then | 12:24 |
| * dviroel waits 1 min | 12:24 | |
| dviroel | #topic new blueprint for automatic skip actions on pre_condition (second pass) | 12:25 |
| dviroel | hey amoralej o/ | 12:25 |
| amoralej | this is follow up of previous discussion | 12:25 |
| dviroel | ack, from last week | 12:25 |
| amoralej | yes | 12:25 |
| amoralej | i updated the blueprint as discussed https://blueprints.launchpad.net/watcher/+spec/skip-actions-in-pre-condition | 12:26 |
| amoralej | I added some more high level details of the conditions covered and added a sentence about the documentation | 12:26 |
| amoralej | If it's fine for you, i'd like to get it approved | 12:27 |
| dviroel | oh ok, so we should review that and approve | 12:28 |
| jgilaber | I think it captures well what we've discussed previously | 12:28 |
| sean-k-mooney | amoralej: so my only reall comemnt is i asked for the condtion to be listed in the bluepritn rather then in the revie | 12:29 |
| sean-k-mooney | we can do it in the review but the reason for bluepirnt and spec is to agree that before the implemetion is done | 12:30 |
| amoralej | I added | 12:30 |
| amoralej | The high level conditions to be checked for each action are: | 12:30 |
| amoralej | - The instance or volume to be acted on does not exist. | 12:30 |
| amoralej | - In a migration action, the instance is not running in source_node (tbd if that should be skipping of failng) | 12:30 |
| amoralej | - Destination pool or host, if explicitely indicated, does not exist or is disabled (that will lead to FAILED action. | 12:30 |
| sean-k-mooney | i think we have enough to move forward with | 12:30 |
| amoralej | - Required element status for the action are met, i.e. an instance is ACTIVE before starting a live migration. Otherwise the action wil | 12:30 |
| amoralej | yep | 12:30 |
| sean-k-mooney | right i wante doyou to say if it woudl be skiped or failed for each condition as well | 12:31 |
| sean-k-mooney | but since we agreed to add action docs | 12:31 |
| sean-k-mooney | we can just capture it there | 12:31 |
| amoralej | btw, I've sent a review https://review.opendev.org/c/openstack/watcher/+/968025 to add actions documentation btw | 12:31 |
| sean-k-mooney | ack we shoudl compelte that first as a prerequist for the atuo skiping | 12:32 |
| sean-k-mooney | are there any objection to appoving this? | 12:32 |
| sean-k-mooney | if not ill give it another 30 second and make the changes in launchpad | 12:33 |
| dviroel | lgtm | 12:33 |
| jgilaber | +1 | 12:33 |
| sean-k-mooney | cool ill do that now and we can move on | 12:33 |
| amoralej | Thanks | 12:34 |
| dviroel | thanks amoralej for the updates | 12:34 |
| dviroel | next topic then amoralej ? | 12:35 |
| amoralej | yes | 12:35 |
| dviroel | #topic new blueprint for Applier service failure management | 12:35 |
| amoralej | #link https://blueprints.launchpad.net/watcher/+spec/monitor-failed-appliers | 12:35 |
| amoralej | this also comes from PTG discussion | 12:35 |
| amoralej | it's related to implement applier monitoring to manage the case when appliers are detected as FAILED | 12:36 |
| amoralej | the behaviour we agreed on PTG is: | 12:36 |
| amoralej | - ActionPlans on ONGOING state will be cancelled and a message will be added to the status_message field of the AP. | 12:36 |
| amoralej | - ActionPlans in PENDING state will be unnasigned (hostname field will be emptied) and a new launch_action_plan RPC message will be sent. That way, any available applier will pick up and execute it. | 12:36 |
| sean-k-mooney | +1 that mataches what i recall | 12:38 |
| sean-k-mooney | this personally feel more like it shoudl have a spec to be honest | 12:38 |
| sean-k-mooney | have we tought about the upgrade impact here? | 12:39 |
| sean-k-mooney | we are not changign the rpc interface i guess | 12:39 |
| amoralej | no, we are not | 12:39 |
| sean-k-mooney | but we are changing the sematics of how it works | 12:39 |
| amoralej | in which sense? | 12:40 |
| sean-k-mooney | we are chanign the semantics of the calcelation | 12:40 |
| dviroel | I'm still thinking on the action plans from ONGOING to CANCELLED. We will need to do something with the Actions too right? | 12:40 |
| amoralej | I don't think so | 12:40 |
| amoralej | note that, that transition already exist | 12:40 |
| sean-k-mooney | so we are talkign about a case where the applier dies | 12:41 |
| amoralej | actually, that behavior occurs currently when an applier is started and it detects previous actionplans were ONGOING on it | 12:41 |
| sean-k-mooney | so the curernt applier that was processing it wont be running so the action will be left in ongoing | 12:41 |
| sean-k-mooney | amoralej: right | 12:41 |
| sean-k-mooney | that what i ment about changing the semantics | 12:41 |
| sean-k-mooney | now you sugesting that we cancel the ongoifn gaction plan | 12:42 |
| dviroel | right, because a Cancel today is handled by the current running threads in the applier. So if another applier start and cancel the AP, not sure what happens with the Actions | 12:42 |
| dviroel | i don't really remember | 12:42 |
| sean-k-mooney | but the curent action woudl nto be canceled | 12:42 |
| jgilaber | so currently the action plan would be left ongoing indefinitely? | 12:43 |
| sean-k-mooney | we likely need to move the actriion to failed as part fo canceleing the action plan | 12:43 |
| sean-k-mooney | jgilaber: yes | 12:43 |
| dviroel | this ^ | 12:43 |
| sean-k-mooney | until the appler is restarted | 12:43 |
| amoralej | yes ^ | 12:43 |
| sean-k-mooney | which we shoudl not rely on | 12:43 |
| amoralej | exactly, because it may affect scale-down scenarios, i.e. | 12:44 |
| dviroel | ack | 12:44 |
| amoralej | https://github.com/openstack/watcher/blob/master/watcher/applier/sync.py#L53-L74 | 12:44 |
| sean-k-mooney | so the reason im askign about upgrade is we need to be sure this works if you have not upgraded all instnace of the applier to the same version | 12:44 |
| amoralej | that's what is used today | 12:44 |
| sean-k-mooney | i.e. if we have 2025.1 appliers they still need to work with master descions engiones or apis | 12:45 |
| amoralej | btw, it cancels actionplans and actions in ongoing pending or ongoing | 12:45 |
| sean-k-mooney | ack | 12:45 |
| sean-k-mooney | so is you proposal to invoke the same behaivor | 12:46 |
| amoralej | yes | 12:46 |
| amoralej | for ONGOING | 12:46 |
| sean-k-mooney | form the elected leader | 12:46 |
| sean-k-mooney | of the appleirs that is doing the monitoring | 12:46 |
| sean-k-mooney | or will this be done by the descion engine? | 12:47 |
| amoralej | that's my doubt | 12:47 |
| sean-k-mooney | or if we add a watcch condcutor in the future will it be done by that | 12:47 |
| amoralej | if we should add this to the existing monitor in the decision engine or add a new one | 12:47 |
| amoralej | at this point, i don't want to add a new conductor service only for this, tbh | 12:47 |
| amoralej | as this would be a big impact for deployment | 12:48 |
| sean-k-mooney | well i think we are builign up more and more usecasue for the condurctor | 12:48 |
| amoralej | which, i think it's too much for this | 12:48 |
| sean-k-mooney | we may be able to punt on addign it for this cycle | 12:48 |
| jgilaber | I think it should either be in the applier or a separate service, I would not want to couple decision engines and applier like this | 12:48 |
| sean-k-mooney | but i think we will have to seriosly consider makign that change next cycle | 12:48 |
| amoralej | the other option, would be to add a new service to the applier, similar to what we do in the decision_engine | 12:49 |
| sean-k-mooney | jgilaber: ya same | 12:49 |
| sean-k-mooney | amoralej: yes that what we mean bu in the applier. mirriong the approch in the decsion engine and the same leader election approch | 12:49 |
| amoralej | i may create a base service class with common code (leader election, get_servise_status, etc..) | 12:49 |
| amoralej | wfm | 12:49 |
| dviroel | +1 | 12:49 |
| sean-k-mooney | ya shareing that makes sense. at least in the short term | 12:50 |
| jgilaber | +1 for the reuse | 12:50 |
| sean-k-mooney | amoralej: on the testing side how are you plannign to test this | 12:50 |
| sean-k-mooney | i dont think this shoudl be tested in tempest | 12:50 |
| amoralej | yeah | 12:50 |
| sean-k-mooney | this is a case where if we had the funtional testing enhacnemtns it woudl be very useful | 12:50 |
| amoralej | so, we can create two appliers as we do with decision_engine, to at least see that both are started and leader election works | 12:51 |
| sean-k-mooney | amoralej: yes we can enable multiple appleris in the jobs | 12:51 |
| amoralej | testing real cases of failing over, etc... requires other kind of testing that tempest, i agree | 12:51 |
| sean-k-mooney | i just dont want to use a post playbook or tempest test to kill one and see that it falls over | 12:51 |
| sean-k-mooney | *fails over | 12:51 |
| sean-k-mooney | tempest is a very poor chocie for this tyep of fault injection / disastor recovery type testing | 12:52 |
| sean-k-mooney | but i do want use to test it in a fucntional test if we can or unit test failing that | 12:53 |
| amoralej | you can see the kind of unit testing I'm doing in https://review.opendev.org/c/openstack/watcher/+/963252/ for decision-engine | 12:54 |
| amoralej | I'd do similar one for the applier | 12:54 |
| amoralej | until we can do something better | 12:54 |
| amoralej | with functional testing | 12:54 |
| sean-k-mooney | ack if you factor out the common code this woudl also be shared | 12:54 |
| sean-k-mooney | we can test both explictly too | 12:54 |
| sean-k-mooney | but the leager election part can be common | 12:54 |
| amoralej | yep | 12:55 |
| sean-k-mooney | the specific of cancelign the action/action plan will be differnt | 12:55 |
| amoralej | exactly | 12:55 |
| amoralej | so, based on that patch, the monitor_services_status method would be specific | 12:55 |
| amoralej | all the rest would be shared i think | 12:55 |
| amoralej | I need to double check, anyway | 12:56 |
| dviroel | ack, so we keep the blueprint or should we consider a spec for this effort? | 12:56 |
| amoralej | my inclination to blueprint was based that it's actually not changing apis, config options or deployment factor | 12:57 |
| sean-k-mooney | one of my rules of tumb is the bus factor | 12:57 |
| jgilaber | judging by the discussion here it looks like the change would be fairly contained in one place, so I would be ok without a spec | 12:57 |
| sean-k-mooney | are we confident based on the descrtipon that if alfredo disapers tomorw we coudl compelt ehte work | 12:57 |
| sean-k-mooney | this is probaly pushing what can be done without one but it seam the over all concenous is to proceed with it as specless for now | 12:58 |
| sean-k-mooney | so i guess if we agree we can approve | 12:58 |
| sean-k-mooney | going once | 12:59 |
| dviroel | +1 to continue with the bp for now | 12:59 |
| sean-k-mooney | twice | 12:59 |
| jgilaber | +1 | 12:59 |
| sean-k-mooney | cool | 12:59 |
| * dviroel time check | 13:00 | |
| dviroel | amoralej: lets circle back this BP again in following meetings if needed | 13:00 |
| amoralej | ack, thanks | 13:00 |
| dviroel | jgilaber: want to quickly cover your spec or move to next week? | 13:01 |
| jgilaber | we can move to next week the discussion and just announce that I pushed a spec for the change to use openstacksdk | 13:01 |
| dviroel | we can just call for reviews for now then | 13:01 |
| dviroel | ack | 13:02 |
| dviroel | #topic Reviews | 13:02 |
| dviroel | not going to cover all of them | 13:02 |
| dviroel | just asking folks to review the ones that are listed there | 13:02 |
| dviroel | also please consider adding them to the status etherpad | 13:02 |
| dviroel | #link https://etherpad.opendev.org/p/watcher-2026.1-status | 13:02 |
| dviroel | you can better organise your patches there if needed | 13:03 |
| dviroel | easy for review to follow up | 13:03 |
| dviroel | s/review/reviewers | 13:03 |
| dviroel | jgilaber: going to move your LP to next week too | 13:04 |
| jgilaber | sounds good,thanks | 13:04 |
| dviroel | otherwise we can continue discuss it here after the meeting | 13:04 |
| dviroel | #topic Volunteers to chair next meeting | 13:04 |
| dviroel | thanks jgilaber for volunteer again o/ | 13:04 |
| dviroel | let's wrap up for today | 13:04 |
| dviroel | we will meet again next week | 13:04 |
| dviroel | thank you all for participating | 13:05 |
| dviroel | #endmeeting | 13:05 |
| opendevmeet | Meeting ended Thu Nov 27 13:05:05 2025 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 13:05 |
| opendevmeet | Minutes: https://meetings.opendev.org/meetings/watcher/2025/watcher.2025-11-27-12.00.html | 13:05 |
| opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/watcher/2025/watcher.2025-11-27-12.00.txt | 13:05 |
| opendevmeet | Log: https://meetings.opendev.org/meetings/watcher/2025/watcher.2025-11-27-12.00.log.html | 13:05 |
| jgilaber | thanks dviroel! | 13:05 |
| dviroel | thanks folks o/ | 13:05 |
| morenod | thanks dviroel++ | 13:05 |
| amoralej | thanks dviroel++ | 13:05 |
| chandankumar | dviroel++ thanks! | 13:05 |
| opendevreview | David proposed openstack/watcher master: [DNM] Testing nodeset with three nodes (two computes + 1 controller) https://review.opendev.org/c/openstack/watcher/+/967331 | 13:09 |
| opendevreview | Alfredo Moralejo proposed openstack/watcher master: Skip migrate actions in pre_condition phase https://review.opendev.org/c/openstack/watcher/+/966699 | 13:11 |
| opendevreview | chandan kumar proposed openstack/watcher-tempest-plugin master: Remove deprecated client_functional tests https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/968651 | 13:26 |
| opendevreview | Douglas Viroel proposed openstack/watcher-tempest-plugin master: Consolidate and improve Zuul CI job definitions https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/968247 | 14:00 |
| opendevreview | David proposed openstack/watcher master: [DNM] Testing nodeset with three nodes (two computes + 1 controller) https://review.opendev.org/c/openstack/watcher/+/967331 | 14:02 |
| opendevreview | David proposed openstack/watcher master: [DNM] Testing nodeset with three nodes (two computes + 1 controller) https://review.opendev.org/c/openstack/watcher/+/967331 | 15:53 |
| opendevreview | Alfredo Moralejo proposed openstack/watcher master: Make VM migrations timeout configurable and apply reasonable defaults https://review.opendev.org/c/openstack/watcher/+/967693 | 16:35 |
| opendevreview | Alfredo Moralejo proposed openstack/watcher master: Make VM resize timeout configurable with migration defaults https://review.opendev.org/c/openstack/watcher/+/968610 | 16:35 |
| opendevreview | Alfredo Moralejo proposed openstack/watcher master: Add documentation section for actions https://review.opendev.org/c/openstack/watcher/+/968025 | 17:01 |
| opendevreview | Douglas Viroel proposed openstack/watcher master: Remove eventlet-based timeout in CDM collectors https://review.opendev.org/c/openstack/watcher/+/968568 | 18:54 |
| opendevreview | Douglas Viroel proposed openstack/watcher-tempest-plugin master: Wait action plan to finish before asserting state https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/968750 | 19:57 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!