| opendevreview | chandan kumar proposed openstack/watcher-tempest-plugin master: Refactor: Consolidate test utilities into shared BaseCommon class https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/967522 | 05:21 |
|---|---|---|
| opendevreview | chandan kumar proposed openstack/watcher-tempest-plugin master: Refactor: Consolidate test utilities into shared BaseCommon class https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/967522 | 06:45 |
| opendevreview | sean mooney proposed openstack/watcher master: [DB] adopt the AdHocDbFixture from oslo https://review.opendev.org/c/openstack/watcher/+/967390 | 08:47 |
| opendevreview | chandan kumar proposed openstack/watcher-tempest-plugin master: Refactor: Consolidate test utilities into shared BaseCommon class https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/967522 | 09:02 |
| opendevreview | David proposed openstack/watcher-tempest-plugin master: Add test for skipped actions https://review.opendev.org/c/openstack/watcher-tempest-plugin/+/966860 | 10:08 |
| opendevreview | Merged openstack/watcher master: [DB] adopt the AdHocDbFixture from oslo https://review.opendev.org/c/openstack/watcher/+/967390 | 10:21 |
| opendevreview | Alfredo Moralejo proposed openstack/watcher master: Add configurable timeout for migrate actions with reasonable defautl https://review.opendev.org/c/openstack/watcher/+/967693 | 13:53 |
| opendevreview | Alfredo Moralejo proposed openstack/watcher master: Add configurable timeout for migrate actions with reasonable defautl https://review.opendev.org/c/openstack/watcher/+/967693 | 13:55 |
| opendevreview | Alfredo Moralejo proposed openstack/watcher master: Add configurable timeout for migrate actions with reasonable default https://review.opendev.org/c/openstack/watcher/+/967693 | 13:55 |
| amoralej | sean-k-mooney, dviroel wrt ^ rethinking about that. Given that nova has its own timeout based on vm size according to https://docs.openstack.org/nova/pike/admin/configuring-migrations.html#advanced-configuration-for-kvm-and-qemu what if watcher do not timeout it at all and keeps waiting until nova succeeds or aborts ? | 14:05 |
| amoralej | "The timeout in seconds is the instance size multiplied by the configurable parameter live_migration_completion_timeout, whose default is 800. For example, shared-storage live migration of an instance with 8GiB memory will time out after 6400 seconds." | 14:06 |
| sean-k-mooney | amoralej: so the timeout on wathcer side shoudl eb a hard limit for pathalogical behavior | 14:07 |
| sean-k-mooney | so basically if we have one at all | 14:07 |
| amoralej | what a pathalogical default may be? | 14:07 |
| sean-k-mooney | then it should be soemthign liek 1 hour or 1 day kind of thing | 14:07 |
| amoralej | yeah, that's what i was thinking about ... | 14:07 |
| sean-k-mooney | basiclly it shoudl be soemthign that is much longer then we would expect for a typeical action to complete | 14:08 |
| dviroel | we need to have a timeout from our side, we cannot depend or other services to timeout our op | 14:08 |
| dviroel | s/or other/on other | 14:09 |
| amoralej | so, I will increase the default to something higher | 14:09 |
| dviroel | yeah, that's a good option I think | 14:09 |
| amoralej | so in nova, with default 800s/ramGB, a 32GBs would be something like 7hr, i can do 8hr as default timeout | 14:10 |
| sean-k-mooney | the problem we have currently with this approch | 14:12 |
| sean-k-mooney | is it will cause teh action plan to pause on that action for up to 8 hours and tie up the applier | 14:12 |
| amoralej | right, an option would be to set workers=2 in the applier so that, we at least have another worker able to run an AP | 14:16 |
| amoralej | and hopefully, we'll soon have the capability to run multiple appliers better :) | 14:17 |
| amoralej | (i mean with at least automatic re-scheduling or canceling existing APs when an applier dies | 14:17 |
| sean-k-mooney | so we can have multipel action plans executing because we cna have 1 per RPC thread | 14:18 |
| amoralej | anyway, I'm open to discuss what the default value should be | 14:18 |
| sean-k-mooney | a better approch woudl be to split the startign of the live migration and the waiting for ti to complete into seperate actions | 14:19 |
| sean-k-mooney | long term we woudl want to fixn the threaded impletion to supprot parelel executiorn or do the refactor to dispatch actions to teh applier not action plans via rpc | 14:19 |
| sean-k-mooney | for what its worth 1 hour for a live migraiton is already exceesive | 14:21 |
| sean-k-mooney | it can take 10s of minutes but it should take form second to maybe 15 minutes | 14:21 |
| sean-k-mooney | its a long tail problem, some vms under a lot of load will take much longer but most shoudl complete orders of magnitude faster | 14:22 |
| amoralej | so 15mins may be a good value? | 14:23 |
| amoralej | feel free to reply in https://review.opendev.org/c/openstack/watcher/+/967693 | 14:23 |
| sean-k-mooney | it could yes. as its a config option you can tune it based on your deployment | 14:24 |
| amoralej | btw, i extended polling interval to 5s | 14:24 |
| sean-k-mooney | that shoudl likely be configurable too and i also dont think it shoudl be fixed long term | 14:24 |
| amoralej | i can also add a parameter migrate_polling_time in this patch | 14:26 |
| sean-k-mooney | im debating if it shoudl be migrate sepcific | 14:26 |
| sean-k-mooney | or if we shoudl have action_polling and action_timeout config that woudl be used for each action | 14:27 |
| sean-k-mooney | we coudl have both | 14:27 |
| sean-k-mooney | long term im also thinkign we will want something like this https://paste.opendev.org/show/bLqoRAuxohGyMM0H0w1l/ | 14:27 |
| amoralej | actually, we have a resize action that may want to use similar timeout | 14:27 |
| amoralej | but i understand others may have very different timeouts | 14:27 |
| sean-k-mooney | ya so im torn between per action timeouts/intervals vs a default with sepcific overried only for the subset that need that | 14:28 |
| amoralej | i.e. change_nova_service_state whould not need a high timeout | 14:28 |
| amoralej | volume_migrate may need higher one | 14:28 |
| sean-k-mooney | right exactly | 14:29 |
| sean-k-mooney | in general i woudl perfer if tiemout where enforece by nova or cinder not watcher | 14:29 |
| amoralej | yeah, that was my original doubt :) | 14:30 |
| sean-k-mooney | but if we have them they shoudl be for the pathalogica case where we go "somethign seame to be wrong" with that said we cant abort most oeratiosn althogh we can fo rlive migraton | 14:30 |
| dviroel | we could have both timeouts, one specific per action, and a global one as fallback, if the user don't provide a specific timeout we fallback to the global one | 14:36 |
| sean-k-mooney | yep that was my intial tought | 14:36 |
| amoralej | note that, at some point, we could do per-audit timeouts | 14:37 |
| sean-k-mooney | have a default that is used for all action and for the ones that might wnat ad diffent value have a sepreate config option that defautl to the global one | 14:37 |
| amoralej | that was my initial idea as next improvement | 14:37 |
| sean-k-mooney | amoralej: we could but im not conviced we shoudl | 14:37 |
| sean-k-mooney | there are 2 diffent admins here. | 14:37 |
| sean-k-mooney | the infra admin has knowlage of host the cloud is deployed and the permance of it | 14:38 |
| sean-k-mooney | the openstack admin has api admin access but may not be involved in managing the deployment and configuration of the cloud | 14:38 |
| sean-k-mooney | so we cant really assume that the person who created the audit | 14:38 |
| sean-k-mooney | hwas deep knowlwage of the cloud | 14:39 |
| amoralej | yes, but, i.e. they know the size and workload activity, which is also a relevant factor | 14:39 |
| sean-k-mooney | now we can allwo them to provide timeout if they do but it shoudl nto be the primary way to do that | 14:39 |
| amoralej | i mean the openstack admin | 14:39 |
| sean-k-mooney | well the createor of the audit wotn have knowlage fo the workload in general | 14:39 |
| sean-k-mooney | either the esize or the activity | 14:40 |
| sean-k-mooney | if this is a public cloud which is what we desgin for by default | 14:40 |
| amoralej | yeah, right | 14:40 |
| sean-k-mooney | we wont have any knoldage of what the end user is runnign in the vm | 14:40 |
| sean-k-mooney | we can provide a way eventually to do this in the audit or action plan for priviate cloud usecase where the admin is alos the ownwer of the workload | 14:41 |
| sean-k-mooney | but that proably not the normal case | 14:41 |
| amoralej | so, if we implement two timeouts, action_timeout (global) + migrate_timeout (for migrate actions). How we would implement the global one? I understand that would not be part of the action itself, but something in the method calling execute() ? | 14:45 |
| sean-k-mooney | we coudl have action have a get_timeout and get_poll_interval function that will read form the relevent config | 14:47 |
| sean-k-mooney | there are other ways to do it too by having a map form class to config option or simialr | 14:48 |
| amoralej | but the logic to implement the timeout would be inside each action execution code iuuc | 14:49 |
| amoralej | not as something "external" checking the execution time on any kind of action | 14:50 |
| sean-k-mooney | i need to look at the code i tought you just implied the opicite | 14:50 |
| sean-k-mooney | if tis in the action execute fucntion it much simpler | 14:50 |
| sean-k-mooney | we define the action to look at its action specific config option | 14:51 |
| sean-k-mooney | and we set the defautl fo that option to the gloabl one | 14:51 |
| sean-k-mooney | so if you dont explciyt defien it it will defautl to the global one | 14:51 |
| dviroel | amoralej: i think that each action would need to be aware of the timeout and end its processing, we don't want to have a thread kill outside anymore | 14:52 |
| amoralej | i can easily implement that if in the migrate in that patch, but if we want to apply at least the global value to all the actions, we'll need to patch all actions | 14:52 |
| sean-k-mooney | https://github.com/openstack/nova/blob/master/nova/conf/netconf.py#L25-L52 | 14:53 |
| amoralej | and again, i still have the doubt about how to size timeouts for volume_migrations specially | 14:53 |
| sean-k-mooney | amoralej: yes we woudl have ot modify all the actions | 14:53 |
| sean-k-mooney | we don thave to do that in one patch however | 14:53 |
| dviroel | i think that all other action are out of the scope for the LP bug scope | 14:53 |
| sean-k-mooney | right | 14:54 |
| sean-k-mooney | so for most action i dont think we have a timeout | 14:54 |
| dviroel | I think that we could really backport this server migration timeout fix | 14:54 |
| dviroel | but it depends how it is implemented | 14:54 |
| sean-k-mooney | so we can backport it if its just a straight config option for this one case | 14:55 |
| sean-k-mooney | so we can start with that | 14:55 |
| sean-k-mooney | but we shoudl fix this later for all actions | 14:55 |
| dviroel | ack | 14:55 |
| sean-k-mooney | so inially patch can jsut be 2 now migrate sepcific cofnig options | 14:55 |
| sean-k-mooney | s/now/new/ | 14:56 |
| amoralej | 2 migrate config options you mean migrate_timeout + migrate_polling_interval or migrate_timeout + action_timeout ? | 14:57 |
| amoralej | for the initial patch, i mean | 14:58 |
| amoralej | i can implement any of those options in https://review.opendev.org/c/openstack/watcher/+/967693 | 14:58 |
| sean-k-mooney | timeout and interval | 14:58 |
| amoralej | ack, i think that's a good option | 14:59 |
| amoralej | and global timeout would come as follow-up with the rest of actions | 14:59 |
| amoralej | lemme prepare the patch | 14:59 |
| sean-k-mooney | that a "stop the bleeding" approch to make live migration actully practical | 14:59 |
| amoralej | wfm | 14:59 |
| sean-k-mooney | we can then design a better long term solution | 14:59 |
| dviroel | agree | 14:59 |
| sean-k-mooney | perhasp as part of the scaling work or seperatly | 14:59 |
| amoralej | anyway, i think i will increase the default timeout to something like 1hr at least | 14:59 |
| sean-k-mooney | i woudl start with 900s and 1s personally but we coudl try an hour | 15:00 |
| sean-k-mooney | 1 second might be a bit agressive | 15:01 |
| amoralej | i think so, tbh | 15:01 |
| sean-k-mooney | in ci we will want to tune this to like 10s and 0.2 seconds | 15:01 |
| sean-k-mooney | so they shoudl be floats | 15:01 |
| sean-k-mooney | maybe 1 second min and keeping it an int may be ok | 15:02 |
| sean-k-mooney | for ci | 15:02 |
| amoralej | i think under-second is too small even for CI :) | 15:02 |
| amoralej | we are not running such a big number of migrations that half a second per migration makes big difference ... | 15:03 |
| sean-k-mooney | not in the tempest tests | 15:03 |
| sean-k-mooney | i dont want this to be an eveutal problem in the fuctional tests | 15:03 |
| sean-k-mooney | btu we can alwasy go form a int to a float later | 15:03 |
| sean-k-mooney | that a backward compatible change | 15:04 |
| sean-k-mooney | so stick with int for now | 15:04 |
| sean-k-mooney | we can revisti if ti becomes a problem later | 15:04 |
| sean-k-mooney | i was prematurly optimizing | 15:04 |
| amoralej | ok | 15:05 |
| opendevreview | Alfredo Moralejo proposed openstack/watcher master: Add configurable timeout for migrate actions with reasonable default https://review.opendev.org/c/openstack/watcher/+/967693 | 15:59 |
| amoralej | sean-k-mooney dviroel ^ finally i have implemented pollint timeout as float, i'm setting it to 1.5 in one of the unit tests | 15:59 |
| amoralej | no timeout, polling time, i meant | 16:00 |
| amoralej | the case for functional tests convinced me :) | 16:00 |
| sean-k-mooney | so you didnt fix https://review.opendev.org/c/openstack/watcher/+/967693/comment/b82a145e_9bd7118b/ | 16:03 |
| sean-k-mooney | decrementing rety is incorrect | 16:04 |
| sean-k-mooney | its not a rety if you are initallising it to the tiem out value | 16:04 |
| sean-k-mooney | so either you need to change thename | 16:04 |
| sean-k-mooney | or you need to change the approch | 16:04 |
| amoralej | i was just replying to that :) | 16:05 |
| amoralej | yeah, retry name seems incorrect | 16:05 |
| amoralej | but i think logic is correct | 16:05 |
| sean-k-mooney | call it timeout and im fine with it | 16:05 |
| sean-k-mooney | well | 16:05 |
| amoralej | yep | 16:05 |
| sean-k-mooney | ya no negitive values are fine | 16:06 |
| sean-k-mooney | you dont >0 in the while | 16:06 |
| sean-k-mooney | you shhoudl not be using getatrr ther but the exsting host was so i gues sthat is not your bug | 16:06 |
| sean-k-mooney | we can fix that seperatly | 16:06 |
| amoralej | yeah, i used what was used before, i didn't want to touch that, tbh | 16:07 |
| sean-k-mooney | ya its fien for now | 16:11 |
| sean-k-mooney | we shoudl clean up that code evenutlly but its not part of this bug | 16:12 |
| sean-k-mooney | its just not following good python codeing style | 16:12 |
| opendevreview | Alfredo Moralejo proposed openstack/watcher master: Add configurable timeout for migrate actions with reasonable default https://review.opendev.org/c/openstack/watcher/+/967693 | 16:13 |
| opendevreview | Douglas Viroel proposed openstack/watcher master: Adds support for threading mode in applier https://review.opendev.org/c/openstack/watcher/+/966226 | 20:29 |
| opendevreview | Douglas Viroel proposed openstack/watcher master: Disable eventlet patching for api service in threading job https://review.opendev.org/c/openstack/watcher/+/967769 | 20:36 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!