09:00:43 <jakeyip> #startmeeting magnum 09:00:43 <opendevmeet> Meeting started Wed Jan 24 09:00:43 2024 UTC and is due to finish in 60 minutes. The chair is jakeyip. Information about MeetBot at http://wiki.debian.org/MeetBot. 09:00:43 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 09:00:43 <opendevmeet> The meeting name has been set to 'magnum' 09:00:44 <jakeyip> Agenda: 09:00:46 <jakeyip> #link https://etherpad.opendev.org/p/magnum-weekly-meeting 09:00:48 <jakeyip> #topic Roll Call 09:00:53 <jakeyip> o/ 09:00:56 <dalees> o/ 09:01:33 <jakeyip> #topic review - flakey tests 09:01:58 <dalees> so this work was prompted when trying to merge the Swarm patchset 09:02:33 <jakeyip> thanks. I had a brief look, haven't tested it out yet due to lack of time 09:02:42 <dalees> and there was some state leaking after some tests, so I've now mocked these and the broken test case seems to be stable now. 09:03:28 <dalees> cool - I've taken the liberty of rebasing the Swarm removal onto this too - it passed CI 09:04:03 <dalees> but otherwise there's no rush on merging. it can wait until reviewers have time 09:05:34 <jakeyip> I'm ok with it, but I was hoping mnasiadka can have a look when he comes back since he was working in this space too 09:05:55 <jakeyip> if there's no urgency maybe we can wait a week? I think he'll be back soon 09:06:22 <dalees> agreed, let's park it until then 09:06:52 <jakeyip> thanks for understanding 09:07:10 <jakeyip> #topic control plan resize 09:07:43 <dalees> so for this i had a question I posted yesterday: 09:07:44 <dalees> I'm interested in thoughts on Magnum-API looking up the driver for a cluster and using validation methods on it. I think the driver is only used by the Magnum Conductor currently, so perhaps the expected way to do this is RPC calls? This is in relation to https://review.opendev.org/c/openstack/magnum/+/906086 09:11:39 <jakeyip> hmm I have a thought, we currently don't support (?) but what if conductor and api are different versions (e.g. middle of upgrade) and there's new driver code 09:13:18 <dalees> well the validation (in this case for control plane sizes on create or resize) will be validated according to whatever driver version is installed in the magnum-api instance and not the magnum-conductor (which still performs the *actual* resize actions) 09:13:20 <jakeyip> also, we don't store driver used for cluster 'at point of creation', rather conductor look it up, so in time this may be an issue... not really sure 09:15:07 <jakeyip> to be fair I don't think mixed version is really much of a problem, we never say we support mixed version right? 09:16:18 <jakeyip> generally i think ok, not shooting this down, just posting edge cases 09:16:21 <dalees> No, I wouldn't think so - perhaps for a short time during upgrade. 09:17:01 <dalees> yeah it's tricky. Seems a big waste of effort to call out to RPC just to validate the master count, but maybe that keeps the delineation. 09:18:08 <jakeyip> I agree it's a lot of effort. I'm not keen for such a simple case 09:18:10 <mkjpryor> Hi all 09:18:16 <dalees> hi mkjpryor 09:18:16 <mkjpryor> Apologies for my late arrival 09:19:08 <jakeyip> hi mkjpryor glad you can join :) 09:19:11 <mkjpryor> Did I miss anything important? 09:19:15 <dalees> mkjpryor: backlog available here: https://meetings.opendev.org/meetings/magnum/2024/magnum.2024-01-24-09.00.log.txt 09:19:32 <mkjpryor> Thanks dalees 09:19:48 <dalees> just discussing a question i had about magnum-api using driver lookup. I'm interested if you have thoughts, too 09:21:48 <mkjpryor> FWIW, I quite like the fact that the API is completely implementation agnostic 09:21:57 <mkjpryor> It makes things very easy to reason about 09:22:15 <mkjpryor> I think it would have to be very compelling functionality to change that 09:22:35 <mkjpryor> dalees: What exactly is the use case you are worried about? 09:23:18 <dalees> right now validation of API requests is performed by magnum-api. It can respond immediately to failures (ie. master node count not accepted). 09:23:48 <dalees> if I do it in the resize function of the conductor's driver, it's an async RPC call and the clusters transitions into UPDATE_FAILED 09:23:58 <mkjpryor> I get you 09:24:14 <mkjpryor> So you're worried about the case where there is driver-specific validation that needs to happen 09:24:18 <mkjpryor> So we can fail early 09:24:19 <dalees> yes 09:24:49 <dalees> maybe the answer is just a non-async RPC to do that, and still have the conductor locate and run the driver code. 09:25:59 <mkjpryor> I guess you are worried about the API and the conductor versions getting out-of-sync if we call driver methods directly in the API? 09:26:27 <mkjpryor> I'd be worried about the latency of the RPC call in the API code 09:26:38 <mkjpryor> But I guess there must be other OpenStack services that do that 09:26:42 <mkjpryor> :shrugs: 09:26:49 <dalees> it's a concern we were discussing, in most deployments both would be deployed at once I'd imagine. 09:27:18 <mkjpryor> I can't imagine many cases, other than a failed upgrade, where they would be different for longer than a few seconds 09:27:32 <mkjpryor> And if you have a failed upgrade, you probably have other problems :D 09:28:33 <jakeyip> openstack upgrade? for us the workflow is probably more than a few seconds :D 09:28:45 <dalees> could be longer, with several api nodes and deployments progressing through them. But yeah, it wouldn't be long. Same problem with a conductor that doesn't know how to answer the validation RPC I guess. 09:29:07 <mkjpryor> OpenStack upgrade - yeah. True - if you have multiple controllers then it might be a few minutes. 09:29:45 <mkjpryor> Or if the conductor at the old versions answers the validation RPC and the conductor at the new version performs the operation 09:30:15 <mkjpryor> I think whatever we do, other than denying API requests during an upgrade, we have a risk of mismatched versions 09:30:47 <mkjpryor> So we might as well do the easy/low-latency thing of calling a driver method directly in the API IMHO 09:33:20 <dalees> okay thanks; I'll think more on your initial comments about it being easier to reason about if conductor does it. I that that's important. 09:34:25 <dalees> maybe it just needs some defined boundaries and commenting. We don't ever want magnum-api doing driver actions, but validation seems helpful when these drivers aren't all exposing the same capabilities. 09:34:50 <jakeyip> yeah, this case is relatively trivial, I'm concerned to 'do it right' just takes so much more effort and blocks things. I want to have some time to think about API doing validation 09:35:27 <mkjpryor> As long as there is a well-defined contract for the driver to implement, on reflection I have less of a problem with the API calling that directly 09:36:12 <mkjpryor> And it is clear which bits of the driver get called by the API and which bits by the conductor 09:36:19 <jakeyip> trying to think about cinder resize, does api figure out if backend driver supports resize? 09:39:42 <dalees> hmm, I'll try and figure that one out, jakeyip - could be similar, or quite different :) 09:41:26 <dalees> just to add concrete code; here are the proposed additions to the base driver, that the magnum-api may call (I need to add really clear docs!): https://review.opendev.org/c/openstack/magnum/+/906086/2/magnum/drivers/common/driver.py#196 09:42:00 <jakeyip> dalees: what are the bad things if an invalid size is specified and cluster goes into UPDATE_FAILED ? 09:43:34 <dalees> jakeyip: 1) user doesn't get feedback in their CLI or UI, they get "accepted" and it fails later. 2) UPDATE_FAILED clusters cannot be acted on with some features like upgrade. You have to use a "trick" to resize them to the same size. (I'm proposing we change this in the patchset too). 09:44:40 <dalees> Mostly that it's a much worse user experience than getting a 400 "no you can't, here are the valid sizes". 09:46:29 <jakeyip> what's the trick? 09:47:31 <dalees> if a cluster is in UPDATE_FAILED you send it a resize to the same size it already is. magnum-api accepts that action and transitions to UPDATE_COMPLETE. 09:49:05 <jakeyip> ok thanks, just want to confirm because I vaguely remember doing something similar but with worker nodegroups etc. 09:49:22 <dalees> (this is in magnum/conductor/handlers/cluster_conductor.py in allow_update_status variables) 09:51:21 <dalees> yeah it doesn't matter what nodegroup you do it with. The whole cluster changes state into UPDATE_COMPLETE which allows more actions on it. 09:53:22 <jakeyip> the change to allow upgrade when UPDATE_FAILED seems to be a bit dangerous, can you explain why you need that? is it not possible to resize to current, get it to transition to UPDATE_COMPLETE then upgrade? 09:57:21 <dalees> because that's a silly workaround? Perhaps UPDATE_FAILED covers too many cases, some you would rather deny upgrades - so I don't stand by that entirely. 09:58:04 <dalees> if update_failed because you tried to set the master size to 100, but it failed and stayed at 3 - well what's the point in denying a cluster upgrade? 09:58:17 <jakeyip> also if we want Magnum's future to be a thin shim, should we even do validation? what if one day even numbers of controllers are acceptable? (becauase magic :P). do we want to update validation logic in magnum? where do we want to draw the line at validation? 09:59:09 <dalees> well yeah, this goes a long way ;) 09:59:12 <jakeyip> yeah UPDATE_FAILED may have many reasons, I guess it was (is?) good to make sure cluster is healthy before allowing another operation. this can change, of cos 09:59:32 <dalees> for that *exact* example, I got you covered. The driver defines whatever it wants to support: https://github.com/stackhpc/magnum-capi-helm/pull/41/files 10:02:03 <jakeyip> OK for stackhpc example understood 10:02:24 <jakeyip> so in the size 100 case, should it be an operation to resize to 3 again then upgrade? what should the status be after upgrade if we allow? db will be storing size 100 I assume? 10:04:19 <dalees> well no, size 100 is right now denied as an HTTP 400 and not stored - that's the use case for validation vs letting magnum-api trigger the async RPC. 10:05:24 <dalees> but i think there are more use cases that Magnum right now encodes that we'll want to move into the driver 10:06:01 <dalees> vexxhost had one a week back which was CNI support for Cilium. The list is hardcoded in Magnum right now. 10:06:42 <jakeyip> sorry am thinking of the case where it is allowed, or it is a resize of worker nodegroup, then no quota or something resulting in a UPDATE_FAILED status 10:09:18 <jakeyip> we are past time, want to have a think and come back next week? 10:09:48 <dalees> yep all good - a bit to think about here. thanks for the discussion mkjpryor and jakeyip 10:09:59 <jakeyip> thanks for bringing this up 10:10:33 <jakeyip> #endmeeting\