21:00:05 <strigazi> #startmeeting containers 21:00:06 <openstack> Meeting started Tue Mar 5 21:00:05 2019 UTC and is due to finish in 60 minutes. The chair is strigazi. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:00:07 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:00:09 <openstack> The meeting name has been set to 'containers' 21:00:11 <strigazi> #topic Roll Call 21:00:17 <strigazi> o/ 21:00:19 <schaney> o/ 21:00:21 <jakeyip> o/ 21:01:48 <brtknr> o/ 21:02:38 <strigazi> Hello schaney jakeyip brtknr 21:02:41 <strigazi> #topic Stories/Tasks 21:02:53 <imdigitaljim> o/ 21:03:08 <strigazi> I want to mention three things quickly. 21:03:18 <strigazi> CI for swarm and kubernetes is not passing 21:03:21 <colin-> hello 21:03:34 <strigazi> Hello colin- imdigitaljim 21:04:09 <strigazi> I'm finding the error 21:04:44 <strigazi> for example for k8s http://logs.openstack.org/73/639873/3/check/magnum-functional-k8s/06f3638/logs/screen-h-eng.txt.gz?level=ERROR 21:04:59 <strigazi> The error is the same for swarm 21:06:01 <strigazi> If someone wants to take a look and then comment in https://review.openstack.org/#/c/640238/ or in a fix :) 21:06:15 <strigazi> 2. 21:06:50 <strigazi> small regression I have found for the etcd_volume_size label (persistent storage for etcd) https://storyboard.openstack.org/#!/story/2005143 21:07:06 <strigazi> this fix is obvious 21:07:25 <strigazi> 3. 21:07:33 <strigazi> imdigitaljim created Cluster creators that leave WRT Keystone cause major error https://storyboard.openstack.org/#!/story/2005145 21:07:40 <imdigitaljim> yeah thats my 1 21:07:57 <strigazi> it has been discusses many times. the keystone team says there is no fix 21:08:21 <strigazi> in our cloud we manually transfer the trustee user to another account. 21:08:23 <imdigitaljim> could we rework magnum to opt to poll heat based on a service account for 1 part 21:08:31 <imdigitaljim> instead of using trust cred to poll heat 21:08:54 <strigazi> imdigitaljim: some says this is a security issue, it was like this before. 21:09:01 <imdigitaljim> oh? 21:09:09 <strigazi> but this fixes part of the problem 21:09:11 <imdigitaljim> couldnt it be scoped to readonly/gets for heat 21:09:25 <imdigitaljim> the kubernetes side 21:09:33 <imdigitaljim> either might be trust transfer (like you suggest) 21:09:46 <imdigitaljim> or we have been opting for teams to use a bot account type approach for their tenant 21:09:58 <imdigitaljim> that will persist among users leaving 21:10:19 <strigazi> trusts transfer *won't* happen in keystone, ever 21:10:24 <imdigitaljim> yeah 21:10:28 <imdigitaljim> i doubt it would 21:10:43 <jakeyip> does this happen only if the user is deleted from keystone? 21:10:47 <strigazi> they were clear with this in the Dublin PTG 21:10:56 <strigazi> yes 21:10:58 <imdigitaljim> yeah 21:11:12 <strigazi> the trust powers die when the user is deleted 21:11:19 <strigazi> same for application creds 21:11:28 <imdigitaljim> to be honest even if we fix the "magnum to opt to poll heat based on a service account" 21:11:32 <imdigitaljim> that would be a huge improvement 21:11:41 <imdigitaljim> that would at least enable us to delete the clusters 21:11:43 <imdigitaljim> without db edits 21:11:58 <strigazi> admins can delete the cluster anyway 21:12:06 <imdigitaljim> we could not 21:12:11 <strigazi> ? 21:12:14 <imdigitaljim> with our admin accounts 21:12:22 <imdigitaljim> the codepaths bomb out with heat polling 21:12:39 <imdigitaljim> not sure where 21:12:43 <jakeyip> is this a heat issue instead? 21:12:44 <imdigitaljim> the occurrence was just yesterda 21:12:53 <strigazi> mayve you diverged in the code? 21:12:56 <imdigitaljim> no i had to delete the heat stack underneath with normal heat functionality 21:13:03 <imdigitaljim> and then manually remove the cluster via db 21:13:16 <strigazi> wrong policy? 21:13:16 <imdigitaljim> not with that regard 21:13:32 <colin-> +1 re: service account, fwiw 21:14:05 <imdigitaljim> nope 21:15:20 <imdigitaljim> AuthorizationFailure: unexpected keystone client error occurred: Could not find user: <deleted_user>. (HTTP 404) (Request-ID: req-370b414f-239a-4e13-b00d-a1d87184904b) 21:15:34 <strigazi> ok 21:15:36 <jakeyip> ok so figuring out why admin can't use magnum to delete a cluster but can use heat to delete a stack will be a way forward? 21:15:48 <jakeyip> I wonder what is the workflow for normal resources (e.g. nova instances) in case of people leaving? 21:16:02 <strigazi> the problem is magnum can't check the status of the stack 21:16:17 <brtknr> it would be nice if the trust was owned by a role+domain rather than a user, so anyone with the role+domain can act as that role+domain 21:16:24 <imdigitaljim> ^ 21:16:25 <imdigitaljim> +1 21:16:26 <imdigitaljim> +1 21:16:36 <brtknr> guess its too late to refactor things now... 21:16:52 <imdigitaljim> imo not really 21:16:53 <strigazi> it is a bit bad as well 21:17:04 <imdigitaljim> but it can be bad based on the use-case 21:17:07 <imdigitaljim> for us its fine 21:17:11 <strigazi> the trust creds are a leak 21:17:39 <imdigitaljim> yeah 21:17:44 <imdigitaljim> the trust creds on the server 21:17:46 <strigazi> userA takes trust creds from userb that they both own the cluster 21:17:50 <imdigitaljim> and you can get access to other clusters 21:17:58 <strigazi> userA is fired, can still access keystone 21:18:23 <brtknr> oh, because trust is still out in the wild? 21:18:32 <strigazi> the polling issue is different than the trust in the cluster 21:18:37 <imdigitaljim> yeah 21:18:40 <brtknr> change trust password *rolls eyes* 21:18:42 <imdigitaljim> different issues 21:18:56 <strigazi> we can do service account for polling again 21:19:07 <imdigitaljim> but an admin readonly scope 21:19:08 <imdigitaljim> ? 21:19:21 <strigazi> That is possible 21:19:32 <strigazi> since the magnum controller is managed by admins 21:19:35 <imdigitaljim> yeah 21:19:44 <imdigitaljim> i think that would a satisfactory solution 21:19:53 <imdigitaljim> the clusters we can figure out/delete/etc 21:20:03 <imdigitaljim> but magnums behavior is a bit unavoidable 21:20:39 <imdigitaljim> thanks strigazi! 21:20:43 <imdigitaljim> you going to denver? 21:21:04 <strigazi> https://github.com/openstack/magnum/commit/f895b2bd0922f29a9d6b08617cb60258fa101c68#diff-e004adac7f8cb91a28c210e2a8d08ee9 21:21:19 <strigazi> I'm going yes 21:21:31 <imdigitaljim> lets meet up! 21:22:01 <strigazi> sure thing :) 21:22:58 <strigazi> Is anyone going to work on the polling thing? maybe a longer description first in storyboard? 21:23:12 <flwang1> strigazi: re https://storyboard.openstack.org/#!/story/2005145 i think you and ricardo proposed this issue before in mailing list 21:24:10 <strigazi> yes, I mentioned this. I discussed it with the keystone team in Dublin 21:24:11 <flwang1> and IIRC, we need support from keystone side? 21:24:41 <strigazi> there won't be help or change 21:24:51 <strigazi> from the keystone side 21:25:10 <strigazi> 22:11 < strigazi> trusts transfer *won't* happen in keystone, ever 21:25:24 <strigazi> nor for application credentials 21:25:25 <flwang1> strigazi: so we have to fix it in magnum? 21:25:31 <strigazi> yes 21:25:45 <strigazi> two issues, one is the polling heat issue 21:26:03 <strigazi> 2nd, the cluster inside the cluster must be rotated 21:26:11 <imdigitaljim> creds inside* 21:26:28 <strigazi> we had a design for this in Dublin, but not man power 21:26:33 <strigazi> yes, creds :) 21:26:43 <imdigitaljim> yeah 1) trust on magnum, fixable and 2) trust on cluster, no clear path yet 21:27:06 <strigazi> 2) we have a rotate certificates api with noop 21:27:17 <strigazi> it can rotate the certs and the trust 21:27:22 <strigazi> that was the design 21:27:26 <flwang1> strigazi: ok, i think we need longer discussion for this one 21:27:44 <imdigitaljim> im more concerned about 1) for the moment which is smaller in scope 21:27:52 <imdigitaljim> 2) might be more challenging and needs more discussion/desing 21:27:55 <imdigitaljim> design 21:27:57 <strigazi> no :) we did it one year ago, someone can implement it :) 21:28:50 <strigazi> I'll bring up the pointer in storyboard 21:30:17 <strigazi> For the autoscaler, are there any outstanfing comments? Can we start pushing the maintainers to accept it? 21:30:36 <flwang1> strigazi: i'm happy with current status. 21:30:43 <flwang1> it passed my test 21:31:12 <schaney> strigazi: there are some future enhancements that I am hoping to work with you guys on 21:31:17 <flwang1> strigazi: so we can/should start to push CA team to merge it 21:32:22 <strigazi> schaney: do you want to leave a comment you are happy with the current state? we can ping the CA team the {'k8s', 'sig', 'openstack'} in some order 21:32:23 <flwang1> schaney: sure, the /resize api is coming 21:34:44 <schaney> I can leave a comment yeah 21:35:15 <schaney> Are you alright with me including some of the stipulations in the comment? 21:35:41 <schaney> for things like nodegroups, resize, and a couple bugs 21:35:59 <strigazi> schaney: I don't know how it will work for them 21:36:18 <schaney> same, not sure if it's better to get something out there and start iterating 21:36:29 <strigazi> +1 ^^ 21:36:33 <schaney> or try to get it perfect first 21:36:58 <flwang1> schaney: i would suggest to track them in magnum or open separated issues later, but just my 2c 21:37:30 <imdigitaljim> we'll probably just do PRs against the first iteration 21:37:31 <schaney> track them in magnum vs the autoscaler? 21:37:43 <imdigitaljim> and use issues in autoscaler repo probably 21:37:47 <imdigitaljim> ./shrug 21:38:27 <schaney> yeah, us making PRs to the autoscaler will work for us going forward 21:38:40 <schaney> the current PR has so much going on already 21:38:48 <strigazi> We can focus on the things that work atm, and when it is in, PR in the CA repo are fine 21:38:53 <flwang1> issues in autoscaler, but don't scare them :) 21:39:03 <flwang1> strigazi: +1 21:39:40 <schaney> one question if tghartland has looking into the TemplateNodeInfo interface method implementation 21:39:41 <strigazi> as long as we agree on the direction 21:40:08 <schaney> I think the current implementation will cause a crash 21:40:16 <imdigitaljim> imho i think we're all heading the same direction 21:40:47 <strigazi> creash on what? 21:40:56 <strigazi> crash on what? why? 21:40:56 <schaney> the autoscaler 21:41:24 <strigazi> is it reproducible? 21:41:58 <schaney> Should be, I am curious as to if you guys have seen it 21:42:03 <strigazi> no 21:42:28 <schaney> I'll double check, but the current implementation should crash 100% of the time when it gets called 21:42:49 <strigazi> it is a specific call that is not implemented? 21:42:55 <schaney> yes 21:42:57 <strigazi> TemplateNodeInfo this > 21:42:59 <schaney> TemplateNodeInfo() 21:43:16 <strigazi> I'll discuss it with him tmr 21:43:48 <schaney> kk sounds good, I think for good faith for the upstream autoscaler guys, we might want to figure that part out 21:44:11 <schaney> before requesting merge 21:44:38 <strigazi> 100% probability of crash should be fixed first 21:44:58 <schaney> :) yeah 21:45:40 <strigazi> it is the vm flavor basically? 21:45:57 <schaney> yeah pretty much 21:46:29 <schaney> the autoscaler gets confused when there are no schedulable nodes 21:46:54 <schaney> so TemplateNodeInfo() should generate a sample node for a given nodegroup 21:47:14 <strigazi> sounds easy 21:47:56 <schaney> Yeah shouldn't be too bad, just need to fully construct the template node 21:48:07 <strigazi> this however: 'the autoscaler gets confused when there are no schedulable nodes' sounds bad. 21:48:33 <schaney> it tries to run simulations before scaling up 21:48:45 <strigazi> so how it works now? 21:49:04 <schaney> if there are valid nodes, it will use their info in the simulation 21:49:14 <strigazi> it doesn't do any simulations? 21:49:17 <schaney> if there is no valid node, it needs the result of templateNodeInfo 21:50:28 <strigazi> if you can send us a scenario to reproduce, it would help 21:51:15 <schaney> cordon all nodes and put the cluster in a situation to scale up, should show the issue 21:51:36 <strigazi> but, won't it create a new node? 21:52:04 <strigazi> I pinged him, he will try tmr 21:52:33 <flwang1> strigazi: in my testing, it scaled up well 21:52:43 <strigazi> schaney: apart from that, anything else? 21:52:52 <strigazi> to request to merge 21:53:01 <strigazi> flwang1: for me as well 21:54:18 <schaney> I think that was the last crash that I was looking at, everything else will just be tweaking 21:54:30 <strigazi> nice 21:54:38 <schaney> flwang1: to be clear, this issue is only seen when effectively scaling up from 0 21:55:02 <flwang1> schaney: i see. i haven't tested that case 21:55:39 <schaney> rare case, but I was just bringing it up since it will cause a crash 21:55:54 <flwang1> schaney: cool 21:55:58 <strigazi> we can address it 21:56:09 <schaney> awesome 21:58:16 <strigazi> we are almost out of time 21:58:43 <flwang1> strigazi: rolling upgrade status? 21:58:54 <strigazi> I'll just ask one more time, Can someone look into the CI failures? 21:59:05 <flwang1> strigazi: i did 21:59:20 <strigazi> flwang1: end meeting first and the discuss it? 21:59:20 <flwang1> the current ci failure is related to nested virt 21:59:30 <strigazi> how so? 21:59:30 <flwang1> strigazi: sure 21:59:45 <flwang1> i even popped up in infra channel 21:59:51 <strigazi> let's end the meeting first 21:59:58 <colin-> see you next time 22:00:03 <strigazi> thanks everyone 22:00:07 <flwang1> and there is no good way now, seems infra recently upgrade their kernel 22:00:16 <flwang1> manser may have more inputs 22:00:33 <strigazi> #endmeeting