21:00:05 #startmeeting containers 21:00:06 Meeting started Tue Mar 5 21:00:05 2019 UTC and is due to finish in 60 minutes. The chair is strigazi. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:00:07 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:00:09 The meeting name has been set to 'containers' 21:00:11 #topic Roll Call 21:00:17 o/ 21:00:19 o/ 21:00:21 o/ 21:01:48 o/ 21:02:38 Hello schaney jakeyip brtknr 21:02:41 #topic Stories/Tasks 21:02:53 o/ 21:03:08 I want to mention three things quickly. 21:03:18 CI for swarm and kubernetes is not passing 21:03:21 hello 21:03:34 Hello colin- imdigitaljim 21:04:09 I'm finding the error 21:04:44 for example for k8s http://logs.openstack.org/73/639873/3/check/magnum-functional-k8s/06f3638/logs/screen-h-eng.txt.gz?level=ERROR 21:04:59 The error is the same for swarm 21:06:01 If someone wants to take a look and then comment in https://review.openstack.org/#/c/640238/ or in a fix :) 21:06:15 2. 21:06:50 small regression I have found for the etcd_volume_size label (persistent storage for etcd) https://storyboard.openstack.org/#!/story/2005143 21:07:06 this fix is obvious 21:07:25 3. 21:07:33 imdigitaljim created Cluster creators that leave WRT Keystone cause major error https://storyboard.openstack.org/#!/story/2005145 21:07:40 yeah thats my 1 21:07:57 it has been discusses many times. the keystone team says there is no fix 21:08:21 in our cloud we manually transfer the trustee user to another account. 21:08:23 could we rework magnum to opt to poll heat based on a service account for 1 part 21:08:31 instead of using trust cred to poll heat 21:08:54 imdigitaljim: some says this is a security issue, it was like this before. 21:09:01 oh? 21:09:09 but this fixes part of the problem 21:09:11 couldnt it be scoped to readonly/gets for heat 21:09:25 the kubernetes side 21:09:33 either might be trust transfer (like you suggest) 21:09:46 or we have been opting for teams to use a bot account type approach for their tenant 21:09:58 that will persist among users leaving 21:10:19 trusts transfer *won't* happen in keystone, ever 21:10:24 yeah 21:10:28 i doubt it would 21:10:43 does this happen only if the user is deleted from keystone? 21:10:47 they were clear with this in the Dublin PTG 21:10:56 yes 21:10:58 yeah 21:11:12 the trust powers die when the user is deleted 21:11:19 same for application creds 21:11:28 to be honest even if we fix the "magnum to opt to poll heat based on a service account" 21:11:32 that would be a huge improvement 21:11:41 that would at least enable us to delete the clusters 21:11:43 without db edits 21:11:58 admins can delete the cluster anyway 21:12:06 we could not 21:12:11 ? 21:12:14 with our admin accounts 21:12:22 the codepaths bomb out with heat polling 21:12:39 not sure where 21:12:43 is this a heat issue instead? 21:12:44 the occurrence was just yesterda 21:12:53 mayve you diverged in the code? 21:12:56 no i had to delete the heat stack underneath with normal heat functionality 21:13:03 and then manually remove the cluster via db 21:13:16 wrong policy? 21:13:16 not with that regard 21:13:32 +1 re: service account, fwiw 21:14:05 nope 21:15:20 AuthorizationFailure: unexpected keystone client error occurred: Could not find user: . (HTTP 404) (Request-ID: req-370b414f-239a-4e13-b00d-a1d87184904b) 21:15:34 ok 21:15:36 ok so figuring out why admin can't use magnum to delete a cluster but can use heat to delete a stack will be a way forward? 21:15:48 I wonder what is the workflow for normal resources (e.g. nova instances) in case of people leaving? 21:16:02 the problem is magnum can't check the status of the stack 21:16:17 it would be nice if the trust was owned by a role+domain rather than a user, so anyone with the role+domain can act as that role+domain 21:16:24 ^ 21:16:25 +1 21:16:26 +1 21:16:36 guess its too late to refactor things now... 21:16:52 imo not really 21:16:53 it is a bit bad as well 21:17:04 but it can be bad based on the use-case 21:17:07 for us its fine 21:17:11 the trust creds are a leak 21:17:39 yeah 21:17:44 the trust creds on the server 21:17:46 userA takes trust creds from userb that they both own the cluster 21:17:50 and you can get access to other clusters 21:17:58 userA is fired, can still access keystone 21:18:23 oh, because trust is still out in the wild? 21:18:32 the polling issue is different than the trust in the cluster 21:18:37 yeah 21:18:40 change trust password *rolls eyes* 21:18:42 different issues 21:18:56 we can do service account for polling again 21:19:07 but an admin readonly scope 21:19:08 ? 21:19:21 That is possible 21:19:32 since the magnum controller is managed by admins 21:19:35 yeah 21:19:44 i think that would a satisfactory solution 21:19:53 the clusters we can figure out/delete/etc 21:20:03 but magnums behavior is a bit unavoidable 21:20:39 thanks strigazi! 21:20:43 you going to denver? 21:21:04 https://github.com/openstack/magnum/commit/f895b2bd0922f29a9d6b08617cb60258fa101c68#diff-e004adac7f8cb91a28c210e2a8d08ee9 21:21:19 I'm going yes 21:21:31 lets meet up! 21:22:01 sure thing :) 21:22:58 Is anyone going to work on the polling thing? maybe a longer description first in storyboard? 21:23:12 strigazi: re https://storyboard.openstack.org/#!/story/2005145 i think you and ricardo proposed this issue before in mailing list 21:24:10 yes, I mentioned this. I discussed it with the keystone team in Dublin 21:24:11 and IIRC, we need support from keystone side? 21:24:41 there won't be help or change 21:24:51 from the keystone side 21:25:10 22:11 < strigazi> trusts transfer *won't* happen in keystone, ever 21:25:24 nor for application credentials 21:25:25 strigazi: so we have to fix it in magnum? 21:25:31 yes 21:25:45 two issues, one is the polling heat issue 21:26:03 2nd, the cluster inside the cluster must be rotated 21:26:11 creds inside* 21:26:28 we had a design for this in Dublin, but not man power 21:26:33 yes, creds :) 21:26:43 yeah 1) trust on magnum, fixable and 2) trust on cluster, no clear path yet 21:27:06 2) we have a rotate certificates api with noop 21:27:17 it can rotate the certs and the trust 21:27:22 that was the design 21:27:26 strigazi: ok, i think we need longer discussion for this one 21:27:44 im more concerned about 1) for the moment which is smaller in scope 21:27:52 2) might be more challenging and needs more discussion/desing 21:27:55 design 21:27:57 no :) we did it one year ago, someone can implement it :) 21:28:50 I'll bring up the pointer in storyboard 21:30:17 For the autoscaler, are there any outstanfing comments? Can we start pushing the maintainers to accept it? 21:30:36 strigazi: i'm happy with current status. 21:30:43 it passed my test 21:31:12 strigazi: there are some future enhancements that I am hoping to work with you guys on 21:31:17 strigazi: so we can/should start to push CA team to merge it 21:32:22 schaney: do you want to leave a comment you are happy with the current state? we can ping the CA team the {'k8s', 'sig', 'openstack'} in some order 21:32:23 schaney: sure, the /resize api is coming 21:34:44 I can leave a comment yeah 21:35:15 Are you alright with me including some of the stipulations in the comment? 21:35:41 for things like nodegroups, resize, and a couple bugs 21:35:59 schaney: I don't know how it will work for them 21:36:18 same, not sure if it's better to get something out there and start iterating 21:36:29 +1 ^^ 21:36:33 or try to get it perfect first 21:36:58 schaney: i would suggest to track them in magnum or open separated issues later, but just my 2c 21:37:30 we'll probably just do PRs against the first iteration 21:37:31 track them in magnum vs the autoscaler? 21:37:43 and use issues in autoscaler repo probably 21:37:47 ./shrug 21:38:27 yeah, us making PRs to the autoscaler will work for us going forward 21:38:40 the current PR has so much going on already 21:38:48 We can focus on the things that work atm, and when it is in, PR in the CA repo are fine 21:38:53 issues in autoscaler, but don't scare them :) 21:39:03 strigazi: +1 21:39:40 one question if tghartland has looking into the TemplateNodeInfo interface method implementation 21:39:41 as long as we agree on the direction 21:40:08 I think the current implementation will cause a crash 21:40:16 imho i think we're all heading the same direction 21:40:47 creash on what? 21:40:56 crash on what? why? 21:40:56 the autoscaler 21:41:24 is it reproducible? 21:41:58 Should be, I am curious as to if you guys have seen it 21:42:03 no 21:42:28 I'll double check, but the current implementation should crash 100% of the time when it gets called 21:42:49 it is a specific call that is not implemented? 21:42:55 yes 21:42:57 TemplateNodeInfo this > 21:42:59 TemplateNodeInfo() 21:43:16 I'll discuss it with him tmr 21:43:48 kk sounds good, I think for good faith for the upstream autoscaler guys, we might want to figure that part out 21:44:11 before requesting merge 21:44:38 100% probability of crash should be fixed first 21:44:58 :) yeah 21:45:40 it is the vm flavor basically? 21:45:57 yeah pretty much 21:46:29 the autoscaler gets confused when there are no schedulable nodes 21:46:54 so TemplateNodeInfo() should generate a sample node for a given nodegroup 21:47:14 sounds easy 21:47:56 Yeah shouldn't be too bad, just need to fully construct the template node 21:48:07 this however: 'the autoscaler gets confused when there are no schedulable nodes' sounds bad. 21:48:33 it tries to run simulations before scaling up 21:48:45 so how it works now? 21:49:04 if there are valid nodes, it will use their info in the simulation 21:49:14 it doesn't do any simulations? 21:49:17 if there is no valid node, it needs the result of templateNodeInfo 21:50:28 if you can send us a scenario to reproduce, it would help 21:51:15 cordon all nodes and put the cluster in a situation to scale up, should show the issue 21:51:36 but, won't it create a new node? 21:52:04 I pinged him, he will try tmr 21:52:33 strigazi: in my testing, it scaled up well 21:52:43 schaney: apart from that, anything else? 21:52:52 to request to merge 21:53:01 flwang1: for me as well 21:54:18 I think that was the last crash that I was looking at, everything else will just be tweaking 21:54:30 nice 21:54:38 flwang1: to be clear, this issue is only seen when effectively scaling up from 0 21:55:02 schaney: i see. i haven't tested that case 21:55:39 rare case, but I was just bringing it up since it will cause a crash 21:55:54 schaney: cool 21:55:58 we can address it 21:56:09 awesome 21:58:16 we are almost out of time 21:58:43 strigazi: rolling upgrade status? 21:58:54 I'll just ask one more time, Can someone look into the CI failures? 21:59:05 strigazi: i did 21:59:20 flwang1: end meeting first and the discuss it? 21:59:20 the current ci failure is related to nested virt 21:59:30 how so? 21:59:30 strigazi: sure 21:59:45 i even popped up in infra channel 21:59:51 let's end the meeting first 21:59:58 see you next time 22:00:03 thanks everyone 22:00:07 and there is no good way now, seems infra recently upgrade their kernel 22:00:16 manser may have more inputs 22:00:33 #endmeeting