21:00:29 #startmeeting containers 21:00:30 Meeting started Tue Dec 11 21:00:29 2018 UTC and is due to finish in 60 minutes. The chair is strigazi. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:00:31 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:00:32 #topic Roll Call 21:00:33 The meeting name has been set to 'containers' 21:00:43 o/ 21:00:45 o/ 21:01:42 o/ 21:02:51 #topic Announcements 21:03:01 None 21:03:06 #topic Stories/Tasks 21:03:37 flwang: mnaser pushed some patches to impove the CI 21:03:49 #link https://review.openstack.org/#/q/topic:k8s-speed+(status:open+OR+status:merged) 21:05:09 the last five is kind of refactor, the first ones are functional tests only. 21:05:29 With nested virt, the k8s test runs in ~40 min 21:05:50 very cool 21:05:53 i'm keen to review them 21:06:12 They are small actually, thanks 21:06:27 strigazi: 40 is ok for now, we can improve it later 21:06:53 ~30 takes only to deploy devstack 21:07:24 that's quite reasonable then 21:08:29 i love then 21:08:44 in the future, we probably can support sonobuoy 21:08:58 From my side I completed the patch to use the cloud-provider-openstack k8s_fedora: Use external kubernetes/cloud-provider-openstack https://review.openstack.org/577477 21:09:20 I did it because I was testing lxkong patch for the delete hooks. 21:10:44 strigazi: yep, i saw that, i just completed the review 21:11:05 cbrumm_: you are using this provider https://hub.docker.com/r/k8scloudprovider/openstack-cloud-controller-manager/ or the upstream from kubernetes ? 21:11:06 commented 21:11:08 looks good, we might have some small additions to make to the cloud controller so it will be good to have those in 21:11:27 we're using the k8s upstream 21:11:42 cbrumm_: with with k8s release? 21:11:49 cbrumm_: with which k8s release? 21:12:03 yeah, matched to the same version of k8s, so 1.12 for right now 21:13:16 I use the out-of-tree and out of k8s repo one. I think new changes go there, did I got this wrong? 21:14:07 we use this one https://github.com/kubernetes/cloud-provider-openstack 21:14:37 oh, ok, my patch is using that one, cool 21:15:15 good, that one is where all the work is going according to the slack channel 21:16:27 that's it from me, flwang cbrumm_ anything you want to bring up? 21:17:04 Nothing, I'm missing people at kubecon and paternity leave 21:17:36 strigazi: the keystone auth patch is ready and in good shape in my standard 21:17:50 may need small polish, need your comments 21:17:54 This kubecon in particural must be very cool 21:18:16 flwang: I'm taking a look 21:18:26 strigazi: and i'd like to discuss the rolling upgrade and auto healing if you have time 21:20:56 flwang: the patch is missing only the tag (?). I just need to test it. 21:21:52 which patch? 21:22:01 the keystone-auth one 21:22:37 yep, we probably need a label for its tag 21:24:26 https://hub.docker.com/r/k8scloudprovider/k8s-keystone-auth/tags/ 21:24:28 For upgrades, I'm finishing what we discussed two weeks ago. I was a little busy with the CVE and some downstream ticketing. It's going well 21:24:44 cool 21:25:27 does the upgrade support master nodes? 21:25:48 btw, I did this script to sync containers and create cluster fast http://paste.openstack.org/raw/737052/ 21:26:15 flwang: yes, with rebuild (if using a volume for etcd) and inplace 21:26:53 nice, thanks for sharing 21:27:14 flwang: when moving to minor releases inplace should be ok. 21:27:27 1.11.5 to 1.11.6 for example 21:27:27 i'm happy with current upgrade desgin 21:27:49 do you want to discuss auto healing in meeting or offline? 21:28:10 let's do it now, he have an 1 hour meeting anyway 21:28:28 based on our experience with the k8s CVE 21:29:02 the healing and monitoting of the heatlth can be two separate things that are both required. 21:29:08 To clarify: 21:29:46 i'm listening 21:29:49 monitoting the health status or even version? os useful for an operator 21:30:24 For the CVE I had to cook a script that checks all apis if they allow anonymous-auth 21:30:40 *all cluster APIs in our cloud. 21:30:58 os useful for an operator ?? os=? 21:31:39 s/os/it is/ 21:33:38 For autohealing, we could use the cluster autoscaler with node-detector and draino instead of writing a magnum specific authealing mechanism 21:35:47 anybody working on autohealing now? i'm testing node problem detector and draino now 21:35:47 does this make sense? 21:36:23 node problem detector is something we'll be taking a deeper look at in Jan 21:36:25 for health status monitoring, do you mean we still prefer to keep the periodic job to check the health status in Magnum? 21:36:48 yes 21:36:49 cbrumm_: cool, please join us to avoid duplicating effort 21:37:05 strigazi: ok, good 21:37:23 strigazi: there are 2 levels for auto-healing, the openstack infra and the k8s, and also we need to take care of both masters and workers 21:37:24 sure, first we'll just be looking at the raw open source product, we aren't far at all 21:38:06 i discussed auto-healing with flwang yesterday, we have a plan 21:39:19 what is the plan? 21:40:51 what is openstack infra a different level? 21:42:19 use aodh and heat to guarantee the desired node count, NPD/drainer/autoscaler for the node problems of workers 21:43:47 fyi, flwang lxkong if we make aodh a dependency, at least at CERN we will diverge, we just decommisioned ceilometer 21:44:40 but we need some component to revieve notification and trigger alarm action 21:44:49 either it's ceilometer or aodh or something else 21:44:52 two years ago we stopped collecting metrics with ceilometers. 21:45:02 two years ago we stopped collecting metrics with ceilometer. 21:45:26 lxkong: as we discussed 21:45:29 for notifications we started using logstash 21:45:55 I don't think aodh is a very appealing option for adoption 21:46:00 the heat/aodh workflow is the infra/physical layer healing 21:46:11 yep 21:46:22 let's focus on the healing which can be detected by NPD 21:46:36 since catalyst cloud doesn't have aodh right now 21:46:51 is there smth that NPD can't detect? 21:47:33 strigazi: if the worker node lost connection or down suddenly, i don't think NPD have chance to detect 21:47:36 if the node on which NPD is running dies 21:47:37 It can detect whatever you make a check for. It can use "nagios" style plugins 21:47:48 since it's running as a pod on that kubelet, correct me if i'm wrong 21:48:27 flwang: if the node stopped reporting for X time it can be considered for removal 21:48:50 by who? 21:49:20 autoscaler 21:51:07 ok, i haven't add autoscaler into my testing 21:51:21 if autoscaler can detect such kind of issue, then i'm ok with that 21:52:00 i couldn't find the link for it. 21:52:17 that's alright 21:52:22 I think that we can separate the openstack specific checks from the k8s ones. 21:52:26 at least, we're on the same page now 21:52:50 aodh or other solutions can take care of nodes without the cluster noticing 21:52:55 NPD+draino+auotscaler and keep the health monitoring in magnum in parallel 21:53:02 yes 21:53:19 how autoscaler deal with openstack? 21:53:43 how autoscaler tell magnum i want to replace a node? 21:54:43 the current options is talking to heat directly, but we work on the nodegroups implementation and a method that the magnum api will expose a node removal api. 21:56:50 strigazi: great, "the magnum api will expose a node removal api", we discussed this yesterday, we probabaly need an api like 'openstack coe cluster replace --node-ip xxx / --node-name yyy 21:57:29 it would be nice if autoscaler can talk to magnum instead of heat because it would be easy for the auto scaling case 21:57:48 for auto scaling scenario, magnum needs to know the number of master/worker nodes 21:58:05 if autoscaler talks to heat dierectly, magnum can't know that info 21:58:56 in the autoscaler implementation we test here: https://github.com/cernops/autoscaler/pull/3 the autoscaler talks to heat and then to magnum to update the node count. 21:59:51 is it testable now? 22:00:30 yes, it is still WIP but it works 22:02:08 fantastic 22:02:24 who is the right person i should talk to ? are you working on that? 22:02:55 flwang: you can create an issue on github 22:03:04 good idea 22:03:12 it is me tghartland and ricardo 22:03:53 o/ sorry I missed the meeting, cbrumm sent me some of the conversation though. Was there any timeline on the nodegroup implementation in Magnum? 22:03:55 you can create one under cernops for now and then we can move things to k/autosclaer 22:03:59 very cool, does that need any change in gophercloud? 22:04:39 schaney: stein and the sooner the better 22:04:55 gohpercloud will need the Magnum API updates integrated once that's complete 22:05:46 flwang: Thomas (tghartland) updated the deps for gophercloud, when we have new magnum APIs we will have to push to gophercloud changes too 22:06:00 awesome 22:06:39 schaney: you work on gophercloud? autosclaler? k8s? openstack? all the above?:) 22:08:26 =) yep! hoping to be able to help polish the autoscaler once everything is released, we sort of went a different direction in the mean time (heat only) 22:09:15 cool, of you have input, just shoot in github, we work there to be easy for people to give input 22:09:23 cool, if you have input, just shoot in github, we work there to be easy for people to give input 22:09:36 schaney: are you the young man we meet at Berlin? 22:09:44 will do! 22:09:57 strigazi: who is working on the magnum api change? 22:10:04 @flwang I wasn't in Berlin. no, maybe Duc was there? 22:10:05 is there a story to track that? 22:10:15 schaney: no worries 22:10:26 appreciate for your contribution 22:11:11 for nodegroups it is Theodoros https://review.openstack.org/#/q/owner:theodoros.tsioutsias%2540cern.ch+status:open+project:openstack/magnum 22:13:05 for the node group feature, how the autoscaler call it to replace a single node? 22:13:27 i think we need a place to discuss the overall design 22:13:56 For node-removal we need to document it explicitly. We don't have something. 22:14:20 I'll create a new story, I'll push a spec too. 22:14:51 based on discussions that we had in my team, the options are two: 22:15:53 one is to have an endpoint to delete a node from the cluster without passing the NG. The other is to remove a node from a NG explicitly. 22:16:35 We might need both actually, I'll write the spec, unless someone is really keen on writing it. :) 22:17:09 @all Shall we end the meeting? 22:17:19 strigazi: let's end it 22:17:31 sure yeah, thanks! 22:17:36 Thanks eveyone 22:17:44 #endmeeting