21:00:29 <strigazi> #startmeeting containers 21:00:30 <openstack> Meeting started Tue Dec 11 21:00:29 2018 UTC and is due to finish in 60 minutes. The chair is strigazi. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:00:31 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:00:32 <strigazi> #topic Roll Call 21:00:33 <openstack> The meeting name has been set to 'containers' 21:00:43 <strigazi> o/ 21:00:45 <cbrumm_> o/ 21:01:42 <flwang> o/ 21:02:51 <strigazi> #topic Announcements 21:03:01 <strigazi> None 21:03:06 <strigazi> #topic Stories/Tasks 21:03:37 <strigazi> flwang: mnaser pushed some patches to impove the CI 21:03:49 <strigazi> #link https://review.openstack.org/#/q/topic:k8s-speed+(status:open+OR+status:merged) 21:05:09 <strigazi> the last five is kind of refactor, the first ones are functional tests only. 21:05:29 <strigazi> With nested virt, the k8s test runs in ~40 min 21:05:50 <flwang> very cool 21:05:53 <flwang> i'm keen to review them 21:06:12 <strigazi> They are small actually, thanks 21:06:27 <flwang> strigazi: 40 is ok for now, we can improve it later 21:06:53 <strigazi> ~30 takes only to deploy devstack 21:07:24 <flwang> that's quite reasonable then 21:08:29 <flwang> i love then 21:08:44 <flwang> in the future, we probably can support sonobuoy 21:08:58 <strigazi> From my side I completed the patch to use the cloud-provider-openstack k8s_fedora: Use external kubernetes/cloud-provider-openstack https://review.openstack.org/577477 21:09:20 <strigazi> I did it because I was testing lxkong patch for the delete hooks. 21:10:44 <flwang> strigazi: yep, i saw that, i just completed the review 21:11:05 <strigazi> cbrumm_: you are using this provider https://hub.docker.com/r/k8scloudprovider/openstack-cloud-controller-manager/ or the upstream from kubernetes ? 21:11:06 <flwang> commented 21:11:08 <cbrumm_> looks good, we might have some small additions to make to the cloud controller so it will be good to have those in 21:11:27 <cbrumm_> we're using the k8s upstream 21:11:42 <strigazi> cbrumm_: with with k8s release? 21:11:49 <strigazi> cbrumm_: with which k8s release? 21:12:03 <cbrumm_> yeah, matched to the same version of k8s, so 1.12 for right now 21:13:16 <strigazi> I use the out-of-tree and out of k8s repo one. I think new changes go there, did I got this wrong? 21:14:07 <cbrumm_> we use this one https://github.com/kubernetes/cloud-provider-openstack 21:14:37 <strigazi> oh, ok, my patch is using that one, cool 21:15:15 <cbrumm_> good, that one is where all the work is going according to the slack channel 21:16:27 <strigazi> that's it from me, flwang cbrumm_ anything you want to bring up? 21:17:04 <cbrumm_> Nothing, I'm missing people at kubecon and paternity leave 21:17:36 <flwang> strigazi: the keystone auth patch is ready and in good shape in my standard 21:17:50 <flwang> may need small polish, need your comments 21:17:54 <strigazi> This kubecon in particural must be very cool 21:18:16 <strigazi> flwang: I'm taking a look 21:18:26 <flwang> strigazi: and i'd like to discuss the rolling upgrade and auto healing if you have time 21:20:56 <strigazi> flwang: the patch is missing only the tag (?). I just need to test it. 21:21:52 <flwang> which patch? 21:22:01 <strigazi> the keystone-auth one 21:22:37 <flwang> yep, we probably need a label for its tag 21:24:26 <flwang> https://hub.docker.com/r/k8scloudprovider/k8s-keystone-auth/tags/ 21:24:28 <strigazi> For upgrades, I'm finishing what we discussed two weeks ago. I was a little busy with the CVE and some downstream ticketing. It's going well 21:24:44 <flwang> cool 21:25:27 <flwang> does the upgrade support master nodes? 21:25:48 <strigazi> btw, I did this script to sync containers and create cluster fast http://paste.openstack.org/raw/737052/ 21:26:15 <strigazi> flwang: yes, with rebuild (if using a volume for etcd) and inplace 21:26:53 <flwang> nice, thanks for sharing 21:27:14 <strigazi> flwang: when moving to minor releases inplace should be ok. 21:27:27 <strigazi> 1.11.5 to 1.11.6 for example 21:27:27 <flwang> i'm happy with current upgrade desgin 21:27:49 <flwang> do you want to discuss auto healing in meeting or offline? 21:28:10 <strigazi> let's do it now, he have an 1 hour meeting anyway 21:28:28 <strigazi> based on our experience with the k8s CVE 21:29:02 <strigazi> the healing and monitoting of the heatlth can be two separate things that are both required. 21:29:08 <strigazi> To clarify: 21:29:46 <flwang> i'm listening 21:29:49 <strigazi> monitoting the health status or even version? os useful for an operator 21:30:24 <strigazi> For the CVE I had to cook a script that checks all apis if they allow anonymous-auth 21:30:40 <strigazi> *all cluster APIs in our cloud. 21:30:58 <flwang> os useful for an operator ?? os=? 21:31:39 <strigazi> s/os/it is/ 21:33:38 <strigazi> For autohealing, we could use the cluster autoscaler with node-detector and draino instead of writing a magnum specific authealing mechanism 21:35:47 <flwang> anybody working on autohealing now? i'm testing node problem detector and draino now 21:35:47 <strigazi> does this make sense? 21:36:23 <cbrumm_> node problem detector is something we'll be taking a deeper look at in Jan 21:36:25 <flwang> for health status monitoring, do you mean we still prefer to keep the periodic job to check the health status in Magnum? 21:36:48 <strigazi> yes 21:36:49 <flwang> cbrumm_: cool, please join us to avoid duplicating effort 21:37:05 <flwang> strigazi: ok, good 21:37:23 <lxkong> strigazi: there are 2 levels for auto-healing, the openstack infra and the k8s, and also we need to take care of both masters and workers 21:37:24 <cbrumm_> sure, first we'll just be looking at the raw open source product, we aren't far at all 21:38:06 <lxkong> i discussed auto-healing with flwang yesterday, we have a plan 21:39:19 <strigazi> what is the plan? 21:40:51 <strigazi> what is openstack infra a different level? 21:42:19 <lxkong> use aodh and heat to guarantee the desired node count, NPD/drainer/autoscaler for the node problems of workers 21:43:47 <strigazi> fyi, flwang lxkong if we make aodh a dependency, at least at CERN we will diverge, we just decommisioned ceilometer 21:44:40 <lxkong> but we need some component to revieve notification and trigger alarm action 21:44:49 <lxkong> either it's ceilometer or aodh or something else 21:44:52 <strigazi> two years ago we stopped collecting metrics with ceilometers. 21:45:02 <strigazi> two years ago we stopped collecting metrics with ceilometer. 21:45:26 <flwang> lxkong: as we discussed 21:45:29 <strigazi> for notifications we started using logstash 21:45:55 <strigazi> I don't think aodh is a very appealing option for adoption 21:46:00 <flwang> the heat/aodh workflow is the infra/physical layer healing 21:46:11 <lxkong> yep 21:46:22 <flwang> let's focus on the healing which can be detected by NPD 21:46:36 <flwang> since catalyst cloud doesn't have aodh right now 21:46:51 <strigazi> is there smth that NPD can't detect? 21:47:33 <flwang> strigazi: if the worker node lost connection or down suddenly, i don't think NPD have chance to detect 21:47:36 <lxkong> if the node on which NPD is running dies 21:47:37 <cbrumm_> It can detect whatever you make a check for. It can use "nagios" style plugins 21:47:48 <flwang> since it's running as a pod on that kubelet, correct me if i'm wrong 21:48:27 <strigazi> flwang: if the node stopped reporting for X time it can be considered for removal 21:48:50 <flwang> by who? 21:49:20 <strigazi> autoscaler 21:51:07 <flwang> ok, i haven't add autoscaler into my testing 21:51:21 <flwang> if autoscaler can detect such kind of issue, then i'm ok with that 21:52:00 <strigazi> i couldn't find the link for it. 21:52:17 <flwang> that's alright 21:52:22 <strigazi> I think that we can separate the openstack specific checks from the k8s ones. 21:52:26 <flwang> at least, we're on the same page now 21:52:50 <strigazi> aodh or other solutions can take care of nodes without the cluster noticing 21:52:55 <flwang> NPD+draino+auotscaler and keep the health monitoring in magnum in parallel 21:53:02 <strigazi> yes 21:53:19 <flwang> how autoscaler deal with openstack? 21:53:43 <flwang> how autoscaler tell magnum i want to replace a node? 21:54:43 <strigazi> the current options is talking to heat directly, but we work on the nodegroups implementation and a method that the magnum api will expose a node removal api. 21:56:50 <flwang> strigazi: great, "the magnum api will expose a node removal api", we discussed this yesterday, we probabaly need an api like 'openstack coe cluster replace <cluster ID> --node-ip xxx / --node-name yyy 21:57:29 <flwang> it would be nice if autoscaler can talk to magnum instead of heat because it would be easy for the auto scaling case 21:57:48 <flwang> for auto scaling scenario, magnum needs to know the number of master/worker nodes 21:58:05 <flwang> if autoscaler talks to heat dierectly, magnum can't know that info 21:58:56 <strigazi> in the autoscaler implementation we test here: https://github.com/cernops/autoscaler/pull/3 the autoscaler talks to heat and then to magnum to update the node count. 21:59:51 <flwang> is it testable now? 22:00:30 <strigazi> yes, it is still WIP but it works 22:02:08 <flwang> fantastic 22:02:24 <flwang> who is the right person i should talk to ? are you working on that? 22:02:55 <strigazi> flwang: you can create an issue on github 22:03:04 <flwang> good idea 22:03:12 <strigazi> it is me tghartland and ricardo 22:03:53 <schaney> o/ sorry I missed the meeting, cbrumm sent me some of the conversation though. Was there any timeline on the nodegroup implementation in Magnum? 22:03:55 <strigazi> you can create one under cernops for now and then we can move things to k/autosclaer 22:03:59 <flwang> very cool, does that need any change in gophercloud? 22:04:39 <strigazi> schaney: stein and the sooner the better 22:04:55 <schaney> gohpercloud will need the Magnum API updates integrated once that's complete 22:05:46 <strigazi> flwang: Thomas (tghartland) updated the deps for gophercloud, when we have new magnum APIs we will have to push to gophercloud changes too 22:06:00 <schaney> awesome 22:06:39 <strigazi> schaney: you work on gophercloud? autosclaler? k8s? openstack? all the above?:) 22:08:26 <schaney> =) yep! hoping to be able to help polish the autoscaler once everything is released, we sort of went a different direction in the mean time (heat only) 22:09:15 <strigazi> cool, of you have input, just shoot in github, we work there to be easy for people to give input 22:09:23 <strigazi> cool, if you have input, just shoot in github, we work there to be easy for people to give input 22:09:36 <flwang> schaney: are you the young man we meet at Berlin? 22:09:44 <schaney> will do! 22:09:57 <flwang> strigazi: who is working on the magnum api change? 22:10:04 <schaney> @flwang I wasn't in Berlin. no, maybe Duc was there? 22:10:05 <flwang> is there a story to track that? 22:10:15 <flwang> schaney: no worries 22:10:26 <flwang> appreciate for your contribution 22:11:11 <strigazi> for nodegroups it is Theodoros https://review.openstack.org/#/q/owner:theodoros.tsioutsias%2540cern.ch+status:open+project:openstack/magnum 22:13:05 <flwang> for the node group feature, how the autoscaler call it to replace a single node? 22:13:27 <flwang> i think we need a place to discuss the overall design 22:13:56 <strigazi> For node-removal we need to document it explicitly. We don't have something. 22:14:20 <strigazi> I'll create a new story, I'll push a spec too. 22:14:51 <strigazi> based on discussions that we had in my team, the options are two: 22:15:53 <strigazi> one is to have an endpoint to delete a node from the cluster without passing the NG. The other is to remove a node from a NG explicitly. 22:16:35 <strigazi> We might need both actually, I'll write the spec, unless someone is really keen on writing it. :) 22:17:09 <strigazi> @all Shall we end the meeting? 22:17:19 <flwang> strigazi: let's end it 22:17:31 <schaney> sure yeah, thanks! 22:17:36 <strigazi> Thanks eveyone 22:17:44 <strigazi> #endmeeting