#openstack-containers log

21:00:29 <strigazi> #startmeeting containers
21:00:30 <openstack> Meeting started Tue Dec 11 21:00:29 2018 UTC and is due to finish in 60 minutes.  The chair is strigazi. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:00:31 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:00:32 <strigazi> #topic Roll Call
21:00:33 <openstack> The meeting name has been set to 'containers'
21:00:43 <strigazi> o/
21:00:45 <cbrumm_> o/
21:01:42 <flwang> o/
21:02:51 <strigazi> #topic Announcements
21:03:01 <strigazi> None
21:03:06 <strigazi> #topic Stories/Tasks
21:03:37 <strigazi> flwang: mnaser pushed some patches to impove the CI
21:03:49 <strigazi> #link https://review.openstack.org/#/q/topic:k8s-speed+(status:open+OR+status:merged)
21:05:09 <strigazi> the last five is kind of refactor, the first ones are functional tests only.
21:05:29 <strigazi> With nested virt, the k8s test runs in ~40 min
21:05:50 <flwang> very cool
21:05:53 <flwang> i'm keen to review them
21:06:12 <strigazi> They are small actually, thanks
21:06:27 <flwang> strigazi: 40 is ok for now, we can improve it later
21:06:53 <strigazi> ~30 takes only to deploy devstack
21:07:24 <flwang> that's quite reasonable then
21:08:29 <flwang> i love then
21:08:44 <flwang> in the future, we probably can support sonobuoy
21:08:58 <strigazi> From my side I completed the patch to use the cloud-provider-openstack k8s_fedora: Use external kubernetes/cloud-provider-openstack  https://review.openstack.org/577477
21:09:20 <strigazi> I did it because I was testing lxkong patch for the delete hooks.
21:10:44 <flwang> strigazi: yep, i saw that, i just completed the review
21:11:05 <strigazi> cbrumm_: you are using this provider https://hub.docker.com/r/k8scloudprovider/openstack-cloud-controller-manager/ or the upstream from kubernetes ?
21:11:06 <flwang> commented
21:11:08 <cbrumm_> looks good, we might have some small additions to make to the cloud controller so it will be good to have those in
21:11:27 <cbrumm_> we're using the k8s upstream
21:11:42 <strigazi> cbrumm_: with with k8s release?
21:11:49 <strigazi> cbrumm_: with which k8s release?
21:12:03 <cbrumm_> yeah, matched to the same version of k8s, so 1.12 for right now
21:13:16 <strigazi> I use the out-of-tree and out of k8s repo one. I think new changes go there, did I got this wrong?
21:14:07 <cbrumm_> we use this one https://github.com/kubernetes/cloud-provider-openstack
21:14:37 <strigazi> oh, ok, my patch is using that one, cool
21:15:15 <cbrumm_> good, that one is where all the work is going according to the slack channel
21:16:27 <strigazi> that's it from me, flwang cbrumm_ anything you want to bring up?
21:17:04 <cbrumm_> Nothing, I'm missing people at kubecon and paternity leave
21:17:36 <flwang> strigazi: the keystone auth patch is ready and in good shape in my standard
21:17:50 <flwang> may need small polish, need your comments
21:17:54 <strigazi> This kubecon in particural must be very cool
21:18:16 <strigazi> flwang: I'm taking a look
21:18:26 <flwang> strigazi: and i'd like to discuss the rolling upgrade and auto healing if you have time
21:20:56 <strigazi> flwang: the patch is missing only the tag (?). I just need to test it.
21:21:52 <flwang> which patch?
21:22:01 <strigazi> the keystone-auth one
21:22:37 <flwang> yep, we probably need a label for its tag
21:24:26 <flwang> https://hub.docker.com/r/k8scloudprovider/k8s-keystone-auth/tags/
21:24:28 <strigazi> For upgrades, I'm finishing what we discussed two weeks ago. I was a little busy with the CVE and some downstream ticketing. It's going well
21:24:44 <flwang> cool
21:25:27 <flwang> does the upgrade support master nodes?
21:25:48 <strigazi> btw, I did this script to sync containers and create cluster fast http://paste.openstack.org/raw/737052/
21:26:15 <strigazi> flwang: yes, with rebuild (if using a volume for etcd) and inplace
21:26:53 <flwang> nice, thanks for sharing
21:27:14 <strigazi> flwang: when moving to minor releases inplace should be ok.
21:27:27 <strigazi> 1.11.5 to 1.11.6 for example
21:27:27 <flwang> i'm happy with current upgrade desgin
21:27:49 <flwang> do you want to discuss auto healing in meeting or offline?
21:28:10 <strigazi> let's do it now, he have an 1 hour meeting anyway
21:28:28 <strigazi> based on our experience with the k8s CVE
21:29:02 <strigazi> the healing and monitoting of the heatlth can be two separate things that are both required.
21:29:08 <strigazi> To clarify:
21:29:46 <flwang> i'm listening
21:29:49 <strigazi> monitoting the health status or even version? os useful for an operator
21:30:24 <strigazi> For the CVE I had to cook a script that checks all apis if they allow anonymous-auth
21:30:40 <strigazi> *all cluster APIs in our cloud.
21:30:58 <flwang> os useful for an operator  ??  os=?
21:31:39 <strigazi> s/os/it is/
21:33:38 <strigazi> For autohealing, we could use the cluster autoscaler with node-detector and draino instead of writing a magnum specific authealing mechanism
21:35:47 <flwang> anybody working on autohealing now? i'm testing node problem detector and draino now
21:35:47 <strigazi> does this make sense?
21:36:23 <cbrumm_> node problem detector is something we'll be taking a deeper look at in Jan
21:36:25 <flwang> for health status monitoring, do you mean we still prefer to keep the periodic job to check the health status in Magnum?
21:36:48 <strigazi> yes
21:36:49 <flwang> cbrumm_: cool, please join us to avoid duplicating effort
21:37:05 <flwang> strigazi: ok, good
21:37:23 <lxkong> strigazi: there are 2 levels for auto-healing, the openstack infra and the k8s, and also we need to take care of both masters and workers
21:37:24 <cbrumm_> sure, first we'll just be looking at the raw open source product, we aren't far at all
21:38:06 <lxkong> i discussed auto-healing with flwang  yesterday, we have a plan
21:39:19 <strigazi> what is the plan?
21:40:51 <strigazi> what is openstack infra a different level?
21:42:19 <lxkong> use aodh and heat to guarantee the desired node count, NPD/drainer/autoscaler for the node problems of workers
21:43:47 <strigazi> fyi, flwang lxkong if we make aodh a dependency, at least at CERN we will diverge, we just decommisioned ceilometer
21:44:40 <lxkong> but we need some component to revieve notification and trigger alarm action
21:44:49 <lxkong> either it's ceilometer or aodh or something else
21:44:52 <strigazi> two years ago we stopped collecting metrics with ceilometers.
21:45:02 <strigazi> two years ago we stopped collecting metrics with ceilometer.
21:45:26 <flwang> lxkong: as we discussed
21:45:29 <strigazi> for notifications we started using logstash
21:45:55 <strigazi> I don't think aodh is a very appealing option for adoption
21:46:00 <flwang> the heat/aodh workflow is the infra/physical layer healing
21:46:11 <lxkong> yep
21:46:22 <flwang> let's focus on the healing which can be detected by NPD
21:46:36 <flwang> since catalyst cloud doesn't have aodh right now
21:46:51 <strigazi> is there smth that NPD can't detect?
21:47:33 <flwang> strigazi: if the worker node lost connection or down suddenly, i don't think NPD have chance to detect
21:47:36 <lxkong> if the node on which NPD is running dies
21:47:37 <cbrumm_> It can detect whatever you make a check for. It can use "nagios" style plugins
21:47:48 <flwang> since it's running as a pod on that kubelet, correct me if i'm wrong
21:48:27 <strigazi> flwang:  if the node stopped reporting for X time it can be considered for removal
21:48:50 <flwang> by who?
21:49:20 <strigazi> autoscaler
21:51:07 <flwang> ok, i haven't add autoscaler into my testing
21:51:21 <flwang> if autoscaler can detect such kind of issue, then i'm ok with that
21:52:00 <strigazi> i couldn't find the link for it.
21:52:17 <flwang> that's alright
21:52:22 <strigazi> I think that we can separate the openstack specific checks from the k8s ones.
21:52:26 <flwang> at least, we're on the same page now
21:52:50 <strigazi> aodh or other solutions can take care of nodes without the cluster noticing
21:52:55 <flwang> NPD+draino+auotscaler     and    keep the health monitoring in magnum in parallel
21:53:02 <strigazi> yes
21:53:19 <flwang> how autoscaler deal with openstack?
21:53:43 <flwang> how autoscaler tell magnum i want to replace a node?
21:54:43 <strigazi> the current options is talking to heat directly, but we work on the nodegroups implementation and a method that the magnum api will expose a node removal api.
21:56:50 <flwang> strigazi: great, "the magnum api will expose a node removal api", we discussed this yesterday, we probabaly need an api like 'openstack coe cluster replace <cluster ID> --node-ip xxx  /  --node-name yyy
21:57:29 <flwang> it would be nice if autoscaler can talk to magnum instead of heat because it would be easy for the auto scaling case
21:57:48 <flwang> for auto scaling scenario, magnum needs to know the number of master/worker nodes
21:58:05 <flwang> if autoscaler talks to heat dierectly, magnum can't know that info
21:58:56 <strigazi> in the autoscaler implementation we test here: https://github.com/cernops/autoscaler/pull/3 the autoscaler talks to heat and then to magnum to update the node count.
21:59:51 <flwang> is it testable now?
22:00:30 <strigazi> yes, it is still WIP but it works
22:02:08 <flwang> fantastic
22:02:24 <flwang> who is the right person i should talk to ? are you working on that?
22:02:55 <strigazi> flwang: you can create an issue on github
22:03:04 <flwang> good idea
22:03:12 <strigazi> it is me tghartland and ricardo
22:03:53 <schaney> o/ sorry I missed the meeting, cbrumm sent me some of the conversation though.  Was there any timeline on the nodegroup implementation in Magnum?
22:03:55 <strigazi> you can create one under cernops for now and then we can move things to k/autosclaer
22:03:59 <flwang> very cool, does that need any change in gophercloud?
22:04:39 <strigazi> schaney: stein and the sooner the better
22:04:55 <schaney> gohpercloud will need the Magnum API updates integrated once that's complete
22:05:46 <strigazi> flwang: Thomas (tghartland) updated the deps for gophercloud, when we have new magnum APIs we will have to push to gophercloud changes too
22:06:00 <schaney> awesome
22:06:39 <strigazi> schaney: you work on gophercloud? autosclaler? k8s? openstack? all the above?:)
22:08:26 <schaney> =) yep! hoping to be able to help polish the autoscaler once everything is released, we sort of went a different direction in the mean time (heat only)
22:09:15 <strigazi> cool, of you have input, just shoot in github, we work there to be easy for people to give input
22:09:23 <strigazi> cool, if you have input, just shoot in github, we work there to be easy for people to give input
22:09:36 <flwang> schaney: are you the young man we meet at Berlin?
22:09:44 <schaney> will do!
22:09:57 <flwang> strigazi: who is working on the magnum api change?
22:10:04 <schaney> @flwang I wasn't in Berlin. no, maybe Duc was there?
22:10:05 <flwang> is there a story to track that?
22:10:15 <flwang> schaney: no worries
22:10:26 <flwang> appreciate for your contribution
22:11:11 <strigazi> for nodegroups it is Theodoros https://review.openstack.org/#/q/owner:theodoros.tsioutsias%2540cern.ch+status:open+project:openstack/magnum
22:13:05 <flwang> for the node group  feature,  how the autoscaler call it to replace a single node?
22:13:27 <flwang> i think we need a place to discuss the overall design
22:13:56 <strigazi> For node-removal we need to document it explicitly. We don't have something.
22:14:20 <strigazi> I'll create a new story, I'll push a spec too.
22:14:51 <strigazi> based on discussions that we had in my team, the options are two:
22:15:53 <strigazi> one is to have an endpoint to delete a node from the cluster without passing the NG. The other is to remove a node from a NG explicitly.
22:16:35 <strigazi> We might need both actually, I'll write the spec, unless someone is really keen on writing it. :)
22:17:09 <strigazi> @all Shall we end the meeting?
22:17:19 <flwang> strigazi: let's end it
22:17:31 <schaney> sure yeah, thanks!
22:17:36 <strigazi> Thanks eveyone
22:17:44 <strigazi> #endmeeting