10:00:15 <strigazi> #startmeeting containers
10:00:16 <openstack> Meeting started Tue Jul 10 10:00:15 2018 UTC and is due to finish in 60 minutes.  The chair is strigazi. Information about MeetBot at http://wiki.debian.org/MeetBot.
10:00:17 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
10:00:19 <openstack> The meeting name has been set to 'containers'
10:00:20 <strigazi> #topic Roll Call
10:00:23 <flwang1> o/
10:00:29 <strigazi> o/
10:01:30 <strigazi> agenda:
10:01:37 <strigazi> #link https://wiki.openstack.org/wiki/Meetings/Containers#Agenda_for_2018-07-10_1700_UTC
10:01:42 <strigazi> #topic Blueprints/Bugs/Ideas
10:02:19 <strigazi> nodegroups, diffrent flavors and AZs
10:02:21 <flwang1> strigazi: after you
10:03:24 <strigazi> At the moment we only support to select flavors for the master and worker nodes
10:03:45 <strigazi> For availability zones, we can select the AZ for all nodes.
10:04:12 <strigazi> There is a need to specify different AZs and flavors
10:04:39 <strigazi> AZ for availability, flavors for special vms, for example flavors with GPUs
10:04:59 <mvpnitesh> stragazi: If we want to multiple flavour for the clsuter creation , can we go ahead and implement this https://blueprints.launchpad.net/magnum/+spec/support-multiple-flavor
10:05:22 <mvpnitesh> i guess the spec for this BP is not approved, can i modify the same or raise the new one ?
10:05:39 <brtknr> o/
10:05:43 <strigazi> Nodegroups were proposed in the past, IMO it is an over engineered solution and we don't have the man power for it.
10:05:58 <strigazi> mvpnitesh: we don't use blueprints any more.
10:06:19 <mvpnitesh> strigazi: ok
10:06:25 <strigazi> mvpnitesh: we migrated in storyboard. These BPs where moved there with the same name.
10:07:06 <strigazi> brtknr: hi
10:07:41 <strigazi> we can consolidate these there blueprints in one, that allows cluster with different types of nodes
10:08:04 <brtknr> strigazi: hi, thanks for reviwing the floating ip patch, im confused about the time on the agenda... says 1700.
10:08:06 <strigazi> each node can be different from the other is flavor and AZ
10:08:20 <strigazi> brtknr is it?
10:08:51 <strigazi> brtknr: we alternate and I copied the wrong line
10:09:31 <brtknr> ah, so 1 week at 1000, another week at 1700? i like that solution
10:10:31 <strigazi> brtknr: http://lists.openstack.org/pipermail/openstack-dev/2018-June/131678.html
10:11:08 <strigazi> So, for AZs and flavors. No one is working on it.
10:11:36 <strigazi> We need to design it and target it for S
10:12:15 <strigazi> mvpnitesh: ^^
10:12:34 <mvpnitesh> strigazi: We want to have a multiple flavours for a single cluster. I'll look into the storyboard and come up with the design and i'll target that for S
10:13:26 <strigazi> We can discuss it next week again and we can come prepared.
10:13:40 <strigazi> brtknr: flwang1: are interested in this ^^
10:14:39 <strigazi> brtknr: flwang1: are you interested in this ^^
10:14:41 <flwang1> strigazi: not really ;)
10:14:41 <brtknr> brtknr: multiple flavors and multiple os too if possible
10:14:50 <strigazi> multiple os?
10:15:09 <strigazi> for the special images you have for gpus?
10:15:45 <brtknr> strigazi: yes, i looked into fedora support for gpu, looks a bit hacky
10:15:56 <brtknr> considering nvidia do not officially support gpu
10:16:02 <brtknr> whereas centos is supported
10:16:27 <brtknr> nvidia do not officially support fedora gpu drivers
10:16:49 <strigazi> Let's see, I think we can even use centos-atomic without any changes
10:17:27 <strigazi> Next subject,
10:18:40 <strigazi> I'm working on changing this method to sync the cluster status https://github.com/openstack/magnum/blob/master/magnum/drivers/heat/driver.py#L191
10:19:11 <flwang1> strigazi: what's the background?
10:19:16 <strigazi> Istead of list that I mentioned we can do get without resolving the outpits of the stack
10:19:23 <flwang1> ah, i see
10:19:24 <flwang1> got it
10:19:58 <strigazi> flwang1: with big stack magnum tries to kill heat.
10:20:16 <strigazi> flwang1: With the health check you are working on
10:20:38 <strigazi> we can avoiding getting the worker nodes, like ever
10:20:39 <flwang1> i can remember the issue now
10:22:12 <strigazi> but this is another discussion
10:23:10 <flwang1> ok
10:23:34 <strigazi> This week I expect to push, this patch, cloud-provider-enabled/disable patch, k8s upgrades and the keypair patch.
10:23:38 <brtknr> is there a story for this?
10:23:57 <strigazi> this one https://storyboard.openstack.org/#!/story/2002648
10:24:25 <strigazi> I saw that flwang1 and imdigitaljim had questions about rebuilding and fixing clusters
10:24:58 <strigazi> The above change is required also for supporting users.
10:25:30 <brtknr> Ah, I had the same issue with cluster created using just heat template definition which we are using to create clusters with complex configuration
10:25:50 <strigazi> At the moment, when a user has a cluster with a few nodes (not one node)
10:26:08 <strigazi> And a node is in bad shape we delete this node with heat
10:26:21 <strigazi> Actually we tell him how to do it:
10:27:29 <strigazi> openstack stack update <stack_id> --existing -P minions_to_remove=<comma separated list with resources ids or private ips> -P number_of_minions=<integer>
10:28:30 <strigazi> the resource id can be found either from the name of the vms or by doing openstack stack resource list -n 2 <stack_id>
10:29:23 <strigazi> using the resource id is more helpful since you can even delete nodes that didn't got an ip and also the command we cook for users is smaller :)
10:29:31 <flwang1> strigazi: btw, is Rocardo still working on auto healing?
10:29:57 <strigazi> flwang1: he said he will
10:30:26 <strigazi> flwang1: so auto healing will do what I described auto automatically
10:30:30 <flwang1> strigazi: ok, cool
10:30:59 <strigazi> even so, the change with the keypait is useful so that admins can do this operation
10:31:08 <strigazi> or other users on the project
10:31:11 <strigazi> or other users in the project
10:31:40 <strigazi> make sense?
10:32:06 <flwang1> strigazi: yes for me
10:32:15 <strigazi> And for supporting users and the health status
10:32:25 <strigazi> At the moment
10:32:44 <strigazi> users can not retrieve certs if the cluster is in a failed state
10:33:17 <strigazi> We must change this so that we also know what k8s or swarm think for the cluster
10:33:44 <flwang1> yep, we need the health status to help magnum understand the cluster status
10:34:13 <strigazi> Instead of checking the status of the cluster magnum can change if the CA is created
10:34:51 <strigazi> If the CA is created users should retrieve the certs
10:34:53 <strigazi> makes sense?
10:35:14 <flwang1> we may need some discussion about this
10:35:27 <strigazi> What are your doubts?
10:35:54 <strigazi> The solution we need is
10:36:05 <strigazi> A user created a cluster with 50 nodes
10:36:28 <strigazi> the CA is created, the master is up and also 49 workers
10:36:56 <strigazi> 1 vm failed, to boot to get connectivity or to report to heat.
10:37:08 <flwang1> strigazi: my concern is why user do have to care about a cluster failure
10:37:14 <strigazi> cluster status goes to CREATE_FAILED
10:37:16 <flwang1> can't he just create a new one?
10:37:39 <strigazi> what if it is 100 nodes?
10:38:00 <flwang1> no mater how many nodes, the time of creation should be 'same'
10:38:05 <strigazi> does it sound reasonable for the rest of the openstack services to take the load again?
10:38:11 <flwang1> so the effort to create a new one is 'same'
10:38:49 <strigazi> flwang1: in practice it will not be the same
10:38:57 <flwang1> i can see your point
10:39:32 <flwang1> i'm happy to have it fixed, but personally i'd like to see a 'fix' api
10:39:33 <strigazi> i was dealing with a 500 node cluster last week and I still feel the pain
10:39:48 <flwang1> instead of introducing too much manual effort for user
10:40:28 <flwang1> we can continue it offline
10:40:30 <strigazi> when automation fails you need to have manual access
10:40:47 <strigazi> not if, when
10:40:51 <strigazi> ok
10:40:56 <strigazi> that's it from me
10:41:10 <flwang1> you finished?
10:41:24 <strigazi> yes, go ahead
10:41:31 <flwang1> i have a long list
10:41:47 <flwang1> 1. i'm still working on the multi region issue
10:42:01 <flwang1> and the root cause is in heat
10:42:26 <flwang1> i have proposed several patches in heat, heat-agents and os-collect-config
10:42:41 <strigazi> I saw only one for heat
10:42:44 <strigazi> pointers?
10:42:58 <flwang1> https://review.openstack.org/580470
10:43:08 <flwang1> https://review.openstack.org/580229
10:43:16 <flwang1> these two are for heat
10:43:20 <flwang1> one has been merged in master
10:43:26 <flwang1> i'm cherrypicking to queens
10:43:42 <flwang1> for occ https://review.openstack.org/580554
10:44:10 <strigazi> these three?
10:44:17 <flwang1> for heat-agent https://review.openstack.org/580984 is backporting to queens
10:44:35 <flwang1> we don't have to care about the last one
10:44:44 <strigazi> yeap
10:44:51 <flwang1> but we do need the fixes for heat and occ
10:45:24 <flwang1> 2. heat-container-agent images can't build
10:45:41 <flwang1> failed to find the python-docker-py, not sure if there is anything i missed
10:46:48 <flwang1> 3. etcd race condition issue   https://review.openstack.org/579484
10:46:59 <strigazi> flwang1 I'll have a look in 2
10:47:06 <strigazi> 3 is ok now
10:47:08 <flwang1> strigazi: thanks
10:47:31 <flwang1> strigazi: yep, can you bless it #3? ;)
10:47:41 <openstackgerrit> Merged openstack/magnum master: Pass in `region_name` to get correct heat endpoint  https://review.openstack.org/579043
10:47:43 <openstackgerrit> Merged openstack/magnum master: Add release notes link in README  https://review.openstack.org/581242
10:47:51 <strigazi> 3: Patch in Merge Conflict
10:48:10 <flwang1> ah, yep.
10:48:28 <strigazi> flwang1: you can trim down the commit message when you rebase
10:48:32 <flwang1> as for the rename scripts patch has been reverted, can we get it in again? https://review.openstack.org/581099 i have fixed it
10:48:41 <strigazi> flwang1: yes
10:48:47 <flwang1> strigazi: thanks
10:49:12 <strigazi> flwang1: I was checking the cover job, not sure why sometimes fails
10:49:30 <flwang1> 4. Clear resources created by k8s before delete cluster https://review.openstack.org/497144
10:49:50 <flwang1> the method used in this patch is not good IMHO
10:50:20 <flwang1> technically, there could be many clusters in same subnet
10:50:27 <strigazi> flwang1: what do you propose?
10:50:49 <flwang1> we're working on a fix in CPO to add cluster id into the LB's description
10:51:03 <flwang1> that's the only safe way IMHO
10:51:10 <sfilatov> I might be missing something, but why we dont clear loadbalancers in a separate software deployment?
10:51:24 <sfilatov> we won't need to connect to k8s api then
10:51:44 <flwang1> sfilatov: what do you mean?
10:51:55 <flwang1> the lb is created by k8s
10:52:07 <flwang1> we're not talking about the lb of master
10:52:12 <sfilatov> yes
10:52:18 <sfilatov> i'm talking about k8s too
10:52:29 <sfilatov> we could have dont kubectl delete on all of them
10:52:33 <sfilatov> inside a cluster
10:53:14 <flwang1> sfilatov: can you just propose a patch? im happy to review and testing
10:53:24 <sfilatov> yes, I'm working on it
10:53:39 <sfilatov> but i've kinda have a patch similar to the one proposed
10:53:51 <strigazi> so to delete a cluster we will do a stack update first?
10:53:59 <sfilatov> and we have a lot of issies with it
10:54:04 <sfilatov> no
10:54:31 <sfilatov> we can have a softwaredeployment  for DELETE action
10:54:49 <strigazi> +1
10:55:18 <sfilatov> we are working on this issue, I can provide a patch by the end of this or next week I guess
10:55:27 <flwang1> sfilatov: nice
10:55:39 <flwang1> strigazi: that's me
10:55:56 <sfilatov> can you link a bug or a blueprint for this issue?
10:56:02 <flwang1> i'm still keen to understand the auto upgrade and auto healing status
10:56:10 <flwang1> sfilatov: wait a sec
10:56:13 <strigazi> brtknr: sfilatov something to add?
10:56:25 <flwang1> sfilatov: story/1712062
10:57:00 <sfilatov> thx
10:57:33 <flwang1> np
10:57:45 <DimGR> hi
10:58:50 <strigazi> Anything else for the meeting?
10:59:43 <flwang1> strigazi: nope, im good
10:59:48 <strigazi> ok then, see you this Thursday or in 1 week - 1 hour
10:59:59 <strigazi> #endmeeting