#openstack-containers log

09:00:39 <flwang1> #startmeeting magnum
09:00:39 <openstack> Meeting started Wed Feb 12 09:00:39 2020 UTC and is due to finish in 60 minutes.  The chair is flwang1. Information about MeetBot at http://wiki.debian.org/MeetBot.
09:00:40 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
09:00:42 <openstack> The meeting name has been set to 'magnum'
09:00:50 <flwang1> #topic roll call
09:00:54 <flwang1> 0/
09:02:59 <brtknr> o/
09:03:07 <flwang1> brtknr: hey, how are you
09:03:14 <brtknr> good thanks and you!
09:03:16 <brtknr> ?
09:03:39 <flwang1> very good
09:03:56 <flwang1> let's wait strigazi a bit
09:04:09 <flwang1> brtknr: did you see my comments on your csi patch?
09:05:52 <brtknr> yes thanks for reviewing
09:06:04 <brtknr> did you see the issue on devstack?
09:06:22 <brtknr> can you leave comment about what k8s version you used and if it was podman or coreos etc.
09:06:37 <brtknr> i havent seen the same issue locally
09:06:43 <flwang1> ok, will do
09:06:52 <flwang1> i'm using v1.16.3 with podman
09:06:56 <flwang1> and coreos
09:08:01 <flwang1> brtknr: let's go through the agenda?
09:08:22 <brtknr> sounds good
09:08:57 <brtknr> btw instead of kubelet-insure-tls, maybe i can use --kubelet-certificate-authority
09:09:07 <brtknr> and use the ca for the cluster?
09:09:34 <flwang1> brtknr: it would be great, otherwise, enterprise use won't like it
09:09:39 <flwang1> 1. Help with removing the constraint that there must be a minimum of 1 worker in a given nodegroup (including default-worker).
09:09:50 <flwang1> have you already got any idea for this?
09:10:53 <brtknr> i can manually specify count as 0 and get the cluster to reach CREATE_COMPLETE but i havent been able to override the value of node_count to 0, at the moment, it defaults to 1
09:12:04 <brtknr> i havent been able to figure out where exactly this constraint is applied. any pointer would be appreciated but i realise there is no easy answer without properly digging underneath
09:12:08 <flwang1> what do you mean manually specify?   like openstack coe cluster create   xxx --node-count 0 ?
09:12:29 <brtknr> that will not work because there is api level constraint
09:12:48 <brtknr> when i remove the api level constraint, the node-count still defaults to 1
09:12:53 <flwang1> brtknr: i see. so you mean you hacked the code to set it to 0?
09:13:02 <brtknr> i can override count in the kubecluser.yaml file
09:13:16 <brtknr> and only then cluster reaches CREATE_COMPLETE
09:13:26 <flwang1> ah, i see.
09:13:51 <flwang1> i would say if it works at the Heat level, then the overall idea should work
09:13:59 <flwang1> we can manage it in magnum scope
09:14:24 <flwang1> it would be nice if you can dig and propose a patch so that we can start to review from there
09:14:32 <brtknr> sounds good
09:15:12 <flwang1> 2. metrics sever CrashLoopBack
09:15:20 <flwang1> is there anything we should discuss about this?
09:17:07 <brtknr> flwang1: i saw your comment
09:17:19 <brtknr> i will see if there is a way to do this without insecure-tls
09:18:27 <brtknr> there is a --kubelet-certificate-authority and --tls-cert-file option which i havent explored
09:18:30 <flwang1> cool, i appreciate your work on this issue
09:19:08 <flwang1> 3. the heat agent log
09:19:10 <brtknr> we're still on roll call topic btw
09:19:16 <brtknr> lol
09:19:16 <flwang1> ah, sorry
09:19:26 <flwang1> #topic heat agent log
09:19:44 <flwang1> with this one, we probably need to wait the investigation result from strigazi
09:19:52 <flwang1> so let's skip it for now?
09:20:21 <brtknr> i was digging into the heat agent log yesterday as its sometimes impossible to see what is happening when the cluser is creating
09:20:52 <brtknr> the main problem is subprocess.communicate does not provide option to stream output
09:21:39 <flwang1> :(
09:21:42 <brtknr> there may be a way to output stdout and stderr to a file on the other hand
09:22:04 <brtknr> from whereever its being executed from
09:22:17 <brtknr> but happy to wait for what strigazi has to say
09:22:35 <flwang1> cool, thanks
09:22:57 <flwang1> move on?
09:25:54 <brtknr> sure
09:27:48 <flwang1> #topic volume AZ
09:27:56 <flwang1> brtknr: did you get a chance to review volume AZ fix https://review.opendev.org/705592 ?
09:30:47 <brtknr> yes it has a complicated logic
09:31:34 <flwang1> brtknr: yep, i don't like it TBH, but i can't figure out a better way to solve it
09:32:17 <brtknr> how has it worked fine until now?
09:32:29 <flwang1> sorry?
09:32:42 <flwang1> you mean why it was working?
09:33:07 <brtknr> until now, how have we survived without this patch is what im asking
09:33:14 <flwang1> probably because most of the companies are not using multi AZ
09:33:30 <flwang1> without multi AZ, user won't run into this issue
09:34:02 <flwang1> as far as i know, Nectar jakeyip, they're hacking the code
09:34:30 <flwang1> i don't know if cern is using multi az
09:35:58 <elenalindq> Hi there, may I interrupt with a question? I tried  openstack coe cluster update $CLUSTER_NAME replace node_count=3 and it failed, because of lack of resources, so my stack is in UPDATE_FAILED state. I fixed the quota and I tried to rerun the update command hoping it will kick it off again, but nothing happens.   If I try openstack stack update <stack_id> --existing it will kick off the update, which succeeds (heat sh
09:35:58 <elenalindq> DATE_COMPLETE), but openstack coe cluster list still shows my stack in UPDATE_FAILED. Is there a way to rerun the update from magnum? Using Openstack Train.
09:36:01 <brtknr> flwang1: can we not get a default value for az in the same way novda does?
09:37:03 <brtknr> elenalindq: if you are using train with an up to date CLI, you can rerun `openstack coe cluster resize <cluster_name> 3`
09:37:08 <flwang1> brtknr: cinder can handle the "" for az
09:37:12 <brtknr> and this will reupdate the heat stack
09:37:28 <elenalindq> thank you  brtknr!
09:37:30 <brtknr> flwang1: but not nova?
09:37:33 <flwang1> brtknr: cinder can NOT handle the "" for az
09:37:40 <flwang1> but nova can
09:37:59 <flwang1> cinder will just return a 400 IIRC
09:38:22 <brtknr> flwang1: can we look into nova code to see how they infer sensible default for region name?
09:38:26 <flwang1> you can easily test this without using multi az
09:38:56 <flwang1> are you trying to solve this issue in cinder?
09:39:45 <brtknr> i think magnum should have an internal default for availability zone
09:39:52 <brtknr> rather than ""
09:40:04 <flwang1> like a config option?
09:40:37 <flwang1> then how can you set the default value for the this option?
09:40:50 <flwang1> and this may break the backward compatibility :(
09:43:03 <brtknr> flwang1: hmm will cinder accept None?
09:43:29 <flwang1> brtknr: no, based on i tried
09:44:25 <flwang1> you can give the patch a try and we can discuss offline
09:44:48 <flwang1> it's a small issue but just make the template complicated, i understand that
09:46:17 <flwang1> brtknr: let's move on
09:46:31 <flwang1> #topic docker storage for fedora coreos
09:46:34 <flwang1> docker storage driver for fedora coreos https://review.opendev.org/696256
09:47:04 <flwang1> brtknr: can you pls revisit above patch?
09:47:08 <flwang1> strigazi: ^
09:51:02 <brtknr> flwang1: my main issue with that patch is that we should try and use the same configure-docker-storage.sh and add a condition in there for fedora coreos
09:51:21 <brtknr> it is hard to factor out common elements when they are in different files
09:52:11 <flwang1> brtknr: we can't use the same script. i'm kind of using the same one from coreos
09:52:20 <flwang1> because the logic are different
09:53:25 <flwang1> brtknr: see https://github.com/openstack/magnum/blob/master/magnum/drivers/k8s_coreos_v1/templates/fragments/configure-docker.yaml
09:54:37 <brtknr> ah ok i see what you mean
09:54:57 <brtknr> my bad
09:55:44 <brtknr> when i tested it, worked for me
09:55:51 <flwang1> all good, please revisit it, because we do need it for fedora coreos driver to remove the TODO :)
09:56:28 <flwang1> let's move on, we only have 5 mins
09:56:30 <brtknr> i just realised that atomic has its own fragment
09:56:41 <brtknr> so this pattern makes sense
09:56:49 <flwang1> #topic autoscaler podman issue
09:57:00 <brtknr> i am happy to take this patch as is
09:57:21 <brtknr> on the topic of autoscaler, are you guys planning to work on supporting nodegroups?
09:57:31 <flwang1> this is a brand new bug, see https://github.com/kubernetes/autoscaler/issues/2819
09:58:17 <flwang1> brtknr: i'm planning to support  /resize api first and then nodegroups, not sure if cern guys will take the node groups support
09:58:39 <flwang1> brtknr: and here is the fix  https://review.opendev.org/707336
09:58:53 <flwang1> we just need to add the volume mount for /etc/machine-id
09:59:03 <flwang1> the bug reporter has confirmed that works for him
09:59:20 <brtknr> flwang1: excellent, looks reasonable to me
09:59:47 <brtknr> we havent started using coreos in prod yet but this kind of bug is precisely the reason why
10:00:06 <brtknr> what do you mean support the resize api and then nodegroups?
10:00:17 <brtknr> doesnt it already support resize?
10:01:38 <flwang1> brtknr: it's using old way https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/magnum/magnum_manager_heat.go#L231
10:02:21 <flwang1> in short, it call heat to remove the node and then call magnum update api to update the correct number, which doesn't make sense given we now got the resize api
10:03:08 <brtknr> flwang1: ah okay thats true
10:03:26 <flwang1> i'd like to use resize to replace it
10:03:35 <brtknr> why did we not update node count on magnum directly?
10:03:38 <flwang1> to drop the dependency with heat
10:04:07 <flwang1> because the magnum update api can't specify which node you can delete
10:04:27 <brtknr> flwang1: oh i see
10:04:53 <flwang1> :)
10:04:57 <brtknr> but if you remove the node from heat stack and update node count, magnum doesnt remove any extra nodes?
10:05:21 <flwang1> magnum just update the number because the node has been deleted
10:05:53 <flwang1> it just "magically work" :)
10:06:06 <brtknr> i really want support for nodegroup autoscalinmg
10:06:11 <flwang1> anyway, i think we all agree resize will be the right one to do this
10:06:18 <brtknr> happy to work on this but will need to learn golang first
10:06:23 <flwang1> brtknr: then show me the code :D
10:06:42 <flwang1> let's move on
10:06:51 <flwang1> i'm going to close this meeting now
10:07:01 <flwang1> we can discuss the out of box storage class offline
10:07:02 <brtknr> #topic out of box storage clasS?
10:07:22 <flwang1> brtknr: it's related to this one https://review.opendev.org/676832
10:07:26 <flwang1> i proposed before
10:07:27 <brtknr> i see dioguerra lurking in the background
10:09:01 <flwang1> :)
10:09:10 <flwang1> let's end the meeting first
10:09:13 <flwang1> #endmeeting