09:00:39 <flwang1> #startmeeting magnum 09:00:39 <openstack> Meeting started Wed Feb 12 09:00:39 2020 UTC and is due to finish in 60 minutes. The chair is flwang1. Information about MeetBot at http://wiki.debian.org/MeetBot. 09:00:40 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 09:00:42 <openstack> The meeting name has been set to 'magnum' 09:00:50 <flwang1> #topic roll call 09:00:54 <flwang1> 0/ 09:02:59 <brtknr> o/ 09:03:07 <flwang1> brtknr: hey, how are you 09:03:14 <brtknr> good thanks and you! 09:03:16 <brtknr> ? 09:03:39 <flwang1> very good 09:03:56 <flwang1> let's wait strigazi a bit 09:04:09 <flwang1> brtknr: did you see my comments on your csi patch? 09:05:52 <brtknr> yes thanks for reviewing 09:06:04 <brtknr> did you see the issue on devstack? 09:06:22 <brtknr> can you leave comment about what k8s version you used and if it was podman or coreos etc. 09:06:37 <brtknr> i havent seen the same issue locally 09:06:43 <flwang1> ok, will do 09:06:52 <flwang1> i'm using v1.16.3 with podman 09:06:56 <flwang1> and coreos 09:08:01 <flwang1> brtknr: let's go through the agenda? 09:08:22 <brtknr> sounds good 09:08:57 <brtknr> btw instead of kubelet-insure-tls, maybe i can use --kubelet-certificate-authority 09:09:07 <brtknr> and use the ca for the cluster? 09:09:34 <flwang1> brtknr: it would be great, otherwise, enterprise use won't like it 09:09:39 <flwang1> 1. Help with removing the constraint that there must be a minimum of 1 worker in a given nodegroup (including default-worker). 09:09:50 <flwang1> have you already got any idea for this? 09:10:53 <brtknr> i can manually specify count as 0 and get the cluster to reach CREATE_COMPLETE but i havent been able to override the value of node_count to 0, at the moment, it defaults to 1 09:12:04 <brtknr> i havent been able to figure out where exactly this constraint is applied. any pointer would be appreciated but i realise there is no easy answer without properly digging underneath 09:12:08 <flwang1> what do you mean manually specify? like openstack coe cluster create xxx --node-count 0 ? 09:12:29 <brtknr> that will not work because there is api level constraint 09:12:48 <brtknr> when i remove the api level constraint, the node-count still defaults to 1 09:12:53 <flwang1> brtknr: i see. so you mean you hacked the code to set it to 0? 09:13:02 <brtknr> i can override count in the kubecluser.yaml file 09:13:16 <brtknr> and only then cluster reaches CREATE_COMPLETE 09:13:26 <flwang1> ah, i see. 09:13:51 <flwang1> i would say if it works at the Heat level, then the overall idea should work 09:13:59 <flwang1> we can manage it in magnum scope 09:14:24 <flwang1> it would be nice if you can dig and propose a patch so that we can start to review from there 09:14:32 <brtknr> sounds good 09:15:12 <flwang1> 2. metrics sever CrashLoopBack 09:15:20 <flwang1> is there anything we should discuss about this? 09:17:07 <brtknr> flwang1: i saw your comment 09:17:19 <brtknr> i will see if there is a way to do this without insecure-tls 09:18:27 <brtknr> there is a --kubelet-certificate-authority and --tls-cert-file option which i havent explored 09:18:30 <flwang1> cool, i appreciate your work on this issue 09:19:08 <flwang1> 3. the heat agent log 09:19:10 <brtknr> we're still on roll call topic btw 09:19:16 <brtknr> lol 09:19:16 <flwang1> ah, sorry 09:19:26 <flwang1> #topic heat agent log 09:19:44 <flwang1> with this one, we probably need to wait the investigation result from strigazi 09:19:52 <flwang1> so let's skip it for now? 09:20:21 <brtknr> i was digging into the heat agent log yesterday as its sometimes impossible to see what is happening when the cluser is creating 09:20:52 <brtknr> the main problem is subprocess.communicate does not provide option to stream output 09:21:39 <flwang1> :( 09:21:42 <brtknr> there may be a way to output stdout and stderr to a file on the other hand 09:22:04 <brtknr> from whereever its being executed from 09:22:17 <brtknr> but happy to wait for what strigazi has to say 09:22:35 <flwang1> cool, thanks 09:22:57 <flwang1> move on? 09:25:54 <brtknr> sure 09:27:48 <flwang1> #topic volume AZ 09:27:56 <flwang1> brtknr: did you get a chance to review volume AZ fix https://review.opendev.org/705592 ? 09:30:47 <brtknr> yes it has a complicated logic 09:31:34 <flwang1> brtknr: yep, i don't like it TBH, but i can't figure out a better way to solve it 09:32:17 <brtknr> how has it worked fine until now? 09:32:29 <flwang1> sorry? 09:32:42 <flwang1> you mean why it was working? 09:33:07 <brtknr> until now, how have we survived without this patch is what im asking 09:33:14 <flwang1> probably because most of the companies are not using multi AZ 09:33:30 <flwang1> without multi AZ, user won't run into this issue 09:34:02 <flwang1> as far as i know, Nectar jakeyip, they're hacking the code 09:34:30 <flwang1> i don't know if cern is using multi az 09:35:58 <elenalindq> Hi there, may I interrupt with a question? I tried openstack coe cluster update $CLUSTER_NAME replace node_count=3 and it failed, because of lack of resources, so my stack is in UPDATE_FAILED state. I fixed the quota and I tried to rerun the update command hoping it will kick it off again, but nothing happens. If I try openstack stack update <stack_id> --existing it will kick off the update, which succeeds (heat sh 09:35:58 <elenalindq> DATE_COMPLETE), but openstack coe cluster list still shows my stack in UPDATE_FAILED. Is there a way to rerun the update from magnum? Using Openstack Train. 09:36:01 <brtknr> flwang1: can we not get a default value for az in the same way novda does? 09:37:03 <brtknr> elenalindq: if you are using train with an up to date CLI, you can rerun `openstack coe cluster resize <cluster_name> 3` 09:37:08 <flwang1> brtknr: cinder can handle the "" for az 09:37:12 <brtknr> and this will reupdate the heat stack 09:37:28 <elenalindq> thank you brtknr! 09:37:30 <brtknr> flwang1: but not nova? 09:37:33 <flwang1> brtknr: cinder can NOT handle the "" for az 09:37:40 <flwang1> but nova can 09:37:59 <flwang1> cinder will just return a 400 IIRC 09:38:22 <brtknr> flwang1: can we look into nova code to see how they infer sensible default for region name? 09:38:26 <flwang1> you can easily test this without using multi az 09:38:56 <flwang1> are you trying to solve this issue in cinder? 09:39:45 <brtknr> i think magnum should have an internal default for availability zone 09:39:52 <brtknr> rather than "" 09:40:04 <flwang1> like a config option? 09:40:37 <flwang1> then how can you set the default value for the this option? 09:40:50 <flwang1> and this may break the backward compatibility :( 09:43:03 <brtknr> flwang1: hmm will cinder accept None? 09:43:29 <flwang1> brtknr: no, based on i tried 09:44:25 <flwang1> you can give the patch a try and we can discuss offline 09:44:48 <flwang1> it's a small issue but just make the template complicated, i understand that 09:46:17 <flwang1> brtknr: let's move on 09:46:31 <flwang1> #topic docker storage for fedora coreos 09:46:34 <flwang1> docker storage driver for fedora coreos https://review.opendev.org/696256 09:47:04 <flwang1> brtknr: can you pls revisit above patch? 09:47:08 <flwang1> strigazi: ^ 09:51:02 <brtknr> flwang1: my main issue with that patch is that we should try and use the same configure-docker-storage.sh and add a condition in there for fedora coreos 09:51:21 <brtknr> it is hard to factor out common elements when they are in different files 09:52:11 <flwang1> brtknr: we can't use the same script. i'm kind of using the same one from coreos 09:52:20 <flwang1> because the logic are different 09:53:25 <flwang1> brtknr: see https://github.com/openstack/magnum/blob/master/magnum/drivers/k8s_coreos_v1/templates/fragments/configure-docker.yaml 09:54:37 <brtknr> ah ok i see what you mean 09:54:57 <brtknr> my bad 09:55:44 <brtknr> when i tested it, worked for me 09:55:51 <flwang1> all good, please revisit it, because we do need it for fedora coreos driver to remove the TODO :) 09:56:28 <flwang1> let's move on, we only have 5 mins 09:56:30 <brtknr> i just realised that atomic has its own fragment 09:56:41 <brtknr> so this pattern makes sense 09:56:49 <flwang1> #topic autoscaler podman issue 09:57:00 <brtknr> i am happy to take this patch as is 09:57:21 <brtknr> on the topic of autoscaler, are you guys planning to work on supporting nodegroups? 09:57:31 <flwang1> this is a brand new bug, see https://github.com/kubernetes/autoscaler/issues/2819 09:58:17 <flwang1> brtknr: i'm planning to support /resize api first and then nodegroups, not sure if cern guys will take the node groups support 09:58:39 <flwang1> brtknr: and here is the fix https://review.opendev.org/707336 09:58:53 <flwang1> we just need to add the volume mount for /etc/machine-id 09:59:03 <flwang1> the bug reporter has confirmed that works for him 09:59:20 <brtknr> flwang1: excellent, looks reasonable to me 09:59:47 <brtknr> we havent started using coreos in prod yet but this kind of bug is precisely the reason why 10:00:06 <brtknr> what do you mean support the resize api and then nodegroups? 10:00:17 <brtknr> doesnt it already support resize? 10:01:38 <flwang1> brtknr: it's using old way https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/magnum/magnum_manager_heat.go#L231 10:02:21 <flwang1> in short, it call heat to remove the node and then call magnum update api to update the correct number, which doesn't make sense given we now got the resize api 10:03:08 <brtknr> flwang1: ah okay thats true 10:03:26 <flwang1> i'd like to use resize to replace it 10:03:35 <brtknr> why did we not update node count on magnum directly? 10:03:38 <flwang1> to drop the dependency with heat 10:04:07 <flwang1> because the magnum update api can't specify which node you can delete 10:04:27 <brtknr> flwang1: oh i see 10:04:53 <flwang1> :) 10:04:57 <brtknr> but if you remove the node from heat stack and update node count, magnum doesnt remove any extra nodes? 10:05:21 <flwang1> magnum just update the number because the node has been deleted 10:05:53 <flwang1> it just "magically work" :) 10:06:06 <brtknr> i really want support for nodegroup autoscalinmg 10:06:11 <flwang1> anyway, i think we all agree resize will be the right one to do this 10:06:18 <brtknr> happy to work on this but will need to learn golang first 10:06:23 <flwang1> brtknr: then show me the code :D 10:06:42 <flwang1> let's move on 10:06:51 <flwang1> i'm going to close this meeting now 10:07:01 <flwang1> we can discuss the out of box storage class offline 10:07:02 <brtknr> #topic out of box storage clasS? 10:07:22 <flwang1> brtknr: it's related to this one https://review.opendev.org/676832 10:07:26 <flwang1> i proposed before 10:07:27 <brtknr> i see dioguerra lurking in the background 10:09:01 <flwang1> :) 10:09:10 <flwang1> let's end the meeting first 10:09:13 <flwang1> #endmeeting