09:00:39 #startmeeting magnum 09:00:39 Meeting started Wed Feb 12 09:00:39 2020 UTC and is due to finish in 60 minutes. The chair is flwang1. Information about MeetBot at http://wiki.debian.org/MeetBot. 09:00:40 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 09:00:42 The meeting name has been set to 'magnum' 09:00:50 #topic roll call 09:00:54 0/ 09:02:59 o/ 09:03:07 brtknr: hey, how are you 09:03:14 good thanks and you! 09:03:16 ? 09:03:39 very good 09:03:56 let's wait strigazi a bit 09:04:09 brtknr: did you see my comments on your csi patch? 09:05:52 yes thanks for reviewing 09:06:04 did you see the issue on devstack? 09:06:22 can you leave comment about what k8s version you used and if it was podman or coreos etc. 09:06:37 i havent seen the same issue locally 09:06:43 ok, will do 09:06:52 i'm using v1.16.3 with podman 09:06:56 and coreos 09:08:01 brtknr: let's go through the agenda? 09:08:22 sounds good 09:08:57 btw instead of kubelet-insure-tls, maybe i can use --kubelet-certificate-authority 09:09:07 and use the ca for the cluster? 09:09:34 brtknr: it would be great, otherwise, enterprise use won't like it 09:09:39 1. Help with removing the constraint that there must be a minimum of 1 worker in a given nodegroup (including default-worker). 09:09:50 have you already got any idea for this? 09:10:53 i can manually specify count as 0 and get the cluster to reach CREATE_COMPLETE but i havent been able to override the value of node_count to 0, at the moment, it defaults to 1 09:12:04 i havent been able to figure out where exactly this constraint is applied. any pointer would be appreciated but i realise there is no easy answer without properly digging underneath 09:12:08 what do you mean manually specify? like openstack coe cluster create xxx --node-count 0 ? 09:12:29 that will not work because there is api level constraint 09:12:48 when i remove the api level constraint, the node-count still defaults to 1 09:12:53 brtknr: i see. so you mean you hacked the code to set it to 0? 09:13:02 i can override count in the kubecluser.yaml file 09:13:16 and only then cluster reaches CREATE_COMPLETE 09:13:26 ah, i see. 09:13:51 i would say if it works at the Heat level, then the overall idea should work 09:13:59 we can manage it in magnum scope 09:14:24 it would be nice if you can dig and propose a patch so that we can start to review from there 09:14:32 sounds good 09:15:12 2. metrics sever CrashLoopBack 09:15:20 is there anything we should discuss about this? 09:17:07 flwang1: i saw your comment 09:17:19 i will see if there is a way to do this without insecure-tls 09:18:27 there is a --kubelet-certificate-authority and --tls-cert-file option which i havent explored 09:18:30 cool, i appreciate your work on this issue 09:19:08 3. the heat agent log 09:19:10 we're still on roll call topic btw 09:19:16 lol 09:19:16 ah, sorry 09:19:26 #topic heat agent log 09:19:44 with this one, we probably need to wait the investigation result from strigazi 09:19:52 so let's skip it for now? 09:20:21 i was digging into the heat agent log yesterday as its sometimes impossible to see what is happening when the cluser is creating 09:20:52 the main problem is subprocess.communicate does not provide option to stream output 09:21:39 :( 09:21:42 there may be a way to output stdout and stderr to a file on the other hand 09:22:04 from whereever its being executed from 09:22:17 but happy to wait for what strigazi has to say 09:22:35 cool, thanks 09:22:57 move on? 09:25:54 sure 09:27:48 #topic volume AZ 09:27:56 brtknr: did you get a chance to review volume AZ fix https://review.opendev.org/705592 ? 09:30:47 yes it has a complicated logic 09:31:34 brtknr: yep, i don't like it TBH, but i can't figure out a better way to solve it 09:32:17 how has it worked fine until now? 09:32:29 sorry? 09:32:42 you mean why it was working? 09:33:07 until now, how have we survived without this patch is what im asking 09:33:14 probably because most of the companies are not using multi AZ 09:33:30 without multi AZ, user won't run into this issue 09:34:02 as far as i know, Nectar jakeyip, they're hacking the code 09:34:30 i don't know if cern is using multi az 09:35:58 Hi there, may I interrupt with a question? I tried openstack coe cluster update $CLUSTER_NAME replace node_count=3 and it failed, because of lack of resources, so my stack is in UPDATE_FAILED state. I fixed the quota and I tried to rerun the update command hoping it will kick it off again, but nothing happens. If I try openstack stack update --existing it will kick off the update, which succeeds (heat sh 09:35:58 DATE_COMPLETE), but openstack coe cluster list still shows my stack in UPDATE_FAILED. Is there a way to rerun the update from magnum? Using Openstack Train. 09:36:01 flwang1: can we not get a default value for az in the same way novda does? 09:37:03 elenalindq: if you are using train with an up to date CLI, you can rerun `openstack coe cluster resize 3` 09:37:08 brtknr: cinder can handle the "" for az 09:37:12 and this will reupdate the heat stack 09:37:28 thank you brtknr! 09:37:30 flwang1: but not nova? 09:37:33 brtknr: cinder can NOT handle the "" for az 09:37:40 but nova can 09:37:59 cinder will just return a 400 IIRC 09:38:22 flwang1: can we look into nova code to see how they infer sensible default for region name? 09:38:26 you can easily test this without using multi az 09:38:56 are you trying to solve this issue in cinder? 09:39:45 i think magnum should have an internal default for availability zone 09:39:52 rather than "" 09:40:04 like a config option? 09:40:37 then how can you set the default value for the this option? 09:40:50 and this may break the backward compatibility :( 09:43:03 flwang1: hmm will cinder accept None? 09:43:29 brtknr: no, based on i tried 09:44:25 you can give the patch a try and we can discuss offline 09:44:48 it's a small issue but just make the template complicated, i understand that 09:46:17 brtknr: let's move on 09:46:31 #topic docker storage for fedora coreos 09:46:34 docker storage driver for fedora coreos https://review.opendev.org/696256 09:47:04 brtknr: can you pls revisit above patch? 09:47:08 strigazi: ^ 09:51:02 flwang1: my main issue with that patch is that we should try and use the same configure-docker-storage.sh and add a condition in there for fedora coreos 09:51:21 it is hard to factor out common elements when they are in different files 09:52:11 brtknr: we can't use the same script. i'm kind of using the same one from coreos 09:52:20 because the logic are different 09:53:25 brtknr: see https://github.com/openstack/magnum/blob/master/magnum/drivers/k8s_coreos_v1/templates/fragments/configure-docker.yaml 09:54:37 ah ok i see what you mean 09:54:57 my bad 09:55:44 when i tested it, worked for me 09:55:51 all good, please revisit it, because we do need it for fedora coreos driver to remove the TODO :) 09:56:28 let's move on, we only have 5 mins 09:56:30 i just realised that atomic has its own fragment 09:56:41 so this pattern makes sense 09:56:49 #topic autoscaler podman issue 09:57:00 i am happy to take this patch as is 09:57:21 on the topic of autoscaler, are you guys planning to work on supporting nodegroups? 09:57:31 this is a brand new bug, see https://github.com/kubernetes/autoscaler/issues/2819 09:58:17 brtknr: i'm planning to support /resize api first and then nodegroups, not sure if cern guys will take the node groups support 09:58:39 brtknr: and here is the fix  https://review.opendev.org/707336 09:58:53 we just need to add the volume mount for /etc/machine-id 09:59:03 the bug reporter has confirmed that works for him 09:59:20 flwang1: excellent, looks reasonable to me 09:59:47 we havent started using coreos in prod yet but this kind of bug is precisely the reason why 10:00:06 what do you mean support the resize api and then nodegroups? 10:00:17 doesnt it already support resize? 10:01:38 brtknr: it's using old way https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/magnum/magnum_manager_heat.go#L231 10:02:21 in short, it call heat to remove the node and then call magnum update api to update the correct number, which doesn't make sense given we now got the resize api 10:03:08 flwang1: ah okay thats true 10:03:26 i'd like to use resize to replace it 10:03:35 why did we not update node count on magnum directly? 10:03:38 to drop the dependency with heat 10:04:07 because the magnum update api can't specify which node you can delete 10:04:27 flwang1: oh i see 10:04:53 :) 10:04:57 but if you remove the node from heat stack and update node count, magnum doesnt remove any extra nodes? 10:05:21 magnum just update the number because the node has been deleted 10:05:53 it just "magically work" :) 10:06:06 i really want support for nodegroup autoscalinmg 10:06:11 anyway, i think we all agree resize will be the right one to do this 10:06:18 happy to work on this but will need to learn golang first 10:06:23 brtknr: then show me the code :D 10:06:42 let's move on 10:06:51 i'm going to close this meeting now 10:07:01 we can discuss the out of box storage class offline 10:07:02 #topic out of box storage clasS? 10:07:22 brtknr: it's related to this one https://review.opendev.org/676832 10:07:26 i proposed before 10:07:27 i see dioguerra lurking in the background 10:09:01 :) 10:09:10 let's end the meeting first 10:09:13 #endmeeting