09:00:11 #startmeeting magnum 09:00:12 Meeting started Wed Mar 11 09:00:11 2020 UTC and is due to finish in 60 minutes. The chair is flwang1. Information about MeetBot at http://wiki.debian.org/MeetBot. 09:00:13 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 09:00:15 The meeting name has been set to 'magnum' 09:00:28 o/ 09:00:35 ο/ 09:00:38 o/ 09:01:26 My topics can go last, I’m still 5 mins away from work 09:01:35 not easy to type on the phone 09:02:06 ok 09:02:09 #topic Allow updating health_status, health_status_reason https://review.opendev.org/710384 09:02:48 strigazi: brtknr: i'd like to propose above change to allow updating the health_status and health_status_reason via the update api 09:03:46 flwang1: would it make sense to configure who can do this by policy? 09:03:49 i'm still doing testing, but i'd like to get your guys comment 09:04:01 strigazi: that's a good idea 09:04:24 i can do that 09:04:49 the context is, all the k8s cluster on our cloud are private, which are not accessible by the magnum control plane 09:04:50 Would it make sense to make all health updates using this rather than magnum making api calls to k8s end point? 09:05:12 so we would like to let the magnum-auto-healer to send api call to update the health status from the cluster inside 09:05:13 I.e. do we need 2 health monitoring mechanism side by side? 09:05:38 I think we need to options, not two running together 09:05:45 s/to/two/ 09:05:55 strigazi: +1 09:06:17 brtknr: the two options work for different scenarios 09:06:37 if the cluster is a private cluster, then currently we don't have option to update the health status 09:06:52 but if it's a public cluster, then magnum can handle it correctly 09:08:15 But the api would work for both types of clusters 09:08:35 brtknr: yes 09:09:24 you can disable the magnum server side health monitoring if you want 09:10:02 but the problem is, not all vendors will deploy magnum auto healer 09:10:33 make sense? 09:11:45 Ok 09:12:18 yeah i am happy to have the option, its a nice workaround for private clusters 09:13:19 strigazi: except the role control, any other comments? 09:14:02 Do we have the different roles that magnum expects documented somewhere? 09:14:25 for this case or general? 09:14:25 e.g. only a heat_stack_owner can deploy a cluster for example 09:14:37 no, as a genetal comment, out-of-tree things should be opt-in. Only kubernetes can not be opt-in 09:16:02 magnum has it's own policy. We can tune policy-in-code or policy fie 09:16:09 s/fie/file/ 09:18:30 in general, magnum has the policy.json, and you can define any role and update the file based on your need 09:18:42 +1 09:18:54 shall we move on? 09:19:16 brtknr: have you arrived? 09:19:40 yep ive been at my desk for 10 mins :D 09:19:48 seamless transition 09:19:58 #topic Restore deploy_{stdout,stderr,status_code} https://review.opendev.org/#/c/710487/ 09:20:02 brtknr: ^ 09:20:30 Ok basically it was bugging me that deploy_stderr was empty and Rico also pointed this out that this breaks backward compatibility 09:21:15 threading seems like the only way to handle two input streams simultaneously without weird buffering behaviour you get when you use "select" 09:21:40 put the same thing as stderr and stdout \o/ 09:22:10 brtknr: what do you think is the best option? 09:22:40 i think using threading to write to the same file but capture the two streams separately is the winner for me 09:23:06 since there might be a genuine deploy_stderr which will always resolve to being empty 09:23:29 I'm only a little scared that we make a critical component for us more complicated 09:23:59 The main reason for this is backwards compatibility? 09:24:20 it works though... threading is not a complicated thing... 09:24:58 ok if it works 09:25:17 i kind of share the same concern as strigazi 09:25:34 and personally, i'd like to see we merge all the things back to the heat-agents repo 09:26:05 it would be nice if we can share the maintainance of the heat-container-agents 09:26:24 The parallel change Rico suggested doesnt work as it reads stderr first and then consumes stdout all at once at the end 09:26:37 +1 ^^ 09:27:55 I'm happy to share the maintenance burden with the heat team... looks like they have even incorporated some tests 09:28:13 IMO The best options are: two files and different outputs (as proposed by brtknr intially) or one file and duplicated output in heat 09:28:39 so my little tiny comment for this patch is, please collaborate with heat team to make sure we are not far away from the original heat-agents code 09:29:20 we can try threading if you want, it produces exactly what we need. 09:29:44 That is my concern, removing deploy_stderr feels like cutting off an arm from the original thing 09:29:45 flwang1: I think heat follows us in this case 09:30:33 strigazi: good 09:30:56 i don't really care who follows who, TBH, i just want to see we're synced 09:31:25 flwang1: they follow == it was low priority for them 09:31:37 ok, fair enough 09:31:49 while for us it is high, we can sync of course. 09:32:24 please test the threading implementation, its available at brtknr/heat-container-agent:ussuri-dev 09:32:43 i have tested with both coreos and atomic and it works 09:33:01 i will, let's move this way then. 09:33:09 last thing to add 09:34:05 I was hesitant because the number of bugs user found instead of us was very high in train. 09:34:22 let's move on 09:34:47 strigazi: should we try to enable the k8s functional testing again? 09:34:53 no 09:34:54 or at least put it on our high priority? 09:35:03 we can't 09:35:17 still blocked by the nested virt? 09:35:33 lack of infra/slow infra 09:36:09 CatalystCloud hasn't enable the nested virt yet 09:36:25 but we will do in the near future, then we maybe able to contribute the infra for testing 09:36:30 CI i mean 09:36:40 sounds good 09:37:20 move on? 09:37:36 yeap 09:38:07 #topic Release 9.3.0 09:39:05 We need some fixes for logging 09:39:16 disable zincati updates 09:39:45 and fix cluster resize (we found a corner case) 09:40:09 if so, we can hold it a bit? brtknr 09:40:15 flwang1: yeah no problem 09:40:27 thats why I wanted to ask you guys if there was anything 09:41:09 i appreciate that 09:41:30 strigazi: anything else? 09:42:12 the logging issue is too serious for us. Heavy services break nodes (fill up the disk) 09:42:18 that's it 09:42:32 how did you choose a value of 50 million? 09:42:38 it is strange you haven'e encountered it 09:42:48 50μ 09:42:50 50m 09:43:14 Ah 50m 09:43:22 mega bytes 09:43:36 I think it is reasonable. Can't explode nodes 09:43:53 should this be configurable? 09:43:59 can you explain more? is it a very frequent issue? 09:44:03 It is not agressive for reasonable services. I mean the logs will stay there for long 09:45:04 if a services produces a lot of logs and they are not rotated the disk fills up. 09:45:15 or creates preassure 09:46:02 so do you think this option should be configurable? 09:46:09 with a default value of 50m? 09:46:14 or is that overkill? 09:46:26 it fills like an overkill 09:46:48 the nodes are not a proper place to hold a lot of logs 09:47:12 this is not an opinion :) 09:47:24 what do you think? 09:48:16 the log is for the pod/container, right? 09:48:31 and of k8s services in podman 09:48:38 s/of/for/ 09:48:58 ok, max 100pod per node, so 50 * 100 =5G, that makes sense for me 09:49:07 and i think it's large enough 09:49:29 as the local log 09:50:39 move on? 09:50:53 #topic https://review.opendev.org/#/c/712154/ Fedora CoreOS Configurarion 09:50:57 strigazi: ^ 09:51:59 I pushed the ignition user_data in a human readable format 09:52:21 from that format the user_data.json can be generated 09:52:49 cool, looks good, i will review it 09:53:01 when do we take this? 09:53:13 before my patch of logging or after? 09:53:38 in https://review.opendev.org/#/c/712154/3/magnum/drivers/k8s_fedora_coreos_v1/templates/fcct-config.yaml line 167, does $CONTAINER_INFRA_PREFIX come from heat or /etc/environment? 09:53:38 s/of/for/ this is a new pattern for me now 09:53:39 is there any depedency b/w them? 09:53:59 i think we should take this first then the logging 09:54:16 flwang1: yes, both update fcct-config.yaml 09:54:17 brtknr: this means both will be backported 09:54:41 yes 09:54:43 ok 09:55:24 i agree format first 09:55:36 which can make your rebase easier, i gues 09:55:39 guess 09:55:51 i can +2 quickly if you can address my comment 09:56:22 I will 09:57:19 Should we add a test to make sure that the user-data.json generated from fcct-config.yaml matches the one in the commit tree? 09:57:21 :P 09:57:24 let's test first though :) you never know. And give flwang1 a chance to review 09:57:48 thanks :) 09:59:12 strigazi: what is this cluster resize corner case? 09:59:24 1. cluster create 09:59:48 is this after stein->train upgrade? 09:59:48 1.1 with an old kube_version (not kube_tag) in kubecluster.yaml 10:00:02 2. create a nodegroup 10:00:09 3. resize default ng 10:00:30 causes change of user_data 10:01:00 yikes 10:01:12 do you have a fix? 10:01:41 does this rebuild the whole cluster? 10:01:47 or just the nodegroups? 10:02:49 strigazi: 10:02:52 hi team, can i close the meeting? 10:02:59 yes 10:03:00 we can discuss this resize issue offline 10:03:06 we have 10:03:06 #endmeeting