09:00:11 <flwang1> #startmeeting magnum 09:00:12 <openstack> Meeting started Wed Mar 11 09:00:11 2020 UTC and is due to finish in 60 minutes. The chair is flwang1. Information about MeetBot at http://wiki.debian.org/MeetBot. 09:00:13 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 09:00:15 <openstack> The meeting name has been set to 'magnum' 09:00:28 <brtknr> o/ 09:00:35 <strigazi> ο/ 09:00:38 <flwang1> o/ 09:01:26 <brtknr> My topics can go last, I’m still 5 mins away from work 09:01:35 <brtknr> not easy to type on the phone 09:02:06 <flwang1> ok 09:02:09 <flwang1> #topic Allow updating health_status, health_status_reason https://review.opendev.org/710384 09:02:48 <flwang1> strigazi: brtknr: i'd like to propose above change to allow updating the health_status and health_status_reason via the update api 09:03:46 <strigazi> flwang1: would it make sense to configure who can do this by policy? 09:03:49 <flwang1> i'm still doing testing, but i'd like to get your guys comment 09:04:01 <flwang1> strigazi: that's a good idea 09:04:24 <flwang1> i can do that 09:04:49 <flwang1> the context is, all the k8s cluster on our cloud are private, which are not accessible by the magnum control plane 09:04:50 <brtknr> Would it make sense to make all health updates using this rather than magnum making api calls to k8s end point? 09:05:12 <flwang1> so we would like to let the magnum-auto-healer to send api call to update the health status from the cluster inside 09:05:13 <brtknr> I.e. do we need 2 health monitoring mechanism side by side? 09:05:38 <strigazi> I think we need to options, not two running together 09:05:45 <strigazi> s/to/two/ 09:05:55 <flwang1> strigazi: +1 09:06:17 <flwang1> brtknr: the two options work for different scenarios 09:06:37 <flwang1> if the cluster is a private cluster, then currently we don't have option to update the health status 09:06:52 <flwang1> but if it's a public cluster, then magnum can handle it correctly 09:08:15 <brtknr> But the api would work for both types of clusters 09:08:35 <flwang1> brtknr: yes 09:09:24 <flwang1> you can disable the magnum server side health monitoring if you want 09:10:02 <flwang1> but the problem is, not all vendors will deploy magnum auto healer 09:10:33 <flwang1> make sense? 09:11:45 <brtknr> Ok 09:12:18 <brtknr> yeah i am happy to have the option, its a nice workaround for private clusters 09:13:19 <flwang1> strigazi: except the role control, any other comments? 09:14:02 <brtknr> Do we have the different roles that magnum expects documented somewhere? 09:14:25 <flwang1> for this case or general? 09:14:25 <brtknr> e.g. only a heat_stack_owner can deploy a cluster for example 09:14:37 <strigazi> no, as a genetal comment, out-of-tree things should be opt-in. Only kubernetes can not be opt-in 09:16:02 <strigazi> magnum has it's own policy. We can tune policy-in-code or policy fie 09:16:09 <strigazi> s/fie/file/ 09:18:30 <flwang1> in general, magnum has the policy.json, and you can define any role and update the file based on your need 09:18:42 <strigazi> +1 09:18:54 <flwang1> shall we move on? 09:19:16 <strigazi> brtknr: have you arrived? 09:19:40 <brtknr> yep ive been at my desk for 10 mins :D 09:19:48 <brtknr> seamless transition 09:19:58 <flwang1> #topic Restore deploy_{stdout,stderr,status_code} https://review.opendev.org/#/c/710487/ 09:20:02 <flwang1> brtknr: ^ 09:20:30 <brtknr> Ok basically it was bugging me that deploy_stderr was empty and Rico also pointed this out that this breaks backward compatibility 09:21:15 <brtknr> threading seems like the only way to handle two input streams simultaneously without weird buffering behaviour you get when you use "select" 09:21:40 <strigazi> put the same thing as stderr and stdout \o/ 09:22:10 <strigazi> brtknr: what do you think is the best option? 09:22:40 <brtknr> i think using threading to write to the same file but capture the two streams separately is the winner for me 09:23:06 <brtknr> since there might be a genuine deploy_stderr which will always resolve to being empty 09:23:29 <strigazi> I'm only a little scared that we make a critical component for us more complicated 09:23:59 <strigazi> The main reason for this is backwards compatibility? 09:24:20 <brtknr> it works though... threading is not a complicated thing... 09:24:58 <strigazi> ok if it works 09:25:17 <flwang1> i kind of share the same concern as strigazi 09:25:34 <flwang1> and personally, i'd like to see we merge all the things back to the heat-agents repo 09:26:05 <flwang1> it would be nice if we can share the maintainance of the heat-container-agents 09:26:24 <brtknr> The parallel change Rico suggested doesnt work as it reads stderr first and then consumes stdout all at once at the end 09:26:37 <strigazi> +1 ^^ 09:27:55 <brtknr> I'm happy to share the maintenance burden with the heat team... looks like they have even incorporated some tests 09:28:13 <strigazi> IMO The best options are: two files and different outputs (as proposed by brtknr intially) or one file and duplicated output in heat 09:28:39 <flwang1> so my little tiny comment for this patch is, please collaborate with heat team to make sure we are not far away from the original heat-agents code 09:29:20 <strigazi> we can try threading if you want, it produces exactly what we need. 09:29:44 <brtknr> That is my concern, removing deploy_stderr feels like cutting off an arm from the original thing 09:29:45 <strigazi> flwang1: I think heat follows us in this case 09:30:33 <flwang1> strigazi: good 09:30:56 <flwang1> i don't really care who follows who, TBH, i just want to see we're synced 09:31:25 <strigazi> flwang1: they follow == it was low priority for them 09:31:37 <flwang1> ok, fair enough 09:31:49 <strigazi> while for us it is high, we can sync of course. 09:32:24 <brtknr> please test the threading implementation, its available at brtknr/heat-container-agent:ussuri-dev 09:32:43 <brtknr> i have tested with both coreos and atomic and it works 09:33:01 <strigazi> i will, let's move this way then. 09:33:09 <strigazi> last thing to add 09:34:05 <strigazi> I was hesitant because the number of bugs user found instead of us was very high in train. 09:34:22 <strigazi> let's move on 09:34:47 <flwang1> strigazi: should we try to enable the k8s functional testing again? 09:34:53 <strigazi> no 09:34:54 <flwang1> or at least put it on our high priority? 09:35:03 <strigazi> we can't 09:35:17 <flwang1> still blocked by the nested virt? 09:35:33 <strigazi> lack of infra/slow infra 09:36:09 <flwang1> CatalystCloud hasn't enable the nested virt yet 09:36:25 <flwang1> but we will do in the near future, then we maybe able to contribute the infra for testing 09:36:30 <flwang1> CI i mean 09:36:40 <strigazi> sounds good 09:37:20 <flwang1> move on? 09:37:36 <strigazi> yeap 09:38:07 <flwang1> #topic Release 9.3.0 09:39:05 <strigazi> We need some fixes for logging 09:39:16 <strigazi> disable zincati updates 09:39:45 <strigazi> and fix cluster resize (we found a corner case) 09:40:09 <flwang1> if so, we can hold it a bit? brtknr 09:40:15 <brtknr> flwang1: yeah no problem 09:40:27 <brtknr> thats why I wanted to ask you guys if there was anything 09:41:09 <flwang1> i appreciate that 09:41:30 <flwang1> strigazi: anything else? 09:42:12 <strigazi> the logging issue is too serious for us. Heavy services break nodes (fill up the disk) 09:42:18 <strigazi> that's it 09:42:32 <brtknr> how did you choose a value of 50 million? 09:42:38 <strigazi> it is strange you haven'e encountered it 09:42:48 <strigazi> 50μ 09:42:50 <strigazi> 50m 09:43:14 <brtknr> Ah 50m 09:43:22 <strigazi> mega bytes 09:43:36 <strigazi> I think it is reasonable. Can't explode nodes 09:43:53 <brtknr> should this be configurable? 09:43:59 <flwang1> can you explain more? is it a very frequent issue? 09:44:03 <strigazi> It is not agressive for reasonable services. I mean the logs will stay there for long 09:45:04 <strigazi> if a services produces a lot of logs and they are not rotated the disk fills up. 09:45:15 <strigazi> or creates preassure 09:46:02 <brtknr> so do you think this option should be configurable? 09:46:09 <brtknr> with a default value of 50m? 09:46:14 <brtknr> or is that overkill? 09:46:26 <strigazi> it fills like an overkill 09:46:48 <strigazi> the nodes are not a proper place to hold a lot of logs 09:47:12 <strigazi> this is not an opinion :) 09:47:24 <strigazi> what do you think? 09:48:16 <flwang1> the log is for the pod/container, right? 09:48:31 <strigazi> and of k8s services in podman 09:48:38 <strigazi> s/of/for/ 09:48:58 <flwang1> ok, max 100pod per node, so 50 * 100 =5G, that makes sense for me 09:49:07 <flwang1> and i think it's large enough 09:49:29 <flwang1> as the local log 09:50:39 <strigazi> move on? 09:50:53 <flwang1> #topic https://review.opendev.org/#/c/712154/ Fedora CoreOS Configurarion 09:50:57 <flwang1> strigazi: ^ 09:51:59 <strigazi> I pushed the ignition user_data in a human readable format 09:52:21 <strigazi> from that format the user_data.json can be generated 09:52:49 <flwang1> cool, looks good, i will review it 09:53:01 <strigazi> when do we take this? 09:53:13 <strigazi> before my patch of logging or after? 09:53:38 <brtknr> in https://review.opendev.org/#/c/712154/3/magnum/drivers/k8s_fedora_coreos_v1/templates/fcct-config.yaml line 167, does $CONTAINER_INFRA_PREFIX come from heat or /etc/environment? 09:53:38 <strigazi> s/of/for/ this is a new pattern for me now 09:53:39 <flwang1> is there any depedency b/w them? 09:53:59 <brtknr> i think we should take this first then the logging 09:54:16 <brtknr> flwang1: yes, both update fcct-config.yaml 09:54:17 <strigazi> brtknr: this means both will be backported 09:54:41 <brtknr> yes 09:54:43 <strigazi> ok 09:55:24 <flwang1> i agree format first 09:55:36 <flwang1> which can make your rebase easier, i gues 09:55:39 <flwang1> guess 09:55:51 <brtknr> i can +2 quickly if you can address my comment 09:56:22 <strigazi> I will 09:57:19 <brtknr> Should we add a test to make sure that the user-data.json generated from fcct-config.yaml matches the one in the commit tree? 09:57:21 <brtknr> :P 09:57:24 <strigazi> let's test first though :) you never know. And give flwang1 a chance to review 09:57:48 <flwang1> thanks :) 09:59:12 <brtknr> strigazi: what is this cluster resize corner case? 09:59:24 <strigazi> 1. cluster create 09:59:48 <brtknr> is this after stein->train upgrade? 09:59:48 <strigazi> 1.1 with an old kube_version (not kube_tag) in kubecluster.yaml 10:00:02 <strigazi> 2. create a nodegroup 10:00:09 <strigazi> 3. resize default ng 10:00:30 <strigazi> causes change of user_data 10:01:00 <brtknr> yikes 10:01:12 <brtknr> do you have a fix? 10:01:41 <brtknr> does this rebuild the whole cluster? 10:01:47 <brtknr> or just the nodegroups? 10:02:49 <brtknr> strigazi: 10:02:52 <flwang1> hi team, can i close the meeting? 10:02:59 <brtknr> yes 10:03:00 <flwang1> we can discuss this resize issue offline 10:03:06 <strigazi> we have 10:03:06 <flwang1> #endmeeting