09:00:11 <flwang1> #startmeeting magnum
09:00:12 <openstack> Meeting started Wed Mar 11 09:00:11 2020 UTC and is due to finish in 60 minutes.  The chair is flwang1. Information about MeetBot at http://wiki.debian.org/MeetBot.
09:00:13 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
09:00:15 <openstack> The meeting name has been set to 'magnum'
09:00:28 <brtknr> o/
09:00:35 <strigazi> ο/
09:00:38 <flwang1> o/
09:01:26 <brtknr> My topics can go last, I’m still 5 mins away from work
09:01:35 <brtknr> not easy to type on the phone
09:02:06 <flwang1> ok
09:02:09 <flwang1> #topic Allow updating health_status, health_status_reason https://review.opendev.org/710384
09:02:48 <flwang1> strigazi: brtknr: i'd like to propose above change to allow updating the health_status and health_status_reason via the update api
09:03:46 <strigazi> flwang1: would it make sense to configure who can do this by policy?
09:03:49 <flwang1> i'm still doing testing, but i'd like to get your guys comment
09:04:01 <flwang1> strigazi: that's a good idea
09:04:24 <flwang1> i can do that
09:04:49 <flwang1> the context is, all the k8s cluster on our cloud are private, which are not accessible by the magnum control plane
09:04:50 <brtknr> Would it make sense to make all health updates using this rather than magnum making api calls to k8s end point?
09:05:12 <flwang1> so we would like to let the magnum-auto-healer to send api call to update the health status from the cluster inside
09:05:13 <brtknr> I.e. do we need 2 health monitoring mechanism side by side?
09:05:38 <strigazi> I think we need to options, not two running together
09:05:45 <strigazi> s/to/two/
09:05:55 <flwang1> strigazi: +1
09:06:17 <flwang1> brtknr: the two options work for different scenarios
09:06:37 <flwang1> if the cluster is a private cluster, then currently we don't have option to update the health status
09:06:52 <flwang1> but if it's a public cluster, then magnum can handle it correctly
09:08:15 <brtknr> But the api would work for both types of clusters
09:08:35 <flwang1> brtknr: yes
09:09:24 <flwang1> you can disable the magnum server side health monitoring if you want
09:10:02 <flwang1> but the problem is, not all vendors will deploy magnum auto healer
09:10:33 <flwang1> make sense?
09:11:45 <brtknr> Ok
09:12:18 <brtknr> yeah i am happy to have the option, its a nice workaround for private clusters
09:13:19 <flwang1> strigazi: except the role control, any other comments?
09:14:02 <brtknr> Do we have the different roles that magnum expects documented somewhere?
09:14:25 <flwang1> for this case or general?
09:14:25 <brtknr> e.g. only a heat_stack_owner can deploy a cluster for example
09:14:37 <strigazi> no, as a genetal comment, out-of-tree things should be opt-in. Only kubernetes can not be opt-in
09:16:02 <strigazi> magnum has it's own policy. We can tune policy-in-code or policy fie
09:16:09 <strigazi> s/fie/file/
09:18:30 <flwang1> in general, magnum has the policy.json, and you can define any role and update the file based on your need
09:18:42 <strigazi> +1
09:18:54 <flwang1> shall we move on?
09:19:16 <strigazi> brtknr: have you arrived?
09:19:40 <brtknr> yep ive been at my desk for 10 mins :D
09:19:48 <brtknr> seamless transition
09:19:58 <flwang1> #topic Restore deploy_{stdout,stderr,status_code} https://review.opendev.org/#/c/710487/
09:20:02 <flwang1> brtknr: ^
09:20:30 <brtknr> Ok basically it was bugging me that deploy_stderr was empty and Rico also pointed this out that this breaks backward compatibility
09:21:15 <brtknr> threading seems like the only way to handle two input streams simultaneously without weird buffering behaviour you get when you use "select"
09:21:40 <strigazi> put the same thing as stderr and stdout \o/
09:22:10 <strigazi> brtknr: what do you think is the best option?
09:22:40 <brtknr> i think using threading to write to the same file but capture the two streams separately is the winner for me
09:23:06 <brtknr> since there might be a genuine deploy_stderr which will always resolve to being empty
09:23:29 <strigazi> I'm only a little scared that we make a critical component for us more complicated
09:23:59 <strigazi> The main reason for this is backwards compatibility?
09:24:20 <brtknr> it works though... threading is not a complicated thing...
09:24:58 <strigazi> ok if it works
09:25:17 <flwang1> i kind of share the same concern as strigazi
09:25:34 <flwang1> and personally, i'd like to see we merge all the things back to the heat-agents repo
09:26:05 <flwang1> it would be nice if we  can share the maintainance of the heat-container-agents
09:26:24 <brtknr> The parallel change Rico suggested doesnt work as it reads stderr first and then consumes stdout all at once at the end
09:26:37 <strigazi> +1 ^^
09:27:55 <brtknr> I'm happy to share the maintenance burden with the heat team... looks like they have even incorporated some tests
09:28:13 <strigazi> IMO The best options are: two files and different outputs (as proposed by brtknr intially) or one file and duplicated output in heat
09:28:39 <flwang1> so my little tiny comment for this patch is, please collaborate with heat team to make sure we are not far away from the original heat-agents code
09:29:20 <strigazi> we can try threading if you want, it produces exactly what we need.
09:29:44 <brtknr> That is my concern, removing deploy_stderr feels like cutting off an arm from the original thing
09:29:45 <strigazi> flwang1: I think heat follows us in this case
09:30:33 <flwang1> strigazi: good
09:30:56 <flwang1> i don't really care who follows who, TBH, i just want to see we're synced
09:31:25 <strigazi> flwang1: they follow == it was low priority for them
09:31:37 <flwang1> ok, fair enough
09:31:49 <strigazi> while for us it is high, we can sync of course.
09:32:24 <brtknr> please test the threading implementation, its available at brtknr/heat-container-agent:ussuri-dev
09:32:43 <brtknr> i have tested with both coreos and atomic and it works
09:33:01 <strigazi> i will, let's move this way then.
09:33:09 <strigazi> last thing to add
09:34:05 <strigazi> I was hesitant because the number of bugs user found instead of us was very high in train.
09:34:22 <strigazi> let's move on
09:34:47 <flwang1> strigazi: should we try to enable the k8s functional testing again?
09:34:53 <strigazi> no
09:34:54 <flwang1> or at least put it on our high priority?
09:35:03 <strigazi> we can't
09:35:17 <flwang1> still blocked by the nested virt?
09:35:33 <strigazi> lack of infra/slow infra
09:36:09 <flwang1> CatalystCloud hasn't enable the nested virt yet
09:36:25 <flwang1> but we will do in the near future, then we maybe able to contribute the infra for testing
09:36:30 <flwang1> CI i mean
09:36:40 <strigazi> sounds good
09:37:20 <flwang1> move on?
09:37:36 <strigazi> yeap
09:38:07 <flwang1> #topic Release 9.3.0
09:39:05 <strigazi> We need some fixes for logging
09:39:16 <strigazi> disable zincati updates
09:39:45 <strigazi> and fix cluster resize (we found a corner case)
09:40:09 <flwang1> if so, we can hold it a bit? brtknr
09:40:15 <brtknr> flwang1: yeah no problem
09:40:27 <brtknr> thats why I wanted to ask you guys if there was anything
09:41:09 <flwang1> i appreciate that
09:41:30 <flwang1> strigazi: anything else?
09:42:12 <strigazi> the logging issue is too serious for us. Heavy services break nodes (fill up the disk)
09:42:18 <strigazi> that's it
09:42:32 <brtknr> how did you choose a value of 50 million?
09:42:38 <strigazi> it is strange you haven'e encountered it
09:42:48 <strigazi> 50μ
09:42:50 <strigazi> 50m
09:43:14 <brtknr> Ah 50m
09:43:22 <strigazi> mega bytes
09:43:36 <strigazi> I think it is reasonable. Can't explode nodes
09:43:53 <brtknr> should this be configurable?
09:43:59 <flwang1> can you explain more? is it a very frequent issue?
09:44:03 <strigazi> It is not agressive for reasonable services. I mean the logs will stay there for long
09:45:04 <strigazi> if a services produces a lot of logs and they are not rotated the disk fills up.
09:45:15 <strigazi> or creates preassure
09:46:02 <brtknr> so do you think this option should be configurable?
09:46:09 <brtknr> with a default value of 50m?
09:46:14 <brtknr> or is that overkill?
09:46:26 <strigazi> it fills like an overkill
09:46:48 <strigazi> the nodes are not a proper place to hold a lot of logs
09:47:12 <strigazi> this is not an opinion :)
09:47:24 <strigazi> what do you think?
09:48:16 <flwang1> the log is for the pod/container, right?
09:48:31 <strigazi> and of k8s services in podman
09:48:38 <strigazi> s/of/for/
09:48:58 <flwang1> ok,  max 100pod per node, so 50 * 100 =5G, that makes sense for me
09:49:07 <flwang1> and i think it's large enough
09:49:29 <flwang1> as the local log
09:50:39 <strigazi> move on?
09:50:53 <flwang1> #topic https://review.opendev.org/#/c/712154/ Fedora CoreOS Configurarion
09:50:57 <flwang1> strigazi: ^
09:51:59 <strigazi> I pushed the ignition user_data in a human readable format
09:52:21 <strigazi> from that format the user_data.json can be generated
09:52:49 <flwang1> cool, looks good, i will review it
09:53:01 <strigazi> when do we take this?
09:53:13 <strigazi> before my patch of logging or after?
09:53:38 <brtknr> in https://review.opendev.org/#/c/712154/3/magnum/drivers/k8s_fedora_coreos_v1/templates/fcct-config.yaml  line 167, does $CONTAINER_INFRA_PREFIX come from heat or /etc/environment?
09:53:38 <strigazi> s/of/for/ this is a new pattern for me  now
09:53:39 <flwang1> is there any depedency b/w them?
09:53:59 <brtknr> i think we should take this first then the logging
09:54:16 <brtknr> flwang1: yes, both update fcct-config.yaml
09:54:17 <strigazi> brtknr: this means both will be backported
09:54:41 <brtknr> yes
09:54:43 <strigazi> ok
09:55:24 <flwang1> i agree format first
09:55:36 <flwang1> which can make your rebase easier, i gues
09:55:39 <flwang1> guess
09:55:51 <brtknr> i can +2 quickly if you can address my comment
09:56:22 <strigazi> I will
09:57:19 <brtknr> Should we add a test to make sure that the user-data.json generated from fcct-config.yaml matches the one in the commit tree?
09:57:21 <brtknr> :P
09:57:24 <strigazi> let's test first though :) you never know. And give flwang1 a chance to review
09:57:48 <flwang1> thanks :)
09:59:12 <brtknr> strigazi: what is this cluster resize corner case?
09:59:24 <strigazi> 1. cluster create
09:59:48 <brtknr> is this after stein->train upgrade?
09:59:48 <strigazi> 1.1 with an old kube_version (not kube_tag) in kubecluster.yaml
10:00:02 <strigazi> 2. create a nodegroup
10:00:09 <strigazi> 3. resize default ng
10:00:30 <strigazi> causes change of user_data
10:01:00 <brtknr> yikes
10:01:12 <brtknr> do you have a fix?
10:01:41 <brtknr> does this rebuild the whole cluster?
10:01:47 <brtknr> or just the nodegroups?
10:02:49 <brtknr> strigazi:
10:02:52 <flwang1> hi team, can i close the meeting?
10:02:59 <brtknr> yes
10:03:00 <flwang1> we can discuss this resize issue offline
10:03:06 <strigazi> we have
10:03:06 <flwang1> #endmeeting