Wednesday, 2020-03-25

openstackgerrit	Feilong Wang proposed openstack/magnum master: [k8s] Upgrade calico/coredns to the latest stable version https://review.opendev.org/705599	02:06
*** xinliang has joined #openstack-containers		03:28
*** xinliang has quit IRC		03:51
*** flwang1 has quit IRC		04:07
*** ykarel\|away is now known as ykarel		04:26
*** udesale has joined #openstack-containers		04:51
*** vishalmanchanda has joined #openstack-containers		05:03
*** vesper11 has quit IRC		07:16
*** vesper has joined #openstack-containers		07:16
*** sapd1_x has joined #openstack-containers		08:18
*** ykarel is now known as ykarel\|lunch		08:39
*** xinliang has joined #openstack-containers		08:43
*** flwang1 has joined #openstack-containers		08:43
flwang1	brtknr: ping	08:43
flwang1	strigazi: around?	08:47
*** xinliang has quit IRC		08:48
strigazi	o/	08:56
flwang1	strigazi: before the meeting, quick question	08:57
flwang1	did you see my email about the cluster upgrade?	08:58
flwang1	strigazi: did you ever think about the upgrade issue from fedora atomic to fedora coreos?	08:58
strigazi	I just saw it, not possible with the API. I tried to support it (mixing coreos and atomic) but you guys said no :)	08:59
strigazi	I don't think it is wise to pursue this	08:59
strigazi	We channel users to use multiple clusters and drop the old ones	09:00
strigazi	Upgrade in place is more useful for CVEs	09:00
strigazi	at least that is our strategy at CERN	09:00
flwang1	i see. i tried and i realized it's very hard	09:01
flwang1	#startmeeting magnum	09:01
openstack	Meeting started Wed Mar 25 09:01:12 2020 UTC and is due to finish in 60 minutes. The chair is flwang1. Information about MeetBot at http://wiki.debian.org/MeetBot.	09:01
openstack	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	09:01
*** openstack changes topic to " (Meeting topic: magnum)"		09:01
openstack	The meeting name has been set to 'magnum'	09:01
flwang1	#topic roll call	09:01
*** openstack changes topic to "roll call (Meeting topic: magnum)"		09:01
flwang1	o/	09:01
strigazi	ο/	09:01
flwang1	brtknr: ^	09:01
brtknr	o/	09:01
flwang1	i think just us	09:02
flwang1	are you guys still safe?	09:02
flwang1	NZ will lockdown in the next 4 weeks :(	09:02
strigazi	all good here	09:02
brtknr	yep, only left the house to go for a run yesterday but havent really left home for 2 weeks	09:03
flwang1	be kind and stay safe	09:03
brtknr	other than to go shopping	09:03
brtknr	you too!	09:03
flwang1	#topic update health status	09:03
*** openstack changes topic to "update health status (Meeting topic: magnum)"		09:03
flwang1	thanks for the good review from brtknr	09:03
flwang1	i think it's in a good shape now	09:04
flwang1	and i have propose a PR in magnum auto healer https://github.com/kubernetes/cloud-provider-openstack/pull/985 if you want to give it a try	09:04
brtknr	i think it would still be good to try and pursue updating the reason only and letting magnum conductor infer health_status	09:05
flwang1	we (catalyst cloud) are keen to have this, because all our cluster are private and now we can't monitor their status	09:05
brtknr	otherwise there will be multiple places with logic for determining health status	09:05
flwang1	brtknr: we can, but how about do it in a separate, following patch?	09:06
brtknr	also why make 2 calls to the API when you can do this with one?	09:06
flwang1	i'm not feeling confident to change the internal health update logic in this patch	09:07
flwang1	strigazi: thoughts?	09:08
strigazi	I try to understand to which two calls you are talking about	09:08
brtknr	1 api call to update health_status and another api call to health_status_reason	09:09
flwang1	brtknr: did we?	09:09
strigazi	1 call should be enough	09:09
flwang1	brtknr: i forgot the details	09:09
flwang1	I'm happy to improve it if it can be, my point is, i'd like to do it in a separate patch	09:10
flwang1	instead of mixing in this patch	09:10
strigazi	+1 to separate patch	09:11
strigazi	(gerrit should allow many patches in a change)	09:11
strigazi	(but it doesn't)	09:11
flwang1	strigazi: please help review the current one, thanks	09:12
flwang1	brtknr: are you ok with that?	09:12
brtknr	also the internal poller is always setting the status to UNKNOWN	09:12
brtknr	something needs to be done about that	09:12
brtknr	otherwise it will be like a lottery	09:13
brtknr	50% of the time, the status will be UNKNOWN	09:13
brtknr	which defeats the point of having an external updater	09:13
flwang1	brtknr: can you explain?	09:14
flwang1	it shouldn't be	09:14
flwang1	i'm happy to fix it in the following patch	09:14
brtknr	when I was testing it, the CLI would update the health_status, but the poller would reset it back to UNKNOWN	09:15
flwang1	if both api and worker nodes are OK, then the health status should be healthy	09:15
flwang1	brtknr: you're mixing the two things	09:15
flwang1	are you saying there is a bug with the internal health status update logic?	09:16
flwang1	i saw your patch for the corner case and I have already +2 on that, are you saying there is another potential bug?	09:16
brtknr	there is no bug currently apart from the master_lb edge case	09:16
flwang1	good	09:17
brtknr	but if we want to be able to update externally, its a race condition between internal poller and the external update	09:17
brtknr	the internal poller sets it back to UNKNOWN every polling interval?	09:17
brtknr	make sense?	09:17
flwang1	yep, but after we fixed the corner case by https://review.opendev.org/#/c/714589/, then we should be good	09:17
flwang1	strigazi: can you help review https://review.opendev.org/#/c/714589/ ?	09:18
flwang1	brtknr: are you ok we improve the health_status calculation in a separate patch?	09:18
strigazi	If a magnum deployment is relying on an external thing, why not disalbe the conductor? I will have a look	09:19
flwang1	strigazi: it's not hard depedency	09:20
strigazi	it's not I know	09:20
flwang1	we can introduce a config if you think that's better	09:20
flwang1	a config to totally disable the internal polling for health status	09:20
strigazi	I mean for someone who uses the external controller it makes sense	09:20
flwang1	right	09:21
strigazi	what brtknr proposes makes in this path	09:21
brtknr	i am slightly unconfortable with it because if we have the health_status calculation logic in both CPO and magnum-conductor, we need to make 2 patches if we ever want to change this logic... my argument is that we should do this in one place... we already have this logic in magnum-conductor so makes sense to keep it therere and let the magnum-auto-healer simply provide the health_status_reason	09:21
strigazi	what brtknr proposes makes in this patch	09:21
brtknr	and let it calculate the health_status reason... i'm okay with it being a separate patch but i'd like to test them together	09:22
flwang1	sure, i mean if the current patch is in good shape, we can get it in and which will make the following patch easy for testing and review	09:23
flwang1	i just don't want to submit large patch because we don't have any function test in gate	09:24
flwang1	as you know, we're fully relying on our manual testing to keep the magnum code quality	09:25
flwang1	that's why i prefer to get smaller patch in	09:25
flwang1	hopefully that makes sense for this case	09:25
brtknr	ok makes sense	09:26
brtknr	lets move to the next topic	09:26
flwang1	thanks, let's move on	09:26
flwang1	#topic https://review.opendev.org/#/c/714423/ - rootfs kubelet	09:26
*** openstack changes topic to "https://review.opendev.org/#/c/714423/ - rootfs kubelet (Meeting topic: magnum)"		09:26
flwang1	brtknr: ^	09:27
brtknr	ok so turns out mounting rootfs to kubelet fixes the cinder selinux issue	09:27
brtknr	i tried mounting just the selinux specific things but that didnt help	09:27
brtknr	selinux specific things: /sys/fs/selinx, /var/lib/selinux/, /etc/selinx	09:28
strigazi	kubelet has access to the docker socket or another cri socket. The least privileged pattern made little sense here.	09:28
brtknr	we mounted /rootfs to kubelet in atomic, strigazi suggested doing this ages ago but flwang and i were cautious, but we should take this	09:29
*** xinliang has joined #openstack-containers		09:29
flwang1	brtknr: after taking this, do we still have to disable selinux?	09:29
brtknr	flwang1: nope	09:29
brtknr	its upto you guys whether you want to take the selinux_mode patch	09:30
brtknr	it might be useful for other things	09:30
strigazi	the patch is useful	09:30
flwang1	if that's the case, i prefer to mountfs and still enable seliux	09:30
brtknr	ok :) lets take both then :P	09:31
brtknr	selinux in fcos is always enabled by default	09:31
flwang1	i'm ok with that	09:32
flwang1	strigazi: ^	09:32
strigazi	of I agree with it, optionally disabling a security feature (selinux) and giving extra access to an already super uber priliged process (kubelet)	09:34
flwang1	cool	09:34
flwang1	next topic?	09:34
flwang1	#topic https://review.opendev.org/#/c/714574/ - cluster name for network	09:34
*** openstack changes topic to "https://review.opendev.org/#/c/714574/ - cluster name for network (Meeting topic: magnum)"		09:34
flwang1	i'm happy to take this one	09:34
flwang1	private as the network name is annoying sometimes	09:35
brtknr	:)	09:35
brtknr	glad you agree	09:35
flwang1	strigazi: ^	09:36
flwang1	anything else we need to discuss?	09:36
strigazi	is it an issue when two clusters with the same name exist?	09:36
flwang1	not a problem	09:37
strigazi	we should do the same for subnets if not there	09:37
brtknr	nope, it will be the same as when there are two networks called private	09:37
brtknr	subnets get their name from heat stack	09:37
flwang1	but sometimes it's not handy to find the correct network	09:37
brtknr	e.g. k8s-flannel-coreos-f2mpsj3k7y6i-network-2imn745rxgzv-private_subnet-27qmm3u76ubp	09:37
brtknr	so its not a problem there	09:37
strigazi	ok	09:38
strigazi	makes sense	09:38
*** ykarel\|lunch is now known as ykarel		09:38
flwang1	anything else we should discuss?	09:39
brtknr	hmm i made a few patches yesterday	09:39
brtknr	https://review.opendev.org/714719	09:40
brtknr	changing repo for etcd	09:40
brtknr	is that okay with you guys	09:40
brtknr	i prefer quay.io/coreos as it uses the same release tag as etcd development repo	09:40
brtknr	it annoys me that k8s.gcr.io drops the v from the release version	09:41
flwang1	building etcd system container for atomic?	09:41
brtknr	also on https://github.com/etcd-io/etcd/releases, they say they use quay.io/coreos/etcd as their secondanry container registry	09:41
strigazi	where does the project publishes their builds? We should use that one (i don't know which one it is)	09:42
brtknr	i am also okay to use gcr.io/etcd-development/etcd	09:42
brtknr	according to https://github.com/etcd-io/etcd/releases, they publish to gcr.io/etcd-development/etcd and quay.io/coreos/etcd officially	09:42
flwang1	i like quay.io since it's maintained by coreos	09:43
brtknr	i am happy with either	09:44
strigazi	I would choose the primary, but for us it doesn't matter, we mirror	09:44
flwang1	agree	09:44
flwang1	brtknr: done?	09:44
flwang1	i have a question about metrics-server	09:44
brtknr	okay shall i change it to primary or leave it as secondary?	09:44
flwang1	when i run 'kubectl top node', i got :	09:45
flwang1	Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes.metrics.k8s.io)	09:45
brtknr	is your metric server running?	09:45
flwang1	yes	09:46
brtknr	flwang1: do you have this patch: https://review.opendev.org/#/c/705984/	09:46
flwang1	yes	09:47
strigazi	what the metrics-server logs say?	09:47
flwang1	http://paste.openstack.org/show/791116/	09:48
flwang1	http://paste.openstack.org/show/791115/	09:48
flwang1	i can't see much error from the metrics-server	09:49
strigazi	which one?	09:49
strigazi	16 or 15	09:49
flwang1	791116	09:49
brtknr	flwang1: is this master branch?	09:49
flwang1	yes	09:49
flwang1	i tested the caliao and coredns	09:49
flwang1	maybe related to the calico issue	09:50
flwang1	i will test it again with a master branch, no calico change	09:50
flwang1	as for calico patch, strigazi, i do need your help	09:50
flwang1	i think i have done everything and i can't see anything wrong, but the connection between nodes and pods don't work	09:51
brtknr	flwang1: is this with calico plugin?	09:51
brtknr	its not working for me either with calico	09:51
flwang1	ok	09:51
brtknr	probably to do with pod to pod communication issue	09:51
brtknr	its working with flannel	09:52
flwang1	then it should be the calico version upgrade issue	09:52
strigazi	left this in gerrit too "With ip encapsualtion it works but the no-encapsulated mode is not working."	09:52
*** ivve has joined #openstack-containers		09:53
brtknr	how do you enable ip encaptulation?	09:53
brtknr	strigazi:	09:53
flwang1	strigazi: just to be clear, you mean 'CALICO_IPV4POOL_IPIP' == 'Always' ?	09:54
strigazi	https://review.opendev.org/#/c/705599/13/magnum/drivers/common/templates/kubernetes/fragments/calico-service.sh@454	09:54
strigazi	Always	09:54
strigazi	yes	09:54
strigazi	Never should work though	09:55
strigazi	as it used to work	09:55
strigazi	when you have SDN on SDN this can happen :)	09:55
strigazi	I mean being lost :)	09:55
flwang1	strigazi: should we ask help for calico team?	09:56
strigazi	yes	09:56
flwang1	and i just found it hard to debug because the toobox doesn't work on fedora coreos	09:56
strigazi	in devstack we run calico on openvswitch	09:56
flwang1	so i can't use tcpdump to check the traffic	09:56
flwang1	strigazi: did you try it on prod?	09:57
flwang1	is it working?	09:57
strigazi	flwang1: come on, privileged daemon with centos and instal whatever you want :)	09:57
strigazi	s/daemon/daemonset/	09:57
strigazi	or add a sidecar to calico node	09:57
flwang1	strigazi: you mean just ran a centos daemon set?	09:57
strigazi	or exec in calico node, it is RHEL	09:58
strigazi	microdnf install	09:58
flwang1	ok, will try	09:58
flwang1	strigazi: did you try it on prod? is it working?	09:58
strigazi	a sidecar is optimal	09:59
strigazi	not yet, today, BUT	09:59
strigazi	in prod we don't run on openvswitch	09:59
strigazi	we use linux-bridge	09:59
strigazi	so it may work	09:59
strigazi	I will update gerrit	09:59
flwang1	pls do, at least it can help us understand the issue	09:59
flwang1	should I split the calico and coredns upgrade into 2 patches?	10:00
brtknr	flwang1: probably good practice :)	10:00
strigazi	as you want, it doesn't hurt	10:00
flwang1	i combine them because they're very critical services	10:00
flwang1	so i want to test them together for conformance test	10:01
brtknr	they're not dependent on each other though right?	10:01
flwang1	no depedency	10:01
strigazi	they are not	10:01
brtknr	have we ruled out if the regression is not caused by coredns?	10:01
brtknr	have we ruled out if the regression is not caused by coredns upgrade?	10:01
strigazi	if you update coredns can you make it run on master too?	10:01
flwang1	i don't think it's related to coredns	10:02
strigazi	it can't be	10:02
strigazi	trust bu verify though	10:02
flwang1	strigazi: make coredns only running on master node?	10:02
strigazi	trust but verify though	10:02
strigazi	flwang1: no, run in master as well	10:02
brtknr	strigazi: why?	10:02
flwang1	ah, sure, i can do that	10:02
brtknr	why run on master as well?	10:03
flwang1	brtknr: i even want to run it only on master ;)	10:03
strigazi	because the user might have a stupid app that will run next to coredns and kill it	10:03
flwang1	since it's critical service	10:03
strigazi	then things on master don't have DNS	10:04
flwang1	we don't want to lose it when the worker node down as well	10:04
flwang1	let's end the meeting first	10:04
brtknr	ok and I suppose we want it to run on workers too because we want the dns service to scale with the number of workers	10:04
flwang1	#endmeeting	10:05
*** openstack changes topic to "OpenStack Containers Team \| Meeting: every Wednesday @ 9AM UTC \| Agenda: https://etherpad.openstack.org/p/magnum-weekly-meeting"		10:05
openstack	Meeting ended Wed Mar 25 10:05:00 2020 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	10:05
openstack	Minutes: http://eavesdrop.openstack.org/meetings/magnum/2020/magnum.2020-03-25-09.01.html	10:05
openstack	Minutes (text): http://eavesdrop.openstack.org/meetings/magnum/2020/magnum.2020-03-25-09.01.txt	10:05
openstack	Log: http://eavesdrop.openstack.org/meetings/magnum/2020/magnum.2020-03-25-09.01.log.html	10:05
flwang1	strigazi: if you have time, pls help debug the calico issue	10:05
flwang1	meanwhile, i will consultant the calico team as well	10:05
strigazi	the argument about dns makes sense?	10:05
strigazi	flwang1: please cc me	10:05
strigazi	if it is public	10:05
flwang1	i will go to the calico slack channel	10:05
strigazi	github issue too?	10:06
flwang1	good idea	10:06
brtknr	i think they will try and ask for cash for advice :)	10:06
flwang1	i will cc you then	10:06
flwang1	brtknr: no, they won't ;)	10:06
flwang1	i asked them before and they're nice	10:06
brtknr	ok maybe just the weave people then	10:07
flwang1	alright, i have to go, guys	10:07
strigazi	good night	10:07
brtknr	ok sleep well!	10:07
flwang1	ttyl	10:08
*** flwang1 has quit IRC		10:08
*** rcernin has quit IRC		10:17
*** trident has quit IRC		10:29
*** trident has joined #openstack-containers		10:31
*** trident has quit IRC		10:33
*** xinliang has quit IRC		10:36
*** trident has joined #openstack-containers		10:37
*** sapd1_x has quit IRC		11:06
*** yolanda has quit IRC		11:27
*** markguz_ has quit IRC		11:31
*** yolanda has joined #openstack-containers		11:32
*** udesale_ has joined #openstack-containers		12:42
*** udesale has quit IRC		12:45
*** ramishra has quit IRC		12:47
*** ramishra has joined #openstack-containers		12:58
*** sapd1_x has joined #openstack-containers		13:38
*** sapd1_x has quit IRC		14:55
*** udesale_ has quit IRC		14:57
*** ykarel is now known as ykarel\|away		15:25
brtknr	cosmicsound: are you using etcd tag 3.4.3?	15:38
brtknr	you need to override it when using coreos	15:38
brtknr	i think	15:39
brtknr	check that etcd is running on the master	15:39
*** sapd1 has joined #openstack-containers		15:41
*** sapd1 has quit IRC		16:24
openstackgerrit	Bharat Kunwar proposed openstack/magnum master: Build new autoscaler containers https://review.opendev.org/714986	16:27
*** tobias-urdin has joined #openstack-containers		18:01
tobias-urdin	quick question, if anybody knows, deployed a kubernetes v1.15.7 cluster with magnum	18:11
tobias-urdin	and using k8scloudprovider/openstack-cloud-controller-manager:v1.15.0	18:11
tobias-urdin	is the openstack ccm v1.15.0 suppose to work with v1.15.7?	18:11
tobias-urdin	lxkong: maybe knows? ^	18:12
tobias-urdin	fails on:	18:12
tobias-urdin	kubectl create -f /srv/magnum/kubernetes/openstack-cloud-controller-manager.yaml	18:12
tobias-urdin	error: SchemaError(io.k8s.api.autoscaling.v2beta1.ExternalMetricSource): invalid object doesn't have additional properties	18:12
tobias-urdin	https://github.com/openstack/magnum/blob/master/magnum/drivers/common/templates/kubernetes/fragments/kube-apiserver-to-kubelet-role.sh#L154	18:12
tobias-urdin	can reproduce, error message doesn't help to point out anything specific in the yaml file so probably an incompatibility issue	18:12
tobias-urdin	i will try to respawn cluster with v1.15.0 instead, maybe openstack-cloud-controller-manager needs to release new versions to support stable v1.15	18:13
*** irclogbot_1 has quit IRC		18:37
tobias-urdin	with k8s v1.15.0 error: SchemaError(io.k8s.api.node.v1alpha1.RuntimeClassSpec): invalid object doesn't have additional properties	18:48
*** irclogbot_0 has joined #openstack-containers		19:01
NobodyCam	Good Morning Folks; I am attempting to deploy a v1.15.9 kubernetes cluster with Rocky having some issues.	19:06
NobodyCam	"kube_cluster_deploy" ends up timing out. are there tricks to get this working... I.e. setting calico tags differently? "kube_tag=v1.15.9,tiller_enabled=True,availability_zone=nova,calico_tag=v2.6.12,calico_cni_tag=v1.11.8,calico_kube_controllers_tag=v1.0.5,heat_container_agent_tag=rawhide"	19:08
*** irclogbot_0 has quit IRC		19:37
*** irclogbot_2 has joined #openstack-containers		19:40
*** irclogbot_2 has quit IRC		19:42
*** irclogbot_1 has joined #openstack-containers		19:45
*** irclogbot_1 has quit IRC		20:00
*** irclogbot_3 has joined #openstack-containers		20:03
*** irclogbot_3 has quit IRC		20:12
*** irclogbot_2 has joined #openstack-containers		20:15
*** irclogbot_2 has quit IRC		20:16
tobias-urdin	the issue seems to be the kubectl version in the heat-container-agent, if i copy the file openstack-cloud-controller-manager.yaml to my computer and run it from there it works	20:17
tobias-urdin	/var/lib/containers/atomic/heat-container-agent.0/rootfs/usr/bin/kubectl version	20:18
tobias-urdin	Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.3", GitCommit:"2bba0127d85d5a46ab4b778548be28623b32d0b0", GitTreeState:"archive", BuildDate:"2018-07-25T11:20:04Z", GoVersion:"go1.11beta2", Compiler:"gc", Platform:"linux/amd64"}	20:18
tobias-urdin	and locally	20:18
tobias-urdin	$kubectl version	20:18
tobias-urdin	Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.7", GitCommit:"6c143d35bb11d74970e7bc0b6c45b6bfdffc0bd4", GitTreeState:"clean", BuildDate:"2019-12-11T12:42:56Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}	20:18
*** irclogbot_0 has joined #openstack-containers		20:21
NobodyCam	https://www.irccloud.com/pastebin/JjeXp3Pk/	20:42
NobodyCam	I end up with :	20:44
NobodyCam	cni.go:213] Unable to update cni config: No networks found in /etc/cni/net.d	20:44
NobodyCam	kubelet.go:2173] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized	20:44
tobias-urdin	NobodyCam: sorry was following up on my issue before yours, not related	20:44
NobodyCam	:) All Good!	20:45
NobodyCam	I was following your issue because it on the surface seemed close to what I was seeing locally	20:46
brtknr	tobias-urdin: which version of magnum are you running?	21:02
brtknr	NobodyCam: I wouldn’t change the default calico and calico_cni_tag	21:04
brtknr	also i think the latest kube_tag supported in rocky is v1.15.11	21:04
NobodyCam	brtknr: Thank you :) I have attempted with the Defaults too	21:08
tobias-urdin	brtknr: 7.2.0 rocky release	21:26
*** flwang1 has joined #openstack-containers		22:11
flwang1	brtknr: ping	22:11
brtknr	flwang1: pong	22:17
brtknr	i was waiting for u!	22:17
flwang1	brtknr: :)	22:17
flwang1	brtknr: i'm reviewing the logic of _poll_health_status	22:18
flwang1	i don't really understand why you said 2 api calls to get the health status	22:18
flwang1	brtknr: are you still there?	22:21
*** Jeffrey4l has quit IRC		22:27
brtknr	brtknr: yes sorry im trying to get my dsvm up again after calico cluster was repeatedly failing on the clean master branch	22:28
brtknr	flwang1: ok i take back that we dont need two api calls	22:29
*** Jeffrey4l has joined #openstack-containers		22:29
brtknr	i didnt realise that it was possible to make multiple updates in a single api call	22:29
brtknr	that said, i am still not a great fan of logic for determining health status being in the magnum-auto-healer :(	22:30
brtknr	i feel like the easiest time to make this change is now, it will only become harder to change this once it is merged	22:31
flwang1	so you mean totally don't allow update the health_status field?	22:32
flwang1	my point is, the health_status_reason is really a dict/json, and it could be anything inside, depends on how the cloud provider would like to leverage it	22:33
flwang1	for example, i'm trying to put the 'updated_at' into the health_status_reason, so that the 3rd monitor code can get more information from there	22:34
flwang1	if we totally limit the format of the health_status_reason, then we will lost the flexibility	22:35
flwang1	brtknr: ^	22:35
*** vishalmanchanda has quit IRC		22:35
brtknr	flwang1: yes, i mean prevent update of health_status field	22:36
flwang1	I can see your point, we get a bit benefit but meanwhile, we will lost bigger flexibility	22:36
brtknr	and let the magnum-conductor work it out	22:36
flwang1	brtknr: another thing is	22:37
flwang1	our current heath_status_reason is quite simple, as you can see, now it only get the Ready condition	22:37
flwang1	if you put more information, say support more conditions of node and master, then the calculation could be mess	22:38
flwang1	which i don't think magnum should control that much	22:38
flwang1	in other words, when I designed this	22:39
flwang1	the main thing, the cloud provider admin care about is, the health status, and the health_status_reason is just a reference, instead of reverse	22:39
flwang1	brtknr: TBH, i don't want to maintain such a logic in magnum	22:40
flwang1	magnum is a platform, as long as we open these 2 fields for cloud admin, we want to grant the flexibility instead of limit it	22:41
brtknr	ok fine i see your argument	22:42
NobodyCam	I am able to deploy up to v1.13.11 on my rocky install	22:42
brtknr	if we have the option to disable the polling from magnum side, i would be happy with that solution	22:42
NobodyCam	1.14.# and above fail	22:42
flwang1	brtknr: you mean totally disable it? for that case, we probably have to introduce a config	22:43
flwang1	but actually, it's a really cluster by cluster config	22:43
NobodyCam	https://wiki.openstack.org/wiki/Magnum#Compatibility_Matrix says 1.15.X should work?	22:44
flwang1	i don't think totally disable it is a good idea, TBH	22:44
brtknr	flwang1: it doesnt make sense for the internal poller and magnum auto healer stepping on each other's toe	22:45
brtknr	NobodyCam: I got 1.15.x working when i last tested rocky	22:45
brtknr	i probably had to use heat_container_agent_tag=train-stable	22:46
brtknr	i cant remember 100%	22:46
NobodyCam	nice! I'm still working on it...	22:46
NobodyCam	oh Thank you I can try that	22:46
flwang1	brtknr: as i mentioned above, some cluster may be public and no auto healer running on that, some cluster maybe private and having cluster running on that	22:46
brtknr	NobodyCam: actually try train-stable-2	22:46
flwang1	some cluster maybe private and having auto healer running on that	22:46
brtknr	flwang1: i meant option to disable it	22:47
brtknr	flwang1: i meant option to disable it for each cluster	22:47
brtknr	e.g. if auto healer is running	22:47
brtknr	could disable automatically if auto healer is running	22:48
flwang1	you mean checking the magnum-auto-healer when doing the accessible validation?	22:48
flwang1	or a separate function to disable it?	22:48
flwang1	no problem,i can do that	22:49
brtknr	something that stops the internal poller and magnum auto healer fighting like cats and dogs	22:51
flwang1	sure, i will fix it in next ps	22:54
flwang1	thank you for your review	22:54
flwang1	and glad to see we're on the same page now	22:54
brtknr	flwang1: :)	22:55
brtknr	flwang1: btw is calico working for you on master branch	22:55
brtknr	without your patch	22:55
flwang1	brtknr: i didn't try that yet TBH	22:55
flwang1	but it works well on our prod	22:56
brtknr	hmmm	22:56
*** rcernin has joined #openstack-containers		22:56
brtknr	its appears to be working on stable/train but broken on master	22:57
brtknr	e.g. the cluster-autoscaler cannot reach keystone for auth	22:57
brtknr	same with cinder-csi-plugin	22:57
brtknr	otherwise all reports healthy	22:57
flwang1	brtknr: try to open the 179 port on master	22:59
brtknr	what? manually?	23:00
brtknr	flwang1: but its not an issue on stable/train branch	23:05
brtknr	only on master	23:05
flwang1	ok, then i'm not sure, probably a regression issue	23:05
brtknr	flwang1: hmm looks like the regression might be caused by your patch to change default calico_ipv4_cidr	23:18
flwang1	brtknr: no way	23:19
flwang1	it's impossible :)	23:19
brtknr	flwang1: yes way!	23:20
flwang1	how? can you pls show me?	23:20
brtknr	when i revert the change, it works	23:20
flwang1	then you should check your local settings	23:21
flwang1	are you having 10.100.x.x with you local vm network?	23:21
brtknr	magnum/drivers/heat/k8s_coreos_template_def.py:58: cluster.labels.get('calico_ipv4pool', '192.168.0.0/16')	23:21
brtknr	magnum/drivers/heat/k8s_fedora_template_def.py:58: cluster.labels.get('calico_ipv4pool', '192.168.0.0/16')	23:21
flwang1	shit, my bad	23:22
flwang1	brtknr: i will submit a fix soon	23:23
brtknr	flwang1: mind if i propose?	23:24
flwang1	i will submit the patch in 5 secs	23:25
brtknr	ok	23:27
brtknr	we should remove the defaults from kubecluster.yaml	23:28
brtknr	especially if pod_network_cidr depends on it	23:28
brtknr	i wonder if this will fix calico upgrade?	23:28
openstackgerrit	Feilong Wang proposed openstack/magnum master: Fix calico regression issue caused by default ipv4pool change https://review.opendev.org/715093	23:28
flwang1	let's see	23:29
flwang1	i will test the calico upgrade again with this one	23:29
flwang1	brtknr: https://review.opendev.org/715093	23:29
flwang1	brtknr: i'm sorry for the regression issue :(	23:31
flwang1	and the stupid confidence :D	23:31
brtknr	hey, we approved it so partly our fault too	23:31
brtknr	we should remove those defaults from kubecluster.yaml if it is never used	23:32
flwang1	brtknr: or remove the defaults from the python code, thoughts?	23:32
flwang1	let's get this one in ,and you and work on how to handle the default value?	23:33
flwang1	and you can work on	23:33
brtknr	hmm makes more sense to handle it in python code though	23:34
brtknr	since pod_network_cidr has to match flannel_network_cidr or calico_ipv4pool	23:35
brtknr	im sure this logic would be far more complicated in heat	23:35
brtknr	im sure this logic would be far more complicated in heat template	23:35
flwang1	ok, but anyway, let's do that in a separate patch	23:45
brtknr	flwang1: but do you agree with what i said? that the logic would be more complicated to implement in heat template?	23:57

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!