openstackgerrit | Feilong Wang proposed openstack/magnum master: [k8s] Upgrade calico/coredns to the latest stable version https://review.opendev.org/705599 | 02:06 |
---|---|---|
*** xinliang has joined #openstack-containers | 03:28 | |
*** xinliang has quit IRC | 03:51 | |
*** flwang1 has quit IRC | 04:07 | |
*** ykarel|away is now known as ykarel | 04:26 | |
*** udesale has joined #openstack-containers | 04:51 | |
*** vishalmanchanda has joined #openstack-containers | 05:03 | |
*** vesper11 has quit IRC | 07:16 | |
*** vesper has joined #openstack-containers | 07:16 | |
*** sapd1_x has joined #openstack-containers | 08:18 | |
*** ykarel is now known as ykarel|lunch | 08:39 | |
*** xinliang has joined #openstack-containers | 08:43 | |
*** flwang1 has joined #openstack-containers | 08:43 | |
flwang1 | brtknr: ping | 08:43 |
flwang1 | strigazi: around? | 08:47 |
*** xinliang has quit IRC | 08:48 | |
strigazi | o/ | 08:56 |
flwang1 | strigazi: before the meeting, quick question | 08:57 |
flwang1 | did you see my email about the cluster upgrade? | 08:58 |
flwang1 | strigazi: did you ever think about the upgrade issue from fedora atomic to fedora coreos? | 08:58 |
strigazi | I just saw it, not possible with the API. I tried to support it (mixing coreos and atomic) but you guys said no :) | 08:59 |
strigazi | I don't think it is wise to pursue this | 08:59 |
strigazi | We channel users to use multiple clusters and drop the old ones | 09:00 |
strigazi | Upgrade in place is more useful for CVEs | 09:00 |
strigazi | at least that is our strategy at CERN | 09:00 |
flwang1 | i see. i tried and i realized it's very hard | 09:01 |
flwang1 | #startmeeting magnum | 09:01 |
openstack | Meeting started Wed Mar 25 09:01:12 2020 UTC and is due to finish in 60 minutes. The chair is flwang1. Information about MeetBot at http://wiki.debian.org/MeetBot. | 09:01 |
openstack | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 09:01 |
*** openstack changes topic to " (Meeting topic: magnum)" | 09:01 | |
openstack | The meeting name has been set to 'magnum' | 09:01 |
flwang1 | #topic roll call | 09:01 |
*** openstack changes topic to "roll call (Meeting topic: magnum)" | 09:01 | |
flwang1 | o/ | 09:01 |
strigazi | ο/ | 09:01 |
flwang1 | brtknr: ^ | 09:01 |
brtknr | o/ | 09:01 |
flwang1 | i think just us | 09:02 |
flwang1 | are you guys still safe? | 09:02 |
flwang1 | NZ will lockdown in the next 4 weeks :( | 09:02 |
strigazi | all good here | 09:02 |
brtknr | yep, only left the house to go for a run yesterday but havent really left home for 2 weeks | 09:03 |
flwang1 | be kind and stay safe | 09:03 |
brtknr | other than to go shopping | 09:03 |
brtknr | you too! | 09:03 |
flwang1 | #topic update health status | 09:03 |
*** openstack changes topic to "update health status (Meeting topic: magnum)" | 09:03 | |
flwang1 | thanks for the good review from brtknr | 09:03 |
flwang1 | i think it's in a good shape now | 09:04 |
flwang1 | and i have propose a PR in magnum auto healer https://github.com/kubernetes/cloud-provider-openstack/pull/985 if you want to give it a try | 09:04 |
brtknr | i think it would still be good to try and pursue updating the reason only and letting magnum conductor infer health_status | 09:05 |
flwang1 | we (catalyst cloud) are keen to have this, because all our cluster are private and now we can't monitor their status | 09:05 |
brtknr | otherwise there will be multiple places with logic for determining health status | 09:05 |
flwang1 | brtknr: we can, but how about do it in a separate, following patch? | 09:06 |
brtknr | also why make 2 calls to the API when you can do this with one? | 09:06 |
flwang1 | i'm not feeling confident to change the internal health update logic in this patch | 09:07 |
flwang1 | strigazi: thoughts? | 09:08 |
strigazi | I try to understand to which two calls you are talking about | 09:08 |
brtknr | 1 api call to update health_status and another api call to health_status_reason | 09:09 |
flwang1 | brtknr: did we? | 09:09 |
strigazi | 1 call should be enough | 09:09 |
flwang1 | brtknr: i forgot the details | 09:09 |
flwang1 | I'm happy to improve it if it can be, my point is, i'd like to do it in a separate patch | 09:10 |
flwang1 | instead of mixing in this patch | 09:10 |
strigazi | +1 to separate patch | 09:11 |
strigazi | (gerrit should allow many patches in a change) | 09:11 |
strigazi | (but it doesn't) | 09:11 |
flwang1 | strigazi: please help review the current one, thanks | 09:12 |
flwang1 | brtknr: are you ok with that? | 09:12 |
brtknr | also the internal poller is always setting the status to UNKNOWN | 09:12 |
brtknr | something needs to be done about that | 09:12 |
brtknr | otherwise it will be like a lottery | 09:13 |
brtknr | 50% of the time, the status will be UNKNOWN | 09:13 |
brtknr | which defeats the point of having an external updater | 09:13 |
flwang1 | brtknr: can you explain? | 09:14 |
flwang1 | it shouldn't be | 09:14 |
flwang1 | i'm happy to fix it in the following patch | 09:14 |
brtknr | when I was testing it, the CLI would update the health_status, but the poller would reset it back to UNKNOWN | 09:15 |
flwang1 | if both api and worker nodes are OK, then the health status should be healthy | 09:15 |
flwang1 | brtknr: you're mixing the two things | 09:15 |
flwang1 | are you saying there is a bug with the internal health status update logic? | 09:16 |
flwang1 | i saw your patch for the corner case and I have already +2 on that, are you saying there is another potential bug? | 09:16 |
brtknr | there is no bug currently apart from the master_lb edge case | 09:16 |
flwang1 | good | 09:17 |
brtknr | but if we want to be able to update externally, its a race condition between internal poller and the external update | 09:17 |
brtknr | the internal poller sets it back to UNKNOWN every polling interval? | 09:17 |
brtknr | make sense? | 09:17 |
flwang1 | yep, but after we fixed the corner case by https://review.opendev.org/#/c/714589/, then we should be good | 09:17 |
flwang1 | strigazi: can you help review https://review.opendev.org/#/c/714589/ ? | 09:18 |
flwang1 | brtknr: are you ok we improve the health_status calculation in a separate patch? | 09:18 |
strigazi | If a magnum deployment is relying on an external thing, why not disalbe the conductor? I will have a look | 09:19 |
flwang1 | strigazi: it's not hard depedency | 09:20 |
strigazi | it's not I know | 09:20 |
flwang1 | we can introduce a config if you think that's better | 09:20 |
flwang1 | a config to totally disable the internal polling for health status | 09:20 |
strigazi | I mean for someone who uses the external controller it makes sense | 09:20 |
flwang1 | right | 09:21 |
strigazi | what brtknr proposes makes in this path | 09:21 |
brtknr | i am slightly unconfortable with it because if we have the health_status calculation logic in both CPO and magnum-conductor, we need to make 2 patches if we ever want to change this logic... my argument is that we should do this in one place... we already have this logic in magnum-conductor so makes sense to keep it therere and let the magnum-auto-healer simply provide the health_status_reason | 09:21 |
strigazi | what brtknr proposes makes in this patch | 09:21 |
brtknr | and let it calculate the health_status reason... i'm okay with it being a separate patch but i'd like to test them together | 09:22 |
flwang1 | sure, i mean if the current patch is in good shape, we can get it in and which will make the following patch easy for testing and review | 09:23 |
flwang1 | i just don't want to submit large patch because we don't have any function test in gate | 09:24 |
flwang1 | as you know, we're fully relying on our manual testing to keep the magnum code quality | 09:25 |
flwang1 | that's why i prefer to get smaller patch in | 09:25 |
flwang1 | hopefully that makes sense for this case | 09:25 |
brtknr | ok makes sense | 09:26 |
brtknr | lets move to the next topic | 09:26 |
flwang1 | thanks, let's move on | 09:26 |
flwang1 | #topic https://review.opendev.org/#/c/714423/ - rootfs kubelet | 09:26 |
*** openstack changes topic to "https://review.opendev.org/#/c/714423/ - rootfs kubelet (Meeting topic: magnum)" | 09:26 | |
flwang1 | brtknr: ^ | 09:27 |
brtknr | ok so turns out mounting rootfs to kubelet fixes the cinder selinux issue | 09:27 |
brtknr | i tried mounting just the selinux specific things but that didnt help | 09:27 |
brtknr | selinux specific things: /sys/fs/selinx, /var/lib/selinux/, /etc/selinx | 09:28 |
strigazi | kubelet has access to the docker socket or another cri socket. The least privileged pattern made little sense here. | 09:28 |
brtknr | we mounted /rootfs to kubelet in atomic, strigazi suggested doing this ages ago but flwang and i were cautious, but we should take this | 09:29 |
*** xinliang has joined #openstack-containers | 09:29 | |
flwang1 | brtknr: after taking this, do we still have to disable selinux? | 09:29 |
brtknr | flwang1: nope | 09:29 |
brtknr | its upto you guys whether you want to take the selinux_mode patch | 09:30 |
brtknr | it might be useful for other things | 09:30 |
strigazi | the patch is useful | 09:30 |
flwang1 | if that's the case, i prefer to mountfs and still enable seliux | 09:30 |
brtknr | ok :) lets take both then :P | 09:31 |
brtknr | selinux in fcos is always enabled by default | 09:31 |
flwang1 | i'm ok with that | 09:32 |
flwang1 | strigazi: ^ | 09:32 |
strigazi | of I agree with it, optionally disabling a security feature (selinux) and giving extra access to an already super uber priliged process (kubelet) | 09:34 |
flwang1 | cool | 09:34 |
flwang1 | next topic? | 09:34 |
flwang1 | #topic https://review.opendev.org/#/c/714574/ - cluster name for network | 09:34 |
*** openstack changes topic to "https://review.opendev.org/#/c/714574/ - cluster name for network (Meeting topic: magnum)" | 09:34 | |
flwang1 | i'm happy to take this one | 09:34 |
flwang1 | private as the network name is annoying sometimes | 09:35 |
brtknr | :) | 09:35 |
brtknr | glad you agree | 09:35 |
flwang1 | strigazi: ^ | 09:36 |
flwang1 | anything else we need to discuss? | 09:36 |
strigazi | is it an issue when two clusters with the same name exist? | 09:36 |
flwang1 | not a problem | 09:37 |
strigazi | we should do the same for subnets if not there | 09:37 |
brtknr | nope, it will be the same as when there are two networks called private | 09:37 |
brtknr | subnets get their name from heat stack | 09:37 |
flwang1 | but sometimes it's not handy to find the correct network | 09:37 |
brtknr | e.g. k8s-flannel-coreos-f2mpsj3k7y6i-network-2imn745rxgzv-private_subnet-27qmm3u76ubp | 09:37 |
brtknr | so its not a problem there | 09:37 |
strigazi | ok | 09:38 |
strigazi | makes sense | 09:38 |
*** ykarel|lunch is now known as ykarel | 09:38 | |
flwang1 | anything else we should discuss? | 09:39 |
brtknr | hmm i made a few patches yesterday | 09:39 |
brtknr | https://review.opendev.org/714719 | 09:40 |
brtknr | changing repo for etcd | 09:40 |
brtknr | is that okay with you guys | 09:40 |
brtknr | i prefer quay.io/coreos as it uses the same release tag as etcd development repo | 09:40 |
brtknr | it annoys me that k8s.gcr.io drops the v from the release version | 09:41 |
flwang1 | building etcd system container for atomic? | 09:41 |
brtknr | also on https://github.com/etcd-io/etcd/releases, they say they use quay.io/coreos/etcd as their secondanry container registry | 09:41 |
strigazi | where does the project publishes their builds? We should use that one (i don't know which one it is) | 09:42 |
brtknr | i am also okay to use gcr.io/etcd-development/etcd | 09:42 |
brtknr | according to https://github.com/etcd-io/etcd/releases, they publish to gcr.io/etcd-development/etcd and quay.io/coreos/etcd officially | 09:42 |
flwang1 | i like quay.io since it's maintained by coreos | 09:43 |
brtknr | i am happy with either | 09:44 |
strigazi | I would choose the primary, but for us it doesn't matter, we mirror | 09:44 |
flwang1 | agree | 09:44 |
flwang1 | brtknr: done? | 09:44 |
flwang1 | i have a question about metrics-server | 09:44 |
brtknr | okay shall i change it to primary or leave it as secondary? | 09:44 |
flwang1 | when i run 'kubectl top node', i got : | 09:45 |
flwang1 | Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes.metrics.k8s.io) | 09:45 |
brtknr | is your metric server running? | 09:45 |
flwang1 | yes | 09:46 |
brtknr | flwang1: do you have this patch: https://review.opendev.org/#/c/705984/ | 09:46 |
flwang1 | yes | 09:47 |
strigazi | what the metrics-server logs say? | 09:47 |
flwang1 | http://paste.openstack.org/show/791116/ | 09:48 |
flwang1 | http://paste.openstack.org/show/791115/ | 09:48 |
flwang1 | i can't see much error from the metrics-server | 09:49 |
strigazi | which one? | 09:49 |
strigazi | 16 or 15 | 09:49 |
flwang1 | 791116 | 09:49 |
brtknr | flwang1: is this master branch? | 09:49 |
flwang1 | yes | 09:49 |
flwang1 | i tested the caliao and coredns | 09:49 |
flwang1 | maybe related to the calico issue | 09:50 |
flwang1 | i will test it again with a master branch, no calico change | 09:50 |
flwang1 | as for calico patch, strigazi, i do need your help | 09:50 |
flwang1 | i think i have done everything and i can't see anything wrong, but the connection between nodes and pods don't work | 09:51 |
brtknr | flwang1: is this with calico plugin? | 09:51 |
brtknr | its not working for me either with calico | 09:51 |
flwang1 | ok | 09:51 |
brtknr | probably to do with pod to pod communication issue | 09:51 |
brtknr | its working with flannel | 09:52 |
flwang1 | then it should be the calico version upgrade issue | 09:52 |
strigazi | left this in gerrit too "With ip encapsualtion it works but the no-encapsulated mode is not working." | 09:52 |
*** ivve has joined #openstack-containers | 09:53 | |
brtknr | how do you enable ip encaptulation? | 09:53 |
brtknr | strigazi: | 09:53 |
flwang1 | strigazi: just to be clear, you mean 'CALICO_IPV4POOL_IPIP' == 'Always' ? | 09:54 |
strigazi | https://review.opendev.org/#/c/705599/13/magnum/drivers/common/templates/kubernetes/fragments/calico-service.sh@454 | 09:54 |
strigazi | Always | 09:54 |
strigazi | yes | 09:54 |
strigazi | Never should work though | 09:55 |
strigazi | as it used to work | 09:55 |
strigazi | when you have SDN on SDN this can happen :) | 09:55 |
strigazi | I mean being lost :) | 09:55 |
flwang1 | strigazi: should we ask help for calico team? | 09:56 |
strigazi | yes | 09:56 |
flwang1 | and i just found it hard to debug because the toobox doesn't work on fedora coreos | 09:56 |
strigazi | in devstack we run calico on openvswitch | 09:56 |
flwang1 | so i can't use tcpdump to check the traffic | 09:56 |
flwang1 | strigazi: did you try it on prod? | 09:57 |
flwang1 | is it working? | 09:57 |
strigazi | flwang1: come on, privileged daemon with centos and instal whatever you want :) | 09:57 |
strigazi | s/daemon/daemonset/ | 09:57 |
strigazi | or add a sidecar to calico node | 09:57 |
flwang1 | strigazi: you mean just ran a centos daemon set? | 09:57 |
strigazi | or exec in calico node, it is RHEL | 09:58 |
strigazi | microdnf install | 09:58 |
flwang1 | ok, will try | 09:58 |
flwang1 | strigazi: did you try it on prod? is it working? | 09:58 |
strigazi | a sidecar is optimal | 09:59 |
strigazi | not yet, today, BUT | 09:59 |
strigazi | in prod we don't run on openvswitch | 09:59 |
strigazi | we use linux-bridge | 09:59 |
strigazi | so it may work | 09:59 |
strigazi | I will update gerrit | 09:59 |
flwang1 | pls do, at least it can help us understand the issue | 09:59 |
flwang1 | should I split the calico and coredns upgrade into 2 patches? | 10:00 |
brtknr | flwang1: probably good practice :) | 10:00 |
strigazi | as you want, it doesn't hurt | 10:00 |
flwang1 | i combine them because they're very critical services | 10:00 |
flwang1 | so i want to test them together for conformance test | 10:01 |
brtknr | they're not dependent on each other though right? | 10:01 |
flwang1 | no depedency | 10:01 |
strigazi | they are not | 10:01 |
brtknr | have we ruled out if the regression is not caused by coredns? | 10:01 |
brtknr | have we ruled out if the regression is not caused by coredns upgrade? | 10:01 |
strigazi | if you update coredns can you make it run on master too? | 10:01 |
flwang1 | i don't think it's related to coredns | 10:02 |
strigazi | it can't be | 10:02 |
strigazi | trust bu verify though | 10:02 |
flwang1 | strigazi: make coredns only running on master node? | 10:02 |
strigazi | trust but verify though | 10:02 |
strigazi | flwang1: no, run in master as well | 10:02 |
brtknr | strigazi: why? | 10:02 |
flwang1 | ah, sure, i can do that | 10:02 |
brtknr | why run on master as well? | 10:03 |
flwang1 | brtknr: i even want to run it only on master ;) | 10:03 |
strigazi | because the user might have a stupid app that will run next to coredns and kill it | 10:03 |
flwang1 | since it's critical service | 10:03 |
strigazi | then things on master don't have DNS | 10:04 |
flwang1 | we don't want to lose it when the worker node down as well | 10:04 |
flwang1 | let's end the meeting first | 10:04 |
brtknr | ok and I suppose we want it to run on workers too because we want the dns service to scale with the number of workers | 10:04 |
flwang1 | #endmeeting | 10:05 |
*** openstack changes topic to "OpenStack Containers Team | Meeting: every Wednesday @ 9AM UTC | Agenda: https://etherpad.openstack.org/p/magnum-weekly-meeting" | 10:05 | |
openstack | Meeting ended Wed Mar 25 10:05:00 2020 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 10:05 |
openstack | Minutes: http://eavesdrop.openstack.org/meetings/magnum/2020/magnum.2020-03-25-09.01.html | 10:05 |
openstack | Minutes (text): http://eavesdrop.openstack.org/meetings/magnum/2020/magnum.2020-03-25-09.01.txt | 10:05 |
openstack | Log: http://eavesdrop.openstack.org/meetings/magnum/2020/magnum.2020-03-25-09.01.log.html | 10:05 |
flwang1 | strigazi: if you have time, pls help debug the calico issue | 10:05 |
flwang1 | meanwhile, i will consultant the calico team as well | 10:05 |
strigazi | the argument about dns makes sense? | 10:05 |
strigazi | flwang1: please cc me | 10:05 |
strigazi | if it is public | 10:05 |
flwang1 | i will go to the calico slack channel | 10:05 |
strigazi | github issue too? | 10:06 |
flwang1 | good idea | 10:06 |
brtknr | i think they will try and ask for cash for advice :) | 10:06 |
flwang1 | i will cc you then | 10:06 |
flwang1 | brtknr: no, they won't ;) | 10:06 |
flwang1 | i asked them before and they're nice | 10:06 |
brtknr | ok maybe just the weave people then | 10:07 |
flwang1 | alright, i have to go, guys | 10:07 |
strigazi | good night | 10:07 |
brtknr | ok sleep well! | 10:07 |
flwang1 | ttyl | 10:08 |
*** flwang1 has quit IRC | 10:08 | |
*** rcernin has quit IRC | 10:17 | |
*** trident has quit IRC | 10:29 | |
*** trident has joined #openstack-containers | 10:31 | |
*** trident has quit IRC | 10:33 | |
*** xinliang has quit IRC | 10:36 | |
*** trident has joined #openstack-containers | 10:37 | |
*** sapd1_x has quit IRC | 11:06 | |
*** yolanda has quit IRC | 11:27 | |
*** markguz_ has quit IRC | 11:31 | |
*** yolanda has joined #openstack-containers | 11:32 | |
*** udesale_ has joined #openstack-containers | 12:42 | |
*** udesale has quit IRC | 12:45 | |
*** ramishra has quit IRC | 12:47 | |
*** ramishra has joined #openstack-containers | 12:58 | |
*** sapd1_x has joined #openstack-containers | 13:38 | |
*** sapd1_x has quit IRC | 14:55 | |
*** udesale_ has quit IRC | 14:57 | |
*** ykarel is now known as ykarel|away | 15:25 | |
brtknr | cosmicsound: are you using etcd tag 3.4.3? | 15:38 |
brtknr | you need to override it when using coreos | 15:38 |
brtknr | i think | 15:39 |
brtknr | check that etcd is running on the master | 15:39 |
*** sapd1 has joined #openstack-containers | 15:41 | |
*** sapd1 has quit IRC | 16:24 | |
openstackgerrit | Bharat Kunwar proposed openstack/magnum master: Build new autoscaler containers https://review.opendev.org/714986 | 16:27 |
*** tobias-urdin has joined #openstack-containers | 18:01 | |
tobias-urdin | quick question, if anybody knows, deployed a kubernetes v1.15.7 cluster with magnum | 18:11 |
tobias-urdin | and using k8scloudprovider/openstack-cloud-controller-manager:v1.15.0 | 18:11 |
tobias-urdin | is the openstack ccm v1.15.0 suppose to work with v1.15.7? | 18:11 |
tobias-urdin | lxkong: maybe knows? ^ | 18:12 |
tobias-urdin | fails on: | 18:12 |
tobias-urdin | kubectl create -f /srv/magnum/kubernetes/openstack-cloud-controller-manager.yaml | 18:12 |
tobias-urdin | error: SchemaError(io.k8s.api.autoscaling.v2beta1.ExternalMetricSource): invalid object doesn't have additional properties | 18:12 |
tobias-urdin | https://github.com/openstack/magnum/blob/master/magnum/drivers/common/templates/kubernetes/fragments/kube-apiserver-to-kubelet-role.sh#L154 | 18:12 |
tobias-urdin | can reproduce, error message doesn't help to point out anything specific in the yaml file so probably an incompatibility issue | 18:12 |
tobias-urdin | i will try to respawn cluster with v1.15.0 instead, maybe openstack-cloud-controller-manager needs to release new versions to support stable v1.15 | 18:13 |
*** irclogbot_1 has quit IRC | 18:37 | |
tobias-urdin | with k8s v1.15.0 error: SchemaError(io.k8s.api.node.v1alpha1.RuntimeClassSpec): invalid object doesn't have additional properties | 18:48 |
*** irclogbot_0 has joined #openstack-containers | 19:01 | |
NobodyCam | Good Morning Folks; I am attempting to deploy a v1.15.9 kubernetes cluster with Rocky having some issues. | 19:06 |
NobodyCam | "kube_cluster_deploy" ends up timing out. are there tricks to get this working... I.e. setting calico tags differently? "kube_tag=v1.15.9,tiller_enabled=True,availability_zone=nova,calico_tag=v2.6.12,calico_cni_tag=v1.11.8,calico_kube_controllers_tag=v1.0.5,heat_container_agent_tag=rawhide" | 19:08 |
*** irclogbot_0 has quit IRC | 19:37 | |
*** irclogbot_2 has joined #openstack-containers | 19:40 | |
*** irclogbot_2 has quit IRC | 19:42 | |
*** irclogbot_1 has joined #openstack-containers | 19:45 | |
*** irclogbot_1 has quit IRC | 20:00 | |
*** irclogbot_3 has joined #openstack-containers | 20:03 | |
*** irclogbot_3 has quit IRC | 20:12 | |
*** irclogbot_2 has joined #openstack-containers | 20:15 | |
*** irclogbot_2 has quit IRC | 20:16 | |
tobias-urdin | the issue seems to be the kubectl version in the heat-container-agent, if i copy the file openstack-cloud-controller-manager.yaml to my computer and run it from there it works | 20:17 |
tobias-urdin | /var/lib/containers/atomic/heat-container-agent.0/rootfs/usr/bin/kubectl version | 20:18 |
tobias-urdin | Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.3", GitCommit:"2bba0127d85d5a46ab4b778548be28623b32d0b0", GitTreeState:"archive", BuildDate:"2018-07-25T11:20:04Z", GoVersion:"go1.11beta2", Compiler:"gc", Platform:"linux/amd64"} | 20:18 |
tobias-urdin | and locally | 20:18 |
tobias-urdin | $kubectl version | 20:18 |
tobias-urdin | Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.7", GitCommit:"6c143d35bb11d74970e7bc0b6c45b6bfdffc0bd4", GitTreeState:"clean", BuildDate:"2019-12-11T12:42:56Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"} | 20:18 |
*** irclogbot_0 has joined #openstack-containers | 20:21 | |
NobodyCam | https://www.irccloud.com/pastebin/JjeXp3Pk/ | 20:42 |
NobodyCam | I end up with : | 20:44 |
NobodyCam | cni.go:213] Unable to update cni config: No networks found in /etc/cni/net.d | 20:44 |
NobodyCam | kubelet.go:2173] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized | 20:44 |
tobias-urdin | NobodyCam: sorry was following up on my issue before yours, not related | 20:44 |
NobodyCam | :) All Good! | 20:45 |
NobodyCam | I was following your issue because it on the surface seemed close to what I was seeing locally | 20:46 |
brtknr | tobias-urdin: which version of magnum are you running? | 21:02 |
brtknr | NobodyCam: I wouldn’t change the default calico and calico_cni_tag | 21:04 |
brtknr | also i think the latest kube_tag supported in rocky is v1.15.11 | 21:04 |
NobodyCam | brtknr: Thank you :) I have attempted with the Defaults too | 21:08 |
tobias-urdin | brtknr: 7.2.0 rocky release | 21:26 |
*** flwang1 has joined #openstack-containers | 22:11 | |
flwang1 | brtknr: ping | 22:11 |
brtknr | flwang1: pong | 22:17 |
brtknr | i was waiting for u! | 22:17 |
flwang1 | brtknr: :) | 22:17 |
flwang1 | brtknr: i'm reviewing the logic of _poll_health_status | 22:18 |
flwang1 | i don't really understand why you said 2 api calls to get the health status | 22:18 |
flwang1 | brtknr: are you still there? | 22:21 |
*** Jeffrey4l has quit IRC | 22:27 | |
brtknr | brtknr: yes sorry im trying to get my dsvm up again after calico cluster was repeatedly failing on the clean master branch | 22:28 |
brtknr | flwang1: ok i take back that we dont need two api calls | 22:29 |
*** Jeffrey4l has joined #openstack-containers | 22:29 | |
brtknr | i didnt realise that it was possible to make multiple updates in a single api call | 22:29 |
brtknr | that said, i am still not a great fan of logic for determining health status being in the magnum-auto-healer :( | 22:30 |
brtknr | i feel like the easiest time to make this change is now, it will only become harder to change this once it is merged | 22:31 |
flwang1 | so you mean totally don't allow update the health_status field? | 22:32 |
flwang1 | my point is, the health_status_reason is really a dict/json, and it could be anything inside, depends on how the cloud provider would like to leverage it | 22:33 |
flwang1 | for example, i'm trying to put the 'updated_at' into the health_status_reason, so that the 3rd monitor code can get more information from there | 22:34 |
flwang1 | if we totally limit the format of the health_status_reason, then we will lost the flexibility | 22:35 |
flwang1 | brtknr: ^ | 22:35 |
*** vishalmanchanda has quit IRC | 22:35 | |
brtknr | flwang1: yes, i mean prevent update of health_status field | 22:36 |
flwang1 | I can see your point, we get a bit benefit but meanwhile, we will lost bigger flexibility | 22:36 |
brtknr | and let the magnum-conductor work it out | 22:36 |
flwang1 | brtknr: another thing is | 22:37 |
flwang1 | our current heath_status_reason is quite simple, as you can see, now it only get the Ready condition | 22:37 |
flwang1 | if you put more information, say support more conditions of node and master, then the calculation could be mess | 22:38 |
flwang1 | which i don't think magnum should control that much | 22:38 |
flwang1 | in other words, when I designed this | 22:39 |
flwang1 | the main thing, the cloud provider admin care about is, the health status, and the health_status_reason is just a reference, instead of reverse | 22:39 |
flwang1 | brtknr: TBH, i don't want to maintain such a logic in magnum | 22:40 |
flwang1 | magnum is a platform, as long as we open these 2 fields for cloud admin, we want to grant the flexibility instead of limit it | 22:41 |
brtknr | ok fine i see your argument | 22:42 |
NobodyCam | I am able to deploy up to v1.13.11 on my rocky install | 22:42 |
brtknr | if we have the option to disable the polling from magnum side, i would be happy with that solution | 22:42 |
NobodyCam | 1.14.# and above fail | 22:42 |
flwang1 | brtknr: you mean totally disable it? for that case, we probably have to introduce a config | 22:43 |
flwang1 | but actually, it's a really cluster by cluster config | 22:43 |
NobodyCam | https://wiki.openstack.org/wiki/Magnum#Compatibility_Matrix says 1.15.X should work? | 22:44 |
flwang1 | i don't think totally disable it is a good idea, TBH | 22:44 |
brtknr | flwang1: it doesnt make sense for the internal poller and magnum auto healer stepping on each other's toe | 22:45 |
brtknr | NobodyCam: I got 1.15.x working when i last tested rocky | 22:45 |
brtknr | i probably had to use heat_container_agent_tag=train-stable | 22:46 |
brtknr | i cant remember 100% | 22:46 |
NobodyCam | nice! I'm still working on it... | 22:46 |
NobodyCam | oh Thank you I can try that | 22:46 |
flwang1 | brtknr: as i mentioned above, some cluster may be public and no auto healer running on that, some cluster maybe private and having cluster running on that | 22:46 |
brtknr | NobodyCam: actually try train-stable-2 | 22:46 |
flwang1 | some cluster maybe private and having auto healer running on that | 22:46 |
brtknr | flwang1: i meant option to disable it | 22:47 |
brtknr | flwang1: i meant option to disable it for each cluster | 22:47 |
brtknr | e.g. if auto healer is running | 22:47 |
brtknr | could disable automatically if auto healer is running | 22:48 |
flwang1 | you mean checking the magnum-auto-healer when doing the accessible validation? | 22:48 |
flwang1 | or a separate function to disable it? | 22:48 |
flwang1 | no problem,i can do that | 22:49 |
brtknr | something that stops the internal poller and magnum auto healer fighting like cats and dogs | 22:51 |
flwang1 | sure, i will fix it in next ps | 22:54 |
flwang1 | thank you for your review | 22:54 |
flwang1 | and glad to see we're on the same page now | 22:54 |
brtknr | flwang1: :) | 22:55 |
brtknr | flwang1: btw is calico working for you on master branch | 22:55 |
brtknr | without your patch | 22:55 |
flwang1 | brtknr: i didn't try that yet TBH | 22:55 |
flwang1 | but it works well on our prod | 22:56 |
brtknr | hmmm | 22:56 |
*** rcernin has joined #openstack-containers | 22:56 | |
brtknr | its appears to be working on stable/train but broken on master | 22:57 |
brtknr | e.g. the cluster-autoscaler cannot reach keystone for auth | 22:57 |
brtknr | same with cinder-csi-plugin | 22:57 |
brtknr | otherwise all reports healthy | 22:57 |
flwang1 | brtknr: try to open the 179 port on master | 22:59 |
brtknr | what? manually? | 23:00 |
brtknr | flwang1: but its not an issue on stable/train branch | 23:05 |
brtknr | only on master | 23:05 |
flwang1 | ok, then i'm not sure, probably a regression issue | 23:05 |
brtknr | flwang1: hmm looks like the regression might be caused by your patch to change default calico_ipv4_cidr | 23:18 |
flwang1 | brtknr: no way | 23:19 |
flwang1 | it's impossible :) | 23:19 |
brtknr | flwang1: yes way! | 23:20 |
flwang1 | how? can you pls show me? | 23:20 |
brtknr | when i revert the change, it works | 23:20 |
flwang1 | then you should check your local settings | 23:21 |
flwang1 | are you having 10.100.x.x with you local vm network? | 23:21 |
brtknr | magnum/drivers/heat/k8s_coreos_template_def.py:58: cluster.labels.get('calico_ipv4pool', '192.168.0.0/16') | 23:21 |
brtknr | magnum/drivers/heat/k8s_fedora_template_def.py:58: cluster.labels.get('calico_ipv4pool', '192.168.0.0/16') | 23:21 |
flwang1 | shit, my bad | 23:22 |
flwang1 | brtknr: i will submit a fix soon | 23:23 |
brtknr | flwang1: mind if i propose? | 23:24 |
flwang1 | i will submit the patch in 5 secs | 23:25 |
brtknr | ok | 23:27 |
brtknr | we should remove the defaults from kubecluster.yaml | 23:28 |
brtknr | especially if pod_network_cidr depends on it | 23:28 |
brtknr | i wonder if this will fix calico upgrade? | 23:28 |
openstackgerrit | Feilong Wang proposed openstack/magnum master: Fix calico regression issue caused by default ipv4pool change https://review.opendev.org/715093 | 23:28 |
flwang1 | let's see | 23:29 |
flwang1 | i will test the calico upgrade again with this one | 23:29 |
flwang1 | brtknr: https://review.opendev.org/715093 | 23:29 |
flwang1 | brtknr: i'm sorry for the regression issue :( | 23:31 |
flwang1 | and the stupid confidence :D | 23:31 |
brtknr | hey, we approved it so partly our fault too | 23:31 |
brtknr | we should remove those defaults from kubecluster.yaml if it is never used | 23:32 |
flwang1 | brtknr: or remove the defaults from the python code, thoughts? | 23:32 |
flwang1 | let's get this one in ,and you and work on how to handle the default value? | 23:33 |
flwang1 | and you can work on | 23:33 |
brtknr | hmm makes more sense to handle it in python code though | 23:34 |
brtknr | since pod_network_cidr has to match flannel_network_cidr or calico_ipv4pool | 23:35 |
brtknr | im sure this logic would be far more complicated in heat | 23:35 |
brtknr | im sure this logic would be far more complicated in heat template | 23:35 |
flwang1 | ok, but anyway, let's do that in a separate patch | 23:45 |
brtknr | flwang1: but do you agree with what i said? that the logic would be more complicated to implement in heat template? | 23:57 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!