airship-irc-bot | <mattmceuen> Hey @sean.eagan I'm able to consistently reproduce an issue with current airshipctl (via airship in a pod), where 1. the `controlplane-ephemeral` phase completes 2. the subsequent interaction with the target cluster fails because it's not really up I see this airshipctl/cliutils output: ```[airshipctl] 2021/07/14 16:05:12 opendev.org/airship/airshipctl/pkg/k8s/applier/applier.go:77: Getting infos for bundle, inventory id is | 16:26 |
---|---|---|
airship-irc-bot | controlplane-ephemeral ... baremetalhost.metal3.io/node01 is Current: Resource is current kubeadmcontrolplane.controlplane.cluster.x-k8s.io/cluster-controlplane is Current: Resource is current metal3cluster.infrastructure.cluster.x-k8s.io/target-cluster is Current: Resource is current [airshipctl] 2021/07/14 16:05:18 opendev.org/airship/airshipctl/pkg/k8s/applier/applier.go:92: applier channel closed | 16:26 |
airship-irc-bot | metal3machinetemplate.infrastructure.cluster.x-k8s.io/cluster-controlplane is Current: Resource is current [airshipctl] 2021/07/14 16:05:18 opendev.org/airship/airshipctl/pkg/phase/client.go:295: executing phase: kubectl-get-node-target``` But the state of the `kubeadmcontrolplane` is still: ```root@airship-in-a-pod:/# kubectl --kubeconfig /root/.airship/kubeconfig --context ephemeral-cluster get kubeadmcontrolplane -A | 16:26 |
airship-irc-bot | NAMESPACE NAME READY INITIALIZED REPLICAS READY REPLICAS UPDATED REPLICAS UNAVAILABLE REPLICAS target-infra cluster-controlplane 1 1 1 ``` | 16:26 |
airship-irc-bot | <mattmceuen> We'd expect cliutils/airshipctl to wait till the kubeadmcontrolplane was in a READY status before proceeding, right? Do you know of any corner cases that could defeat that? | 16:26 |
airship-irc-bot | <kk6740> @mattmceuen no, that isn’t the case i think. because cluster-api objects don’t fully implement ready conditions. So as far as i remember, we had a special waiter that made sure control plane is reacable | 16:33 |
airship-irc-bot | <mattmceuen> Ah ok - thanks Konstantine, that may be what's getting defeated (and may actually be the thing that's breaking with `Unable to connect to the server: dial tcp 10.23.25.102:6443: connect: no route to host` -- I'll dig into that a bit | 16:35 |
airship-irc-bot | <kk6740> what is the next phase that is coming after that? | 16:43 |
airship-irc-bot | <kk6740> in airship in the pod env | 16:43 |
airship-irc-bot | <mattmceuen> kubectl-get-node-target is what's defined in the gating phase plan (so I don't think specific to aiap) | 16:44 |
airship-irc-bot | <mattmceuen> Later on we call a kubectl-wait-cluster-target -- maybe we should be calling it at that point as well? | 16:44 |
airship-irc-bot | <kk6740> get node should be able to wait for node to become reacable , and retry 60 times with interval 30 seconds | 16:46 |
airship-irc-bot | <mattmceuen> ok, let me make sure that's the phase that really gets called in aiap :slightly_smiling_face: | 16:46 |
airship-irc-bot | <kk6740> i don’t remember aiap setup, maybe its not working with the plan at the moment | 16:47 |
airship-irc-bot | <mattmceuen> It looks like `kubectl-get-node-target` is the phase that's failing, with ```airshipctl] 2021/07/14 16:05:18 opendev.org/airship/airshipctl/pkg/phase/client.go:295: executing phase: kubectl-get-node-target {"Message":"starting generic container","Operation":"GenericContainerStart","Timestamp":"2021-07-14T16:05:18.894734428Z","Type":"GenericContainerEvent"} [airshipctl] 2021/07/14 16:05:18 | 16:48 |
airship-irc-bot | opendev.org/airship/airshipctl/pkg/k8s/kubeconfig/builder.go:258: Received error when extracting context, ignoring kubeconfig. Error: failed merging kubeconfig: source context 'target-cluster' does not exist in source kubeconfig [airshipctl] 2021/07/14 16:05:18 opendev.org/airship/airshipctl/pkg/k8s/kubeconfig/builder.go:168: Merging kubecontext for cluster 'target-cluster', into site kubeconfig [airshipctl] 2021/07/14 16:05:18 | 16:48 |
airship-irc-bot | opendev.org/airship/airshipctl/pkg/phase/executors/container.go:184: Config reference is specified, looking for the object in config ref: '&ObjectReference{Kind:ConfigMap,Namespace:,Name:kubectl-get-node,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,}' [airshipctl] 2021/07/14 16:05:19 Filtering input bundle by Group: , Version: , Kind: + kubectl --kubeconfig /kubeconfig --context target-cluster --request-timeout 10s get node Unable to connect | 16:48 |
airship-irc-bot | to the server: dial tcp 10.23.25.102:6443: connect: no route to host``` | 16:48 |
airship-irc-bot | <kk6740> did it exit? because it should try it 60 times | 16:49 |
airship-irc-bot | <kk6740> and output this error untill it succeeds | 16:49 |
airship-irc-bot | <mattmceuen> Yeah, it exited immediately | 16:49 |
airship-irc-bot | <mattmceuen> no loop | 16:49 |
airship-irc-bot | <mattmceuen> weird | 16:49 |
airship-irc-bot | <kk6740> https://github.com/airshipit/airshipctl/blob/master/manifests/function/phase-helpers/wait_node/kubectl_wait_node.sh | 16:50 |
airship-irc-bot | <kk6740> this is the script | 16:50 |
airship-irc-bot | <mattmceuen> `set -xe` | 16:50 |
airship-irc-bot | <mattmceuen> Plus `timeout 20 kubectl --context $KCTL_CONTEXT get node` | 16:51 |
airship-irc-bot | <mattmceuen> wont that quit out if kubectl returns non-zero? | 16:51 |
airship-irc-bot | <kk6740> ```"$(timeout 20 \ kubectl --context $KCTL_CONTEXT \ get node -o name | wc -l)"``` This is executed in and i don’t think `-xe` flags are proagated to the subshell | 16:51 |
airship-irc-bot | <mattmceuen> ah I see | 16:52 |
airship-irc-bot | <kk6740> but i have a feeling that this part succeeds ```"$(timeout 20 \ kubectl --context $KCTL_CONTEXT \ get node -o name | wc -l)"``` while this one fails: ```timeout 20 kubectl --context $KCTL_CONTEXT get node``` | 16:53 |
airship-irc-bot | <mattmceuen> This is what we see in the airshipctl output as the failing line: ```kubectl --kubeconfig /kubeconfig --context target-cluster --request-timeout 10s get node``` | 16:53 |
airship-irc-bot | <kk6740> w8 i am looking at the wrong script :slightly_smiling_face: | 16:53 |
airship-irc-bot | <mattmceuen> That doesn't match either of the ones in the script, does it? | 16:53 |
airship-irc-bot | <mattmceuen> hahaha | 16:53 |
airship-irc-bot | <kk6740> https://github.com/airshipit/airshipctl/blob/master/manifests/function/phase-helpers/get_node/kubectl_get_node.sh | 16:54 |
airship-irc-bot | <kk6740> can u check previous phase? | 16:54 |
airship-irc-bot | <mattmceuen> Goes stright from `controlplane-ephemeral` -> `kubectl-get-node-target` | 16:55 |
airship-irc-bot | <kk6740> that certainly looks like a wrong order | 16:55 |
airship-irc-bot | <kk6740> because i think we need to wait for the node first, and then get it | 16:55 |
airship-irc-bot | <mattmceuen> The strange thing is, why am I hitting this and the gates aren't? | 16:55 |
airship-irc-bot | <kk6740> i actually dont think we need to get it at all :slightly_smiling_face: | 16:55 |
airship-irc-bot | <mattmceuen> yeah :slightly_smiling_face: | 16:56 |
airship-irc-bot | <kk6740> that maybe because controlplane-ephemeral actually waits for some conditions | 16:56 |
airship-irc-bot | <kk6740> but not exactly what we need | 16:57 |
airship-irc-bot | <kk6740> so sometimes by the time it exists, node is available, and sometimes its not | 16:57 |
airship-irc-bot | <mattmceuen> yeah, so may be sensitive to the environment its running in | 16:58 |
airship-irc-bot | <kk6740> yes, so to mitigate that, we need different order in the plan | 16:58 |
airship-irc-bot | <mattmceuen> what about the `kubectl-wait-cluster-target`, is this the waiting scenario it's designed for, or no? | 16:59 |
airship-irc-bot | <kk6740> exactly that phase is working against target-cluster api, and its not avaialable yet at this point. But we can reuse the script, and create a phase `kubectl-wait-cluster-ephemeral` that would do the same for us | 17:02 |
airship-irc-bot | <kk6740> but on ephemeral API | 17:02 |
airship-irc-bot | <mattmceuen> ah I see | 17:02 |
airship-irc-bot | <kk6740> wait-cluster waits for a cluster object to reach controlPlaneReady condition, so this is a good way for us to wait | 17:03 |
airship-irc-bot | <kk6740> so it would be good for transparency and predictability: ```- kubectl-wait-cluster-ephemeral - kubectl-wait-node-target - kubectl-get-node-target``` | 17:05 |
airship-irc-bot | <mattmceuen> I added the first two with: https://review.opendev.org/c/airship/airshipctl/+/800819 and will check whether it fixes my problem in aiap we already have kubectl-get-node-target though | 17:34 |
airship-irc-bot | <kk6740> :+1: let's see how it goes | 17:37 |
airship-irc-bot | <hr858f> Team, can I have reviews: https://review.opendev.org/c/airship/sip/+/800566? Thanks. | 22:05 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!