-@gerrit:opendev.org- Tudor-Stefan Tabacel-Manea proposed: [zuul/nodepool] 929503: Node instances: label modifiers (aws spot, fleet) https://review.opendev.org/c/zuul/nodepool/+/929503 | 08:04 | |
@harbott.osism.tech:regio.chat | Clark: the job has been failing for weeks, unlikely to be related to raxflex | 12:35 |
---|---|---|
@harbott.osism.tech:regio.chat | is there a reason that the openstack nodepool driver doesn't support `volume-type` as attribute like aws + gce do? | 12:36 |
-@gerrit:opendev.org- Tudor-Stefan Tabacel-Manea proposed: [zuul/nodepool] 929503: Node instances: label modifiers (aws spot, fleet) https://review.opendev.org/c/zuul/nodepool/+/929503 | 12:41 | |
@fungicide:matrix.org | > <@harbott.osism.tech:regio.chat> is there a reason that the openstack nodepool driver doesn't support `volume-type` as attribute like aws + gce do? | 12:43 |
my guess would be simple oversight, since it seems like bfv use cases could take advantage of it (assuming the attribute isn't always baked into the flavor) | ||
-@gerrit:opendev.org- Tudor-Stefan Tabacel-Manea proposed: [zuul/zuul] 929641: node model: adapt to instance_properties https://review.opendev.org/c/zuul/zuul/+/929641 | 13:09 | |
-@gerrit:opendev.org- Tudor-Stefan Tabacel-Manea proposed: | 14:10 | |
- [zuul/nodepool] 929503: Node instances: label modifiers (aws spot, fleet) https://review.opendev.org/c/zuul/nodepool/+/929503 | ||
- [zuul/nodepool] 929651: metastatic driver: instance_properties flag https://review.opendev.org/c/zuul/nodepool/+/929651 | ||
@clarkb:matrix.org | > <@harbott.osism.tech:regio.chat> Clark: the job has been failing for weeks, unlikely to be related to raxflex | 15:08 |
Thanks, I wonder what the best path forward is here. I guess I can try to spend a little time debugging that job, or we can mark it non voting for now? Any opinions from the group here? | ||
@jim:acmegating.com | i think it should be fixed; it's usually quite reliable and helpful | 15:36 |
@clarkb:matrix.org | the job log does seem to show what looks like a successful microk8s deployment, but we don't have any k8s service logs so difficult to confirm. Might need to hold a node and see what logs are available and what might be breaking unless someone is familiar with microk8s and how we should better collect those logs upfront | 15:38 |
@clarkb:matrix.org | autohold is in place and I have rechecked the job. We shall see what that reveals | 16:06 |
@harbott.osism.tech:regio.chat | ah, I had held a node already, but didn't get to look earlier | 16:12 |
``` | ||
2024-09-17T16:10:42.168263+00:00 debian microk8s.daemon-kubelite[3016068]: E0917 16:10:42.168115 3016068 controller.go:97] Error removing old endpoints from kubernetes service: no API server IP addresses were listed in storage, refusing to erase all endpoints for the kubernetes Service | ||
``` | ||
@harbott.osism.tech:regio.chat | * ah, I had held a node already, but didn't get to look earlier. from `/var/log/syslog` | 16:12 |
``` | ||
2024-09-17T16:10:42.168263+00:00 debian microk8s.daemon-kubelite[3016068]: E0917 16:10:42.168115 3016068 controller.go:97] Error removing old endpoints from kubernetes service: no API server IP addresses were listed in storage, refusing to erase all endpoints for the kubernetes Service | ||
``` | ||
@harbott.osism.tech:regio.chat | although that's maybe not actually the fatal error | 16:13 |
@clarkb:matrix.org | that almost reads like k8s is keeping the api up for us so ya probably something else instead? One thing I wanted to check is the port bindings | 16:14 |
@jim:acmegating.com | (i did confirm that this has failed on clouds other than raxflex (as expected)) | 16:14 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed on behalf of Tobias Henkel: [zuul/zuul] 922450: Add spec for OIDC Workload Identity Federation https://review.opendev.org/c/zuul/zuul/+/922450 | 16:19 | |
@jim:acmegating.com | fungi swest tristanC Albin Vass ^ made a small update if you want to re-review | 16:23 |
@jim:acmegating.com | Clark: i see some different errors in the job failures. also -- the most recent on is a FAILURE not a RETRY | 16:29 |
@jim:acmegating.com | couple more failures in there too | 16:29 |
@clarkb:matrix.org | corvus: you mean the tracebacks differ in different runs? We're polling for ready nodes within a timeout so plenty of time for different behaviors I suppose | 16:30 |
@jim:acmegating.com | https://zuul.opendev.org/t/zuul/builds?job_name=nodepool-functional-k8s&skip=0 says that we have pre-run failures and run failures | 16:31 |
@jim:acmegating.com | the run failures are tracebacks; the pre-run are k8s setup failures | 16:31 |
@jim:acmegating.com | perhaps a related (or same) underlying cause... but just pointing that out as a variable | 16:31 |
@clarkb:matrix.org | oh interseting I hadn't noticed that thanks | 16:34 |
@clarkb:matrix.org | the failure in pre-run looks like an issue querying the namespaces too. I wonder if it is race to failure of the k8s cluster then depending on how fast that occurs we see the job errors/failures in different locations | 16:35 |
@clarkb:matrix.org | 172.99.69.73 is my new held node and `netstat -lnp` reports `tcp6 0 0 :::16443 :::* LISTEN 132812/kubelite` which looks like something is listening anyway | 16:46 |
@jim:acmegating.com | journalctl -u snap.microk8s.daemon-kubelite.service | 16:51 |
@jim:acmegating.com | what in the world is going on there ^ ? | 16:51 |
@jim:acmegating.com | 283741 lines, most are shell greps? | 16:51 |
@clarkb:matrix.org | it logs "Starting kubelite" over and over again | 16:52 |
@clarkb:matrix.org | so maybe we're working until the time where it restarts kubelite again? | 16:52 |
@clarkb:matrix.org | `E0917 16:20:41.708716 12884 kubelet.go:1519] "Failed to start ContainerManager" err="failed to initialize top level QOS containers: root container [kubepods] doesn't exist"` | 16:53 |
@clarkb:matrix.org | https://github.com/kubernetes/kubernetes/issues/122955#issuecomment-2020403422 maybe? | 16:54 |
@clarkb:matrix.org | so I'm remembering that we pinned microk8s bceause they made a broken release for debian in 1.29. There is now a 1.30 stable reelase and several other updates to both 1.28 and 1.29 (most recently as of september 9) | 16:55 |
@clarkb:matrix.org | maybe we try bumping the version to the 1.30 stable release and see if that is happier | 16:55 |
@jim:acmegating.com | can't hurt (maybe something in debian was updated that fixes the new stuff and breaks the old?) | 16:56 |
@jim:acmegating.com | https://github.com/canonical/microk8s/issues/4361#issuecomment-2336768715 | 16:57 |
@jim:acmegating.com | maybe 1.31? :) | 16:57 |
@jim:acmegating.com | let's just upload a bunch of changes | 16:57 |
@jim:acmegating.com | try 1.30 and 31 | 16:58 |
@clarkb:matrix.org | I'll start with 31 | 16:58 |
@clarkb:matrix.org | I didn't realize 31 was newest because the snap store sorts 30 at the top of the list | 16:58 |
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/nodepool] 929685: Update MicroK8s to 1.31 stable release https://review.opendev.org/c/zuul/nodepool/+/929685 | 17:00 | |
@jim:acmegating.com | ├─snap.microk8s.daemon-kubelite.service | 17:03 |
├─snap.microk8s.daemon-apiserver-kicker.service | ||
├─snap.microk8s.daemon-cluster-agent.service | ||
├─snap.microk8s.daemon-containerd.service | ||
├─snap.microk8s.daemon-k8s-dqlite.service | ||
@jim:acmegating.com | do we want to grab journalctl logs from all those? | 17:03 |
@jim:acmegating.com | not sure if there's a better way to get logs | 17:03 |
@clarkb:matrix.org | ++ presumably there may also be a kubepods service? | 17:05 |
@jim:acmegating.com | oh how about `microk8s inspect` | 17:05 |
@jim:acmegating.com | https://microk8s.io/docs/troubleshooting | 17:05 |
@clarkb:matrix.org | I ran that on this node it produces a tarball | 17:05 |
@clarkb:matrix.org | we could just grab that and keep it simple | 17:05 |
@jim:acmegating.com | me too :) | 17:06 |
@jim:acmegating.com | i untarred it in /tmp | 17:06 |
@jim:acmegating.com | seems to have those logs and more | 17:06 |
@clarkb:matrix.org | the tarball is only 1.6MB and the extra info might not be somethign you'd share from prod machines but should be fine from test nodes | 17:07 |
@clarkb:matrix.org | the change to use 1.31/stable passed | 17:09 |
@clarkb:matrix.org | or the job passed, still waiting on the buildset for the change | 17:09 |
@jim:acmegating.com | Clark: i approved it; i think that's straightforward enough | 17:10 |
@clarkb:matrix.org | thanks | 17:11 |
@clarkb:matrix.org | corvus: were you working on a change to collect a microk8s inspect tarball or should I do that? Should we just do it for every job success or failure? | 17:15 |
@jim:acmegating.com | Clark: i did not (though i did look to see that i think it probably wants to go in the collect-kubernetes-logs role); i think maybe all runs? | 17:17 |
@clarkb:matrix.org | I can't find evidence that inspect takes an argument to specify the tarball output. But we can parse the stdout to get the path | 17:17 |
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul-jobs] 929689: Collection microk8s inspect info in k8s log collection role https://review.opendev.org/c/zuul/zuul-jobs/+/929689 | 17:30 | |
@clarkb:matrix.org | somethnig like that maybe? We don't appear to have triggered any of the ensure-k8s jobs though. Let me see if I can fix that | 17:31 |
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul-jobs] 929689: Collection microk8s inspect info in k8s log collection role https://review.opendev.org/c/zuul/zuul-jobs/+/929689 | 17:35 | |
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul-jobs] 929689: Bump the default ensure-kubernetes microk8s version to 1.31/stable https://review.opendev.org/c/zuul/zuul-jobs/+/929689 | 17:45 | |
@clarkb:matrix.org | the test jobs for this hit the same problem as nodepool so I've updated things to be a version bump with better testing and log collection | 17:46 |
-@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: [zuul/nodepool] 929685: Update MicroK8s to 1.31 stable release https://review.opendev.org/c/zuul/nodepool/+/929685 | 17:49 | |
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul-jobs] 929689: Bump the default ensure-kubernetes microk8s version to 1.31/stable https://review.opendev.org/c/zuul/zuul-jobs/+/929689 | 17:56 | |
@clarkb:matrix.org | I think ^ is working now. Unfortunately it triggers 168 ansible lint warnings. I have meetings I need to pay attention to and prep for for the next little while but I guess I can put a lint fixup change under ^ | 18:05 |
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul-jobs] 929689: Bump the default ensure-kubernetes microk8s version to 1.31/stable https://review.opendev.org/c/zuul/zuul-jobs/+/929689 | 18:15 | |
@clarkb:matrix.org | if anyone is wondering ansible continues to fail to parse comments with unbalanced ' in them | 18:27 |
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul-jobs] 929689: Bump the default ensure-kubernetes microk8s version to 1.31/stable https://review.opendev.org/c/zuul/zuul-jobs/+/929689 | 18:27 | |
@clarkb:matrix.org | https://review.opendev.org/c/zuul/nodepool/+/929573 does pass testing now that we bumped the microk8s version. It has a couple of +2s should I go ahead and approve it or did anyone else want to review the hack first? | 19:54 |
@jim:acmegating.com | done | 19:57 |
-@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: [zuul/nodepool] 929573: Set git repo ownership for nodepool dib integration testing https://review.opendev.org/c/zuul/nodepool/+/929573 | 20:45 | |
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul-jobs] 929689: Bump the default ensure-kubernetes microk8s version to 1.31/stable https://review.opendev.org/c/zuul/zuul-jobs/+/929689 | 21:50 | |
@clarkb:matrix.org | so there were 167 warnings and 1 failure. The failure is that I didn't set pipefile on the script in the shell task. Previously I had a shell comment with a single `'` in it that exploded ansible and thats fine. But if you don't put pipefail in your script that is an error... | 21:51 |
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul-jobs] 929689: Bump the default ensure-kubernetes microk8s version to 1.31/stable https://review.opendev.org/c/zuul/zuul-jobs/+/929689 | 22:05 | |
@clarkb:matrix.org | I think ^ is ready for review now | 22:27 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!