Tuesday, 2024-09-17

-@gerrit:opendev.org- Tudor-Stefan Tabacel-Manea proposed: [zuul/nodepool] 929503: Node instances: label modifiers (aws spot, fleet) https://review.opendev.org/c/zuul/nodepool/+/92950308:04
@harbott.osism.tech:regio.chatClark: the job has been failing for weeks, unlikely to be related to raxflex12:35
@harbott.osism.tech:regio.chatis there a reason that the openstack nodepool driver doesn't support `volume-type` as attribute like aws + gce do?12:36
-@gerrit:opendev.org- Tudor-Stefan Tabacel-Manea proposed: [zuul/nodepool] 929503: Node instances: label modifiers (aws spot, fleet) https://review.opendev.org/c/zuul/nodepool/+/92950312:41
@fungicide:matrix.org> <@harbott.osism.tech:regio.chat> is there a reason that the openstack nodepool driver doesn't support `volume-type` as attribute like aws + gce do?12:43
my guess would be simple oversight, since it seems like bfv use cases could take advantage of it (assuming the attribute isn't always baked into the flavor)
-@gerrit:opendev.org- Tudor-Stefan Tabacel-Manea proposed: [zuul/zuul] 929641: node model: adapt to instance_properties https://review.opendev.org/c/zuul/zuul/+/92964113:09
-@gerrit:opendev.org- Tudor-Stefan Tabacel-Manea proposed:14:10
- [zuul/nodepool] 929503: Node instances: label modifiers (aws spot, fleet) https://review.opendev.org/c/zuul/nodepool/+/929503
- [zuul/nodepool] 929651: metastatic driver: instance_properties flag https://review.opendev.org/c/zuul/nodepool/+/929651
@clarkb:matrix.org> <@harbott.osism.tech:regio.chat> Clark: the job has been failing for weeks, unlikely to be related to raxflex15:08
Thanks, I wonder what the best path forward is here. I guess I can try to spend a little time debugging that job, or we can mark it non voting for now? Any opinions from the group here?
@jim:acmegating.comi think it should be fixed; it's usually quite reliable and helpful15:36
@clarkb:matrix.orgthe job log does seem to show what looks like a successful microk8s deployment, but we don't have any k8s service logs so difficult to confirm. Might need to hold a node and see what logs are available and what might be breaking unless someone is familiar with microk8s and how we should better collect those logs upfront15:38
@clarkb:matrix.orgautohold is in place and I have rechecked the job. We shall see what that reveals16:06
@harbott.osism.tech:regio.chatah, I had held a node already, but didn't get to look earlier16:12
```
2024-09-17T16:10:42.168263+00:00 debian microk8s.daemon-kubelite[3016068]: E0917 16:10:42.168115 3016068 controller.go:97] Error removing old endpoints from kubernetes service: no API server IP addresses were listed in storage, refusing to erase all endpoints for the kubernetes Service
```
@harbott.osism.tech:regio.chat * ah, I had held a node already, but didn't get to look earlier. from `/var/log/syslog`16:12
```
2024-09-17T16:10:42.168263+00:00 debian microk8s.daemon-kubelite[3016068]: E0917 16:10:42.168115 3016068 controller.go:97] Error removing old endpoints from kubernetes service: no API server IP addresses were listed in storage, refusing to erase all endpoints for the kubernetes Service
```
@harbott.osism.tech:regio.chatalthough that's maybe not actually the fatal error16:13
@clarkb:matrix.orgthat almost reads like k8s is keeping the api up for us so ya probably something else instead? One thing I wanted to check is the port bindings16:14
@jim:acmegating.com(i did confirm that this has failed on clouds other than raxflex (as expected))16:14
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed on behalf of Tobias Henkel: [zuul/zuul] 922450: Add spec for OIDC Workload Identity Federation https://review.opendev.org/c/zuul/zuul/+/92245016:19
@jim:acmegating.comfungi swest tristanC Albin Vass ^ made a small update if you want to re-review16:23
@jim:acmegating.comClark: i see some different errors in the job failures.  also -- the most recent on is a FAILURE not a RETRY16:29
@jim:acmegating.comcouple more failures in there too16:29
@clarkb:matrix.orgcorvus: you mean the tracebacks differ in different runs? We're polling for ready nodes within a timeout so plenty of time for different behaviors I suppose16:30
@jim:acmegating.comhttps://zuul.opendev.org/t/zuul/builds?job_name=nodepool-functional-k8s&skip=0  says that we have pre-run failures and run failures16:31
@jim:acmegating.comthe run failures are tracebacks; the pre-run are k8s setup failures16:31
@jim:acmegating.comperhaps a related (or same) underlying cause... but just pointing that out as a variable16:31
@clarkb:matrix.orgoh interseting I hadn't noticed that thanks16:34
@clarkb:matrix.orgthe failure in pre-run looks like an issue querying the namespaces too. I wonder if it is race to failure of the k8s cluster then depending on how fast that occurs we see the job errors/failures in different locations16:35
@clarkb:matrix.org172.99.69.73 is my new held node and `netstat -lnp` reports `tcp6       0      0 :::16443                :::*                    LISTEN      132812/kubelite` which looks like something is listening anyway16:46
@jim:acmegating.comjournalctl -u snap.microk8s.daemon-kubelite.service16:51
@jim:acmegating.comwhat in the world is going on there ^ ?16:51
@jim:acmegating.com283741 lines, most are shell greps?16:51
@clarkb:matrix.orgit logs "Starting kubelite" over and over again16:52
@clarkb:matrix.orgso maybe we're working until the time where it restarts kubelite again?16:52
@clarkb:matrix.org`E0917 16:20:41.708716   12884 kubelet.go:1519] "Failed to start ContainerManager" err="failed to initialize top level QOS containers: root container [kubepods] doesn't exist"`16:53
@clarkb:matrix.orghttps://github.com/kubernetes/kubernetes/issues/122955#issuecomment-2020403422 maybe?16:54
@clarkb:matrix.orgso I'm remembering that we pinned microk8s bceause they made a broken release for debian in 1.29. There is now a 1.30 stable reelase and several other updates to both 1.28 and 1.29 (most recently as of september 9)16:55
@clarkb:matrix.orgmaybe we try bumping the version to the 1.30 stable release and see if that is happier16:55
@jim:acmegating.comcan't hurt (maybe something in debian was updated that fixes the new stuff and breaks the old?)16:56
@jim:acmegating.comhttps://github.com/canonical/microk8s/issues/4361#issuecomment-233676871516:57
@jim:acmegating.commaybe 1.31? :)16:57
@jim:acmegating.comlet's just upload a bunch of changes16:57
@jim:acmegating.comtry 1.30 and 3116:58
@clarkb:matrix.orgI'll start with 3116:58
@clarkb:matrix.orgI didn't realize 31 was newest because the snap store sorts 30 at the top of the list16:58
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/nodepool] 929685: Update MicroK8s to 1.31 stable release https://review.opendev.org/c/zuul/nodepool/+/92968517:00
@jim:acmegating.com├─snap.microk8s.daemon-kubelite.service17:03
├─snap.microk8s.daemon-apiserver-kicker.service
├─snap.microk8s.daemon-cluster-agent.service
├─snap.microk8s.daemon-containerd.service
├─snap.microk8s.daemon-k8s-dqlite.service
@jim:acmegating.comdo we want to grab journalctl logs from all those?17:03
@jim:acmegating.comnot sure if there's a better way to get logs17:03
@clarkb:matrix.org++ presumably there may also be a kubepods service?17:05
@jim:acmegating.comoh how about `microk8s inspect`17:05
@jim:acmegating.comhttps://microk8s.io/docs/troubleshooting17:05
@clarkb:matrix.orgI ran that on this node it produces a tarball17:05
@clarkb:matrix.orgwe could just grab that and keep it simple17:05
@jim:acmegating.comme too :)17:06
@jim:acmegating.comi untarred it in /tmp17:06
@jim:acmegating.comseems to have those logs and more17:06
@clarkb:matrix.orgthe tarball is only 1.6MB and the extra info might not be somethign you'd share from prod machines but should be fine from test nodes17:07
@clarkb:matrix.orgthe change to use 1.31/stable passed17:09
@clarkb:matrix.orgor the job passed, still waiting on the buildset for the change17:09
@jim:acmegating.comClark: i approved it; i think that's straightforward enough17:10
@clarkb:matrix.orgthanks17:11
@clarkb:matrix.orgcorvus: were you working on a change to collect a microk8s inspect tarball or should I do that? Should we just do it for every job success or failure?17:15
@jim:acmegating.comClark: i did not (though i did look to see that i think it probably wants to go in the collect-kubernetes-logs role); i think maybe all runs?17:17
@clarkb:matrix.orgI can't find evidence that inspect takes an argument to specify the tarball output. But we can parse the stdout to get the path17:17
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul-jobs] 929689: Collection microk8s inspect info in k8s log collection role https://review.opendev.org/c/zuul/zuul-jobs/+/92968917:30
@clarkb:matrix.orgsomethnig like that maybe? We don't appear to have triggered any of the ensure-k8s jobs though. Let me see if I can fix that17:31
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul-jobs] 929689: Collection microk8s inspect info in k8s log collection role https://review.opendev.org/c/zuul/zuul-jobs/+/92968917:35
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul-jobs] 929689: Bump the default ensure-kubernetes microk8s version to 1.31/stable https://review.opendev.org/c/zuul/zuul-jobs/+/92968917:45
@clarkb:matrix.orgthe test jobs for this hit the same problem as nodepool so I've updated things to be a version bump with better testing and log collection17:46
-@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: [zuul/nodepool] 929685: Update MicroK8s to 1.31 stable release https://review.opendev.org/c/zuul/nodepool/+/92968517:49
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul-jobs] 929689: Bump the default ensure-kubernetes microk8s version to 1.31/stable https://review.opendev.org/c/zuul/zuul-jobs/+/92968917:56
@clarkb:matrix.orgI think ^ is working now. Unfortunately it triggers 168 ansible lint warnings. I have meetings I need to pay attention to and prep for for the next little while but I guess I can put a lint fixup change under ^18:05
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul-jobs] 929689: Bump the default ensure-kubernetes microk8s version to 1.31/stable https://review.opendev.org/c/zuul/zuul-jobs/+/92968918:15
@clarkb:matrix.orgif anyone is wondering ansible continues to fail to parse comments with unbalanced ' in them18:27
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul-jobs] 929689: Bump the default ensure-kubernetes microk8s version to 1.31/stable https://review.opendev.org/c/zuul/zuul-jobs/+/92968918:27
@clarkb:matrix.orghttps://review.opendev.org/c/zuul/nodepool/+/929573 does pass testing now that we bumped the microk8s version. It has a couple of +2s should I go ahead and approve it or did anyone else want to review the hack first?19:54
@jim:acmegating.comdone19:57
-@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: [zuul/nodepool] 929573: Set git repo ownership for nodepool dib integration testing https://review.opendev.org/c/zuul/nodepool/+/92957320:45
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul-jobs] 929689: Bump the default ensure-kubernetes microk8s version to 1.31/stable https://review.opendev.org/c/zuul/zuul-jobs/+/92968921:50
@clarkb:matrix.orgso there were 167 warnings and 1 failure. The failure is that I didn't set pipefile on the script in the shell task. Previously I had a shell comment with a single `'` in it that exploded ansible and thats fine. But if you don't put pipefail in your script that is an error...21:51
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul-jobs] 929689: Bump the default ensure-kubernetes microk8s version to 1.31/stable https://review.opendev.org/c/zuul/zuul-jobs/+/92968922:05
@clarkb:matrix.orgI think ^ is ready for review now22:27

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!