@iwienand:matrix.org | https://github.com/quay/quay/blob/f774e4c6b6c674c822057a1d3dd49a3d5c36e6ca/endpoints/v2/manifest.py looks generally like the quay manifest handling | 00:11 |
---|---|---|
@iwienand:matrix.org | urgh, there's so many layers, and it abstracts everything as it seems to handle about 8 different formats from docker/OCI etc | 00:23 |
-@gerrit:opendev.org- Ian Wienand proposed: [zuul/zuul-registry] Reject mismatched layer sizes, with some exceptions https://review.opendev.org/c/zuul/zuul-registry/+/807232 | 04:23 | |
-@gerrit:opendev.org- Zuul merged on behalf of Benjamin Schanzel: [zuul/zuul] Update tenant quota spec config example https://review.opendev.org/c/zuul/zuul/+/800766 | 08:38 | |
-@gerrit:opendev.org- Simon Westphahl proposed: [zuul/zuul] Fix unstable cross-scheduler config priming test https://review.opendev.org/c/zuul/zuul/+/807272 | 08:44 | |
-@gerrit:opendev.org- Matthieu Huin https://matrix.to/#/@mhuin:matrix.org proposed: [zuul/zuul] [GUI] Buildset: add "show retries" toggles https://review.opendev.org/c/zuul/zuul/+/806201 | 12:18 | |
@erbarr:matrix.org | hello, i was wondering how I need to setup my connections and tenants so that I can inherit the ironic-base job. In my tenant if I list opendev.org/openstack/ironic to match the required project, I get an error in the scheduler and I don't get any jobs. https://opendev.org/openstack/ironic/src/branch/master/zuul.d/ironic-jobs.yaml#L10 | 14:32 |
@fungicide:matrix.org | your tenant needs a connection for opendev.org in that case (though it can probably be a git connection if you don't want to set up an account for the gerrit connection driver) | 14:37 |
@gtema:matrix.org | ``` | 14:38 |
[connection "opendev"] | ||
name=opendev | ||
driver=git | ||
baseurl=https://opendev.org | ||
``` | ||
@gtema:matrix.org | that's what works for us erbarr | 14:38 |
@fungicide:matrix.org | and then put the opendev source in your tenant config listing opendev.org/openstack/ironic as an untrusted project | 14:39 |
@fungicide:matrix.org | well, in the tenant config it doesn't have to be a fully qualified project name since the domain part is taken from the connection details, so can just be openstack/ironic as the untrusted project entry there | 14:42 |
@fungicide:matrix.org | you might also consider limiting that untrusted project entry with `include: [job]` so that it doesn't pull in things like the project definition | 14:52 |
@fungicide:matrix.org | we've also got a bunch of real-world examples of tenant configs we use in opendev: https://opendev.org/openstack/project-config/src/branch/master/zuul/main.yaml | 14:54 |
@erbarr:matrix.org | i have that entry of opendev and I have gerrit, I can trigger from project changes, it adds comments to review, but when I add opendev.org/openstack/ironic to the tenant, cloning the project fails by trying a url https://opendev.org/opendev.org/openstack/ironic | 15:04 |
@erbarr:matrix.org | openstack/ironic works though, but fails to upload jobs that require opendev.org/openstack/ironic or others that start with opendev.org] | 15:05 |
-@gerrit:opendev.org- Matthieu Huin https://matrix.to/#/@mhuin:matrix.org proposed: [zuul/zuul] REST API: Implement nodes filtering https://review.opendev.org/c/zuul/zuul/+/736042 | 15:08 | |
@clarkb:matrix.org | corvus: ianw I wrote a bit of a novel on https://review.opendev.org/c/zuul/zuul-registry/+/807232 to try and convince me there isn't a race around the strict verification blob cleanup. If you can double check that I would appreciate it. | 15:18 |
@clarkb:matrix.org | * corvus: ianw I wrote a bit of a novel on https://review.opendev.org/c/zuul/zuul-registry/+/807232 to try and convince myself there isn't a race around the strict verification blob cleanup. If you can double check that I would appreciate it. | 15:18 |
@jim:acmegating.com | Clark, ianw: what's the end goal here? if the registry rejects the upload, how does that improve the situation? | 15:20 |
@clarkb:matrix.org | corvus: it has a fixup case specifically for the docker pushes and would only reject other clients pushing bad images | 15:21 |
@clarkb:matrix.org | I think the idea is we can special case buggy clients but if they are not special cased we error | 15:21 |
@jim:acmegating.com | ah | 15:21 |
@clarkb:matrix.org | I think the change is ok if we're happy that the existing lock check and bail out is sufficient to avoid the race I describe in my comment | 15:29 |
@jim:acmegating.com | Clark: replied to your last pgraph | 15:31 |
@jim:acmegating.com | why would we delete anything? | 15:32 |
@clarkb:matrix.org | corvus: the line I've commented on deletes the blobs if manifest size verification fails. I think the idea is that image isn't valid at all and we shouldn't store it | 15:34 |
@jim:acmegating.com | Clark: i left a -1 to that effect. | 15:34 |
@jim:acmegating.com | correct, the image manifest is wrong. but the layer is correct. no need to delete it. | 15:34 |
@clarkb:matrix.org | Wouldn't that be an orphaned layer then? | 15:35 |
@jim:acmegating.com | it could be referenced by a different manifest that is correct | 15:35 |
@jim:acmegating.com | yes, it would. there's a cleanup process for that. | 15:35 |
@jim:acmegating.com | i don't think we're running it, but it's a necessary part of operations anyway, and we should rely on it. | 15:35 |
@clarkb:matrix.org | I see we dedup the layers across images. | 15:35 |
@clarkb:matrix.org | and deleting here would affect any other images with that layer | 15:35 |
@jim:acmegating.com | yes -- though it's not like we're doing anything special to de-dupe the layers, that's just inherent to a CAS | 15:36 |
@clarkb:matrix.org | yup | 15:36 |
@clarkb:matrix.org | Should I push an updated patchset so we can keep pushing this along? ianw is weekending at this point so unlikely to do that for a couple of days | 15:41 |
@jim:acmegating.com | Clark: i feel like it's a substantial enough change it would be best if he could weigh in on that, unless it's urgent? | 15:43 |
@clarkb:matrix.org | I guess it isn't super urgent, the only place we've observed it breaking is with buildkit and not much is using that yet? | 15:43 |
@jim:acmegating.com | Clark: (if it's urgent, sure and we can make sure he knows to push a patch to add it back if he wants to continue discussion.) | 15:43 |
@jim:acmegating.com | what is using buildkit now? | 15:44 |
@clarkb:matrix.org | With zuul-registry I think it is all attempts at this point. The zuul things mordred was looking at and the opendev assets image that started ianw down debugging this | 15:44 |
-@gerrit:opendev.org- Zuul merged on behalf of Simon Westphahl: [zuul/zuul] Fix unstable cross-scheduler config priming test https://review.opendev.org/c/zuul/zuul/+/807272 | 15:45 | |
@clarkb:matrix.org | Basically work in progress and not production usage yet | 15:45 |
@jim:acmegating.com | Clark: a middle ground would be to push a patchset, both of us leave +2s on it, and ianw can approve on what we call "sunday" if he likes it :) | 15:46 |
@clarkb:matrix.org | that seems like a reasonable compromise. I'll do that | 15:47 |
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] Remove nodeset from NodeRequest https://review.opendev.org/c/zuul/zuul/+/806063 | 15:48 | |
-@gerrit:opendev.org- Clark Boylan proposed on behalf of Ian Wienand: [zuul/zuul-registry] Reject mismatched layer sizes, with some exceptions https://review.opendev.org/c/zuul/zuul-registry/+/807232 | 15:51 | |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] Try harder to unlock failed build requests https://review.opendev.org/c/zuul/zuul/+/807221 | 16:23 | |
@jim:acmegating.com | Clark: ^ | 16:23 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/nodepool] Add user_data field to Node https://review.opendev.org/c/zuul/nodepool/+/807362 | 16:30 | |
@mordred:inaugust.com | corvus: nodepool and python-base/builder use buildkit | 16:39 |
@jim:acmegating.com | mordred: are we failing builds on those due to this error now? | 16:44 |
@jim:acmegating.com | i guess because of the way this manifests, we can build exactly 1 thing with buildkit, but anything that depends on it fails? | 16:46 |
@jim:acmegating.com | but docker doesn't fail pulling... only podman? | 16:46 |
@mordred:inaugust.com | We're failing builds ok ianw patches to update to bullseye. I don't think it's URGENT | 16:46 |
@jim:acmegating.com | why would just those fail? | 16:47 |
-@gerrit:opendev.org- Atit Patel proposed: [zuul/zuul] Correcting timeout logic to fix retry errors https://review.opendev.org/c/zuul/zuul/+/807364 | 16:48 | |
@mordred:inaugust.com | https://review.opendev.org/c/zuul/nodepool/+/806312/11 for instance | 16:50 |
@mordred:inaugust.com | I'm not sure? | 16:50 |
-@gerrit:opendev.org- Zuul merged on behalf of Benjamin Schanzel: [zuul/nodepool] Add Tenant-Scoped Resource Quota https://review.opendev.org/c/zuul/nodepool/+/800765 | 16:51 | |
@jim:acmegating.com | mordred: was -bullseye built without buildkit, and bullseye with? | 16:55 |
@jim:acmegating.com | * mordred: was -buster built without buildkit, and bullseye with? | 16:55 |
@jim:acmegating.com | i think that would explain it; basically we have a poisoned bullseye image, so anything that tries to use that will fail | 16:56 |
@jim:acmegating.com | Clark: ^? | 16:56 |
@clarkb:matrix.org | I'm not sure | 16:57 |
@clarkb:matrix.org | I thought they were built the same way | 16:57 |
@clarkb:matrix.org | https://review.opendev.org/c/zuul/nodepool/+/806312/11 specifically is failing on a dib thing that is being fixed iirc | 16:58 |
@clarkb:matrix.org | https://review.opendev.org/c/openstack/diskimage-builder/+/806318 that depends on change, but it needs to land and then get a release 806312 can set in requirements.txt | 16:58 |
@jim:acmegating.com | Clark, mordred: ah, maybe that job specifically uses podman? | 17:00 |
@jim:acmegating.com | oh but that job is still passing on buster :/ | 17:01 |
@jim:acmegating.com | very weird | 17:02 |
-@gerrit:opendev.org- Atit Patel proposed: [zuul/zuul] Changes in time out logic, addition of retry to handle slower API responses https://review.opendev.org/c/zuul/zuul/+/805396 | 17:10 | |
-@gerrit:opendev.org- Atit Patel proposed: [zuul/zuul] Correcting timeout logic to fix retry errors caused by incorrect implementation of an object `retries` https://review.opendev.org/c/zuul/zuul/+/807364 | 17:12 | |
@mordred:inaugust.com | corvus: Iirc, at least some of the ianw changes explicitly enabled buildkit... But I'm not at a computer presently | 17:13 |
@clarkb:matrix.org | mordred: I don't think those have landed yet due to these problems we are having | 17:17 |
@clarkb:matrix.org | corvus: the logs for the failed test in https://zuul.opendev.org/t/zuul/build/53db0725e0ee4d9ab8b998b5b5b922f0 show that the _retry() function is hitting an exception. Specifically a no node error which makes me wonder if something is causing the build request to go away between the update and the put? | 21:08 |
@jim:acmegating.com | Clark: neat, i didn't hit that when running locally (which is why i didn't catch the log bug) | 21:20 |
@jim:acmegating.com | Clark: that test removes a tenant, which recursively deletes the tenant info in zk including the pipeline result queues | 21:24 |
@jim:acmegating.com | Clark: i think maybe we have to handle NoNodeError on the put and pass it | 21:25 |
@clarkb:matrix.org | aha that would cause a no node error | 21:25 |
@jim:acmegating.com | i'll fix that up when i finish up the change i'm writing | 21:26 |
@clarkb:matrix.org | I'm not sure I see the log bug | 21:27 |
@jim:acmegating.com | oh sorry, it's what i fixed between ps1 and 2 | 21:29 |
@jim:acmegating.com | flake8 caught it | 21:29 |
@clarkb:matrix.org | oh got it | 21:31 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: | 21:36 | |
- [zuul/zuul] Make node requests persistent https://review.opendev.org/c/zuul/zuul/+/806280 | ||
- [zuul/zuul] Add node request cache to zk nodepool interface https://review.opendev.org/c/zuul/zuul/+/806639 | ||
- [zuul/zuul] Wrap nodepool request completed events with election https://review.opendev.org/c/zuul/zuul/+/806653 | ||
- [zuul/zuul] Remove unecessary node request cancelation code https://review.opendev.org/c/zuul/zuul/+/806814 | ||
- [zuul/zuul] Refactor the checkNodeRequest method https://review.opendev.org/c/zuul/zuul/+/806816 | ||
- [zuul/zuul] Don't store node requests/nodesets on queue items https://review.opendev.org/c/zuul/zuul/+/806821 | ||
- [zuul/zuul] Remove internal nodepool request cache https://review.opendev.org/c/zuul/zuul/+/807196 | ||
- [zuul/zuul] Report nodepool resource stats gauges in scheduler https://review.opendev.org/c/zuul/zuul/+/807406 | ||
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] Try harder to unlock failed build requests https://review.opendev.org/c/zuul/zuul/+/807221 | 21:40 | |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] Try harder to unlock failed build requests https://review.opendev.org/c/zuul/zuul/+/807221 | 21:41 | |
@jim:acmegating.com | Clark: ^ maybe something like that | 21:41 |
@clarkb:matrix.org | the double pass of log is fun | 21:43 |
@jim:acmegating.com | everyone needs a log! log, log, log! | 21:49 |
@clarkb:matrix.org | corvus: does https://review.opendev.org/c/zuul/zuul/+/806280/ impact our ability to freely restart nodepool launchers? I don't think so as it wasn't creating the node requests but now I'm wondering how that works today. Does it just pick up the requests and start booting new instances when it comes back? or does it go back to the queue? | 22:53 |
@jim:acmegating.com | Clark: what's "it"? a launcher? | 22:54 |
@jim:acmegating.com | Clark: a launcher basically just walks the request queue repeatedly; that's about it. | 22:55 |
@jim:acmegating.com | to directly answer your first question: nodepool won't perceive a difference. | 22:55 |
@clarkb:matrix.org | it being the node request. If a launcher is handling a node request and we restart it what happens? I guess the ephemeral property wasn't important there because the executor handled that | 22:55 |
@clarkb:matrix.org | * it being the node request. If a launcher is handling a node request and we restart the launcher what happens? I guess the ephemeral property wasn't important there because the executor handled that | 22:56 |
@jim:acmegating.com | ah yep. that's correct. what happens (before and after that change) is the node request gets unlocked (because the launcher locks it) and another launcher (possibly the restarted one, possibly a different one) locks it and starts over | 22:57 |
@clarkb:matrix.org | aha the lock remains ephemeral but not the request itself | 22:57 |
@clarkb:matrix.org | that makes esnse | 22:57 |
@clarkb:matrix.org | corvus: does the cleanup there need to somehow signal to a launcher that a request is no longer needed and unlock it? My concern there is the leaked noderequests may still be fulfilled by nodepool | 22:58 |
@clarkb:matrix.org | but then zuul will do nothing with them? | 22:59 |
@clarkb:matrix.org | Then once the node request goes away the nodes can be reused? At worst we're delayed an hour? | 23:00 |
@clarkb:matrix.org | The code itself seems fine and does what it says on the tin. I just worry about ^ at this point | 23:01 |
@jim:acmegating.com | Clark: the cleanup is a failsafe, it shouldn't be necessary | 23:03 |
@jim:acmegating.com | Clark: but if something does slip through, then the cleanup will delete the node request, and as soon as a launcher sees a 'ready' node allocated to a node request that doesn't exist anymore, it will recycle it. | 23:03 |
@clarkb:matrix.org | yup the launchers handle that already for cancelled jobs iirc. | 23:04 |
@jim:acmegating.com | all the typical cases, like zuul canceling a build before it starts, etc, should immediately delete the node request. the cleanup is more like handling, i dunno, the zk server disappearing right as a cancel request happens :) | 23:04 |
@clarkb:matrix.org | Thinking out loud here I wonder if it might be a good idea to have a small utility that skims zuul logs for things like this where we've tripped failsafes and don't expect to | 23:04 |
@clarkb:matrix.org | For nodepool restarting the NodeWorker thread would be something to look for too | 23:05 |
@jim:acmegating.com | we could -- we'd have to examine each one and see if it's something we could actually prevent though. | 23:06 |
@jim:acmegating.com | node worker threads are expected to restart occasionally, that shouldn't be an error | 23:06 |
@jim:acmegating.com | basically each time you change the configuration, you get a new node worker | 23:07 |
@jim:acmegating.com | (while the old one gracefully stops) | 23:07 |
@clarkb:matrix.org | corvus: that was one of the things I was looking at with the debugging yesterday and I think that isn't true. Only the provider manager is restarted. The NodeWorker doesn't typically | 23:07 |
@clarkb:matrix.org | its a little confusing as there are a lot of threads. But the driver specific thread does restart when the config change but the NodeWorker thread sits above the driver specific thread and it only starts when the provider is added and restarts if the previous thread died | 23:08 |
@jim:acmegating.com | Clark: okay let's get specific -- exactly what class are you talking about? | 23:08 |
@jim:acmegating.com | cause there isn't anything called a NodeWorker in nodepool? | 23:09 |
@clarkb:matrix.org | https://opendev.org/zuul/nodepool/src/commit/433e4f33d187496a61a333a6f976572725ead4a2/nodepool/launcher.py#L1077-L1096 | 23:09 |
@clarkb:matrix.org | Sorry it is a PoolWorker | 23:10 |
@clarkb:matrix.org | We run a top level Nodepool thread which runs a PoolWorker thread for each pool regardless of driver specifics. That PoolWorker thread interacts with driver specific threads which do get restarted | 23:11 |
@jim:acmegating.com | okay. so what's the error you want to scan for? | 23:11 |
@clarkb:matrix.org | Specifically https://opendev.org/zuul/nodepool/src/commit/433e4f33d187496a61a333a6f976572725ead4a2/nodepool/launcher.py#L1092 would indicate an unexpected thread death | 23:12 |
@clarkb:matrix.org | which could indicate uncaught exceptions or perhaps some sort of OOM if OOMKiller can kill specific lightweight threads and not the parent process (that is an interesting question I don't know the answer to but I suspect it only kills parent threads) | 23:12 |
@clarkb:matrix.org | it isn't a majorly urgent thing, but thinking out loud I thought maybe it could be useful to give hints for bad things to operators | 23:13 |
@jim:acmegating.com | yeah, my preference would be to make sure that the run method in that thread catches all exceptions and that code isn't necessary. then exceptions just get logged as exceptions. | 23:13 |
@jim:acmegating.com | that code is pretty non-idiomatic for zuul | 23:14 |
@clarkb:matrix.org | > <@jim:acmegating.com> yeah, my preference would be to make sure that the run method in that thread catches all exceptions and that code isn't necessary. then exceptions just get logged as exceptions. | 23:14 |
Yup that is what my change to address the issue previously does, catches more places exceptions can happen. | ||
@jim:acmegating.com | anyway, then the "look for bad stuff" operation becomes "look for log messages with severity ERROR" :) | 23:14 |
@clarkb:matrix.org | fair enough | 23:15 |
@clarkb:matrix.org | looks like linters failed on the node request change | 23:15 |
@jim:acmegating.com | though maybe it's worth thinking about whether we want those to be warning or error. might be worth getting specific about that. | 23:15 |
@clarkb:matrix.org | I'm going to +2 it now to remind myself that I liked the current state before it got updated | 23:15 |
@jim:acmegating.com | maybe error should be "zuul can not fix this" vs warning meaning "zuul noted an anomaly which it corrected" | 23:16 |
@clarkb:matrix.org | I like that. | 23:16 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: | 23:17 | |
- [zuul/zuul] Make node requests persistent https://review.opendev.org/c/zuul/zuul/+/806280 | ||
- [zuul/zuul] Add node request cache to zk nodepool interface https://review.opendev.org/c/zuul/zuul/+/806639 | ||
- [zuul/zuul] Wrap nodepool request completed events with election https://review.opendev.org/c/zuul/zuul/+/806653 | ||
- [zuul/zuul] Remove unecessary node request cancelation code https://review.opendev.org/c/zuul/zuul/+/806814 | ||
- [zuul/zuul] Refactor the checkNodeRequest method https://review.opendev.org/c/zuul/zuul/+/806816 | ||
- [zuul/zuul] Don't store node requests/nodesets on queue items https://review.opendev.org/c/zuul/zuul/+/806821 | ||
- [zuul/zuul] Remove internal nodepool request cache https://review.opendev.org/c/zuul/zuul/+/807196 | ||
- [zuul/zuul] Report nodepool resource stats gauges in scheduler https://review.opendev.org/c/zuul/zuul/+/807406 | ||
@jim:acmegating.com | Clark: ^ fixed the lint error | 23:18 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!