-@gerrit:opendev.org- Zuul merged on behalf of Simon Westphahl: [zuul/nodepool] 884234: Double check node request during node cleanup https://review.opendev.org/c/zuul/nodepool/+/884234 | 06:11 | |
-@gerrit:opendev.org- Zuul merged on behalf of Simon Westphahl: [zuul/nodepool] 884241: Double check node allocation in a few more places https://review.opendev.org/c/zuul/nodepool/+/884241 | 06:14 | |
@jjbeckman:matrix.org | Ah, I missed your comment. Thank you for the nudge in the right direction. | 08:42 |
---|---|---|
I've set the following in `main.yaml` | ||
``` | ||
- authorization-rule: | ||
name: rule-name | ||
conditions: | ||
- groups: {redacted} | ||
- api-root: | ||
authentication-realm: {redacted}.com | ||
access-rules: rule-name | ||
- tenant: | ||
name: example-tenant | ||
admin-rules: | ||
- rule-name | ||
``` | ||
As a result, when accessing the root URL, after authentication, I get stuck with the message "Login in progress You will be redirected shortly...". | ||
I assume something is missing from my configuration? | ||
@jjbeckman:matrix.org | I've been able to get "authentication + getting privileged actions working in the Web UI". Thanks so much for your advice :D | 08:43 |
@mhuin:matrix.org | 🎉 | 08:43 |
@flaper87:matrix.org | 👋 Is there a way to re-trigger a post job from the UI? I'm looking for a way to retrigger failed jobs manually (periodic jobs, jobs in the post pipeline, etc) | 08:57 |
@mhuin:matrix.org | > <@flaper87:matrix.org> 👋 Is there a way to re-trigger a post job from the UI? I'm looking for a way to retrigger failed jobs manually (periodic jobs, jobs in the post pipeline, etc) | 08:58 |
if you are authenticated and your token match admin rules set for your tenant, you can re-enqueue from the buildset's summary page | ||
@flaper87:matrix.org | mhu: a-ha, thanks. Then I think my problem is the admin-rules because I don't see it anywhere. I'm using an OAuth proxy in front of Zuul rather than the Zuul oauth support for now. I'll proobably switch soon. That said, do you have an example on how to make all users admin? This is a small org and we don't really need to have multiple roles just yet | 09:02 |
@flaper87:matrix.org | "Make authenticated users admin" should be enough | 09:04 |
@flaper87:matrix.org | * "Make authenticated users admin" should be enough, looking at this: https://zuul-ci.org/docs/zuul/latest/developer/specs/tenant-scoped-admin-web-API.html#access-control-configuration | 09:04 |
@mhuin:matrix.org | I'm not sure an oauth proxy would work for authentication in the web UI. IIRC the openid lib used in the UI expects to fetch an auth token | 09:04 |
@mhuin:matrix.org | I think you could write a trivial rule, like some condition about the issuer or the audience since this should be the same for every user and under your control | 09:05 |
@flaper87:matrix.org | mhu: thanks! Do you know if there's an easy way to test authorization rules without fully deploying them? | 09:15 |
@mhuin:matrix.org | No but that could be a good feature to add to the zuul-admin CLI | 09:16 |
@flaper87:matrix.org | I've enabled oauth with Google but it's definitely not matching my rule | 09:16 |
@mhuin:matrix.org | what I do is grab a JWT - you can do that on the use page once your authenticated on zuul - and analyze it in the debugger at https://jwt.io | 09:16 |
@flaper87:matrix.org | ah, thanks! That's useful. I also just found out that Zuul's web client stores a `zuul_auth_param` cookie in the session storage | 09:20 |
@woju:invisiblethingslab.com | Hi, I'm trying to deploy zuul and I'm hitting this error when starting scheduler: | 13:12 |
``` | ||
2023-05-25 12:05:14,846 ERROR zuul.Scheduler: Error starting Zuul: | ||
Traceback (most recent call last): | ||
File "/opt/zuul/lib/python3.11/site-packages/zuul/cmd/scheduler.py", line 102, in run | ||
self.sched.prime(self.config) | ||
File "/opt/zuul/lib/python3.11/site-packages/zuul/scheduler.py", line 1054, in prime | ||
tenant = loader.loadTenant( | ||
^^^^^^^^^^^^^^^^^^ | ||
File "/opt/zuul/lib/python3.11/site-packages/zuul/configloader.py", line 2853, in loadTenant | ||
new_tenant = self.tenant_parser.fromYaml( | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
File "/opt/zuul/lib/python3.11/site-packages/zuul/configloader.py", line 1832, in fromYaml | ||
self._cacheTenantYAML(abide, tenant, loading_errors, min_ltimes, | ||
File "/opt/zuul/lib/python3.11/site-packages/zuul/configloader.py", line 2152, in _cacheTenantYAML | ||
future.result() | ||
File "/usr/lib/python3.11/concurrent/futures/_base.py", line 456, in result | ||
return self.__get_result() | ||
^^^^^^^^^^^^^^^^^^^ | ||
File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result | ||
raise self._exception | ||
File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run | ||
result = self.fn(*self.args, **self.kwargs) | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
File "/opt/zuul/lib/python3.11/site-packages/zuul/configloader.py", line 2211, in _cacheTenantYAMLBranch | ||
self._updateUnparsedBranchCache( | ||
File "/opt/zuul/lib/python3.11/site-packages/zuul/configloader.py", line 2325, in _updateUnparsedBranchCache | ||
min_ltimes[source_context.project_canonical_name][ | ||
~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
KeyError: 'localhost/gramine-zuul-config' | ||
2023-05-25 12:05:14,848 DEBUG zuul.Scheduler: Stopping scheduler | ||
``` | ||
(Environment: Debian testing, python 3.11, virtualenv with latest version of zuul from pip as of about a week ago, systemd with custom units) | ||
@woju:invisiblethingslab.com | What did I do wrong and/or where to look for problems? | 13:12 |
@woju:invisiblethingslab.com | I've tried reading through `_cacheTenantYAML` function, but I don't understand much | 13:12 |
-@gerrit:opendev.org- Matthieu Huin https://matrix.to/#/@mhuin:matrix.org proposed: [zuul/zuul-operator] 884393: Update zuul-ci, opendevorg containers to fetch from quay.io https://review.opendev.org/c/zuul/zuul-operator/+/884393 | 13:13 | |
@woju:invisiblethingslab.com | also there's similar traceback in the log about one another project, same function and line | 13:14 |
@fungicide:matrix.org | woju: is that project included in the tenant config? | 13:21 |
@woju:invisiblethingslab.com | yes, it is | 13:22 |
@woju:invisiblethingslab.com | let me paste | 13:22 |
@woju:invisiblethingslab.com | ``` | 13:22 |
- tenant: | ||
name: gramine | ||
source: | ||
local: | ||
config-projects: | ||
- gramine-zuul-config | ||
- zuul-base-jobs | ||
- zuul-jobs | ||
untrusted-projects: | ||
- gramine | ||
``` | ||
@woju:invisiblethingslab.com | "local" is `file:///srv/local` and those repositories are cloned there, though some have only stub `.zuul.yaml` | 13:23 |
@woju:invisiblethingslab.com | there's single pipeline `check` and an example job in `gramine*` repos | 13:23 |
@fungicide:matrix.org | `local` is using the git source connection driver? | 13:25 |
@woju:invisiblethingslab.com | ``` | 13:26 |
[connection local] | ||
driver=git | ||
baseurl=file:///srv/git | ||
poll_delay=10 | ||
``` | ||
@woju:invisiblethingslab.com | sorry I'm not pasting the whole file, but it's full of secrets :) | 13:26 |
@fungicide:matrix.org | sure, just making sure i understand how it's trying to access those repositories | 13:27 |
@fungicide:matrix.org | i'm a little out of my depth on this question, having only used the git driver for trivial cases (it doesn't really have usable triggers besides ref-updated, isn't appropriate for gating proposed changes as there's no code review workflow visible to zuul). hopefully someone else can either spot the configuration mistake or knows whether this is a limitation of the git driver | 13:31 |
@woju:invisiblethingslab.com | fyi those keys are really missing, from `min_ltimes`, I've quickly added `self.log.debug` and: | 13:33 |
``` | ||
2023-05-25 13:31:29,542 DEBUG zuul.TenantParser: _cacheTenantYAMLBranch(..., min_ltimes={'localhost/zuul-base-jobs': {'master': 132}, 'localhost/zuul-jobs': {'master': 133}}) | ||
``` | ||
@woju:invisiblethingslab.com | eventually I'll move untrusted repo(s) to github, but right now I wanted to just get something running | 13:34 |
@fungicide:matrix.org | yeah, possible it's a corner case where there's missing implementation and we don't have test coverage, i'm trying to follow the source and see if i can understand how that could happen | 13:38 |
@woju:invisiblethingslab.com | I've followed this up to `zk/layout.py:LayoutStateStore.getMinLtimes` and looks like this is data from zookeeper: | 13:48 |
@woju:invisiblethingslab.com | ``` | 13:48 |
2023-05-25 13:46:07,084 DEBUG zuul.LayoutStore: getMinLtimes(layout_state=<LayoutState gramine: ltime=252, hostname=ip-172-26-9-217.eu-central-1.compute.internal, last_reconfigured=1684776455>) path='/zuul/layout-data/c20051a77d95465b920531cde9bf2191' | ||
2023-05-25 13:46:07,085 DEBUG zuul.LayoutStore: data=b'{"min_ltimes": {"localhost/zuul-base-jobs": {"master": 132}, "localhost/zuul-jobs": {"master": 133}}}' | ||
2023-05-25 13:46:07,086 DEBUG zuul.ConfigLoader: loadTenant(..., tenant_name='gramine', min_ltimes={'localhost/zuul-base-jobs': {'master': 132}, 'localhost/zuul-jobs': {'master': 133}}) | ||
2023-05-25 13:46:07,118 DEBUG zuul.TenantParser: _cacheTenantBranch(..., min_ltimes={'localhost/zuul-base-jobs': {'master': 132}, 'localhost/zuul-jobs': {'master': 133}}) | ||
2023-05-25 13:46:07,118 DEBUG zuul.TenantParser: _cacheTenantYAMLBranch(..., min_ltimes={'localhost/zuul-base-jobs': {'master': 132}, 'localhost/zuul-jobs': {'master': 133}}) | ||
``` | ||
@woju:invisiblethingslab.com | I have no idea what to do now | 13:49 |
@fungicide:matrix.org | looks like this may have changed a couple of months back in https://review.opendev.org/835100 which introduced min_ltimes, though that didn't touch the drivers so i suspect it's not a driver-specific behavior (unless it's relying on aspects of the other drivers which the git driver lacks) | 13:51 |
@woju:invisiblethingslab.com | thanks for the link, I'll read what I can | 13:53 |
@jim:acmegating.com | woju: assuming that your system is dead in the water and you can handle losing the running state information, you might consider running this command to make sure that it's not some kind of data corruption: https://zuul-ci.org/docs/zuul/latest/client.html#delete-state | 13:53 |
@woju:invisiblethingslab.com | yeah, it's pretty much empty | 13:53 |
@jim:acmegating.com | that's the big red "reset everything" button | 13:53 |
@fungicide:matrix.org | also i suppose if /zuul/layout-data isn't writeable by the service that could cause cache problems | 13:53 |
@fungicide:matrix.org | mmm, that's the zk path though not a disk path | 13:54 |
@fungicide:matrix.org | and presumably zk is working for other aspects of the system | 13:54 |
@woju:invisiblethingslab.com | yes, this is `self.log.debug(f'{path=}')`, a local variable from this function | 13:55 |
@fungicide:matrix.org | * <del>also i suppose if /zuul/layout-data isn't writeable by the service that could cause cache problems</del> | 13:55 |
@woju:invisiblethingslab.com | > <@jim:acmegating.com> woju: assuming that your system is dead in the water and you can handle losing the running state information, you might consider running this command to make sure that it's not some kind of data corruption: https://zuul-ci.org/docs/zuul/latest/client.html#delete-state | 13:58 |
this has helped, thank you! | ||
@fungicide:matrix.org | zuul equivalent of "turn it off and on again" ;) | 14:00 |
@fungicide:matrix.org | woju: so it's getting further after clearing the state? | 14:01 |
@woju:invisiblethingslab.com | it got `defaultdict`s as this min_ltime= argument, and the service started (I still have this ad-hoc logging) | 14:07 |
@woju:invisiblethingslab.com | I'll continue setting it up, let's see how far I'll go | 14:07 |
@fungicide:matrix.org | i wonder if you could have been running a version of zuul from before the schema change and the migration didn't get applied to the existing data when updating to the newer version | 14:09 |
@woju:invisiblethingslab.com | probably not a chance, I've started deploying this installation two weeks ago and this change is from March? | 14:10 |
@fungicide:matrix.org | the change merged in march but may not have appeared in a release until much later | 14:11 |
@woju:invisiblethingslab.com | ah, yes, I didn't check this | 14:11 |
@fungicide:matrix.org | er, no ignore me. i can't remember what year it is any more. the change that updated the schema for that was from march of last year, not this year | 14:12 |
@woju:invisiblethingslab.com | :) | 14:12 |
@fungicide:matrix.org | but as far as release timelines, there was a several-month gap between 8.2.0 (2023-02-21) and 8.3.0 (2023-05-16) so if you started a couple of weeks ago then something related may have changed in 8.3.0/8.3.1 which led to the existing data not being viable | 14:14 |
@woju:invisiblethingslab.com | let me check the versions | 14:15 |
@woju:invisiblethingslab.com | python's package metadata says 8.3.1 | 14:15 |
@woju:invisiblethingslab.com | I suspect I had something misconfigured and data in zookeeper got corrupted | 14:17 |
@fungicide:matrix.org | that seems increasingly likely. the last change to the zk model was in 8.2.0 and i don't see any warnings in the release notes about any additional steps needed on update either | 14:18 |
@tristanc_:matrix.org | corvus: Clark looking at https://review.opendev.org/c/zuul/zuul-operator/+/881245 , it seems like kubernetes is no longer pulling the image from the intermediate registry, is this a known issue with quay.io ? | 15:39 |
@jim:acmegating.com | tristanC: it seems like the docker-based solutions for speculative execution don't work for images with quay.io. so we're moving all the other repos to podman to address that. i'm not sure exactly what is broken with that k8s setup, but i suspect we might need to switch to microk8s since that is the most recent thing that ianw got working with speculative registry jobs. | 15:44 |
@tristanc_:matrix.org | corvus: alright, I meant to try that, would you mind if I update your change? | 15:44 |
@jim:acmegating.com | tristanC: please do and thanks! | 15:45 |
@jim:acmegating.com | i've been focusing on the other repos. i think we're 99% there; the podman changes are ready for both zuul and nodepool, but i want to make sure i fully understand the implications of some cgroup stuff in https://review.opendev.org/883952 before we actually merge them. | 15:46 |
@jim:acmegating.com | tristanC: btw there should be some jobs in zuul-jobs using the microk8s roles for reference | 15:48 |
-@gerrit:opendev.org- Tristan Cacqueray proposed on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul-operator] 881245: Publish container images to quay.io https://review.opendev.org/c/zuul/zuul-operator/+/881245 | 15:48 | |
@tristanc_:matrix.org | corvus: on 883952, it seems like the service is restarting because of `PermissionError: [Errno 13] Permission denied: '/opt/dib/images/builder_id.txt'` | 16:07 |
@jim:acmegating.com | tristanC: agreed, i have a held node and it looks like podman is not setting up the expected subuidmap, so i'm trying to work out the right set of arguments for that | 16:13 |
-@gerrit:opendev.org- Tristan Cacqueray proposed on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul-operator] 881245: Publish container images to quay.io https://review.opendev.org/c/zuul/zuul-operator/+/881245 | 18:19 | |
@jim:acmegating.com | okay so the issue with 883952 is that we bind mount files owned by user 10001 into the container. that is the 'nodepool' user on the host, and also the 'nodepool' user in the container. we are running as the 'zuul' user on the host, and the zuul user on the host does not have permission to map host uid 10001 into the container (it's allowed uids 1000 (itself) and 100000-65535). so this is a design mismatch between rootless and root containers. this test mimics the opendev deployment style where we have zuul and nodepool users on the host which are intended to match the in-container users of the same name and uid for our bind mounts. i think we can resolve it by either doing "sudo podman" so we don't end up with a userns and we mount everything in (so basically, run like docker). or redesign the test and opendev deployments for rootless use (which probably means chowning this directory on the host to be owned by "zuul" which we would then either map to the "nodepool" user in the container, or the "root" user (and then run the nodepool process in the container as root, which isn't really root, it's the user that launched the container (ie, zuul)) | 18:23 |
@jim:acmegating.com | Clark: fyi opendev podman thoughts ^ | 18:24 |
@jim:acmegating.com | * okay so the issue with 883952 is that we bind mount files owned by user 10001 into the container. that is the 'nodepool' user on the host, and also the 'nodepool' user in the container. we are running as the 'zuul' user on the host, and the zuul user on the host does not have permission to map host uid 10001 into the container (it's allowed uids 1000 (itself) and 100000-165535). so this is a design mismatch between rootless and root containers. this test mimics the opendev deployment style where we have zuul and nodepool users on the host which are intended to match the in-container users of the same name and uid for our bind mounts. i think we can resolve it by either doing "sudo podman" so we don't end up with a userns and we mount everything in (so basically, run like docker). or redesign the test and opendev deployments for rootless use (which probably means chowning this directory on the host to be owned by "zuul" which we would then either map to the "nodepool" user in the container, or the "root" user (and then run the nodepool process in the container as root, which isn't really root, it's the user that launched the container (ie, zuul)) | 18:25 |
@clarkb:matrix.org | corvus: another option may be to exec podman as nodepool on the host? | 18:26 |
@clarkb:matrix.org | then its uids should all match up by default? | 18:26 |
@jim:acmegating.com | yes | 18:26 |
@clarkb:matrix.org | I think ansible has become tooling to do that | 18:27 |
@jim:acmegating.com | that actually makes the whole thing sound slightly less insane :) | 18:27 |
@jim:acmegating.com | (it complicates the test a little bit though) | 18:27 |
@tristanc_:matrix.org | if the host uid match the container uid, then are we using `--userns keep-id` ? | 18:28 |
@jim:acmegating.com | well, we aren't now, but it might be an option if we go with clark's suggestion | 18:29 |
@jim:acmegating.com | (oh, another option would be to, on the host, allow the zuul user to use the nodepool uid; ie, set up /etc/subuid to allow that) | 18:32 |
-@gerrit:opendev.org- Tristan Cacqueray proposed on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul-operator] 881245: Publish container images to quay.io https://review.opendev.org/c/zuul/zuul-operator/+/881245 | 18:37 | |
@jim:acmegating.com | Clark: hrm, podman as the nodepool user isn't working out that well; if i su to nodepool, i'm not actually able to run podman apparently due to login issues | 19:11 |
@jim:acmegating.com | for one, it keeps trying to do this: `mkdir /run/user/1000/libpod/tmp: permission denied` (nodepool is user 10001, zuul is user 1000) | 19:11 |
@mordred:inaugust.com | corvus: is it any difference if you su - nodepool? | 19:12 |
-@gerrit:opendev.org- Tristan Cacqueray proposed on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul-operator] 881245: Publish container images to quay.io https://review.opendev.org/c/zuul/zuul-operator/+/881245 | 19:12 | |
@jim:acmegating.com | mordred: yeah, i also did `sudo su -` then `su - nodepool` | 19:12 |
@mordred:inaugust.com | wow | 19:12 |
@jim:acmegating.com | nothing in env says 1000, so idunno where it's getting that | 19:13 |
@mordred:inaugust.com | that seems like a bug in podman | 19:13 |
@mordred:inaugust.com | (which is a shame - because I really like the "run podman as the nodepool user" - as that models the intended reduced privileges really nice) | 19:14 |
@jim:acmegating.com | yeah it sounds like the shortest way out of this maze... | 19:14 |
@clarkb:matrix.org | It really wants that systemd session stuff | 19:17 |
@clarkb:matrix.org | You probably need to login in a way that triggers systemd session stuff. I don't know how to do that | 19:17 |
@mordred:inaugust.com | what's the command that sounds like systemd but is less systemd? systemd-launch or systemd-runcommand or something like that? | 19:19 |
@jim:acmegating.com | Clark: mordred tristanC i'm about of ideas. this issue is pretty easy to replicate and test locally, but for now, i think we're going to have to leave the test running as root (with "sudo podman") and likewise opendev will probably have to run podman containers as root. | 19:27 |
@jim:acmegating.com | if we're okay accepting that, then i think we can proceed. if we aren't okay with that, then i think someone needs to figure out an alternative, or we roll back to dockerhub. | 19:28 |
@tristanc_:matrix.org | corvus: would it be possible to mount podman volumes instead of existing host directory? I think doing that should handle uid mapping transparently. | 19:29 |
@clarkb:matrix.org | > <@jim:acmegating.com> if we're okay accepting that, then i think we can proceed. if we aren't okay with that, then i think someone needs to figure out an alternative, or we roll back to dockerhub. | 19:30 |
I don't think this is any worse than with docker so should be fine? With possibility of improvement in the future | ||
@jim:acmegating.com | tristanC: the opendev admins like to have things like /opt/dib bind mounted so it's accessible outside of containers and allows for complete removal and replacement of containers without losing vital data | 19:31 |
@jim:acmegating.com | Clark: i agree, it's no worse so i think we can live with it. | 19:32 |
@tristanc_:matrix.org | corvus: you can still access the volume data from the host through ~/.local/share/containers/storage/volumes | 19:33 |
@tristanc_:matrix.org | also, for the record, we had quite a few issues running images featuring a USER statement in openshift, and in the end we removed the user creation inside the image, and let the runtime assign the uid. | 19:36 |
-@gerrit:opendev.org- Tristan Cacqueray proposed on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul-operator] 881245: Publish container images to quay.io https://review.opendev.org/c/zuul/zuul-operator/+/881245 | 19:43 | |
@fungicide:matrix.org | problem with the container runtime assigning a uid is that you don't have a stable uid on the host end, so if you replace that container and mount the old data back in, it may no longer be owned by the same uids | 20:13 |
@avass:matrix.vassast.org | > <@fungicide:matrix.org> problem with the container runtime assigning a uid is that you don't have a stable uid on the host end, so if you replace that container and mount the old data back in, it may no longer be owned by the same uids | 20:14 |
I think I've used gods to work around that. But maybe that doesn't work for your issue | ||
@avass:matrix.vassast.org | gids* :) | 20:14 |
@fungicide:matrix.org | well, also if we want to make things from the host that are owned by users inside the container, we need to be able to predict the uids the container's users will have | 20:14 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 883985: Add error information to config-errors API endpoint https://review.opendev.org/c/zuul/zuul/+/883985 | 22:22 | |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 883985: Add error information to config-errors API endpoint https://review.opendev.org/c/zuul/zuul/+/883985 | 22:41 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!