corvus | thanks; the delta from your last review on keycloak is very small :) | 00:00 |
---|---|---|
clarkb | corvus: for the auth token. We supply the value in the config as our token directly? or do we generate a time specific token? And if we generate a time specific token what prevents anyone from doing that (I've got the zuul client docs up and they don't seem to speak to this) | 00:02 |
clarkb | oh there it is zuul create-token | 00:02 |
clarkb | zuul create-auth-token | 00:02 |
clarkb | but that still doesn't explain how this is ACL'd https://zuul-ci.org/docs/zuul/reference/client.html#create-auth-token | 00:03 |
corvus | clarkb: my plan is to generate one token with no limit on the lifetime, and stick that in a config file on zuul0X for use with zuul-client | 00:05 |
corvus | if we lose that token, we'll need to change the secret, but we can do that easily | 00:05 |
clarkb | corvus: ok, and what prevents anyone from running zuul create-auth-token and creating another token? | 00:06 |
clarkb | that is what I'm confused about. The command doesn't seem to accept a secret at all | 00:06 |
clarkb | so wondering where the trust is involved | 00:06 |
corvus | clarkb: oh we run that on zuul02 and it reads zuul.conf, so only we can create those tokens | 00:06 |
corvus | and then the acl is here: https://review.opendev.org/820277 | 00:06 |
clarkb | oh this isn't creating a token via the rest api | 00:06 |
clarkb | creating the token is local via the local config. Then the craeted token can be used via the rest api according to the rules in 820277 | 00:07 |
corvus | correct, 'zuul create-auth-token' is an admin-only command that can only be run with the actual production zuul.conf. | 00:07 |
corvus | clarkb: exactly | 00:07 |
clarkb | ok now to find the docs on those config settings | 00:08 |
corvus | and i set up the rules so we can use a single token for every tenant | 00:08 |
clarkb | corvus: if I read the docs correctly you may not actually be planning to use teh create-auth-token command and instead manually generate the token? | 00:13 |
clarkb | I'm basing this on the fact that you seem to be relying on the issuer to accept the token via the iss config in 820277. But create-auth-token doesn't seem to allow you to specify that? Or maybe it is automatic based on the zuul-web canonical name | 00:13 |
corvus | it uses the value from the config file (issuer_id) | 00:14 |
clarkb | aha | 00:14 |
clarkb | and we could filter this further by matching sub is admin or similar, but in this case it is sufficient to match the issuer since we are the only issuer | 00:14 |
clarkb | I know we've broken down the zuul docs in terms of tutorials, guides, and reference but digging through docs for this makes me wish that everything was more centralized. | 00:15 |
clarkb | I'll have to think on how to make this stuff more discoverable and easy to read in the docs. I don't have any good ideas other than I wish it wasn't in three different places right now :) | 00:17 |
clarkb | corvus: one last question. The docs say that revoking tokens is not trivial. I assume the process is basically to chgne the secret in the config? then old tokens won't match anymore? | 00:21 |
clarkb | Basically it is doable but might requires a restart? | 00:21 |
corvus | yep | 00:22 |
corvus | (it may as well be a shared-secret system the way we'll be using it -- but this way we don't need to implement another auth mech) | 00:23 |
clarkb | cool and then if we only ever leave the token on the servers themselves the risk is roughly the same as the other side of the verification | 00:23 |
corvus | yep | 00:23 |
clarkb | if we move the secret off host we should time scope them | 00:23 |
corvus | yeah. hopefully instead of doing that though we just run keycloak for our local convenience :) | 00:24 |
clarkb | right | 00:24 |
clarkb | I'm just making sure I've got a good grasp of how this works. This has helped, thanks | 00:24 |
clarkb | and for keycloak we'd add a new auth opendevkeycleak entry or similar wthat is driver openidconnect. Then in our admin rules we would configure it to look for that issuer and probably specific users | 00:25 |
clarkb | and then token issuance happens via keycloak's api | 00:25 |
clarkb | whatever that method might be | 00:25 |
corvus | clarkb: yep, and we can do stuff with keycloak groups and zuul tenants, etc. | 00:26 |
Clark[m] | Related to understanding how things work John Carmack did a commencement speech recently where he talks about not being afraid to dig into details and actually understand how things work. I'm sure I'll never have as much understanding as he does, but I find that I enjoy technology a lot more when I don't treat it as magic but instead as largely decipherable tools. https://www.youtube.com/watch?v=YOZnqjHkULc | 00:29 |
clarkb | heh and now to the other client. Of course sometimes you wonder if software and computers truly are decipherable :) | 00:30 |
opendevreview | Ian Wienand proposed opendev/system-config master: Update bridge playbook match https://review.opendev.org/c/opendev/system-config/+/820281 | 00:38 |
opendevreview | Ian Wienand proposed opendev/system-config master: Rename install-ansible to bootstrap-bridge https://review.opendev.org/c/opendev/system-config/+/820282 | 00:38 |
opendevreview | Ian Wienand proposed opendev/system-config master: Rename install-ansible to bootstrap-bridge https://review.opendev.org/c/opendev/system-config/+/820282 | 00:45 |
ianw | corvus: sorry, lost it in the scrollback, did you want a full zuul restart, or just schedulers/web? | 00:46 |
corvus | Roll sched and web only | 00:48 |
ianw | ok, i can do that, and gerrit, in say 1-30/2 hrs from now | 00:49 |
corvus | Thx. I won't be around then fyi. Or even now really :) | 00:50 |
ianw | no worries. what could possibly go wrong :) | 00:50 |
ianw | as clarkb says, always good to get into the details | 00:50 |
corvus | You know where the big red reset button is | 00:51 |
*** rlandy|ruck is now known as rlandy|out | 00:55 | |
kevinz | clarkb: ianw: Np, I will update today for the Cert. | 02:30 |
ianw | ok going for some restarts | 02:45 |
ianw | ok, https://zuul.opendev.org/t/openstack/build/414cdcdd14bc432c9909c692a3841aed/logs pushed 3.3: digest: sha256:152fc54f4d91f938cfe6bf5a762f129f8716e05a46619a5fe31eaaca5eabd7c5 | 02:48 |
ianw | that matches https://hub.docker.com/layers/opendevorg/gerrit/3.3/images/sha256-152fc54f4d91f938cfe6bf5a762f129f8716e05a46619a5fe31eaaca5eabd7c5?context=explore | 02:48 |
ianw | "RepoDigests": [ | 02:50 |
ianw | "opendevorg/gerrit@sha256:152fc54f4d91f938cfe6bf5a762f129f8716e05a46619a5fe31eaaca5eabd7c5" | 02:50 |
ianw | ], | 02:50 |
ianw | matches on review. so we're ready to restart with 3.3.8 | 02:51 |
ianw | old image is a071a9727a92 | 02:51 |
fungi | lgtm | 02:53 |
ianw | ... and back Powered by Gerrit Code Review (3.3.8-9-g783af24727-dirty) | 02:53 |
fungi | yay! thanks | 02:54 |
ianw | #status log restarted gerrit with 3.3.8 from https://review.opendev.org/c/opendev/system-config/+/819733/ | 02:54 |
opendevstatus | ianw: finished logging | 02:54 |
ianw | now zuul | 02:54 |
*** sshnaidm is now known as sshnaidm|off | 02:57 | |
ianw | zuul/zuul-scheduler <none> 0a216ce83b59 26 hours ago 491MB | 03:00 |
ianw | i'm not so sure about this on zuul01 | 03:00 |
ianw | https://hub.docker.com/layers/zuul/zuul-scheduler/latest/images/sha256-25347323eeaead7f8a8ca27f5b8ffd5ee62dda5ddcb508b30f1b6390727674bb?context=explore | 03:00 |
ianw | is the latest | 03:00 |
ianw | "zuul/zuul-scheduler@sha256:25347323eeaead7f8a8ca27f5b8ffd5ee62dda5ddcb508b30f1b6390727674bb" | 03:02 |
ianw | it's ok, i'm just blind, that's the old image | 03:02 |
ianw | everything matches | 03:02 |
ianw | 2021-12-03 03:02:40,151 DEBUG zuul.CommandSocket: Received b'stop' from socket | 03:03 |
wxy-xiyuan | Hi, @ianw, cloud you please take a look https://review.opendev.org/c/openstack/project-config/+/818723 if you're free? When I tried to add openEuler support to devstack, the team suggest that the CI cloud be ready at the same time. Before I write the job, the node should be ready first I guess. | 03:03 |
wxy-xiyuan | s/cloud/could | 03:04 |
ianw | https://zuul.opendev.org/components shows zuul01 init | 03:06 |
ianw | 01 is up, going to restart 02 & web | 03:24 |
ianw | wyx-xiyuan: sorry, i had missed that one. one comment inline | 03:33 |
ianw | z2 scheduler up, restarting web now | 03:39 |
wxy-xiyuan | @ianw, big thanks! will reply soon | 03:40 |
opendevreview | wangxiyuan proposed openstack/project-config master: Add openEuler 20.03 LTS SP2 node https://review.opendev.org/c/openstack/project-config/+/818723 | 03:44 |
ianw | we now seem to have a beaker next to pipelines, which i think means it worked | 03:57 |
ianw | #status log performed rolling restart of zuul01/02 and zuul-web | 03:58 |
opendevstatus | ianw: finished logging | 03:58 |
fungi | https://zuul.openstack.org/status is back as well now that the fix is in place | 04:14 |
*** ysandeep|out is now known as ysandeep|ruck | 04:54 | |
*** pojadhav|afk is now known as pojadhav | 05:00 | |
wxy-xiyuan | ianw, fixed now. :) | 06:06 |
*** raukadah is now known as chandankumar | 06:12 | |
*** ysandeep|ruck is now known as ysandeep|afk | 06:16 | |
ianw | wxy-xiyuan: one other thing; is there a reason it's not added to nl02? as it's a new distro, we could restrict it to the rax servers to start and then roll out everywhere when it is working (i.e do it in a follow-on) | 07:48 |
ianw | it is easier to debug one thing at a time | 07:48 |
wxy-xiyuan | My thought is to enable it in a small place for testing first. Once it's stable enough, we can add it to everywhere. So I just added it in nl01 for x86 | 07:51 |
opendevreview | Ian Wienand proposed opendev/system-config master: infra-prod: setup system-config on bridge in bootstrap job https://review.opendev.org/c/opendev/system-config/+/820320 | 07:52 |
ianw | wxy-xiyuan: ok; might be worth adding it in a follow-on but mark it wip | 07:53 |
wxy-xiyuan | Sure | 07:53 |
ianw | clarkb / fungi: ^^that's a bit of a hail mary change and i'll think on it some more; but it feels right. basically, a infra-prod-bootstrap-bridge job should *always* run, and it should a) install the production ansible and b) update system-config to the buildset reference | 07:55 |
ianw | at the moment, we use playbooks/zuul/run-production-playbook.yaml to run "install-ansible.yaml". that actually feels wrong -- that is using the "production" ansible to install ... the production ansible | 07:57 |
ianw | now I don't think you'd actually ever notice unless we wiped out ansible on bridge, but it still feels like we've got a hidden bootstrap problem with that | 07:58 |
ianw | it is a question for me if our existing install-ansible role is 100% idempotent. it probably is but i'd want to investigate | 07:59 |
ianw | by removing the file matcher and making this run unconditionally, i think we have avoided the fundamental problem of the other queues not updating the source | 08:00 |
*** jpena|off is now known as jpena | 08:02 | |
ianw | the DISABLE-ANSIBLE also needs integration. I think the place to do that is from setup-keys -- after setting up so each executor can log into bridge, each job can then check the on-bridge-disk flag file and stop itself before it goes on | 08:02 |
*** ysandeep|afk is now known as ysandeep | 08:07 | |
*** ysandeep is now known as ysandeep|ruck | 08:08 | |
opendevreview | Ian Wienand proposed openstack/project-config master: Update the opendev/system-config tag https://review.opendev.org/c/openstack/project-config/+/819715 | 08:16 |
opendevreview | Elod Illes proposed openstack/project-config master: Add rights to neutron-dynamic-routing-stable-maint https://review.opendev.org/c/openstack/project-config/+/820351 | 08:51 |
*** ysandeep|ruck is now known as ysandeep|lunch | 08:58 | |
*** ysandeep|lunch is now known as ysandeep | 10:13 | |
*** ysandeep is now known as ysandeep|ruck | 10:16 | |
opendevreview | Arnaud Morin proposed openstack/project-config master: Disable nodepool temporarily https://review.opendev.org/c/openstack/project-config/+/820369 | 10:44 |
*** rlandy_ is now known as rlandy|ruck | 11:14 | |
*** arxcruz|rover is now known as arxcruz | 12:43 | |
fungi | ianw: i've pretty much always assumed there were bootstrapping gaps for bridge.o.o, but i agree it would be good to close them where we can | 13:08 |
*** pojadhav is now known as pojadhav|brb | 13:48 | |
*** pojadhav|brb is now known as pojadhav | 15:20 | |
*** chandankumar is now known as raukadah | 15:29 | |
clarkb | ianw: fungi: left a couple of thoughts on that system-config update. I think it is close to what we want, but needs a few edits. Also as noted we might be able to do it in two stages where we can confirm the first job is doing what we want before we rely on it | 16:08 |
fungi | i'm still thinking through how best to test the hanging newlist call | 16:08 |
clarkb | fungi: Maybe hack up the test input for the test job and undo the "don't send emails" | 16:10 |
fungi | yeah, maybe drop all the lists except the mailman meta-list for one site | 16:11 |
clarkb | fungi: for that I think you'll want to update the inventory stuff to remove all the lists except for your lists.openinfra.dev list. Then toggle the test flag to false | 16:11 |
clarkb | ya exactly | 16:11 |
clarkb | then you should be in a state where it is stuck and the job will timeout. Then you can hop onto the held node and rerun the newlist manually | 16:11 |
*** marios is now known as marios|afk | 16:15 | |
*** pojadhav is now known as pojadhav|dinner | 16:36 | |
*** ysandeep|ruck is now known as ysandeep|out | 16:38 | |
opendevreview | Jeremy Stanley proposed opendev/system-config master: DNM: Reproduce mailman newlist hanging https://review.opendev.org/c/opendev/system-config/+/820392 | 16:38 |
clarkb | ya something along those lines should work | 16:39 |
fungi | if it ever gets a node assigned | 16:44 |
fungi | there it goes | 16:45 |
corvus | infra-root: frickler requested that i move the docker valume mapped directory /var/keycloak/log to /var/log/keycloak in https://review.opendev.org/819923 -- do we have a collective preference for that? | 16:45 |
corvus | in some cases we have, eg, /var/log/zuul, but in others, i see us putting all the docker volume dirs under one /var/foo | 16:46 |
fungi | i guess it's a question of whether it's more convenient to group the mapped dirs together in one place on the host fs | 16:46 |
clarkb | I don't think we've been very consistent about how to capture logs for containers. Partly because services do logging in a variety of ways. For services that write to stdout/stderr we've got syslog capturing systems that write them to dedicated files on disk. Services like zuul and gitea we capture in log dirs but as you mention they are done in different ways | 16:47 |
fungi | i don't have a personal preference, other than for consistency, which we already seem to lack at this point | 16:47 |
corvus | yeah, that change seems to have collected a very large number of nit comments where people said "we do it this way" and in fact, we do it that way half the time, and we do it another way the other half of the time. so i'm trying to navigate that and produce something that will actually get some +2 reviews. | 16:48 |
corvus | so i'm trying to figure out what the actual right answers are | 16:48 |
corvus | (i copied that from the etherpad role, btw, so everything in that change has precedent) | 16:49 |
fungi | i'll admit i do tend to look in /var/log first when trying to find logs, but a quick skim of the docker-compose file typically sorts me out if i don't find whatever i'm looking for | 16:50 |
corvus | okay, i'll switch it then since frickler has a preference and no one else does | 16:50 |
*** jpena is now known as jpena|off | 16:51 | |
clarkb | For me my biggest concerns at this point are understanding the user we're running as (1000 in this case) and ensuring we don't accidentally delete state because ti was written into the ephemeral image and not a bind mount (I think we're mounting the h2 db dir?) | 16:51 |
corvus | clarkb: correct | 16:51 |
clarkb | Otherwise it is hard to be consistent with every application simply because they differ and referring to the configs (docker-compose in many cases) is a good way to determine that when debugging | 16:52 |
corvus | (tbh, i prefer the other way considering that the actual filesystem location in the container is /opt/jboss/keycloak/standalone/log , and it's a sibling directory to data, so in my original change, they are both sibling directories in both locations) | 16:52 |
*** pojadhav|dinner is now known as pojadhav | 16:53 | |
opendevreview | James E. Blair proposed opendev/system-config master: Add a keycloak server https://review.opendev.org/c/opendev/system-config/+/819923 | 16:54 |
corvus | clarkb, ianw, frickler ^ i think i addressed all the comments | 16:54 |
*** marios|afk is now known as marios | 16:59 | |
*** marios is now known as marios|out | 17:08 | |
opendevreview | Clark Boylan proposed opendev/system-config master: Add a second Zuul user in gerrit testing https://review.opendev.org/c/opendev/system-config/+/820395 | 17:10 |
clarkb | Thats a naive first step in testing around case sensitive usernames | 17:10 |
opendevreview | James E. Blair proposed opendev/system-config master: Add a keycloak server https://review.opendev.org/c/opendev/system-config/+/819923 | 17:24 |
*** pojadhav is now known as pojadhav|out | 17:24 | |
corvus | clarkb, fungi ^ one more ps to fix the thing fungi caught | 17:24 |
fungi | thanks! | 17:25 |
fungi | so the good news is that i was able to recreate the hanging newlist call: https://zuul.opendev.org/t/openstack/build/ef9e10d4365b4aa69afac0bd4bd149de | 17:34 |
clarkb | yay | 17:34 |
fungi | the ara report has the tasks up to the one which tries to create the lists since that task times out and never completes | 17:34 |
fungi | doesn't actually have that task itself, so i simply assume it's hanging like we observed in production | 17:35 |
clarkb | ya the json only writes out when the task completes iirc | 17:35 |
clarkb | and we probably timed out and killed ansible before that happened due to the nag | 17:36 |
clarkb | *hang | 17:36 |
clarkb | fungi: I guess you can run newlist by hand next and see what it prompts for and if that is different for the mailman meta list than a normal list? | 17:41 |
clarkb | it may be that there is a required value to be provided and it can't default like the normal list case | 17:41 |
fungi | i suspect it's more that our workaround isn't working around. i'm going to test that next | 17:43 |
clarkb | fungi: also why didn't testing catch this when you added the new site? I guess that is why it is good news we replicated | 17:45 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: DNM: Reproduce mailman newlist hanging https://review.opendev.org/c/opendev/system-config/+/820392 | 17:45 |
fungi | well, we don't test it because we explicitly avoid sending notifications, and it's the notification sending which prompts | 17:46 |
fungi | i replicated it by running the test job without disabling notifications | 17:46 |
fungi | we previously thought we had replicated it and that replacing stdin with an empty string would do the same as what we had tested, i believe | 17:47 |
fungi | but we may need to add </dev/null or something like that | 17:48 |
clarkb | oh right I forgot about that so ya removing the no email flag makes it prompt ( I wish software didn't do that, but what can you do) | 17:49 |
clarkb | fungi: yes I'm fairly certain we managed to confirm it was fixed, but I suppose its possible we only convinced ourselves of that and reality was different | 17:49 |
fungi | well, what's fun is that removing the no email flag makes it prompt if it thinks you're in an interactive shell, and it looks like ansible probably goes out of its way to convince shells that's the case | 17:50 |
fungi | part of the problem is that killing the newlist once it reaches the notification prompt basically has the desired effect minus notification, since the list has already been created at that point | 17:52 |
fungi | so subsequent runs will see the list exists and not rerun newlist | 17:52 |
clarkb | fun. | 17:52 |
fungi | the effective way to test this would be to configure exim to send all messages to /dev/null or something | 17:54 |
fungi | so that newlist can believe it's notifying admins | 17:54 |
clarkb | fungi: and remove the test only flag? I'd be open to that | 17:54 |
clarkb | but also not sure I know how to make exim garuntee that | 17:54 |
fungi | well, we could in theory use this change or a similar one to work out the details | 17:55 |
fungi | so that we avoid annoying/confusing real list admins | 17:55 |
clarkb | ++ | 17:56 |
clarkb | fungi: maybe we need to run under nohup? | 17:58 |
clarkb | a new patchset to your existing change should be able to test something like that. Then we can work backwards to swap out exim configs? | 17:59 |
fungi | yep. that ought to work | 18:00 |
fungi | mmm, nohup also redirects stdout/stderr to local files automatically. that may make debugging harder | 18:06 |
fungi | i'll wip it for the moment while we test whether it's an effective workaround at all | 18:06 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Run newlist under nohup https://review.opendev.org/c/opendev/system-config/+/820397 | 18:09 |
clarkb | fungi: nohup manpage says we can redirect to other files if we prefer. We would have to switch from command to shell module in the nasible to use redirects | 18:13 |
fungi | yeah, and if we can use redirects we could just </dev/null explicitly instead | 18:13 |
clarkb | ya | 18:18 |
fungi | i think we ended up using cmd and overloading stdin that way because "shell" is frowned upon? | 18:19 |
clarkb | ya I think the linter prefers it that way but if redirecting then you need the shell and its fine | 18:21 |
clarkb | fungi: https://zuul.opendev.org/t/openstack/build/d6d023ba318e44e6b5aefd861614061e/log/job-output.txt#22193-22221 | 18:32 |
clarkb | fungi: maybe the "fix" here is to switch to sending a newline on the stdin | 18:32 |
fungi | 819923,9 looks like it failed system-config-run-keycloak when trying to start apache, but we don't collect apache logs | 18:32 |
clarkb | https://docs.ansible.com/ansible/latest/collections/ansible/builtin/shell_module.html#parameter-stdin_add_newline but apparently it is already the default to send a newline | 18:33 |
clarkb | fungi: https://opendev.org/opendev/puppet-mailman/src/branch/master/lib/puppet/provider/mailman_list/mailman.rb#L69-L93 that is what puppte was doing I think. Unfortunately no explicit stdin handling | 18:38 |
clarkb | fungi: is it possible that mailman updates and/or mailman on focal is the cause of this behavior change? | 18:39 |
clarkb | that could explain why we were confident it was fixed but now it isn't | 18:39 |
fungi | i can certainly try running the job on bionic | 18:42 |
fungi | or xenial? | 18:43 |
clarkb | ya it would've been xenial before | 18:43 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: DNM: Reproduce mailman newlist hanging https://review.opendev.org/c/opendev/system-config/+/820392 | 18:47 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Redirect stdin for newlist https://review.opendev.org/c/opendev/system-config/+/820397 | 18:47 |
fungi | switched the reproducer to xenial, updated the workaround to redirect stdin with a shell task | 18:47 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Pipe yes into newlist https://review.opendev.org/c/opendev/system-config/+/820397 | 19:25 |
clarkb | infra-root I copied my raw notes file for the gerrit user summit into my homedir on review02 | 19:27 |
clarkb | I'd typically prefer to stick them in an etherpad but they are very raw and have event urls and I'm not comfortable putting them on etherpad right away | 19:28 |
clarkb | if you would like them more curated on an etherpad I can work on that next week | 19:28 |
corvus | clarkb: don't forget to scrub the names for gdpr compliance! (/sarcasm -- maybe, honestly, don't know) | 19:30 |
clarkb | corvus: ya... thats one of the things since I put some names in there | 19:30 |
clarkb | and they seem very cautious in that community about names :) | 19:31 |
corvus | simple solution: give everyone aliases from clue. Colonel Mustard uploaded gerrit 3.4 in the office with the release script. | 19:32 |
clarkb | hahahaha | 19:32 |
fungi | the newlist </dev/null in https://zuul.opendev.org/t/openstack/build/052e2fa7f2ef4fa592db74ffe991136d looks like it did work (i think the prompt was printed but bypassed), though subsequent tests failed for it | 19:32 |
fungi | oh, but maybe that's because i didn't fix the tests for the lists i omitted | 19:32 |
clarkb | fungi: ya the tests check specific sites and lsits | 19:32 |
clarkb | if </dev/null worked then we probably have a reasonable workaround | 19:33 |
fungi | i agree, i'll try switching back to that one momentarily | 19:34 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: DNM: Reproduce mailman newlist hanging https://review.opendev.org/c/opendev/system-config/+/820392 | 19:39 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Redirect stdin for newlist https://review.opendev.org/c/opendev/system-config/+/820397 | 19:39 |
fungi | also i confirmed the reproducer still reproduces on xenial, so i don't think this crept in with the focal upgrade, i think it was just never thoroughly tested | 19:40 |
fungi | further, i think corvus would make an excellent professor plum | 19:42 |
ianw | > in some cases we have, eg, /var/log/zuul, but in others, i see us putting all the docker volume dirs under one /var/foo | 20:33 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: DNM: Reproduce mailman newlist hanging https://review.opendev.org/c/opendev/system-config/+/820392 | 20:34 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Redirect stdin for newlist https://review.opendev.org/c/opendev/system-config/+/820397 | 20:34 |
ianw | my feeling on that is probably that if it's under /var/foo, /var/foo might be a separate cinder volume. i feel like maybe gerrit/graphite are things that have separate storage volumes | 20:34 |
ianw | anyway, not super fussed | 20:34 |
corvus | well, etherpad was the role i copied all that from | 20:35 |
corvus | (so every thing someone disagreed with was true for the etherpad role). there's /var/etherpad/db and /var/etherpad/www | 20:35 |
corvus | i'm not super fussed either, but given the differences between roles and review comments, maybe we ought to go through and articulate a policy | 20:36 |
corvus | infra-root: public service announcement: because of all the recent 'zuul delete-state' runs, there are no autohold records in zuul, but there are some held nodes in nodepool. might be worth a check of the nodepool nodes. | 20:38 |
ianw | clarkb: thanks, will loop back on your comments, sorry yes i meant to add the nodes:[] from your prior comment on that, thanks for picking up | 20:39 |
fungi | clarkb: one other thing i've noticed... even though i set my address as the testlist admin, i did *not* receive any notification from the test node. checked my mta's logs and there were no connections (not even rejections) from the node's ip address | 21:28 |
fungi | not sure if we're successfully blocking test nodes from sending e-mail already, or if that workaround is causing newlist not to generate the notification | 21:30 |
fungi | (though it has an exit code of 0 so it didn't act like that was a failure) | 21:30 |
clarkb | corvus: thanks for the reminder I am pretty sure I have a couple I can clean up. Will check momentarily | 21:30 |
clarkb | fungi: it oculd be the test node provider blocks smtp | 21:31 |
clarkb | I have requested that nodepool delete my held nodes they shuold disappear momentarily | 21:32 |
clarkb | fungi: maybe that is the easiest thing to do though add an iptables rule blocking port 25? | 21:33 |
clarkb | left that as a suggestion on the change so that it doesn't get missed if fungi is weekending already | 21:37 |
fungi | this is how i weekend ;) | 21:45 |
clarkb | fungi: if you were closer I'd take you fishing or something so that you could weekend more weekendy | 21:46 |
fungi | but yeah, i agree, an egress rule blocking destination port 25/tcp before the allow all egress rule would be a great addition | 21:46 |
fungi | if you were closer you wouldn't need to take me fishing, we could just cast from the yard ;) | 21:46 |
clarkb | ha indeed | 21:46 |
fungi | anyway, firewall rule is a stellar idea, far less effort than reconfiguring exim to drop outbound messages on the floor | 21:48 |
ianw | i probably have some gerrit held nodes, they can be removed if in there, otherwise i'll clean up and test the new gerrit 3.4 images next week | 21:48 |
fungi | clarkb: the main reason i brought it up is that i suspect we need to test whether it tried to deliver the message, so that we know the workaround isn't just equivalent to always doing newlist -q | 21:48 |
clarkb | fungi: ah good point. Maybe we can test that on a held node? | 21:49 |
clarkb | through manual invocations of newlist | 21:49 |
fungi | also, maybe we can configure our parent job to collect exim logs | 21:49 |
fungi | if i had the exim logs and they showed mailman sending notifications through exim (even if undeliverable because of firewall drop/reject) that would be enough to satisfy my concern | 21:50 |
fungi | so i think that's what i'll do. a change to block outbound 25 on our test list nodes, a change to collect exim logs in all our deployment tests, and then drop the -q and associated conditional in the workaround | 21:52 |
clarkb | sounds like a plan | 21:53 |
clarkb | by the way my naive case insensitive username collision fails and this si apparently expected. The reason for this is while current gerrit treats existing usernames as case sensitive (therefore not breaking our existing users) it won't let you create new users that have collisions | 21:54 |
clarkb | that makes testing of the behavior changes a bit difficutl, but probably good enough for now | 21:54 |
fungi | in fact, it might be a good idea to just make system-config-run block outbound 25/tcp from everything? | 21:55 |
fungi | system-config-run is only inherited by test jobs, right? | 21:56 |
clarkb | yes system-config-run is independent of our prod stuff | 21:56 |
clarkb | there is overlap where the roles and common group/host vars are used | 21:57 |
clarkb | but system-config-run runs distinct playbooks to setup stuff and will also put new zuul test job specific host/group varsi n places | 21:57 |
fungi | okay, i'll propose a single change which blocks 25/tcp outbound in system-config-run and collects exim logs | 21:57 |
fungi | if that makes sense | 21:57 |
clarkb | yup I think that sounds great | 21:58 |
fungi | that way any of our deployment test jobs shouldn't be able to accidentally send outbound e-mail, but also if we're curious about whether something tried we can look in the log | 21:59 |
clarkb | infra-root https://review.opendev.org/c/opendev/system-config/+/820267 would be a good one to review for early next week. It upgrades gitea to 1.15.7 | 22:05 |
clarkb | I'm a bit distracted this afternoon with parenting duties so please don't approve now unless you intend on watchnig it :) but I'll happily land it monday or fix issues if people find them | 22:05 |
opendevreview | James E. Blair proposed opendev/system-config master: Add a keycloak server https://review.opendev.org/c/opendev/system-config/+/819923 | 22:17 |
corvus | clarkb, fungi, ianw: i now have a definite preference for the ordering of server certs; i updated that to list the server first so that we don't have to template out more of the apache config | 22:18 |
corvus | (and really, the individual server name is optional anyway; we could drop it and be fine; it's just a convenience for us when debugging with direct access) | 22:19 |
Clark[m] | Ah because it can always be keycloak.opendev.org in the file path that way? | 22:19 |
corvus | yep | 22:20 |
opendevreview | James E. Blair proposed opendev/system-config master: Update letsencrypt role docs to suggest a specific order https://review.opendev.org/c/opendev/system-config/+/820409 | 22:26 |
ianw | corvus: i'm fine with adding on backups as we find out. is there a particular reason you don't want to grab the service-status page via apache in the testinfra? i've definitely seen things before where it was listening, but not actually responding correctly, so my preference is to do more end-to-end validation in testinfra if we can | 22:36 |
corvus | ianw: oh sorry i forgot to reply to that comment... do we do that anywhere? | 22:40 |
ianw | we generally call out to curl; quite a few examples e.g. https://opendev.org/opendev/system-config/src/branch/master/testinfra/test_codesearch.py#L23 | 22:40 |
ianw | as this has a UI, could even do the screenshot stuff | 22:41 |
ianw | i don't mind if this is a follow-on; just we do have facilities to do a lot more testing there | 22:41 |
corvus | have an example with a system-status page? | 22:42 |
corvus | i'm looking and can't find one | 22:42 |
ianw | oh, i don't think explicitly a system-status page | 22:42 |
ianw | but we do have examples of using requests directly | 22:43 |
ianw | https://opendev.org/opendev/system-config/src/branch/master/testinfra/test_paste.py#L35 is what i'm thinking of. so if it's json that might be easier too | 22:45 |
corvus | how about we put in a check for loading the main page (so something like "test_paste") in a followup? I already destroyed my local test env; i think at this point it'll be easier to just look at the server when we boot it to get the correct output. i think that's worth more than the apache status page anyway (which is only there again because i copied that from etherpad) | 22:50 |
corvus | incidentally, the reason i missed that comment is that it's on line 23, and gerrit and gertty disagree on whether that file has a line 23. | 22:55 |
corvus | (the final byte of that file is the newline on line 22) | 22:55 |
fungi | seems fine to keep on the to do list for a followup change, i assume ianw would be fine with that too since he's +2'd the current change | 22:57 |
clarkb | corvus: that is a neat bug :) | 22:57 |
corvus | yeah. i'm clearly going to need to "fix" it even if i don't agree it's "broken" :) | 22:57 |
clarkb | the best bugs are the ones you fix that were never really boken in the first place | 22:58 |
corvus | is it okay from a system-config ansible perspective for me to +w that now? | 22:58 |
clarkb | I just sat back down after a school run and can take another look, however if only the le names moved I'm fine with a +w | 22:59 |
clarkb | it also dones't have a new server yet so this is largely a noop until we boot one right? | 22:59 |
clarkb | so ya +w should be fine | 22:59 |
fungi | yes, that's all | 23:14 |
ianw | follow-up is fine. i've just been writing a talk about how amazing our testing is, so i'm attuned to it atm :) | 23:16 |
corvus | should i +w that change, or should i create the server and add it to dns and inventory first? | 23:17 |
corvus | not sure about the chicken/egg thing here | 23:17 |
corvus | also, i'm going to guess 4g for this server | 23:18 |
ianw | i'd probably make the server and add to inventory then merge the change but i don't think it matters | 23:20 |
Clark[m] | Ya doesn't matter too much. You'll just have to follow-up with an inventory update next | 23:20 |
corvus | okay, i'll do the server first... and redo it since i just named it keystone instead of keycloak | 23:23 |
corvus | er, anyone know how to run "openstack server list" on bridge? | 23:28 |
fungi | huh, yeah, i'm getting a "temporary failure in name resolution" | 23:30 |
corvus | oh okay so that was supposed to work | 23:30 |
fungi | sudo ~fungi/osc/bin/openstack --os-cloud openstackci-vexxhost --os-region-name ca-ymq-1 server list | 23:30 |
fungi | i installed latest osc in a venv there | 23:31 |
corvus | oh should i make this in vexxhost and not rax dfw? | 23:31 |
fungi | oh, i see, i think the name resolution failures are something to do with osc being installed in a docker container? | 23:31 |
fungi | /usr/local/bin/openstack on bridge is a wrapper script calling docker run | 23:32 |
corvus | oh, so running your command with rax instead of vexx might work? | 23:32 |
fungi | there may be dns resolver configuration problems within the container itself | 23:32 |
corvus | running `~fungi/osc/bin/openstack --os-cloud openstackci-rax` as root does not work for me | 23:33 |
corvus | `Version 2 is not supported, use supported version 3 instead.` | 23:33 |
fungi | yeah, i was just trying to figure that out as well | 23:34 |
fungi | i wonder if rackspace changed their keystone | 23:34 |
corvus | somehow launch-node works tho | 23:34 |
fungi | yeah, our clouds.yaml seems to set identity_api_version: 2 | 23:35 |
corvus | okay, i managed to find the right rackspace web login and deleted the server | 23:37 |
fungi | i'm betting we need to update the clouds.yaml now | 23:38 |
ianw | i am 100% sure we've had this problem with the openstack wrapper on bridge before | 23:41 |
ianw | i just can not find any details about it | 23:42 |
ianw | i feel like we might have done something like a docker restart and it started working | 23:43 |
fungi | oh, i think the "Version 2 is not supported, use supported version 3 instead." error is coming from osc itself. rackspace still uses/needs keystone v2 api, so you have to use an old osc release to talk to rackspace | 23:43 |
fungi | it seems like osc has given up on the idea of backward compatibility there | 23:45 |
Clark[m] | This is why shade existed and it makes me wonder if we need to resurrect that sort of idea with the sdk team | 23:47 |
fungi | though i can't be sure, as i'm not actually finding that error string in osc | 23:51 |
Clark[m] | It would be in the keystoneauth library | 23:54 |
Clark[m] | And ya grepping those repos is often an exercise in frustration because the code does too much magic | 23:54 |
fungi | i think it's actually bubbling up from cinderclient? | 23:58 |
fungi | sudo ~fungi/osc/bin/openstack --os-cloud openstackci-rax --os-region-name DFW --os-volume-api-version 3 server list | 23:59 |
fungi | that's working for me now | 23:59 |
fungi | overriding the volume_api_version: 2 from clouds.yaml | 23:59 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!