opendevreview | Jens Harbott proposed openstack/project-config master: Drop repos with config errors from x/ namespace https://review.opendev.org/c/openstack/project-config/+/923509 | 12:01 |
---|---|---|
frickler | seems gerrit is sometime rather slow to respond currently | 12:07 |
frickler | +s | 12:07 |
jrosser | it was pretty slow yesterday when i was making cherry-picks | 12:13 |
frickler | ack, let me check load + memory | 12:15 |
frickler | mnaser: I dropped x/ospurge from the cleanup patch above ^^, but it seems https://review.opendev.org/c/x/ospurge/+/872697 is still having some issues. like senlin has been retired in the meantime, I've only deleted it from the zuul config for now | 12:16 |
*** iurygregory__ is now known as iurygregory | 12:17 | |
fungi | frickler: jrosser: which operations are slow, specifically? is it git pushes, or page loads, or...? | 12:26 |
jrosser | I used the gerrit ui to create cherry picks to stable branches, and I feel like it was maybe 30s until the notification on irc happened | 12:27 |
fungi | okay, so that could be delays on the event stream side then, i suppose | 12:28 |
jrosser | but that’s hard to judge the actual time retrospectively | 12:28 |
jrosser | the page reloads and shows the cherry-pick too after that time | 12:28 |
fungi | we do get second resolution in gerrit comments and irc logs, so could compare those to see if there's a significant delay between the two | 12:28 |
frickler | for me it was submitting a new PS with git-review, so git push essentially I guess | 12:30 |
frickler | I can check my action in the log while I'm already looking at it | 12:30 |
fungi | i'll be pushing some reviews here in the next few hours as well, so can pay close attention to what performance i'm getting | 12:31 |
frickler | INFO com.google.gerrit.server.git.MultiProgressMonitor : Processing changes: refs: 1, updated: 1 [CONTEXT ratelimit_period="1 MINUTES [skipped: 3]" ] | 12:35 |
frickler | not sure what the "skipped: 3" is actually saying, but something like this is getting logged rather often, also with the commit I pushed | 12:36 |
opendevreview | Jan Gutter proposed zuul/zuul-jobs master: Move minikube out of /tmp https://review.opendev.org/c/zuul/zuul-jobs/+/924738 | 14:37 |
leonardo-serrano | Hi everyone, I'm trying to add an apt source list for the tests on starlingx/distcloud. So far I've come up with an ansible playbook added to the "pre-run" section. Checking zuul logs it seems to me that the "proper" way to handle this would be use the "configure-mirrors" role. Does anyone have some advice on how to use it? I didn't quite understand the docs | 14:59 |
fungi | leonardo-serrano: you probably want the ensure-package-repositories role: https://zuul-ci.org/docs/zuul-jobs/latest/system-roles.html#role-ensure-package-repositories | 15:14 |
fungi | use it in a pre-run playbook so it's done prior to any tasks which might attempt to install packages from that repository | 15:15 |
leonardo-serrano | fungi: Thank you for the feedback. I'm unsure if adding to pre-run would resolve the issue because bindep runs before the pre-run section of the job I'm working on. It seems like the target job first calls opendev.org/opendev/base-jobs/playbooks/base/pre.yaml@master, which calls bindep. I'm taking these logs are reference: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_192/ | 15:24 |
leonardo-serrano | 924676/3/check/stx-distcloud-tox-py39/192db38/job-output.txt | 15:24 |
leonardo-serrano | (Link: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_192/924676/3/check/stx-distcloud-tox-py39/192db38/job-output.txt) | 15:25 |
opendevreview | Jan Gutter proposed zuul/zuul-jobs master: [DNM] triggering a test https://review.opendev.org/c/zuul/zuul-jobs/+/924750 | 15:34 |
clarkb | as a general rule of thumb I would not use irc bots as an indicator for gerrit slowness :) | 15:47 |
jrosser | i misspoke there really | 15:47 |
jrosser | the web page took an exceedingly long time to refresh showing that the cherry-pick was made | 15:48 |
clarkb | got it. As a side note to that the version of gerrit we upgraded to recently now allows for those web actions to be asynchronous if you opt into that in the config. I think it is probably better to get the signal that things may be slow though | 15:49 |
clarkb | git operations on certain repos in particular can be slow. A lot of that is a function of size. I think zuul can take several minutes to merge nova changes for example | 15:50 |
clarkb | so its less about absolute values and more deviation from typical behavior I guess | 15:50 |
frickler | infra-root: in case you haven't seen, nova has some CVE patches to merge, so it would be nice to avoid any disruptions from the opendev side until that is done | 15:51 |
clarkb | ack. Probably best to avoid doing the gitea db doctoring too then since that is how people are likely to get the patches | 15:51 |
frickler | not sure if that would also be relevant for the gitea work, likely not | 15:51 |
frickler | but can't hurt just to play it safe, too | 15:52 |
clarkb | frickler: I think the main risk would be reduced capacity to the git cluster for serving the commits. Its probably safe enough but I'm happy to focus on that tomorrow when I have all day to work through it | 15:52 |
clarkb | (wheras today there are meetings all morning) | 15:52 |
frickler | that's assuming things will be done tomorrow, which may or may not be optimistic ;) | 15:53 |
fungi | this time it's just patches for nova, at least | 16:14 |
clarkb | looks like stable devstack is broken though? | 16:19 |
fungi | always... *sigh* | 16:20 |
fungi | leonardo-serrano: the zuul console view is a little easier to analyze: https://zuul.opendev.org/t/openstack/build/192db382e8fb4306b31752715fe6a594/console | 16:23 |
fungi | it looks like bindep is being called in the pre-run playbook for the "unittests" ancestor job, so you could make your own alternative to unittests (and tox) jobs and then use those as the ancestors of the distcloud job definitions | 16:25 |
fungi | but that does sound a bit duplicative | 16:25 |
clarkb | fungi: leonardo-serrano ya this seems like a bug. The job shoudl be working if it manages to get to the bindep role via a parent job | 16:25 |
clarkb | but I haven't had a chance to look at it yet | 16:25 |
fungi | leonardo-serrano: it's worth noting that bindep isn't really intended to be used for listing packages outside the ones normally supplied officially by distributions, so it doesn't have a means of specifying alternative package repositories presently. it might make more sense to explicitly install those packages in a later pre-run playbook instead of trying to do it with bindep.txt | 16:26 |
clarkb | oh if thats the problem then ya | 16:28 |
fungi | yes, the problem statement from earlier was that they want to install packages from an alternative repository as part of a job | 16:28 |
clarkb | if you are using non standard repos you either need your own jobs to configure them in the correct order so that bindep works or add the package repos then packages as a followup step. I recommend the followup step approach beacuse anyone fetching your code locally and running bindep will be in the same istuation as the ci jobs now and it will break for them | 16:29 |
clarkb | having an explicit secondary step ensures that others can more easily reproduce that process without special knowledge needed before running bindep | 16:29 |
fungi | so probably easier to not use bindep.txt for those packages, and combine the ensure-package-repositories role with package tasks for the packages you need | 16:29 |
clarkb | https://github.com/pypa/setuptools/issues/4483 is the underlying issue | 16:37 |
clarkb | its fascinating | 16:37 |
fungi | oh, for the openstack stable branch problems? | 16:39 |
clarkb | yes | 16:40 |
opendevreview | Jan Gutter proposed zuul/zuul-jobs master: [DNM] triggering a test https://review.opendev.org/c/zuul/zuul-jobs/+/924750 | 16:40 |
clarkb | https://opendev.org/openstack/requirements/src/branch/stable/2023.1/upper-constraints.txt#L339 pins to 21.3 but need 22.0 or newer | 16:40 |
clarkb | the prposed workaround is to disable the package install that trips it... rather than fixing the requirements | 16:40 |
fungi | looks like it impacted lots and lots of people | 16:45 |
frickler | well bumping requirements for stable branches is also not without possible complications | 16:58 |
frickler | I'm still hoping setuptools will find a more backwards compatible solution | 16:59 |
clarkb | I don't think it is that complicated? We just bump the constraint and if tests pass merge it | 16:59 |
opendevreview | Jan Gutter proposed zuul/zuul-jobs master: [DNM] triggering a test https://review.opendev.org/c/zuul/zuul-jobs/+/924750 | 16:59 |
clarkb | I'm about to push that change | 16:59 |
frickler | tests in the reqs repo are covering only some small subset of things. also need to consider the possible impact for distros and deployment toolings | 17:01 |
frickler | like you wouldn't notice the dbcounter issue in a reqs patch | 17:01 |
clarkb | frickler: this shouldn't affect disotrs because its a packaging problem which they solve separately. And things are already broken. We are unlikely to break things worse imo | 17:01 |
clarkb | remote: https://review.opendev.org/c/openstack/requirements/+/924764 Update packaging constraint to match 2023.2 | 17:02 |
leonardo-serrano | clarkb: fungi: That's a great point on being able to easily reproduce tests locally. In that case I think the current approach of using a custom playbook in pre-run is the way to go. Thanks for the help | 17:32 |
opendevreview | James E. Blair proposed opendev/system-config master: Run zuul-launcher https://review.opendev.org/c/opendev/system-config/+/924188 | 17:48 |
*** thuvh1 is now known as thuvh | 18:20 | |
opendevreview | Merged zuul/zuul-jobs master: prepare-workspace-git: Add allow deleting current branch https://review.opendev.org/c/zuul/zuul-jobs/+/917539 | 18:21 |
clarkb | there were issues with our cleanup run playbook failing while I was out because we didn't have ssh connectivity. That was fixed by having zuul ignore cleanup run playbook failures? | 20:35 |
clarkb | I suspect that may still be a problem with the base-jobs update I'm about to push but I guess we can sort it out as we go since it should only affect jobs that are failing anyway. | 20:36 |
opendevreview | Clark Boylan proposed opendev/base-jobs master: Stop using deprecated cleanup-run playbooks https://review.opendev.org/c/opendev/base-jobs/+/924786 | 20:42 |
opendevreview | Clark Boylan proposed opendev/base-jobs master: Move remove-build-ssh-key into a cleanup playbook https://review.opendev.org/c/opendev/base-jobs/+/924787 | 20:42 |
opendevreview | Clark Boylan proposed opendev/base-jobs master: Move remove-build-ssh-key on all base jobs to cleanup https://review.opendev.org/c/opendev/base-jobs/+/924788 | 20:42 |
clarkb | there is a whole stack to capture that. corvus if you have time can you take a look and make sure there aren't any security concerns etc? | 20:42 |
opendevreview | Clark Boylan proposed openstack/project-config master: Reduce centos-8-stream min ready to 0 https://review.opendev.org/c/openstack/project-config/+/924789 | 20:57 |
opendevreview | Clark Boylan proposed openstack/project-config master: Remove centos-8-stream image uploads and labels https://review.opendev.org/c/openstack/project-config/+/924790 | 20:57 |
opendevreview | Clark Boylan proposed openstack/project-config master: Stop building centos 8 stream images https://review.opendev.org/c/openstack/project-config/+/924791 | 20:57 |
clarkb | and there are the nodepool cleanups for centos-8-stream | 20:57 |
fungi | clarkb: the underlying problem, i think was that host unreachable events during cleanup-run playbooks were being treated like host unreachable events during post-run playbooks, leading to jobs being repeatedly restarted | 21:01 |
fungi | after the security fix to do away with the cleanup-run phase and deprecate cleanup-run playbooks | 21:02 |
clarkb | oh right so we stopped retrying but it may still fail. However that isn't an issue because we only run the tasks if we have already failed | 21:02 |
clarkb | I was trying to figure out why it doesn't seem to happen for all jobs and I think it is due to ssh control persistence. If that playbooks starts quickly enough after hte last ssh traffic I think it sneaks in and is fine, but if not then you get the extra failed tasks | 21:03 |
clarkb | anyway that should mean the first proposed change is safe then we can decide if we want to rearrnage the order of things to remove the ssh keys at the end of cleanup. I suspect this is more correct (I'm not entirely sure why we remove ssh keys at all honestly but if we do then doing so in the playbook that always runs (cleanup) seems safest) | 21:04 |
clarkb | but please do review the security implications of all that and point out if I've missed something important | 21:04 |
corvus | clarkb: left a discussion comment on that | 21:07 |
corvus | oh interesting, we don't have a cleanup-ssh | 21:08 |
clarkb | corvus: so it would be post then cleanup then post logs? and then the move of the ssh key removal to cleanup would still be fine in that order | 21:08 |
clarkb | ? | 21:08 |
clarkb | I like that. It isn't something we could do before since cleanup was a seprate phase but taking advantage of that now makes sense to me | 21:09 |
corvus | zuul-base-jobs has remove-build-sshkey in a cleanup playbook | 21:09 |
clarkb | ok cool so the followup changes should be fine too then | 21:10 |
clarkb | let me push up a new stack with the things reordered slightly | 21:10 |
corvus | yeah, i think your stack matches the idea in zuul-base-jobs; so that plus the reordering sounds good to me | 21:11 |
clarkb | corvus: and if cleanup.yaml fails due to the ssh thing post-logs.yaml should still run right? I'm liek 99% sure we don't ever short circuit like that | 21:14 |
clarkb | we can skip things at deeper nest levels but we won't skip playbooks at the same level due to failures in one | 21:14 |
opendevreview | Clark Boylan proposed opendev/base-jobs master: Stop using deprecated cleanup-run playbooks https://review.opendev.org/c/opendev/base-jobs/+/924786 | 21:14 |
opendevreview | Clark Boylan proposed opendev/base-jobs master: Move remove-build-ssh-key into a cleanup playbook https://review.opendev.org/c/opendev/base-jobs/+/924787 | 21:14 |
opendevreview | Clark Boylan proposed opendev/base-jobs master: Move remove-build-ssh-key on all base jobs to cleanup https://review.opendev.org/c/opendev/base-jobs/+/924788 | 21:14 |
clarkb | I'm going to self approve https://review.opendev.org/c/openstack/project-config/+/924789 to reduce centos-8-strema min ready to 0 since that is a trivial change and will allow us to delete any ready nodes that may be hanging around | 21:16 |
corvus | clarkb: right, if the job did not abort, then all post playbooks at the same level run and also all cleanups, even if some fail. if the job did abort then the same is true but only for cleanups. | 21:19 |
corvus | clarkb: one comment on the second change; also -- i think it might be safer to merge those together (either squashed or in rapid succession) since i think #1 without #2 exacerbates the ssh controlpersist issue | 21:22 |
clarkb | corvus: hrm in that case I guess I can make both chagnes to base-test then we can land both chagnes againts the other two. I can rearrange things | 21:22 |
clarkb | corvus: re ignore_errors if the cleanup playbook fails on that task and the job is successful otherwise won't it fail the whole job? That was the raeson for the flag according to the comment | 21:23 |
opendevreview | Merged openstack/project-config master: Reduce centos-8-stream min ready to 0 https://review.opendev.org/c/openstack/project-config/+/924789 | 21:28 |
opendevreview | Clark Boylan proposed opendev/base-jobs master: Use modern cleanup tooling in base-test https://review.opendev.org/c/opendev/base-jobs/+/924786 | 21:31 |
opendevreview | Clark Boylan proposed opendev/base-jobs master: Update cleanups for base and base-minimal https://review.opendev.org/c/opendev/base-jobs/+/924787 | 21:31 |
clarkb | corvus: ^ I went ahead and kept the comment and ignore_errors for now since they don't hurt anything btu we can clean that up if we feel confident failures of that task are unlikely to fail anything further | 21:32 |
corvus | clarkb: yeah, i think i got that backwards; it was not necessary before but it is necessary now | 21:32 |
corvus | that would translate to a POST_FAILURE if it fails | 21:33 |
clarkb | ok so we want to keep it and avoid that state in an otherwise successful job | 21:33 |
corvus | (but POST_FAILURE won't override failure, and that only runs on failure... so actually i think it is okay to remove it. but not for the reason i said. for different reasons. :) | 21:33 |
clarkb | corvus: only the stat tasks run on failure | 21:34 |
clarkb | the ssh removal should run always just as it was in post.yaml | 21:34 |
corvus | right, so the stats: only run when FAILURE is already set, and POST_FAILURE won't override them, so no need to deal with that... | 21:35 |
corvus | for the ssh removal -- if the job is SUCCESS up to that point and that fails... shrug.... i guess keeping ignore_failures is okay, but i kind of don't think that POST_FAILURE is bad in that case | 21:36 |
corvus | possibly an improvement if it causes us to notice it failing :) | 21:36 |
corvus | i agree it's the most conservative thing to do now, but i'm like 60% thinking we should still remove it and accept POST_FAILURE there and see what happens. maybe on friday though? :) | 21:37 |
clarkb | ya lets do that in a followup when things are a bit quieter | 21:37 |
clarkb | unless we want to just do the whole thing later. I guess that is fine too | 21:38 |
opendevreview | James E. Blair proposed opendev/system-config master: Run zuul-launcher https://review.opendev.org/c/opendev/system-config/+/924188 | 21:44 |
clarkb | infra-root please hold off on approving changes to project-config or system-config due to these errors: https://zuul.opendev.org/t/openstack/build/42ba72266e284e8092ba20937b796c62 | 23:22 |
clarkb | it isn't clear to me what state the git repos are in so we want to carefully approve things after we think we've fixed that | 23:23 |
clarkb | https://zuul.opendev.org/t/openstack/build/95bc3906ac124f6088ab7674322e0a2a appears to be similar but different failure mode | 23:25 |
clarkb | `you try to unset an option which does not exist (ret=5),` this is the issue with the second failure. Still not sure what is going on with the first | 23:28 |
fungi | any idea yet what led to that starting? | 23:32 |
clarkb | fungi: its almost certainly this change https://review.opendev.org/c/zuul/zuul-jobs/+/917539 | 23:32 |
clarkb | ok the second error is due to the git config var necessary to push to a non bare repo not being set. In both cases it seems like the vars are not being set in the repo properly | 23:33 |
corvus | skipped, since /home/zuul/src/opendev.org/opendev/system-config exists for https://zuul.opendev.org/t/openstack/build/95bc3906ac124f6088ab7674322e0a2a/console#1/1/7/bridge01.opendev.org | 23:33 |
clarkb | ya its due to the creates line on that task | 23:34 |
clarkb | just noticed that | 23:34 |
clarkb | so this change is only future compatible with new nodes/repos | 23:34 |
clarkb | this is why we only saw this on the infra-prod jobs since bridge isn't a throwaway node | 23:34 |
corvus | i think that's right | 23:35 |
corvus | so we should split that shell block | 23:35 |
corvus | or... | 23:35 |
clarkb | do we want to revert and try again or push forward with a fix? I think I've convinced myself that if we fix it the git reset on line 61 here: https://review.opendev.org/c/zuul/zuul-jobs/+/917539/6/roles/prepare-workspace-git/tasks/main.yaml means we should be ok | 23:35 |
corvus | wrap the first part in a shell check if the dir exists? | 23:35 |
corvus | i lean toward roll-forward; i'll propose a fix | 23:36 |
clarkb | thanks I'll be around to review it | 23:36 |
clarkb | and the reason for the two failure modes is in the case of jobs that rely on project-config and the change updating project config we hit the issue with changing the working tree so it fails earlier. In the case of just system-config which has been relatively static today we don't fail until we try to update the config option | 23:38 |
clarkb | s/update the config option/unset the config option/ | 23:38 |
clarkb | checking bridge system-config is up to date and working tree is clean. project-config is one commit behind but also has a clean working tree | 23:40 |
clarkb | all that to say I think we're in a safe steady state at the moment | 23:41 |
clarkb | in the case of project-config the missing commit is the one to set min ready to centos 8 stream to 0 so the impact there is very small and it will auto resolve on the hourly run for infra-prod-service-nodepool after this is fixed | 23:42 |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: Fix prepare-workspace-git operating on existing repos https://review.opendev.org/c/zuul/zuul-jobs/+/924802 | 23:43 |
clarkb | corvus: you should update the test role as well | 23:44 |
corvus | derp | 23:44 |
corvus | (also.. i feel like we may be able to drop those at this point) | 23:44 |
corvus | why are they different? | 23:45 |
corvus | https://paste.opendev.org/show/b5zGoVD8UD5sbKC2rQeD/ | 23:46 |
corvus | that chunk at the bottom, lines 39-69 | 23:46 |
clarkb | maybe that is how the testing works? | 23:47 |
clarkb | corvus: looking at bridge we don't have null remote for origin on system-config | 23:47 |
corvus | that's supposed to be a copy of the role for use in base-test jobs | 23:47 |
clarkb | that means your proposed fix will create additional deltas to the current behavior I think? | 23:47 |
clarkb | it could be that way beacuse the null origin was added after that role was used to create the repo and it never ran those tasks against the repo? | 23:48 |
corvus | clarkb: i wonder why we don't; is it because that role never cloned the repo? or because we do something else to add it later? | 23:48 |
clarkb | I don't see anything obvious adding it later in that role but could be something else I suppose | 23:49 |
corvus | nothing in zuul-jobs should do it; only something in our system-config playbooks | 23:49 |
clarkb | anyway it does may me wonder if maybe we should keep the change as small as possible just to fix this specific issue then sort that stuff out later? | 23:49 |
corvus | i can't think of a reason to say that the proposed behavior is wrong or inconsistent with what the role is intended to do | 23:49 |
corvus | so we could minimize the change, but i'm not sure how we would decide it's okay to do this | 23:50 |
clarkb | I'm more concerned about breaking existing users including opendev but there may be others | 23:50 |
clarkb | that may rely on that info | 23:50 |
corvus | but what info? | 23:50 |
clarkb | the origin remote info | 23:51 |
corvus | like, someone cloned a repo outside of zuul, then used prepare-workspace-git to update the repo, but expects the origin remote to stay? | 23:51 |
clarkb | corvus: I think the fail case is more likely to be someone set the origin remote after prepare-workspace-git created the repo then it wuold never be changed | 23:52 |
clarkb | but I'm still not finding anything that is setting it for opendev so can't say what the istuation is here yet | 23:52 |
corvus | but if they did that once, their job should do it again, right? | 23:52 |
clarkb | maybe? it could've been done manually too I suppose. I'm mostly just trying to call out the behavior change and identify that it is a risk that could break people | 23:53 |
clarkb | I'm beignning to suspect that what happened for opendev is similar to what I describe of someone setting it later in order to fetch something out of band for testing or whatever and then it being stuck because I'm not finding automation that does that | 23:54 |
corvus | right, but i believe the change still causes the role to operate within its specification: "This role uses git operations ... to mirror the locally prepared git repos to the remote nodes" | 23:54 |
clarkb | that is orthgonal to my concern though. I'm concerned about the chagne in behavior even if maybe the behavior was unexpected before people may rely on it | 23:55 |
clarkb | looks like project config also has a remote set that is not null | 23:55 |
corvus | so i agree that it could break someone that was (somehow) relying on that weird behavior quirk. based on the docs, i don't think that hypothetical person should have expected to rely on that. | 23:55 |
corvus | i'm starting to think this conversation is on the wrong channel | 23:56 |
corvus | perhaps we should move it to #zuul:opendev.org | 23:56 |
clarkb | I think I figured it out for opendev fwiw | 23:57 |
clarkb | when we run in the periodic pipelines we want to update from latest tip of master | 23:58 |
clarkb | and not the contents when we were enqueued to zuul | 23:58 |
clarkb | and I think somewhere we have task(s) that update from gitea which may be setting the remote | 23:58 |
clarkb | I haven't found that code yet and I don't know if the proposed change would break that | 23:58 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!