Tuesday, 2024-07-23

opendevreviewJens Harbott proposed openstack/project-config master: Drop repos with config errors from x/ namespace  https://review.opendev.org/c/openstack/project-config/+/92350912:01
fricklerseems gerrit is sometime rather slow to respond currently12:07
frickler+s12:07
jrosserit was pretty slow yesterday when i was making cherry-picks12:13
fricklerack, let me check load + memory12:15
fricklermnaser: I dropped x/ospurge from the cleanup patch above ^^, but it seems https://review.opendev.org/c/x/ospurge/+/872697 is still having some issues. like senlin has been retired in the meantime, I've only deleted it from the zuul config for now12:16
*** iurygregory__ is now known as iurygregory12:17
fungifrickler: jrosser: which operations are slow, specifically? is it git pushes, or page loads, or...?12:26
jrosserI used the gerrit ui to create cherry picks to stable branches, and I feel like it was maybe 30s until the notification on irc happened12:27
fungiokay, so that could be delays on the event stream side then, i suppose12:28
jrosserbut that’s hard to judge the actual time retrospectively12:28
jrosserthe page reloads and shows the cherry-pick too after that time12:28
fungiwe do get second resolution in gerrit comments and irc logs, so could compare those to see if there's a significant delay between the two12:28
fricklerfor me it was submitting a new PS with git-review, so git push essentially I guess12:30
fricklerI can check my action in the log while I'm already looking at it12:30
fungii'll be pushing some reviews here in the next few hours as well, so can pay close attention to what performance i'm getting12:31
fricklerINFO  com.google.gerrit.server.git.MultiProgressMonitor : Processing changes: refs: 1, updated: 1 [CONTEXT ratelimit_period="1 MINUTES [skipped: 3]" ]12:35
fricklernot sure what the "skipped: 3" is actually saying, but something like this is getting logged rather often, also with the commit I pushed12:36
opendevreviewJan Gutter proposed zuul/zuul-jobs master: Move minikube out of /tmp  https://review.opendev.org/c/zuul/zuul-jobs/+/92473814:37
leonardo-serranoHi everyone, I'm trying to add an apt source list for the tests on starlingx/distcloud. So far I've come up with an ansible playbook added to the "pre-run" section. Checking zuul logs it seems to me that the "proper" way to handle this would be use the "configure-mirrors" role. Does anyone have some advice on how to use it? I didn't quite understand the docs14:59
fungileonardo-serrano: you probably want the ensure-package-repositories role: https://zuul-ci.org/docs/zuul-jobs/latest/system-roles.html#role-ensure-package-repositories15:14
fungiuse it in a pre-run playbook so it's done prior to any tasks which might attempt to install packages from that repository15:15
leonardo-serranofungi: Thank you for the feedback. I'm unsure if adding to pre-run would resolve the issue because bindep runs before the pre-run section of the job I'm working on. It seems like the target job first calls opendev.org/opendev/base-jobs/playbooks/base/pre.yaml@master, which calls bindep. I'm taking these logs are reference: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_192/15:24
leonardo-serrano924676/3/check/stx-distcloud-tox-py39/192db38/job-output.txt 15:24
leonardo-serrano(Link: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_192/924676/3/check/stx-distcloud-tox-py39/192db38/job-output.txt)15:25
opendevreviewJan Gutter proposed zuul/zuul-jobs master: [DNM] triggering a test  https://review.opendev.org/c/zuul/zuul-jobs/+/92475015:34
clarkbas a general rule of thumb I would not use irc bots as an indicator for gerrit slowness :)15:47
jrosseri misspoke there really15:47
jrosserthe web page took an exceedingly long time to refresh showing that the cherry-pick was made15:48
clarkbgot it. As a side note to that the version of gerrit we upgraded to recently now allows for those web actions to be asynchronous if you opt into that in the config. I think it is probably better to get the signal that things may be slow though15:49
clarkbgit operations on certain repos in particular can be slow. A lot of that is a function of size. I think zuul can take several minutes to merge nova changes for example15:50
clarkbso its less about absolute values and more deviation from typical behavior I guess15:50
fricklerinfra-root: in case you haven't seen, nova has some CVE patches to merge, so it would be nice to avoid any disruptions from the opendev side until that is done15:51
clarkback. Probably best to avoid doing the gitea db doctoring too then since that is how people are likely to get the patches15:51
fricklernot sure if that would also be relevant for the gitea work, likely not15:51
fricklerbut can't hurt just to play it safe, too15:52
clarkbfrickler: I think the main risk would be reduced capacity to the git cluster for serving the commits. Its probably safe enough but I'm happy to focus on that tomorrow when I have all day to work through it15:52
clarkb(wheras today there are meetings all morning)15:52
fricklerthat's assuming things will be done tomorrow, which may or may not be optimistic ;)15:53
fungithis time it's just patches for nova, at least16:14
clarkblooks like stable devstack is broken though?16:19
fungialways... *sigh*16:20
fungileonardo-serrano: the zuul console view is a little easier to analyze: https://zuul.opendev.org/t/openstack/build/192db382e8fb4306b31752715fe6a594/console16:23
fungiit looks like bindep is being called in the pre-run playbook for the "unittests" ancestor job, so you could make your own alternative to unittests (and tox) jobs and then use those as the ancestors of the distcloud job definitions16:25
fungibut that does sound a bit duplicative16:25
clarkbfungi: leonardo-serrano ya this seems like a bug. The job shoudl be working if it manages to get to the bindep role via a parent job16:25
clarkbbut I haven't had a chance to look at it yet16:25
fungileonardo-serrano: it's worth noting that bindep isn't really intended to be used for listing packages outside the ones normally supplied officially by distributions, so it doesn't have a means of specifying alternative package repositories presently. it might make more sense to explicitly install those packages in a later pre-run playbook instead of trying to do it with bindep.txt16:26
clarkboh if thats the problem then ya16:28
fungiyes, the problem statement from earlier was that they want to install packages from an alternative repository as part of a job16:28
clarkbif you are using non standard repos you either need your own jobs to configure them in the correct order so that bindep works or add the package repos then packages as a followup step. I recommend the followup step approach beacuse anyone fetching your code locally and running bindep will be in the same istuation as the ci jobs now and it will break for them16:29
clarkbhaving an explicit secondary step ensures that others can more easily reproduce that process without special knowledge needed before running bindep16:29
fungiso probably easier to not use bindep.txt for those packages, and combine the ensure-package-repositories role with package tasks for the packages you need16:29
clarkbhttps://github.com/pypa/setuptools/issues/4483 is the underlying issue16:37
clarkbits fascinating16:37
fungioh, for the openstack stable branch problems?16:39
clarkbyes16:40
opendevreviewJan Gutter proposed zuul/zuul-jobs master: [DNM] triggering a test  https://review.opendev.org/c/zuul/zuul-jobs/+/92475016:40
clarkbhttps://opendev.org/openstack/requirements/src/branch/stable/2023.1/upper-constraints.txt#L339 pins to 21.3 but need 22.0 or newer16:40
clarkbthe prposed workaround is to disable the package install that trips it... rather than fixing the requirements16:40
fungilooks like it impacted lots and lots of people16:45
fricklerwell bumping requirements for stable branches is also not without possible complications16:58
fricklerI'm still hoping setuptools will find a more backwards compatible solution16:59
clarkbI don't think it is that complicated? We just bump the constraint and if tests pass merge it16:59
opendevreviewJan Gutter proposed zuul/zuul-jobs master: [DNM] triggering a test  https://review.opendev.org/c/zuul/zuul-jobs/+/92475016:59
clarkbI'm about to push that change16:59
fricklertests in the reqs repo are covering only some small subset of things. also need to consider the possible impact for distros and deployment toolings17:01
fricklerlike you wouldn't notice the dbcounter issue in a reqs patch17:01
clarkbfrickler: this shouldn't affect disotrs because its a packaging problem which they solve separately. And things are already broken. We are unlikely to break things worse imo17:01
clarkbremote:   https://review.opendev.org/c/openstack/requirements/+/924764 Update packaging constraint to match 2023.217:02
leonardo-serranoclarkb: fungi: That's a great point on being able to easily reproduce tests locally. In that case I think the current approach of using a custom playbook in pre-run is the way to go. Thanks for the help17:32
opendevreviewJames E. Blair proposed opendev/system-config master: Run zuul-launcher  https://review.opendev.org/c/opendev/system-config/+/92418817:48
*** thuvh1 is now known as thuvh18:20
opendevreviewMerged zuul/zuul-jobs master: prepare-workspace-git: Add allow deleting current branch  https://review.opendev.org/c/zuul/zuul-jobs/+/91753918:21
clarkbthere were issues with our cleanup run playbook failing while I was out because we didn't have ssh connectivity. That was fixed by having zuul ignore cleanup run playbook failures?20:35
clarkbI suspect that may still be a problem with the base-jobs update I'm about to push but I guess we can sort it out as we go since it should only affect jobs that are failing anyway.20:36
opendevreviewClark Boylan proposed opendev/base-jobs master: Stop using deprecated cleanup-run playbooks  https://review.opendev.org/c/opendev/base-jobs/+/92478620:42
opendevreviewClark Boylan proposed opendev/base-jobs master: Move remove-build-ssh-key into a cleanup playbook  https://review.opendev.org/c/opendev/base-jobs/+/92478720:42
opendevreviewClark Boylan proposed opendev/base-jobs master: Move remove-build-ssh-key on all base jobs to cleanup  https://review.opendev.org/c/opendev/base-jobs/+/92478820:42
clarkbthere is a whole stack to capture that. corvus if you have time can you take a look and make sure there aren't any security concerns etc?20:42
opendevreviewClark Boylan proposed openstack/project-config master: Reduce centos-8-stream min ready to 0  https://review.opendev.org/c/openstack/project-config/+/92478920:57
opendevreviewClark Boylan proposed openstack/project-config master: Remove centos-8-stream image uploads and labels  https://review.opendev.org/c/openstack/project-config/+/92479020:57
opendevreviewClark Boylan proposed openstack/project-config master: Stop building centos 8 stream images  https://review.opendev.org/c/openstack/project-config/+/92479120:57
clarkband there are the nodepool cleanups for centos-8-stream20:57
fungiclarkb: the underlying problem, i think was that host unreachable events during cleanup-run playbooks were being treated like host unreachable events during post-run playbooks, leading to jobs being repeatedly restarted21:01
fungiafter the security fix to do away with the cleanup-run phase and deprecate cleanup-run playbooks21:02
clarkboh right so we stopped retrying but it may still fail. However that isn't an issue because we only run the tasks if we have already failed21:02
clarkbI was trying to figure out why it doesn't seem to happen for all jobs and I think it is due to ssh control persistence. If that playbooks starts quickly enough after hte last ssh traffic I think it sneaks in and is fine, but if not then you get the extra failed tasks21:03
clarkbanyway that should mean the first proposed change is safe then we can decide if we want to rearrnage the order of things to remove the ssh keys at the end of cleanup. I suspect this is more correct (I'm not entirely sure why we remove ssh keys at all honestly but if we do then doing so in the playbook that always runs (cleanup) seems safest)21:04
clarkbbut please do review the security implications of all that and point out if I've missed something important21:04
corvusclarkb: left a discussion comment on that21:07
corvusoh interesting, we don't have a cleanup-ssh21:08
clarkbcorvus: so it would be post then cleanup then post logs? and then the move of the ssh key removal to cleanup would still be fine in that order21:08
clarkb?21:08
clarkbI like that. It isn't something we could do before since cleanup was a seprate phase but taking advantage of that now makes sense to me21:09
corvuszuul-base-jobs has remove-build-sshkey in a cleanup playbook21:09
clarkbok cool so the followup changes should be fine too then21:10
clarkblet me push up a new stack with the things reordered slightly21:10
corvusyeah, i think your stack matches the idea in zuul-base-jobs; so that plus the reordering sounds good to me21:11
clarkbcorvus: and if cleanup.yaml fails due to the ssh thing post-logs.yaml should still run right? I'm liek 99% sure we don't ever short circuit like that21:14
clarkbwe can skip things at deeper nest levels but we won't skip playbooks at the same level due to failures in one21:14
opendevreviewClark Boylan proposed opendev/base-jobs master: Stop using deprecated cleanup-run playbooks  https://review.opendev.org/c/opendev/base-jobs/+/92478621:14
opendevreviewClark Boylan proposed opendev/base-jobs master: Move remove-build-ssh-key into a cleanup playbook  https://review.opendev.org/c/opendev/base-jobs/+/92478721:14
opendevreviewClark Boylan proposed opendev/base-jobs master: Move remove-build-ssh-key on all base jobs to cleanup  https://review.opendev.org/c/opendev/base-jobs/+/92478821:14
clarkbI'm going to self approve https://review.opendev.org/c/openstack/project-config/+/924789 to reduce centos-8-strema min ready to 0 since that is a trivial change and will allow us to delete any ready nodes that may be hanging around21:16
corvusclarkb: right, if the job did not abort, then all post playbooks at the same level run and also all cleanups, even if some fail.  if the job did abort then the same is true but only for cleanups.21:19
corvusclarkb: one comment on the second change; also -- i think it might be safer to merge those together (either squashed or in rapid succession) since i think #1 without #2 exacerbates the ssh controlpersist issue21:22
clarkbcorvus: hrm in that case I guess I can make both chagnes to base-test then we can land both chagnes againts the other two. I can rearrange things21:22
clarkbcorvus: re ignore_errors if the cleanup playbook fails on that task and the job is successful otherwise won't it fail the whole job? That was the raeson for the flag according to the comment21:23
opendevreviewMerged openstack/project-config master: Reduce centos-8-stream min ready to 0  https://review.opendev.org/c/openstack/project-config/+/92478921:28
opendevreviewClark Boylan proposed opendev/base-jobs master: Use modern cleanup tooling in base-test  https://review.opendev.org/c/opendev/base-jobs/+/92478621:31
opendevreviewClark Boylan proposed opendev/base-jobs master: Update cleanups for base and base-minimal  https://review.opendev.org/c/opendev/base-jobs/+/92478721:31
clarkbcorvus: ^ I went ahead and kept the comment and ignore_errors for now since they don't hurt anything btu we can clean that up if we feel confident failures of that task are unlikely to fail anything further21:32
corvusclarkb: yeah, i think i got that backwards; it was not necessary before but it is necessary now21:32
corvusthat would translate to a POST_FAILURE if it fails21:33
clarkbok so we want to keep it and avoid that state in an otherwise successful job21:33
corvus(but POST_FAILURE won't override failure, and that only runs on failure... so actually i think it is okay to remove it.  but not for the reason i said.  for different reasons.  :)21:33
clarkbcorvus: only the stat tasks run on failure21:34
clarkbthe ssh removal should run always just as it was in post.yaml21:34
corvusright, so the stats: only run when FAILURE is already set, and POST_FAILURE won't override them, so no need to deal with that...21:35
corvusfor the ssh removal -- if the job is SUCCESS up to that point and that fails... shrug.... i guess keeping ignore_failures is okay, but i kind of don't think that POST_FAILURE is bad in that case21:36
corvuspossibly an improvement if it causes us to notice it failing :)21:36
corvusi agree it's the most conservative thing to do now, but i'm like 60% thinking we should still remove it and accept POST_FAILURE there and see what happens.  maybe on friday though? :)21:37
clarkbya lets do that in a followup when things are a bit quieter21:37
clarkbunless we want to just do the whole thing later. I guess that is fine too21:38
opendevreviewJames E. Blair proposed opendev/system-config master: Run zuul-launcher  https://review.opendev.org/c/opendev/system-config/+/92418821:44
clarkbinfra-root please hold off on approving changes to project-config or system-config due to these errors: https://zuul.opendev.org/t/openstack/build/42ba72266e284e8092ba20937b796c6223:22
clarkbit isn't clear to me what state the git repos are in so we want to carefully approve things after we think we've fixed that23:23
clarkbhttps://zuul.opendev.org/t/openstack/build/95bc3906ac124f6088ab7674322e0a2a appears to be similar but different failure mode23:25
clarkb`you try to unset an option which does not exist (ret=5),` this is the issue with the second failure. Still not sure what is going on with the first23:28
fungiany idea yet what led to that starting?23:32
clarkbfungi: its almost certainly this change https://review.opendev.org/c/zuul/zuul-jobs/+/91753923:32
clarkbok the second error is due to the git config var necessary to push to a non bare repo not being set. In both cases it seems like the vars are not being set in the repo properly23:33
corvusskipped, since /home/zuul/src/opendev.org/opendev/system-config exists for https://zuul.opendev.org/t/openstack/build/95bc3906ac124f6088ab7674322e0a2a/console#1/1/7/bridge01.opendev.org23:33
clarkbya its due to the creates line on that task23:34
clarkbjust noticed that23:34
clarkbso this change is only future compatible with new nodes/repos23:34
clarkbthis is why we only saw this on the infra-prod jobs since bridge isn't a throwaway node23:34
corvusi think that's right23:35
corvusso we should split that shell block23:35
corvusor...23:35
clarkbdo we want to revert and try again or push forward with a fix? I think I've convinced myself that if we fix it the git reset on line 61 here: https://review.opendev.org/c/zuul/zuul-jobs/+/917539/6/roles/prepare-workspace-git/tasks/main.yaml means we should be ok23:35
corvuswrap the first part in a shell check if the dir exists?23:35
corvusi lean toward roll-forward; i'll propose a fix23:36
clarkbthanks I'll be around to review it23:36
clarkband the reason for the two failure modes is in the case of jobs that rely on project-config and the change updating project config we hit the issue with changing the working tree so it fails earlier. In the case of just system-config which has been relatively static today we don't fail until we try to update the config option23:38
clarkbs/update the config option/unset the config option/23:38
clarkbchecking bridge system-config is up to date and working tree is clean. project-config is one commit behind but also has a clean working tree23:40
clarkball that to say I think we're in a safe steady state at the moment23:41
clarkbin the case of project-config the missing commit is the one to set min ready to centos 8 stream to 0 so the impact there is very small and it will auto resolve on the hourly run for infra-prod-service-nodepool after this is fixed23:42
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: Fix prepare-workspace-git operating on existing repos  https://review.opendev.org/c/zuul/zuul-jobs/+/92480223:43
clarkbcorvus: you should update the test role as well23:44
corvusderp23:44
corvus(also.. i feel like we may be able to drop those at this point)23:44
corvuswhy are they different?23:45
corvushttps://paste.opendev.org/show/b5zGoVD8UD5sbKC2rQeD/23:46
corvusthat chunk at the bottom, lines 39-6923:46
clarkbmaybe that is how the testing works?23:47
clarkbcorvus: looking at bridge we don't have null remote for origin on system-config23:47
corvusthat's supposed to be a copy of the role for use in base-test jobs23:47
clarkbthat means your proposed fix will create additional deltas to the current behavior I think?23:47
clarkbit could be that way beacuse the null origin was added after that role was used to create the repo and it never ran those tasks against the repo?23:48
corvusclarkb: i wonder why we don't; is it because that role never cloned the repo?  or because we do something else to add it later?23:48
clarkbI don't see anything obvious adding it later in that role but could be something else I suppose23:49
corvusnothing in zuul-jobs should do it; only something in our system-config playbooks23:49
clarkbanyway it does may me wonder if maybe we should keep the change as small as possible just to fix this specific issue then sort that stuff out later?23:49
corvusi can't think of a reason to say that the proposed behavior is wrong or inconsistent with what the role is intended to do23:49
corvusso we could minimize the change, but i'm not sure how we would decide it's okay to do this23:50
clarkbI'm more concerned about breaking existing users including opendev but there may be others23:50
clarkbthat may rely on that info23:50
corvusbut what info? 23:50
clarkbthe origin remote info23:51
corvuslike, someone cloned a repo outside of zuul, then used prepare-workspace-git to update the repo, but expects the origin remote to stay?23:51
clarkbcorvus: I think the fail case is more likely to be someone set the origin remote after prepare-workspace-git created the repo then it wuold never be changed23:52
clarkbbut I'm still not finding anything that is setting it for opendev so can't say what the istuation is here yet23:52
corvusbut if they did that once, their job should do it again, right?23:52
clarkbmaybe? it could've been done manually too I suppose. I'm mostly just trying to call out the behavior change and identify that it is a risk that could break people23:53
clarkbI'm beignning to suspect that what happened for opendev is similar to what I describe of someone setting it later in order to fetch something out of band for testing or whatever and then it being stuck because I'm not finding automation that does that23:54
corvusright, but i believe the change still causes the role to operate within its specification: "This role uses git operations ... to mirror the locally prepared git repos to the remote nodes"23:54
clarkbthat is orthgonal to my concern though. I'm concerned about the chagne in behavior even if maybe the behavior was unexpected before people may rely on it23:55
clarkblooks like project config also has a remote set that is not null23:55
corvusso i agree that it could break someone that was (somehow) relying on that weird behavior quirk.  based on the docs, i don't think that hypothetical person should have expected to rely on that.23:55
corvusi'm starting to think this conversation is on the wrong channel23:56
corvusperhaps we should move it to #zuul:opendev.org 23:56
clarkbI think I figured it out for opendev fwiw23:57
clarkbwhen we run in the periodic pipelines we want to update from latest tip of master23:58
clarkband not the contents when we were enqueued to zuul23:58
clarkband I think somewhere we have task(s) that update from gitea which may be setting the remote23:58
clarkbI haven't found that code yet and I don't know if the proposed change would break that23:58

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!