Tuesday, 2024-07-23

opendevreview	Jens Harbott proposed openstack/project-config master: Drop repos with config errors from x/ namespace https://review.opendev.org/c/openstack/project-config/+/923509	12:01
frickler	seems gerrit is sometime rather slow to respond currently	12:07
frickler	+s	12:07
jrosser	it was pretty slow yesterday when i was making cherry-picks	12:13
frickler	ack, let me check load + memory	12:15
frickler	mnaser: I dropped x/ospurge from the cleanup patch above ^^, but it seems https://review.opendev.org/c/x/ospurge/+/872697 is still having some issues. like senlin has been retired in the meantime, I've only deleted it from the zuul config for now	12:16
*** iurygregory__ is now known as iurygregory		12:17
fungi	frickler: jrosser: which operations are slow, specifically? is it git pushes, or page loads, or...?	12:26
jrosser	I used the gerrit ui to create cherry picks to stable branches, and I feel like it was maybe 30s until the notification on irc happened	12:27
fungi	okay, so that could be delays on the event stream side then, i suppose	12:28
jrosser	but that’s hard to judge the actual time retrospectively	12:28
jrosser	the page reloads and shows the cherry-pick too after that time	12:28
fungi	we do get second resolution in gerrit comments and irc logs, so could compare those to see if there's a significant delay between the two	12:28
frickler	for me it was submitting a new PS with git-review, so git push essentially I guess	12:30
frickler	I can check my action in the log while I'm already looking at it	12:30
fungi	i'll be pushing some reviews here in the next few hours as well, so can pay close attention to what performance i'm getting	12:31
frickler	INFO com.google.gerrit.server.git.MultiProgressMonitor : Processing changes: refs: 1, updated: 1 [CONTEXT ratelimit_period="1 MINUTES [skipped: 3]" ]	12:35
frickler	not sure what the "skipped: 3" is actually saying, but something like this is getting logged rather often, also with the commit I pushed	12:36
opendevreview	Jan Gutter proposed zuul/zuul-jobs master: Move minikube out of /tmp https://review.opendev.org/c/zuul/zuul-jobs/+/924738	14:37
leonardo-serrano	Hi everyone, I'm trying to add an apt source list for the tests on starlingx/distcloud. So far I've come up with an ansible playbook added to the "pre-run" section. Checking zuul logs it seems to me that the "proper" way to handle this would be use the "configure-mirrors" role. Does anyone have some advice on how to use it? I didn't quite understand the docs	14:59
fungi	leonardo-serrano: you probably want the ensure-package-repositories role: https://zuul-ci.org/docs/zuul-jobs/latest/system-roles.html#role-ensure-package-repositories	15:14
fungi	use it in a pre-run playbook so it's done prior to any tasks which might attempt to install packages from that repository	15:15
leonardo-serrano	fungi: Thank you for the feedback. I'm unsure if adding to pre-run would resolve the issue because bindep runs before the pre-run section of the job I'm working on. It seems like the target job first calls opendev.org/opendev/base-jobs/playbooks/base/pre.yaml@master, which calls bindep. I'm taking these logs are reference: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_192/	15:24
leonardo-serrano	924676/3/check/stx-distcloud-tox-py39/192db38/job-output.txt	15:24
leonardo-serrano	(Link: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_192/924676/3/check/stx-distcloud-tox-py39/192db38/job-output.txt)	15:25
opendevreview	Jan Gutter proposed zuul/zuul-jobs master: [DNM] triggering a test https://review.opendev.org/c/zuul/zuul-jobs/+/924750	15:34
clarkb	as a general rule of thumb I would not use irc bots as an indicator for gerrit slowness :)	15:47
jrosser	i misspoke there really	15:47
jrosser	the web page took an exceedingly long time to refresh showing that the cherry-pick was made	15:48
clarkb	got it. As a side note to that the version of gerrit we upgraded to recently now allows for those web actions to be asynchronous if you opt into that in the config. I think it is probably better to get the signal that things may be slow though	15:49
clarkb	git operations on certain repos in particular can be slow. A lot of that is a function of size. I think zuul can take several minutes to merge nova changes for example	15:50
clarkb	so its less about absolute values and more deviation from typical behavior I guess	15:50
frickler	infra-root: in case you haven't seen, nova has some CVE patches to merge, so it would be nice to avoid any disruptions from the opendev side until that is done	15:51
clarkb	ack. Probably best to avoid doing the gitea db doctoring too then since that is how people are likely to get the patches	15:51
frickler	not sure if that would also be relevant for the gitea work, likely not	15:51
frickler	but can't hurt just to play it safe, too	15:52
clarkb	frickler: I think the main risk would be reduced capacity to the git cluster for serving the commits. Its probably safe enough but I'm happy to focus on that tomorrow when I have all day to work through it	15:52
clarkb	(wheras today there are meetings all morning)	15:52
frickler	that's assuming things will be done tomorrow, which may or may not be optimistic ;)	15:53
fungi	this time it's just patches for nova, at least	16:14
clarkb	looks like stable devstack is broken though?	16:19
fungi	always... sigh	16:20
fungi	leonardo-serrano: the zuul console view is a little easier to analyze: https://zuul.opendev.org/t/openstack/build/192db382e8fb4306b31752715fe6a594/console	16:23
fungi	it looks like bindep is being called in the pre-run playbook for the "unittests" ancestor job, so you could make your own alternative to unittests (and tox) jobs and then use those as the ancestors of the distcloud job definitions	16:25
fungi	but that does sound a bit duplicative	16:25
clarkb	fungi: leonardo-serrano ya this seems like a bug. The job shoudl be working if it manages to get to the bindep role via a parent job	16:25
clarkb	but I haven't had a chance to look at it yet	16:25
fungi	leonardo-serrano: it's worth noting that bindep isn't really intended to be used for listing packages outside the ones normally supplied officially by distributions, so it doesn't have a means of specifying alternative package repositories presently. it might make more sense to explicitly install those packages in a later pre-run playbook instead of trying to do it with bindep.txt	16:26
clarkb	oh if thats the problem then ya	16:28
fungi	yes, the problem statement from earlier was that they want to install packages from an alternative repository as part of a job	16:28
clarkb	if you are using non standard repos you either need your own jobs to configure them in the correct order so that bindep works or add the package repos then packages as a followup step. I recommend the followup step approach beacuse anyone fetching your code locally and running bindep will be in the same istuation as the ci jobs now and it will break for them	16:29
clarkb	having an explicit secondary step ensures that others can more easily reproduce that process without special knowledge needed before running bindep	16:29
fungi	so probably easier to not use bindep.txt for those packages, and combine the ensure-package-repositories role with package tasks for the packages you need	16:29
clarkb	https://github.com/pypa/setuptools/issues/4483 is the underlying issue	16:37
clarkb	its fascinating	16:37
fungi	oh, for the openstack stable branch problems?	16:39
clarkb	yes	16:40
opendevreview	Jan Gutter proposed zuul/zuul-jobs master: [DNM] triggering a test https://review.opendev.org/c/zuul/zuul-jobs/+/924750	16:40
clarkb	https://opendev.org/openstack/requirements/src/branch/stable/2023.1/upper-constraints.txt#L339 pins to 21.3 but need 22.0 or newer	16:40
clarkb	the prposed workaround is to disable the package install that trips it... rather than fixing the requirements	16:40
fungi	looks like it impacted lots and lots of people	16:45
frickler	well bumping requirements for stable branches is also not without possible complications	16:58
frickler	I'm still hoping setuptools will find a more backwards compatible solution	16:59
clarkb	I don't think it is that complicated? We just bump the constraint and if tests pass merge it	16:59
opendevreview	Jan Gutter proposed zuul/zuul-jobs master: [DNM] triggering a test https://review.opendev.org/c/zuul/zuul-jobs/+/924750	16:59
clarkb	I'm about to push that change	16:59
frickler	tests in the reqs repo are covering only some small subset of things. also need to consider the possible impact for distros and deployment toolings	17:01
frickler	like you wouldn't notice the dbcounter issue in a reqs patch	17:01
clarkb	frickler: this shouldn't affect disotrs because its a packaging problem which they solve separately. And things are already broken. We are unlikely to break things worse imo	17:01
clarkb	remote: https://review.opendev.org/c/openstack/requirements/+/924764 Update packaging constraint to match 2023.2	17:02
leonardo-serrano	clarkb: fungi: That's a great point on being able to easily reproduce tests locally. In that case I think the current approach of using a custom playbook in pre-run is the way to go. Thanks for the help	17:32
opendevreview	James E. Blair proposed opendev/system-config master: Run zuul-launcher https://review.opendev.org/c/opendev/system-config/+/924188	17:48
*** thuvh1 is now known as thuvh		18:20
opendevreview	Merged zuul/zuul-jobs master: prepare-workspace-git: Add allow deleting current branch https://review.opendev.org/c/zuul/zuul-jobs/+/917539	18:21
clarkb	there were issues with our cleanup run playbook failing while I was out because we didn't have ssh connectivity. That was fixed by having zuul ignore cleanup run playbook failures?	20:35
clarkb	I suspect that may still be a problem with the base-jobs update I'm about to push but I guess we can sort it out as we go since it should only affect jobs that are failing anyway.	20:36
opendevreview	Clark Boylan proposed opendev/base-jobs master: Stop using deprecated cleanup-run playbooks https://review.opendev.org/c/opendev/base-jobs/+/924786	20:42
opendevreview	Clark Boylan proposed opendev/base-jobs master: Move remove-build-ssh-key into a cleanup playbook https://review.opendev.org/c/opendev/base-jobs/+/924787	20:42
opendevreview	Clark Boylan proposed opendev/base-jobs master: Move remove-build-ssh-key on all base jobs to cleanup https://review.opendev.org/c/opendev/base-jobs/+/924788	20:42
clarkb	there is a whole stack to capture that. corvus if you have time can you take a look and make sure there aren't any security concerns etc?	20:42
opendevreview	Clark Boylan proposed openstack/project-config master: Reduce centos-8-stream min ready to 0 https://review.opendev.org/c/openstack/project-config/+/924789	20:57
opendevreview	Clark Boylan proposed openstack/project-config master: Remove centos-8-stream image uploads and labels https://review.opendev.org/c/openstack/project-config/+/924790	20:57
opendevreview	Clark Boylan proposed openstack/project-config master: Stop building centos 8 stream images https://review.opendev.org/c/openstack/project-config/+/924791	20:57
clarkb	and there are the nodepool cleanups for centos-8-stream	20:57
fungi	clarkb: the underlying problem, i think was that host unreachable events during cleanup-run playbooks were being treated like host unreachable events during post-run playbooks, leading to jobs being repeatedly restarted	21:01
fungi	after the security fix to do away with the cleanup-run phase and deprecate cleanup-run playbooks	21:02
clarkb	oh right so we stopped retrying but it may still fail. However that isn't an issue because we only run the tasks if we have already failed	21:02
clarkb	I was trying to figure out why it doesn't seem to happen for all jobs and I think it is due to ssh control persistence. If that playbooks starts quickly enough after hte last ssh traffic I think it sneaks in and is fine, but if not then you get the extra failed tasks	21:03
clarkb	anyway that should mean the first proposed change is safe then we can decide if we want to rearrnage the order of things to remove the ssh keys at the end of cleanup. I suspect this is more correct (I'm not entirely sure why we remove ssh keys at all honestly but if we do then doing so in the playbook that always runs (cleanup) seems safest)	21:04
clarkb	but please do review the security implications of all that and point out if I've missed something important	21:04
corvus	clarkb: left a discussion comment on that	21:07
corvus	oh interesting, we don't have a cleanup-ssh	21:08
clarkb	corvus: so it would be post then cleanup then post logs? and then the move of the ssh key removal to cleanup would still be fine in that order	21:08
clarkb	?	21:08
clarkb	I like that. It isn't something we could do before since cleanup was a seprate phase but taking advantage of that now makes sense to me	21:09
corvus	zuul-base-jobs has remove-build-sshkey in a cleanup playbook	21:09
clarkb	ok cool so the followup changes should be fine too then	21:10
clarkb	let me push up a new stack with the things reordered slightly	21:10
corvus	yeah, i think your stack matches the idea in zuul-base-jobs; so that plus the reordering sounds good to me	21:11
clarkb	corvus: and if cleanup.yaml fails due to the ssh thing post-logs.yaml should still run right? I'm liek 99% sure we don't ever short circuit like that	21:14
clarkb	we can skip things at deeper nest levels but we won't skip playbooks at the same level due to failures in one	21:14
opendevreview	Clark Boylan proposed opendev/base-jobs master: Stop using deprecated cleanup-run playbooks https://review.opendev.org/c/opendev/base-jobs/+/924786	21:14
opendevreview	Clark Boylan proposed opendev/base-jobs master: Move remove-build-ssh-key into a cleanup playbook https://review.opendev.org/c/opendev/base-jobs/+/924787	21:14
opendevreview	Clark Boylan proposed opendev/base-jobs master: Move remove-build-ssh-key on all base jobs to cleanup https://review.opendev.org/c/opendev/base-jobs/+/924788	21:14
clarkb	I'm going to self approve https://review.opendev.org/c/openstack/project-config/+/924789 to reduce centos-8-strema min ready to 0 since that is a trivial change and will allow us to delete any ready nodes that may be hanging around	21:16
corvus	clarkb: right, if the job did not abort, then all post playbooks at the same level run and also all cleanups, even if some fail. if the job did abort then the same is true but only for cleanups.	21:19
corvus	clarkb: one comment on the second change; also -- i think it might be safer to merge those together (either squashed or in rapid succession) since i think #1 without #2 exacerbates the ssh controlpersist issue	21:22
clarkb	corvus: hrm in that case I guess I can make both chagnes to base-test then we can land both chagnes againts the other two. I can rearrange things	21:22
clarkb	corvus: re ignore_errors if the cleanup playbook fails on that task and the job is successful otherwise won't it fail the whole job? That was the raeson for the flag according to the comment	21:23
opendevreview	Merged openstack/project-config master: Reduce centos-8-stream min ready to 0 https://review.opendev.org/c/openstack/project-config/+/924789	21:28
opendevreview	Clark Boylan proposed opendev/base-jobs master: Use modern cleanup tooling in base-test https://review.opendev.org/c/opendev/base-jobs/+/924786	21:31
opendevreview	Clark Boylan proposed opendev/base-jobs master: Update cleanups for base and base-minimal https://review.opendev.org/c/opendev/base-jobs/+/924787	21:31
clarkb	corvus: ^ I went ahead and kept the comment and ignore_errors for now since they don't hurt anything btu we can clean that up if we feel confident failures of that task are unlikely to fail anything further	21:32
corvus	clarkb: yeah, i think i got that backwards; it was not necessary before but it is necessary now	21:32
corvus	that would translate to a POST_FAILURE if it fails	21:33
clarkb	ok so we want to keep it and avoid that state in an otherwise successful job	21:33
corvus	(but POST_FAILURE won't override failure, and that only runs on failure... so actually i think it is okay to remove it. but not for the reason i said. for different reasons. :)	21:33
clarkb	corvus: only the stat tasks run on failure	21:34
clarkb	the ssh removal should run always just as it was in post.yaml	21:34
corvus	right, so the stats: only run when FAILURE is already set, and POST_FAILURE won't override them, so no need to deal with that...	21:35
corvus	for the ssh removal -- if the job is SUCCESS up to that point and that fails... shrug.... i guess keeping ignore_failures is okay, but i kind of don't think that POST_FAILURE is bad in that case	21:36
corvus	possibly an improvement if it causes us to notice it failing :)	21:36
corvus	i agree it's the most conservative thing to do now, but i'm like 60% thinking we should still remove it and accept POST_FAILURE there and see what happens. maybe on friday though? :)	21:37
clarkb	ya lets do that in a followup when things are a bit quieter	21:37
clarkb	unless we want to just do the whole thing later. I guess that is fine too	21:38
opendevreview	James E. Blair proposed opendev/system-config master: Run zuul-launcher https://review.opendev.org/c/opendev/system-config/+/924188	21:44
clarkb	infra-root please hold off on approving changes to project-config or system-config due to these errors: https://zuul.opendev.org/t/openstack/build/42ba72266e284e8092ba20937b796c62	23:22
clarkb	it isn't clear to me what state the git repos are in so we want to carefully approve things after we think we've fixed that	23:23
clarkb	https://zuul.opendev.org/t/openstack/build/95bc3906ac124f6088ab7674322e0a2a appears to be similar but different failure mode	23:25
clarkb	`you try to unset an option which does not exist (ret=5),` this is the issue with the second failure. Still not sure what is going on with the first	23:28
fungi	any idea yet what led to that starting?	23:32
clarkb	fungi: its almost certainly this change https://review.opendev.org/c/zuul/zuul-jobs/+/917539	23:32
clarkb	ok the second error is due to the git config var necessary to push to a non bare repo not being set. In both cases it seems like the vars are not being set in the repo properly	23:33
corvus	skipped, since /home/zuul/src/opendev.org/opendev/system-config exists for https://zuul.opendev.org/t/openstack/build/95bc3906ac124f6088ab7674322e0a2a/console#1/1/7/bridge01.opendev.org	23:33
clarkb	ya its due to the creates line on that task	23:34
clarkb	just noticed that	23:34
clarkb	so this change is only future compatible with new nodes/repos	23:34
clarkb	this is why we only saw this on the infra-prod jobs since bridge isn't a throwaway node	23:34
corvus	i think that's right	23:35
corvus	so we should split that shell block	23:35
corvus	or...	23:35
clarkb	do we want to revert and try again or push forward with a fix? I think I've convinced myself that if we fix it the git reset on line 61 here: https://review.opendev.org/c/zuul/zuul-jobs/+/917539/6/roles/prepare-workspace-git/tasks/main.yaml means we should be ok	23:35
corvus	wrap the first part in a shell check if the dir exists?	23:35
corvus	i lean toward roll-forward; i'll propose a fix	23:36
clarkb	thanks I'll be around to review it	23:36
clarkb	and the reason for the two failure modes is in the case of jobs that rely on project-config and the change updating project config we hit the issue with changing the working tree so it fails earlier. In the case of just system-config which has been relatively static today we don't fail until we try to update the config option	23:38
clarkb	s/update the config option/unset the config option/	23:38
clarkb	checking bridge system-config is up to date and working tree is clean. project-config is one commit behind but also has a clean working tree	23:40
clarkb	all that to say I think we're in a safe steady state at the moment	23:41
clarkb	in the case of project-config the missing commit is the one to set min ready to centos 8 stream to 0 so the impact there is very small and it will auto resolve on the hourly run for infra-prod-service-nodepool after this is fixed	23:42
opendevreview	James E. Blair proposed zuul/zuul-jobs master: Fix prepare-workspace-git operating on existing repos https://review.opendev.org/c/zuul/zuul-jobs/+/924802	23:43
clarkb	corvus: you should update the test role as well	23:44
corvus	derp	23:44
corvus	(also.. i feel like we may be able to drop those at this point)	23:44
corvus	why are they different?	23:45
corvus	https://paste.opendev.org/show/b5zGoVD8UD5sbKC2rQeD/	23:46
corvus	that chunk at the bottom, lines 39-69	23:46
clarkb	maybe that is how the testing works?	23:47
clarkb	corvus: looking at bridge we don't have null remote for origin on system-config	23:47
corvus	that's supposed to be a copy of the role for use in base-test jobs	23:47
clarkb	that means your proposed fix will create additional deltas to the current behavior I think?	23:47
clarkb	it could be that way beacuse the null origin was added after that role was used to create the repo and it never ran those tasks against the repo?	23:48
corvus	clarkb: i wonder why we don't; is it because that role never cloned the repo? or because we do something else to add it later?	23:48
clarkb	I don't see anything obvious adding it later in that role but could be something else I suppose	23:49
corvus	nothing in zuul-jobs should do it; only something in our system-config playbooks	23:49
clarkb	anyway it does may me wonder if maybe we should keep the change as small as possible just to fix this specific issue then sort that stuff out later?	23:49
corvus	i can't think of a reason to say that the proposed behavior is wrong or inconsistent with what the role is intended to do	23:49
corvus	so we could minimize the change, but i'm not sure how we would decide it's okay to do this	23:50
clarkb	I'm more concerned about breaking existing users including opendev but there may be others	23:50
clarkb	that may rely on that info	23:50
corvus	but what info?	23:50
clarkb	the origin remote info	23:51
corvus	like, someone cloned a repo outside of zuul, then used prepare-workspace-git to update the repo, but expects the origin remote to stay?	23:51
clarkb	corvus: I think the fail case is more likely to be someone set the origin remote after prepare-workspace-git created the repo then it wuold never be changed	23:52
clarkb	but I'm still not finding anything that is setting it for opendev so can't say what the istuation is here yet	23:52
corvus	but if they did that once, their job should do it again, right?	23:52
clarkb	maybe? it could've been done manually too I suppose. I'm mostly just trying to call out the behavior change and identify that it is a risk that could break people	23:53
clarkb	I'm beignning to suspect that what happened for opendev is similar to what I describe of someone setting it later in order to fetch something out of band for testing or whatever and then it being stuck because I'm not finding automation that does that	23:54
corvus	right, but i believe the change still causes the role to operate within its specification: "This role uses git operations ... to mirror the locally prepared git repos to the remote nodes"	23:54
clarkb	that is orthgonal to my concern though. I'm concerned about the chagne in behavior even if maybe the behavior was unexpected before people may rely on it	23:55
clarkb	looks like project config also has a remote set that is not null	23:55
corvus	so i agree that it could break someone that was (somehow) relying on that weird behavior quirk. based on the docs, i don't think that hypothetical person should have expected to rely on that.	23:55
corvus	i'm starting to think this conversation is on the wrong channel	23:56
corvus	perhaps we should move it to #zuul:opendev.org	23:56
clarkb	I think I figured it out for opendev fwiw	23:57
clarkb	when we run in the periodic pipelines we want to update from latest tip of master	23:58
clarkb	and not the contents when we were enqueued to zuul	23:58
clarkb	and I think somewhere we have task(s) that update from gitea which may be setting the remote	23:58
clarkb	I haven't found that code yet and I don't know if the proposed change would break that	23:58

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!