Wednesday, 2022-02-16

clarkb	ianw: left a couple of notes on the first change in the wheel python stack	00:09
ianw	clarkb: thanks; new version posted	00:14
clarkb	ianw: ok left comments on the gpg encrypted log files stack too	00:43
ianw	clarkb: it was bit of a concious choice to not tee the logs there, as i wondered if it was all quite long and not quite necessary in the console output, because it's not formatted that well for that context	00:47
clarkb	ianw: hrm I tend to use them. But I know the ara is there as an alternative (as would the new log files)	00:50
clarkb	I have an old habit of looking at hte job output file before anything else. I can be convinced this is a bad idea :)	00:50
ianw	as it changes the status quo, i can put in a "tee" there for now, and propose stopping doing this in a separate change	00:51
clarkb	I'm happy to see if others have a preference	00:51
clarkb	and just remove it if that is what people prefer	00:51
*** rlandy\|ruck is now known as rlandy\|out		00:52
opendevreview	Merged openstack/diskimage-builder master: Update platform support to describe stable testing https://review.opendev.org/c/openstack/diskimage-builder/+/418204	01:22
ianw	thanks for the reviews, i will tweak things after som elunch	01:23
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: encrypt-file : role to encrypt a file https://review.opendev.org/c/zuul/zuul-jobs/+/828818	02:06
Clark[m]	fungi: email indicates your GitHub fix is likely working	02:35
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: encrypt-file : role to encrypt a file https://review.opendev.org/c/zuul/zuul-jobs/+/828818	02:53
opendevreview	Ian Wienand proposed opendev/system-config master: Base work for exporting encrypted logs https://review.opendev.org/c/opendev/system-config/+/828810	03:03
opendevreview	Ian Wienand proposed opendev/system-config master: run-production-playbook: return encrypted logs https://review.opendev.org/c/opendev/system-config/+/829147	03:03
opendevreview	Ian Wienand proposed opendev/system-config master: zuul/run-base.yaml : don't echo test playbooks to console log https://review.opendev.org/c/opendev/system-config/+/829470	03:03
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: encrypt-file : role to encrypt a file https://review.opendev.org/c/zuul/zuul-jobs/+/828818	03:09
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: encrypt-file : role to encrypt a file https://review.opendev.org/c/zuul/zuul-jobs/+/828818	03:19
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: encrypt-file : role to encrypt a file https://review.opendev.org/c/zuul/zuul-jobs/+/828818	03:24
*** ysandeep\|out is now known as ysandeep		05:36
opendevreview	Ian Wienand proposed opendev/system-config master: Base work for exporting encrypted logs https://review.opendev.org/c/opendev/system-config/+/828810	05:40
opendevreview	Ian Wienand proposed opendev/system-config master: run-production-playbook: return encrypted logs https://review.opendev.org/c/opendev/system-config/+/829147	05:40
opendevreview	Ian Wienand proposed opendev/system-config master: zuul/run-base.yaml : don't echo test playbooks to console log https://review.opendev.org/c/opendev/system-config/+/829470	05:40
*** ykarel_ is now known as ykarel		06:49
*** dwhite449 is now known as dwhite44		07:38
*** amoralej\|off is now known as amoralej		08:03
*** jpena\|off is now known as jpena		08:33
*** ysandeep is now known as ysandeep\|lunch		08:59
lourot	fungi, re: github mirroring fix, this worked, thanks a lot!	08:59
fungi	perfect. at least the missing repos were created. content may not show up for them until a new change merges in each	09:00
*** ysandeep\|lunch is now known as ysandeep		10:03
*** pojadhav- is now known as pojadhav		10:25
*** ykarel_ is now known as ykarel		10:31
*** rlandy\|out is now known as rlandy\|ruck		11:08
*** dviroel\|out is now known as dviroel		11:21
opendevreview	Merged openstack/project-config master: Add Rocky Linux to nodepool elements tooling https://review.opendev.org/c/openstack/project-config/+/829405	11:26
opendevreview	Szymon Datko proposed zuul/zuul-jobs master: [ensure-python] Fix for CentOS/RHEL 9 packages https://review.opendev.org/c/zuul/zuul-jobs/+/829533	12:48
opendevreview	Szymon Datko proposed zuul/zuul-jobs master: [ensure-python] Fix for CentOS/RHEL 9 packages https://review.opendev.org/c/zuul/zuul-jobs/+/829533	12:49
*** amoralej is now known as amoralej\|lunch		13:02
*** pojadhav is now known as pojadhav\|brb		13:18
opendevreview	Szymon Datko proposed zuul/zuul-jobs master: [ensure-python] Fix for CentOS/RHEL 9 packages https://review.opendev.org/c/zuul/zuul-jobs/+/829533	13:27
frickler	kevinz_: hi, any update on the certificate? I'm still seeing the expiry warnings	13:39
opendevreview	Szymon Datko proposed zuul/zuul-jobs master: [ensure-python] Allow overriding package name https://review.opendev.org/c/zuul/zuul-jobs/+/829544	13:48
*** amoralej\|lunch is now known as amoralej		14:01
admin1	hi all.. i am setting up openstack-ansible and got this message	14:16
admin1	You are not building wheels while running role against multiple hosts. This might result in DOS-ing OpenDev infrustructure servers. In order to proceed, please ensure that you have repo servers for selected OS version and architecture. If you want to avoid building wheel on purpose, ensure that you run playbook in serial manner. In case of causing	14:16
admin1	unreasonable load on the opendev.org git servers, your access may be blocked to protect other users and the OpenDev CI infrastructure which are reliant on this service.	14:16
fungi	admin1: please ask in #openstack-ansible	14:19
admin1	they sent me here :D	14:20
fungi	all the opendev sysadmins know is that sometimes openstack-ansible users flood our git servers with repository clone requests from hundreds of systems at once and knock us offline, so the openstack-ansible maintainers added that warning after diagnosing the cause	14:21
opendevreview	Neil Hanlon proposed openstack/project-config master: Add rockylinux-8 to nodepool configuration https://review.opendev.org/c/openstack/project-config/+/828435	14:33
*** pojadhav\|brb is now known as pojadhav		14:51
*** weechat1 is now known as amorin		15:02
*** pojadhav is now known as pojadhavdinner		15:21
*** pojadhavdinner is now known as pojadhav\|dinner		15:21
*** ysandeep is now known as ysandeep\|out		15:26
fungi	another open source videoconferencing platform i hadn't seen before: https://github.com/jangouts/jangouts	15:28
fungi	looks like the underlying webrtc gateway implementation came from meetecho.com which i'd also never heard of	15:29
fungi	part of the pandemic wfh vc explosion bubble, i guess	15:29
*** dviroel is now known as dviroel\|lunch		15:31
*** ysandeep\|out is now known as ysandeep		16:00
fungi	nope, i guessed wrong: "Meetecho was born in 2009 as an official academic spin-off of the University of Napoli Federico II."	16:01
*** ysandeep is now known as ysandeep\|out		16:04
*** rlandy\|ruck is now known as rlandy\|ruck\|mtg		16:22
opendevreview	Szymon Datko proposed zuul/zuul-jobs master: [ensure-python] Allow overriding package name https://review.opendev.org/c/zuul/zuul-jobs/+/829544	16:33
*** rlandy\|ruck\|mtg is now known as rlandy\|ruck		16:37
*** dviroel\|lunch is now known as dviroel		16:39
*** marios is now known as marios\|out		17:04
*** pojadhav\|dinner is now known as pojadhav		17:08
corvus	i'm going to look into the log streaming issue that ianw reported; anything new i should be aware of?	17:31
*** jpena is now known as jpena\|off		17:32
clarkb	not that I know of	17:33
clarkb	I think we've all been distracted by other stuff and havne't had a chance to look at it closer	17:33
clarkb	infra-root I'm approving https://review.opendev.org/c/opendev/system-config/+/829134 https://review.opendev.org/c/opendev/system-config/+/829119 and then https://review.opendev.org/c/openstack/project-config/+/829121 will be ready to land. This last one could use one more review	17:37
fungi	thanks for the heads up, looking now	17:46
clarkb	fungi: re gerrit gitea I think I can technically land that change now. But historically davido has submitted changes for me when he has +2'd them so I'm thinking maybe he wants extra review on these?	17:49
fungi	yeah, that's what i was wondering	17:50
fungi	i didn't notice him adding other requested reviewers, but may have missed it	17:50
clarkb	I can ping davido on their slack and get his opinion so that it is clear	17:50
opendevreview	Merged opendev/system-config master: Remove configuration management for wiki servers https://review.opendev.org/c/opendev/system-config/+/829134	17:58
opendevreview	Merged opendev/system-config master: Stop using puppet repos that will be retired https://review.opendev.org/c/opendev/system-config/+/829119	17:58
clarkb	corvus: https://zuul.opendev.org is non responsive right now and zuul-web is spinning a cpu	18:27
*** pojadhav is now known as pojadhav\|out		18:27
clarkb	is this possibly related to your debugging? Should we restart zuul web?	18:27
fungi	or did he already restart zuul-web and it's still doing its smart-reconfig?	18:28
clarkb	oh maybe?	18:28
clarkb	the process is old but it does appear it is reloading its configs	18:29
fungi	if memory serves, it takes zuul-web 15-20 minutes to restart now that it's been reimplemented as basically another scheduler	18:29
clarkb	tailing the debug log shows it talking about config files	18:29
corvus	i did sigusr2, so yappi is running... maybe that's slowing this all down	18:35
clarkb	ah	18:35
clarkb	yes that could do it	18:35
fungi	from the logs, i suspect we ended up doing a reconfigure with yappi going	18:35
corvus	i hit it again, hopefully it speeds up now	18:35
clarkb	thanks	18:35
corvus	check out the grafana	18:36
corvus	peaked at 200 queued requests	18:36
corvus	back to normal now	18:37
*** rlandy\|ruck is now known as rlandy\|ruck\|mtg		19:00
*** rlandy\|ruck\|mtg is now known as rlandy\|ruck		20:14
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: encrypt-file : role to encrypt a file https://review.opendev.org/c/zuul/zuul-jobs/+/828818	20:37
opendevreview	Jonathan Rosser proposed zuul/zuul-jobs master: Allow some configure-mirrors repositories to be disabled https://review.opendev.org/c/zuul/zuul-jobs/+/829028	20:38
opendevreview	Jonathan Rosser proposed zuul/zuul-jobs master: Allow some configure-mirrors repositories to be disabled https://review.opendev.org/c/zuul/zuul-jobs/+/829028	20:39
*** rlandy\|ruck is now known as rlandy\|ruck\|mtg		20:43
clarkb	ianw: if you get a chance can you review https://review.opendev.org/c/opendev/system-config/+/829141 for improved haproxying with gitea?	20:48
clarkb	fungi: looks like davido merged the gerrit changes	20:51
clarkb	fungi: I think we can test it without the depends on now	20:52
fungi	oh, yep! on it	20:53
opendevreview	Jeremy Stanley proposed opendev/system-config master: Use Gitea for Gerrit's code browser URLs https://review.opendev.org/c/opendev/system-config/+/825339	20:59
opendevreview	Jeremy Stanley proposed opendev/system-config master: DNM: Fail our Gerrit testing for an autohold https://review.opendev.org/c/opendev/system-config/+/825396	20:59
fungi	i've set a fresh autohold for the dnm change there and released the previous hold	21:00
clarkb	exciting	21:02
clarkb	and then if we update for that we'll pull in the ls-members fix too	21:02
fungi	yup	21:03
*** rlandy\|ruck\|mtg is now known as rlandy\|ruck		21:04
*** amoralej is now known as amoralej\|off		21:10
ianw	clarkb: why do we need to "verify none" for the production case -- there wouldn't we have valid SSL certificates?	21:20
clarkb	ianw: we do have valid ssl certs in prod. The problem is testing it since you have to provide the ca files as well. Mostly just worried that if we land the chagne we'll suddenly have no valid backends because testing is difficult	21:21
ianw	clarkb: perhaps as a follow-up we should switch one to do ssl checks in production, and if it's ok, switch the rest?	21:22
fungi	well, even in prod we'd have to point to a ca file	21:22
clarkb	fungi: ya but if they are all verify none but one the ca-file should be pretty non impactful for the other 7	21:23
clarkb	ianw: I think that approach works	21:23
fungi	ianw: keep in mind that this isn't a regression over what we already had with the tcp checks, even before apache, and we're separately checking and alerting on cert validity anyway outside of haproxy	21:23
clarkb	we need to bind mount the /etc/ssl/certs/ca-certificates.crt file into the container then set the config to use it	21:23
clarkb	and if one backend has a sad that isn't the end of the world	21:23
ianw	ahh, i had sort of assumed the LB container would have the certs setups for LE	21:24
ianw	i'm fine with it btw, just thinking through what we could do	21:24
clarkb	ianw: we consume haproxy from upstream and they don't seem to include any certs	21:25
clarkb	which kinda makes sense for the target audience of the image I guess	21:25
fungi	the odds that one of our gitea servers would spontaneously have an invalid cert while the others are fine is fairly low, and having the load balancer make decisions based on cert validity also increases complexity, thus the chances that it might decide to take all the backends out of the pool because of an error somewhere	21:25
clarkb	but we can bind mount it in from the host	21:25
clarkb	fungi: ya also that	21:25
clarkb	it does add more complexity and that is always opportunity for unexpected failure	21:25
*** dviroel is now known as dviroel\|out		21:26
fungi	my primary fear with load balancers is that they'll hit a condition in their check logic which causes them to invalidate all backends	21:26
ianw	fair points, i guess it's more about the LB knowing who it's talking to in the back-end	21:27
fungi	and this is fear borne from experience managing very expensive commercial load balancers for decades in a past life. it definitely happens	21:27
fungi	telling the customer that a minor change to their website caused the load balancer cluster to suddenly decide none of their servers were viable destinations was never fun, and inevitably led them to question why they were even using load balancers if they could cause the site to go offline rather than preventing it	21:29
fungi	so, yeah, simpler checks are better. the complexity of the layer-7 check is a mitigation for the reverse-proxy causing the service to seem up for simpler tcp socket checks when it isn't, but i think we need to carefully weigh any increase in check complexity against the benefits provided	21:32
fungi	in particular, we know that we take the gitea containers offline when the container images are replaced	21:33
fungi	so i think not sending traffic to them under those (fairly frequent) outages is worth the added risk	21:33
clarkb	ya I think we definitely want a check that covers both apache and gitea	21:35
clarkb	currently we only have apache. The proposed change should also cover gitea	21:35
clarkb	looking at the timing of the change landing it appears I'll be starting my walk to get kids from school around when the job should merge. infra-root if you'd like I can put the lb in the emergency file now and remove it and run the playbook when I return	21:45
clarkb	Considering it is tested I'm not too worried about it. But I won't be able to fix it for a little bit after it lands if something goes wrong	21:46
opendevreview	Merged zuul/zuul-jobs master: Allow some configure-mirrors repositories to be disabled https://review.opendev.org/c/zuul/zuul-jobs/+/829028	21:47
fungi	i should be done with dinner by then	21:59
clarkb	cool I'll leave it be then	22:04
opendevreview	Merged opendev/system-config master: Haproxy http checks for Gitea https://review.opendev.org/c/opendev/system-config/+/829141	22:08
clarkb	it will end up behind the hourly jobs. Those take about half an hour iirc. I'll check when the school run is done	22:09
ianw	i'm around, so can watch too	22:14
rcastillo\|rover	Hello. We're running into some issues with our tripleo centos 9 content provider jobs. They're failing on retry and the running jobs don't seem to print any logs	22:19
rcastillo\|rover	https://zuul.opendev.org/t/openstack/builds?job_name=tripleo-ci-centos-9-content-provider&skip=0	22:19
rcastillo\|rover	some issue with stream 9 nodes?	22:20
ianw	rcastillo\|rover: hrm, let me have a look	22:20
ianw	https://zuul.opendev.org/t/openstack/build/ad2af72e30044ef7bd081ff8b035d711 let's check that one, seems the latest	22:20
ianw	Adding node request <NodeRequest 300-0017269841 ['centos-9-stream']> for job <FrozenJob tripleo-ci-centos-9-content-provider> to item <QueueItem 4cd6063e4397459dabb0793b8f36afbf for <Change 0x7f1cc4175250 openstack/tripleo-ansible 828920,3> in check>	22:24
ianw	2022-02-16 22:05:04,927 DEBUG zuul.nodepool: [e: 1a7616b0e3c74112b17d45540797e1ab] Node request <NodeRequest 300-0017269841 ['centos-9-stream']> fulfilled	22:24
ianw	2022-02-16 22:05:04,884 DEBUG nodepool.driver.NodeRequestHandler[nl04.opendev.org-PoolWorker.ovh-bhs1-main-ae1885d0488944b68fb9a380bcdca2b9]: [e: 1a7616b0e3c74112b17d45540797e1ab] [node_request: 300-0017269841] Fulfilled node request	22:32
ianw	nl04 fufilled the request, and it went to ovh-bhs1	22:32
ianw	2022-02-16 22:04:24,900 ERROR zuul.AnsibleJob: [e: 1a7616b0e3c74112b17d45540797e1ab] [build: e89d10ecb10a4812b1dd9e4ea50fdb1a] Exception while executing job	22:44
ianw	on ze03 ... this might be a clue	22:44
ianw	https://paste.opendev.org/show/bZxLtgjgsH1w3q7Yj4xP/ is the full error	22:48
ianw	2022-02-16 22:04:24,900 ERROR zuul.AnsibleJob: ValueError: SHA b'01e50ed7dac6cd25ce268d01cc457910633ccbf0' could not be resolved, git returned: b'01e50ed7dac6cd25ce268d01cc457910633ccbf0 missing'	22:49
ianw	is the interesting bit	22:49
fungi	corrupt repo cache?	22:49
ianw	corvus: ^ might be able to short-cut my further investigations :)	22:49
ianw	yeah, possibly	22:49
ianw	2022-02-16 22:04:24,899 DEBUG zuul.Repo.Ref: Create reference refs/heads/patchback/backports/stable-4/7f793c83f1aae99e936238dc1873251ac2c358a3/pr-4162 at 01e50ed7dac6cd25ce268d01cc457910633ccbf0 in /var/lib/zuul/builds/e89d10ecb10a4812b1dd9e4ea50fdb1a/work/src/github.com/ansible-collections/community.general/.git	22:50
ianw	this is before it	22:50
ianw	https://review.opendev.org/c/openstack/tripleo-ansible/+/828920/ does not have a depends-on	22:52
ianw	so that's something	22:52
clarkb	looks like haproxy update hasn't happened yet and I'm back so can watch it	22:53
ianw	root@ze03:/var/lib/zuul/executor-git/github.com/ansible-collections/ansible-collections%2Fcommunity.general# git show 01e50ed7dac6cd25ce268d01cc457910633ccbf0	22:54
ianw	fatal: bad object 01e50ed7dac6cd25ce268d01cc457910633ccbf0	22:54
clarkb	ianw: I think we need to check if that object is broken upstream too	22:55
clarkb	if it is then there isn't much we can do. If it isn't then we likely need to take the executor out of rotation, remove the repo and let it repopulate after starting the executor back up again	22:55
clarkb	considering this seems to be somewhat widepsread I suspect this isn't a problem on our end but somethign we are managing to fetch from upstream (but we should confirm that)	22:55
ianw	this must have run several times across executors you'd think	22:56
ianw	yeah, what you said :)	22:56
ianw	[iwienand@fedora19 community.general (main)]$ git show 01e50ed7dac6cd25ce268d01cc457910633ccbf0	22:56
ianw	fatal: bad object 01e50ed7dac6cd25ce268d01cc457910633ccbf0	22:56
clarkb	that is unfortunate for ansible, but I'm not sure we can do anything about it?	22:57
clarkb	eg that should be fixed upstream	22:57
ianw	i guess i will file an issue with ansible, i feel like github may have to get involved	22:57
clarkb	ya it is possible	22:58
clarkb	I guess on our end we might be able to better capture the error somehow	22:58
*** prometheanfire is now known as Guest2		22:59
ianw	i'm not exactly sure how zuul came up with 01e50ed...	23:02
clarkb	looking at the traceback it is doing a setRepostate which sets up all the refs/branches iirc	23:03
ianw	grep -r '01e50ed' * in .git/refs doesn't match anything	23:03
ianw	https://github.com/ansible-collections/community.general/pull/4208 seems related	23:04
*** Guest2 is now known as prometheanfire		23:05
clarkb	the commit for that PR seems to be 01250ed	23:05
clarkb	if I can type it right :)	23:05
ianw	https://github.com/ansible-collections/community.general/pull/4208/commits/01e50ed7dac6cd25ce268d01cc457910633ccbf0	23:06
*** dhill is now known as Guest5		23:06
ianw	the branch was deleted	23:06
*** Guest5 is now known as dhill_		23:06
dhill_	hi	23:07
dhill_	rlandy\|ruck, hi	23:07
ianw	i just want to make sure before blaming github that this is not some generic error you see if you have an old ref	23:07
clarkb	ianw: the thing Zuul is attempting to do there is set all the branches to the right state. SO maybe we aren't handling a branch being deleted properly if that is what has happened? However what is odd is that the ref is in the repo so you would've expect tofind it anyway	23:08
rlandy\|ruck	dhill_: hey	23:08
clarkb	ianw: how do we know the branch was deleted? Does github tell us that?	23:08
rlandy\|ruck	join the party	23:08
dhill_	rlandy\|ruck, we have too many irc network	23:09
clarkb	or just that clicking on the branch in the PR is a 404?	23:09
rlandy\|ruck	discussion in progress	23:09
ianw	$ git show aaaaaed7dac6cd25ce268d01cc457910633ccbf0	23:09
ianw	fatal: bad object aaaaaed7dac6cd25ce268d01cc457910633ccbf0	23:09
ianw	so we get the same error with a bogus ref anyway	23:09
ianw	which makes me think it's zuul's fault here, somehow	23:09
ianw	clarkb: i'm looking at the last comment in https://github.com/ansible-collections/community.general/pull/4208	23:09
clarkb	ianw: aha thanks	23:09
clarkb	ianw: as far as zuul doing something wrong. I think the expectation is that Zuul sets every branch to the ref that is pointed to upstream. If zuul somehow missed the delete branch event and then didn't noticed the branch was gone when listing branches to set state on that could happen?	23:10
clarkb	I think another thing that makes this confusing is this repo seems to use a rebase not merge method of commiting PRs (the sha for the PR commit changed in the target branch and there is no merge commit)	23:11
clarkb	basically the branch and the commit get discarded. And ya maybe it is possible for zuul to get confused in that state	23:11
ianw	./remotes/origin	23:12
ianw	./remotes/origin/patchback	23:12
ianw	./remotes/origin/patchback/backports	23:12
ianw	./remotes/origin/patchback/backports/stable-4	23:12
ianw	./remotes/origin/patchback/backports/stable-4/05c3e0d69f3653a352b5fbf1671ff4e2c0e9c812	23:12
ianw	./remotes/origin/patchback/backports/stable-4/05c3e0d69f3653a352b5fbf1671ff4e2c0e9c812/pr-4136	23:12
ianw	that is what ze03's git cache repo has	23:12
ianw	oh, pr-4136 is ! 4208	23:13
clarkb	and that is a valid branch	23:13
clarkb	ianw: did it log which branch repo state settnig it awas angry about?	23:14
ianw	let me go back and find the exception to see what's above it more	23:14
clarkb	I'm curious if we asked zuul to try again and see if it works now and this is a race with deleting branches and trying to update them locally	23:15
clarkb	Zuul checks repo and finds PR branch with sha foo, repo delets branch with sha foo, zuul tries to set branch with sha foo and cannot find sha foo sort of ordering	23:15
ianw	seems to happen between 21:53:16 and 22:04:41 so that's a big window	23:16
ianw	like 8 retries	23:16
dhill_	ianw, so if I "recheck" my failing patch we'll know ?	23:16
dhill_	ianw, https://review.opendev.org/c/openstack/tripleo-heat-templates/+/829610	23:16
ianw	dhill_: yes, it's worth trying at least please	23:16
dhill_	ianw, I just rechecked this one	23:17
ianw	clarkb: that's all we have above exception for this repo -> https://paste.opendev.org/show/bKwUKfj4VfWd5OavNSi8/	23:18
clarkb	829610 appears to have been in the check queue for an hour and 16 minutes already	23:18
dhill_	it fails here	23:19
dhill_	tripleo-ci-centos-9-content-provider https://zuul.opendev.org/t/openstack/build/00999e9311d4423b92df82451a91cbce : RETRY_LIMIT in 23s	23:19
ianw	felixfontein deleted the patchback/backports/stable-4/7f793c83f1aae99e936238dc1873251ac2c358a3/pr-4162 branch 1 hour ago	23:19
clarkb	dhill_: ya but that is from 21:54 and it is 23:19 now	23:19
clarkb	dhill_: and it is in check for an hour and 16 minutes so ~22:02 ish?	23:19
clarkb	if you hover the one hour ago it gives you a proper timestamp. I suspect that this is a race	23:21
ianw	hovering that tells me it was 8:52am local time, which was ... 1hour 28 minutes ago, so UTC 21:52	23:21
clarkb	ya and the failure was a couple minutes after that. Depending on how long it took zuul to construct a state for the job (which isn't always quick depending on merges etc) we could hit this race	23:21
clarkb	In a repo that merged commits this would be a non issue	23:21
clarkb	you'd have the commit in the repo still and could checkout a branch to that (even if the branch shouldn't exit anymore)	23:22
ianw	ok, that sort of makes sense, but why is it still failing on recheck	23:22
clarkb	I think it is specifically the combo of deleting while zuul is processing state for jobs containing that repo and the repo must use lossy merge process	23:22
clarkb	ianw: I don't think it is?	23:22
clarkb	the change dhill_ linked to is in check and has been since ~22:02 ish	23:23
dhill_	clarkb, yeah it failed with a retry	23:23
clarkb	829610 is currently in check and has been since 22:02	23:23
clarkb	I think the recheck worked and its been running?	23:23
clarkb	I feel like I'm missing something	23:23
ianw	yeah, i agree; tripleo-ci-centos-9-content-provider is https://zuul.opendev.org/t/openstack/stream/3bb4b65e64a34334b7fdbd67fae4cd7f?logfile=console.log and currently paused	23:26
clarkb	infra-root the haproxy config update happened but we don't reload the haproxy config when that happens. We do reload when the docker compose file is updated. I'm going to manually run the graceful reload command. Then push up a change to fix our config management	23:27
fungi	sounds good, thanks	23:27
clarkb	the docker-compose kill -s HUP haproxy command has been run and there is a new process. The old ones go away when they are done serving content iirc. I can still reach opendev.org so thats good	23:29
fungi	same	23:29
opendevreview	Clark Boylan proposed opendev/system-config master: Reload haproxy when its config updates https://review.opendev.org/c/opendev/system-config/+/829615	23:30
*** diablo_rojo_phone is now known as Guest7		23:36
clarkb	dhill_: rlandy\|ruck rcastillo\|rover ianw reporting some info back from the zuul channel. A typical solution to this would be setting https://zuul-ci.org/docs/zuul/latest/tenants.html#attr-tenant.untrusted-projects.%3Cproject%3E.exclude-unprotected-branches however that only helps if the repo in question protects the branches you care about (and I'm not sure how to determine that	23:39
clarkb	here)	23:39
clarkb	https://review.opendev.org/c/zuul/zuul/+/804177/ is an alternative that would help us but it hasn't merge yet. You'd specificy which branches you cared about in those repos regardless of their protection strategy	23:40
* rlandy\|ruck reads back		23:43
rlandy\|ruck	hmmm ... good question	23:45
clarkb	corvus notes that the second option presented by 804177 presents its own challenges because now you're getting a window on the repository and if you need something outside that window you may fail too	23:47
clarkb	Really the best option would be to not delete commits :)	23:47
clarkb	but I suspect we won't get very far making that usggestion to people	23:47
rlandy\|ruck	I am not sure I have a good answer on the branches	23:47
dhill_	I don't understand anything of this lol	23:48
dhill_	:/	23:48
rlandy\|ruck	we'd need to set up something like:	23:49
rlandy\|ruck	exclude-unprotected-branches:	23:49
rlandy\|ruck	option	23:49
rlandy\|ruck	but be clear what exact branches we need	23:49
rlandy\|ruck	which is not always constant	23:50
clarkb	dhill_: let me try to explain a different way. For every buildset (collection of jobs) that zuul runs it attempts to configure a consistent repository state for every repository the jobs know about. This can take some time as it has to inspect the git content of all the places the jobs are configured. If one of those repos (like the ansible collection repo) does lossy deletions while	23:50
clarkb	zuul is processing that state this can happen.	23:50
clarkb	Basically people in github are deleting content from their repo while zuul is trying to process that content and that results in brokeness. Rerunning does seem to work though	23:50
clarkb	rlandy\|ruck: ya that is what 804177 would add to zuul and ya you'd be potentially trying to keep up with upstream changes that way	23:51
rlandy\|ruck	which doesn't seem sustainable	23:51
dhill_	no failures yet	23:51
rlandy\|ruck	considering the number of repos we include	23:51
clarkb	In general I think my sugegstions would be that people don't perform lossy operations on their git repos as a good first step. But people do it because they like the cleanliness of it	23:52
rlandy\|ruck	clarkb: ianw: fungi: thanks for following this through	23:52
rlandy\|ruck	and yeah - it's a possible hit for us on any repo	23:52
rlandy\|ruck	at any time	23:52
clarkb	rlandy\|ruck: well only those that allow lossy operations. Those in gerrit shouldn't	23:53
clarkb	I suppose another option here is to stop using those repos via zuul and clone them from github instead	23:53
rlandy\|ruck	that has other challenges	23:53
clarkb	you'd trade lossy operation races for network bw and failures	23:53
rlandy\|ruck	correct	23:53
rlandy\|ruck	we have done that in the past	23:53
fungi	and could no longer depends-on pull requests	23:53
rlandy\|ruck	and decided zuul was far more reliable	23:53
rlandy\|ruck	I think we go forward as is - seems like the most maintainable solution	23:54
clarkb	if you talk to your upstreams they may be willing to protect the branches that you care about then you can set that flag	23:54
clarkb	and/or sugegst they stop deleting content :)	23:54
rlandy\|ruck	I'll pick it up tomorrow morning with our internal infra people	23:55
rlandy\|ruck	we need a standard line here wrt repos we intend to include	23:56
rlandy\|ruck	thanks again ... I'm out	23:56
*** rlandy\|ruck is now known as rlandy\|out		23:56

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!