Wednesday, 2021-12-08

*** rlandy\|ruck is now known as rlandy\|out		00:09
*** dviroel is now known as dviroel\|out		00:10
opendevreview	Merged openstack/project-config master: Update the opendev/system-config tag https://review.opendev.org/c/openstack/project-config/+/819715	00:26
*** timburke__ is now known as timburke		00:33
opendevreview	Merged openstack/project-config master: Fix Neutron periodic dashboard https://review.opendev.org/c/openstack/project-config/+/820912	00:34
opendevreview	Merged openstack/project-config master: Add rights to neutron-dynamic-routing-stable-maint https://review.opendev.org/c/openstack/project-config/+/820351	00:41
*** raukadah is now known as chandankumar		04:43
*** ysandeep\|out is now known as ysandeep		04:50
opendevreview	yatin proposed openstack/project-config master: Fix Neutron periodic dashboard https://review.opendev.org/c/openstack/project-config/+/820980	06:37
*** bhagyashris_ is now known as bhagyashris		06:57
*** bhagyashris_ is now known as bhagyashris		07:19
*** ysandeep is now known as ysandeep\|lunch		07:23
opendevreview	Merged openstack/project-config master: Fix Neutron periodic dashboard https://review.opendev.org/c/openstack/project-config/+/820980	08:29
*** ysandeep\|lunch is now known as ysandeep		08:35
*** ykarel_ is now known as ykarel		09:21
*** ysandeep is now known as ysandeep\|afk		10:11
*** dviroel\|out is now known as dviroel		10:38
*** rlandy\|out is now known as rlandy\|ruck		11:05
*** ysandeep\|afk is now known as ysandeep		11:16
*** jcapitao is now known as jcapitao_lunch		12:02
*** ysandeep is now known as ysandeep\|brb		12:49
*** outbrito_ is now known as outbrito		13:02
*** ysandeep\|brb is now known as ysandeep		13:07
*** jcapitao_lunch is now known as jcapitao		13:34
*** ysandeep is now known as ysandeep\|dinner		13:49
opendevreview	daniel.pawlik proposed openstack/ci-log-processing master: Convert max-skipped parameter to int https://review.opendev.org/c/openstack/ci-log-processing/+/820848	13:50
*** ykarel is now known as ykarel\|away		14:07
*** dviroel is now known as dviroel\|lunch		14:56
slaweq	Hi infra team	15:13
slaweq	I want to ask about one potential improvement in zuul	15:13
slaweq	in Neutron team we were thinking how to improve number of rechecks on patches, and resources used by neutron	15:14
slaweq	and one of the potential improvement could be if maybe jobs which finish with POST_FAILURE could be automatically retried	15:14
slaweq	or if we could recheck only such POST_FAILURE jobs	15:15
slaweq	as in most of the cases when job will finish with POST_FAILURE it's not really related to the patch itself	15:15
slaweq	and it should be safe to not recheck everything else in such case	15:15
slaweq	wdyt about it? would it be doable maybe?	15:16
fungi	we do automatically rerun builds which fail in a pre-run playbook, so rerunning builds which fail in a post-run playbook probably wouldn't be that different, except that for consistent failures of that sort you'd potentially wait far longer for a retry_limit result. one problem i foresee is that failures in the run playbook are often followed by failures in post-run (run didn't create	15:25
fungi	some artifact which is collected at the end of the job, for example) so this would potentially hide such error conditions	15:25
fungi	also if it were implemented the same way as how pre-run failures are caught, i think that's a global behavior of the scheduler so would affect all jobs for all projects in all tenants	15:26
*** ysandeep\|dinner is now known as ysandeep		15:45
slaweq	fungi regarding failures in RUN phase, I think that if there are such errors, then job finishes with "FAILED" not with "POST_FAILURE"	16:10
slaweq	but I agree that it could potentially hide some other errors which happend earlier	16:10
slaweq	so maybe there would be way to recheck on jobs which ended up in POST_FAILURE state	16:11
slaweq	that would save at least some infra resources in some "recheck" cases	16:11
*** dviroel\|lunch is now known as dviroel		16:13
clarkb	slaweq: its more nuanced than that. If the run failure induces failure in post then you get a post failure. THis is very common	16:16
clarkb	Since post-run tends to process outputs of run and if run fails to produce those outputs properly this happens	16:16
Reed_	dpawlik How's it going? Any luck connecting to OpenSearch?	16:18
slaweq	clarkb I see, but that's why I'm asking if it would be maybe possible to recheck only jobs which ended up like that, to not recheck "selectively" always, but at least in this specific scenario. Maybe e.g. allowed only for core team, I don't know	16:18
slaweq	if it's not possible, than it's fine too for me	16:19
slaweq	at least I'll have answer for that :	16:19
clarkb	slaweq: I addressed that in the mailing list thread	16:19
slaweq	😀	16:19
clarkb	it is a bad idea for a number of reasons. Most importantly it circumvents "clean check"	16:19
slaweq	clarkb I totally agree that it's bad idea in general	16:19
slaweq	but I was hoping to maybe have exception only for such POST_FAILURE jobs	16:20
slaweq	if it's bad idea too, that's fine	16:20
clarkb	slaweq: I'm not sure that POST_FAILURE changes anything for why it is a bad idea.	16:20
clarkb	slaweq: I wrote this in the emails but basically where I stand is anything that doesn't fix or remove errors is only going to accelerate the problems not make them better	16:21
fungi	slaweq: i think that would be very hard (likely intractably hard) for zuul to determine afterward, given the nature of job dependencies some may have been skipped so wouldn't actually represent a post_failure result for example, or dependencies may need to get rerun (consider a build ending in post_failure which needs an image registry being run by another job it depends on)	16:21
clarkb	slaweq: if you ignore errors and retry until you pass you allow more errors into the system quicker	16:22
clarkb	this is why clean check exists. We did a bit of analysis on a number of gate breaking issues and found significant numbers of them were rechecked and forced through	16:24
clarkb	I think a better approach is to fix the errors. And if the scope is too large to fix then reduce the scope	16:24
clarkb	slaweq: looking at the last 4 neutron jobs with POST_FAILURE results 3 of them are related to subunit processing not correctly setting the command to run. So its executing the command arguments with no command prefix and failing	16:35
clarkb	slaweq: the fourth has no logs which implies complete network connectivity loss (could be the job mangled the network stack or the kernel paniced etc)	16:36
clarkb	I'll look at the subunit thing today	16:36
clarkb	slaweq: remote: https://review.opendev.org/c/zuul/zuul-jobs/+/821101 Try to fix broken stestr command discovery is the change	17:04
*** rlandy\|ruck is now known as rlandy\|ruck\|mtg		17:09
*** ysandeep is now known as ysandeep\|out		17:10
*** rlandy\|ruck\|mtg is now known as rlandy\|ruck		18:05
*** tobias-urdin3 is now known as tobias-urdin		20:10
*** aluria is now known as Guest8008		20:13
slaweq	clarkb ok, I will check them. Thx	20:52
*** rlandy\|ruck is now known as rlandy\|ruck\|bbl		23:35
*** ysandeep\|out is now known as ysandeep		23:51

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!