*** rlandy|ruck is now known as rlandy|out | 00:09 | |
*** dviroel is now known as dviroel|out | 00:10 | |
opendevreview | Merged openstack/project-config master: Update the opendev/system-config tag https://review.opendev.org/c/openstack/project-config/+/819715 | 00:26 |
---|---|---|
*** timburke__ is now known as timburke | 00:33 | |
opendevreview | Merged openstack/project-config master: Fix Neutron periodic dashboard https://review.opendev.org/c/openstack/project-config/+/820912 | 00:34 |
opendevreview | Merged openstack/project-config master: Add rights to neutron-dynamic-routing-stable-maint https://review.opendev.org/c/openstack/project-config/+/820351 | 00:41 |
*** raukadah is now known as chandankumar | 04:43 | |
*** ysandeep|out is now known as ysandeep | 04:50 | |
opendevreview | yatin proposed openstack/project-config master: Fix Neutron periodic dashboard https://review.opendev.org/c/openstack/project-config/+/820980 | 06:37 |
*** bhagyashris_ is now known as bhagyashris | 06:57 | |
*** bhagyashris_ is now known as bhagyashris | 07:19 | |
*** ysandeep is now known as ysandeep|lunch | 07:23 | |
opendevreview | Merged openstack/project-config master: Fix Neutron periodic dashboard https://review.opendev.org/c/openstack/project-config/+/820980 | 08:29 |
*** ysandeep|lunch is now known as ysandeep | 08:35 | |
*** ykarel_ is now known as ykarel | 09:21 | |
*** ysandeep is now known as ysandeep|afk | 10:11 | |
*** dviroel|out is now known as dviroel | 10:38 | |
*** rlandy|out is now known as rlandy|ruck | 11:05 | |
*** ysandeep|afk is now known as ysandeep | 11:16 | |
*** jcapitao is now known as jcapitao_lunch | 12:02 | |
*** ysandeep is now known as ysandeep|brb | 12:49 | |
*** outbrito_ is now known as outbrito | 13:02 | |
*** ysandeep|brb is now known as ysandeep | 13:07 | |
*** jcapitao_lunch is now known as jcapitao | 13:34 | |
*** ysandeep is now known as ysandeep|dinner | 13:49 | |
opendevreview | daniel.pawlik proposed openstack/ci-log-processing master: Convert max-skipped parameter to int https://review.opendev.org/c/openstack/ci-log-processing/+/820848 | 13:50 |
*** ykarel is now known as ykarel|away | 14:07 | |
*** dviroel is now known as dviroel|lunch | 14:56 | |
slaweq | Hi infra team | 15:13 |
slaweq | I want to ask about one potential improvement in zuul | 15:13 |
slaweq | in Neutron team we were thinking how to improve number of rechecks on patches, and resources used by neutron | 15:14 |
slaweq | and one of the potential improvement could be if maybe jobs which finish with POST_FAILURE could be automatically retried | 15:14 |
slaweq | or if we could recheck only such POST_FAILURE jobs | 15:15 |
slaweq | as in most of the cases when job will finish with POST_FAILURE it's not really related to the patch itself | 15:15 |
slaweq | and it should be safe to not recheck everything else in such case | 15:15 |
slaweq | wdyt about it? would it be doable maybe? | 15:16 |
fungi | we do automatically rerun builds which fail in a pre-run playbook, so rerunning builds which fail in a post-run playbook probably wouldn't be that different, except that for consistent failures of that sort you'd potentially wait far longer for a retry_limit result. one problem i foresee is that failures in the run playbook are often followed by failures in post-run (run didn't create | 15:25 |
fungi | some artifact which is collected at the end of the job, for example) so this would potentially hide such error conditions | 15:25 |
fungi | also if it were implemented the same way as how pre-run failures are caught, i think that's a global behavior of the scheduler so would affect all jobs for all projects in all tenants | 15:26 |
*** ysandeep|dinner is now known as ysandeep | 15:45 | |
slaweq | fungi regarding failures in RUN phase, I think that if there are such errors, then job finishes with "FAILED" not with "POST_FAILURE" | 16:10 |
slaweq | but I agree that it could potentially hide some other errors which happend earlier | 16:10 |
slaweq | so maybe there would be way to recheck on jobs which ended up in POST_FAILURE state | 16:11 |
slaweq | that would save at least some infra resources in some "recheck" cases | 16:11 |
*** dviroel|lunch is now known as dviroel | 16:13 | |
clarkb | slaweq: its more nuanced than that. If the run failure induces failure in post then you get a post failure. THis is very common | 16:16 |
clarkb | Since post-run tends to process outputs of run and if run fails to produce those outputs properly this happens | 16:16 |
Reed_ | dpawlik How's it going? Any luck connecting to OpenSearch? | 16:18 |
slaweq | clarkb I see, but that's why I'm asking if it would be maybe possible to recheck only jobs which ended up like that, to not recheck "selectively" always, but at least in this specific scenario. Maybe e.g. allowed only for core team, I don't know | 16:18 |
slaweq | if it's not possible, than it's fine too for me | 16:19 |
slaweq | at least I'll have answer for that : | 16:19 |
clarkb | slaweq: I addressed that in the mailing list thread | 16:19 |
slaweq | 😀 | 16:19 |
clarkb | it is a bad idea for a number of reasons. Most importantly it circumvents "clean check" | 16:19 |
slaweq | clarkb I totally agree that it's bad idea in general | 16:19 |
slaweq | but I was hoping to maybe have exception only for such POST_FAILURE jobs | 16:20 |
slaweq | if it's bad idea too, that's fine | 16:20 |
clarkb | slaweq: I'm not sure that POST_FAILURE changes anything for why it is a bad idea. | 16:20 |
clarkb | slaweq: I wrote this in the emails but basically where I stand is anything that doesn't fix or remove errors is only going to accelerate the problems not make them better | 16:21 |
fungi | slaweq: i think that would be very hard (likely intractably hard) for zuul to determine afterward, given the nature of job dependencies some may have been skipped so wouldn't actually represent a post_failure result for example, or dependencies may need to get rerun (consider a build ending in post_failure which needs an image registry being run by another job it depends on) | 16:21 |
clarkb | slaweq: if you ignore errors and retry until you pass you allow more errors into the system quicker | 16:22 |
clarkb | this is why clean check exists. We did a bit of analysis on a number of gate breaking issues and found significant numbers of them were rechecked and forced through | 16:24 |
clarkb | I think a better approach is to fix the errors. And if the scope is too large to fix then reduce the scope | 16:24 |
clarkb | slaweq: looking at the last 4 neutron jobs with POST_FAILURE results 3 of them are related to subunit processing not correctly setting the command to run. So its executing the command arguments with no command prefix and failing | 16:35 |
clarkb | slaweq: the fourth has no logs which implies complete network connectivity loss (could be the job mangled the network stack or the kernel paniced etc) | 16:36 |
clarkb | I'll look at the subunit thing today | 16:36 |
clarkb | slaweq: remote: https://review.opendev.org/c/zuul/zuul-jobs/+/821101 Try to fix broken stestr command discovery is the change | 17:04 |
*** rlandy|ruck is now known as rlandy|ruck|mtg | 17:09 | |
*** ysandeep is now known as ysandeep|out | 17:10 | |
*** rlandy|ruck|mtg is now known as rlandy|ruck | 18:05 | |
*** tobias-urdin3 is now known as tobias-urdin | 20:10 | |
*** aluria is now known as Guest8008 | 20:13 | |
slaweq | clarkb ok, I will check them. Thx | 20:52 |
*** rlandy|ruck is now known as rlandy|ruck|bbl | 23:35 | |
*** ysandeep|out is now known as ysandeep | 23:51 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!