*** rlandy is now known as rlandy|out | 01:00 | |
*** ysandeep|out is now known as ysandeep | 05:55 | |
*** jpena|off is now known as jpena | 07:18 | |
slaweq | fungi hi, can You help me again with tobiko-upload-git-mirror job? I added ssh key encrypted with zuul-client in tobiko repo: https://zuul-ci.org/docs/zuul/latest/project-config.html#encryption | 08:09 |
---|---|---|
slaweq | but job is still failing without any logs: https://zuul.opendev.org/t/openstack/build/9fd7eeac81354a23a0e09d1dd5a8e150/logs | 08:09 |
slaweq | can You check for me what can be wrong there still? | 08:09 |
slaweq | thx in advance | 08:09 |
*** ysandeep is now known as ysandeep|afk | 08:17 | |
* frickler not fungi, but taking a look at logs | 08:33 | |
slaweq | frickler thx a lot :) | 09:01 |
frickler | slaweq: still seeing the same: ValueError: Encryption/decryption failed. | 10:12 |
*** rlandy|out is now known as rlandy | 10:24 | |
frickler | slaweq: looking at https://opendev.org/x/tobiko/commit/ae6deb883f6ce87c2d069013d94c753af793974a the new secret seems much shorter than the previous one, not sure if that might be legit or is an indication of currupted data | 10:26 |
slaweq | frickler yeah, maybe I did something wrong first time | 10:30 |
slaweq | I will try to merge that new patch and check then | 10:30 |
*** ysandeep|afk is now known as ysandeep | 10:42 | |
*** rlandy is now known as rlandy|mtg | 11:09 | |
yadnesh | hello all, could someone take a look at this job https://zuul.opendev.org/t/openstack/build/c39ca54a9167461fb974033a1846a3dc/logs | 11:19 |
yadnesh | it seems to be timing out while downloading pip modules RUN END RESULT_TIMED_OUT: [untrusted : opendev.org/zuul/zuul-jobs/playbooks/tox/run.yaml@master] | 11:19 |
opendevreview | Merged openstack/openstack-zuul-jobs master: Remove pike branch from periodic-stable templates https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/851613 | 11:19 |
*** yadnesh is now known as yadnesh|afk | 11:36 | |
fungi | yadnesh|afk: the cause is fairly straightforward. notice these messages in the job log: | 11:43 |
fungi | INFO: This is taking longer than usual. You might need to provide the dependency resolver with stricter constraints to reduce runtime. See https://pip.pypa.io/warnings/backtracking for guidance. If you want to abort this run, press Ctrl + C. | 11:44 |
fungi | now see this comment in the project's tox configuration: https://opendev.org/openstack/python-aodhclient/src/branch/stable/yoga/tox.ini#L16 | 11:44 |
fungi | yadnesh|afk: the solutions available are to either start using constraints like the vast majority of openstack projects, or get rid of some of your dependencies, or choose much tighter dependency version ranges, or increase your job timeout | 11:45 |
*** rlandy|mtg is now known as rlandy | 11:50 | |
*** yadnesh|afk is now known as yadnesh | 12:16 | |
*** dasm|off is now known as dasm | 13:00 | |
*** ministry is now known as __ministry | 13:11 | |
*** blarnath is now known as d34dh0r53 | 13:32 | |
*** ysandeep is now known as ysandeep|dinner | 14:19 | |
*** yadnesh is now known as yadnesh|away | 14:52 | |
*** dviroel_ is now known as dviroel | 14:55 | |
*** ysandeep|dinner is now known as ysandeep | 15:18 | |
*** ysandeep is now known as ysandeep|out | 15:33 | |
clarkb | dansmith: fungi: I spot checked jobs listed by https://zuul.opendev.org/t/openstack/builds?result=POST_FAILURE&skip=0 and all have had logs so far | 15:36 |
clarkb | there seem to be some persistently broken jobs too which likely indicates a problem with those specific jobs | 15:36 |
fungi | did they log what failed though? | 15:36 |
clarkb | yes | 15:37 |
fungi | https://zuul.opendev.org/t/openstack/build/3c40559f664543359c0109b28dc07656 was the example brought up in the tc meeting, i was about to look into it | 15:37 |
clarkb | for example all the monasca errors are due to python not being found | 15:37 |
clarkb | that example has logs fwiw | 15:38 |
fungi | yeah, i haven't dived into them yet, but i wonder if the confusion is that the failed ansible tasks aren't rendered in the console view (because we only render tasks which complete) | 15:39 |
clarkb | ittimed out | 15:39 |
clarkb | so ya I'm not sure that fits the "no logs post failure" situation | 15:39 |
clarkb | https://zuul.opendev.org/t/openstack/build/3c40559f664543359c0109b28dc07656/log/job-output.txt#34100-34215 looks like slow log processing likely due to ansible loop costs | 15:40 |
clarkb | this is related to the improvements I've already made to stage-output, but I'm sure more can bedone | 15:41 |
clarkb | just processing apache logs there took 13 minutes | 15:42 |
fungi | yeah, i'm on my netbook at the moment and browsing that build result is bogging down the browser something fierce | 15:42 |
*** tkajinam is now known as Guest2971 | 15:43 | |
clarkb | also looks like the first task in stage output which stats all the files in a loop wasn't fast. I didn't update that part of stage-output. I only updated the portin that renames files to have .txt suffixes | 15:44 |
clarkb | so that may be another target for optimizing | 15:44 |
dansmith | clarkb: ack, I think the one I got with a slew of post_failures with no logs was last week sometime, but it was just one | 15:46 |
dansmith | so then the post fails this week, which were smaller, were on my radar, but several (including the ones mentioned) were with-logs and not super obvious | 15:47 |
fungi | there have been some ssh host key mismatch errors which look like rogue virtual machines in cloud providers getting into arp overwrite situations with our test nodes, and if that hits at the wrong moment then the executor can break while trying to retrieve logs from the node | 15:47 |
clarkb | ya my experience has been that as long as there are logs these tend to be debuggable. | 15:47 |
clarkb | fungi: the issue there is that rsync doesn't chare the ssh control pesistency with regular ssh so it is more susceptible to those failures | 15:48 |
clarkb | s/chare/share/ | 15:48 |
fungi | yeah, that definitely doesn't help matters | 15:48 |
dansmith | clarkb: so what was the failure in the above-quoted one | 15:49 |
dansmith | ? | 15:49 |
dansmith | like, I see this: POST-RUN END RESULT_NORMAL | 15:49 |
fungi | slowness in processing all the logs from the job led to a timeout for the post-run playbooks | 15:50 |
dansmith | and the successful log upload, seems like POST_FAILURE doesn't fit | 15:50 |
clarkb | dansmith: I linked to it. the post run timed out processing all of the devstack log files | 15:50 |
clarkb | https://zuul.opendev.org/t/openstack/build/3c40559f664543359c0109b28dc07656/log/job-output.txt#34100-34215 | 15:50 |
clarkb | it took 13 minutes processing apache logs | 15:50 |
clarkb | with a half hour timeout (and there were other bits that are slow too) | 15:50 |
dansmith | aha, multiple post-runs | 15:51 |
dansmith | I guess I look for the post-run from the bottom, and expect that to be a post_failure-inducing fail | 15:52 |
clarkb | https://review.opendev.org/c/zuul/zuul-jobs/+/858961 https://review.opendev.org/c/zuul/zuul-jobs/+/857228 and https://review.opendev.org/c/zuul/zuul-jobs/+/855402 are three changes I've already made to speed these sorts of things up for many jobs | 15:55 |
clarkb | But basically timeouts like this indicate we need to make more changes like ^ | 15:55 |
fungi | and/or reduce unnecessary log collection where it's reasonable to do so | 15:56 |
clarkb | ++ | 15:56 |
clarkb | also if anyone happens to know how to convince ansible to treat high task runtime costs as a bug that should be fixed I will owe you a $beverage | 16:04 |
JayF | clarkb: Red Hat runs ansible. Red Hat has a large number of developers invested in OpenStack. Maybe we can use that to leverage? | 16:12 |
clarkb | JayF: maybe, its definitely something we've struggled to convince them in the past. But part of the issue with those tasks above is that ansible has like a 1-2 ish second per task minimum cost. When you run a loop in ansible you take the number of loop entries and multiply it by that base cost for a floor of the best runtime | 16:13 |
fungi | the ansible core devs have frequently viewed zuul's use case as rather fringe, which hasn't helped when we bring up problems we experience | 16:13 |
clarkb | in this case if you have a ton of log files to process and loop over them it can easily be just 5 minutes of ansible task startup cost when the actual work is a few milliseconds | 16:13 |
fungi | also ansible-lint really does push people to write inefficient loops over ansible tasks rather than have a single shell task with a loop inside it | 16:14 |
clarkb | yes the ansible way to do things is slow | 16:15 |
fungi | anyway, point being solving this within ansible upstream would involve a change in approach and mindset, not just some simple patching | 16:17 |
fungi | they've made explicit design choices to trade efficiency for clarity and consistency | 16:19 |
clarkb | I'm looking at the apache tasks for that job more closely and I think it is taking a minute to run each task. | 16:26 |
clarkb | Its possible that maybe we are deep into swap or something which is just compounding the problem? It is weird to me that we aren't running those tasks in parallel across the different nodes too | 16:27 |
clarkb | we run ansible with -f 5 in zuul jobs iirc. Which should allow a two node job to run tasks like that in parallel. | 16:27 |
clarkb | it doesn't seem to be quite so bad earlier in the job so ya maybe there is some side effect (like swapping) that the job is inducing that causes it to get worse | 16:29 |
*** jpena is now known as jpena|off | 16:34 | |
*** dviroel is now known as dviroel|biab | 19:55 | |
*** dviroel|biab is now known as dviroel | 20:51 | |
*** dviroel is now known as dviroel|afk | 21:33 | |
*** rlandy is now known as rlandy|bbl | 22:12 | |
*** dasm is now known as dasm|off | 23:52 | |
*** rlandy|bbl is now known as rlandy | 23:55 | |
*** rlandy is now known as rlandy|out | 23:59 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!