*** dasm|off is now known as Guest1645 | 04:00 | |
*** jpena|off is now known as jpena | 07:33 | |
*** dviroel|out is now known as dviroel | 11:23 | |
*** Guest1645 is now known as dasm | 13:50 | |
slaweq | hi infra team, I have a question about POST_FAILURES in jobs | 14:41 |
---|---|---|
slaweq | we noticed in some neutron jobs, like functional or fullstack at least few times a week we have POST_FAILURE results on those jobs | 14:41 |
slaweq | but when I check into job-output.txt file, I don't really see any errors | 14:42 |
slaweq | like e.g. https://c228ca193be60f87086b-704d2e5cde896695f5c12544f01f1d12.ssl.cf5.rackcdn.com/840420/30/check/neutron-functional-with-uwsgi/3638734/job-output.txt | 14:42 |
slaweq | so I'm not sure what is causing that POST_FAILURE there | 14:42 |
slaweq | all of such failures which I checked are in the OVH GRA1 | 14:42 |
slaweq | do You know maybe what was the error there? | 14:43 |
slaweq | and do You know about any issue like that already? | 14:43 |
frickler | slaweq: that looks like a failure during log upload, although most of the logs seem to be in place | 15:02 |
clarkb | slaweq: I find it helps to start from the zuul build page rather than the logs. Much more additional info that way. https://zuul.opendev.org/t/openstack/build/363873433e4d4f1ab66f6b2e97fb9429 for your example | 15:02 |
frickler | this directory seems to be very large, not sure if that's always the case https://c228ca193be60f87086b-704d2e5cde896695f5c12544f01f1d12.ssl.cf5.rackcdn.com/840420/30/check/neutron-functional-with-uwsgi/3638734/controller/logs/dsvm-functional-logs/index.html | 15:03 |
clarkb | frickler: slaweq: note that large numbers of files also seem to slow down swift uploads | 15:03 |
frickler | not necessarily related to the failure, but maybe worth improving, like just uploading a tgz from it? | 15:03 |
clarkb | that dir could be problematic for multiple reasons | 15:04 |
clarkb | anyway I like the zuul build page because you can check https://zuul.opendev.org/t/openstack/build/363873433e4d4f1ab66f6b2e97fb9429/console but I agree with frickler this must be an issue with the log upload itself which doesn't shwo up there (because it is a chicken and egg with the uploads) | 15:05 |
clarkb | we can check ze03's executor log though | 15:05 |
clarkb | slaweq: [build: 363873433e4d4f1ab66f6b2e97fb9429] Ansible complete, result RESULT_TIMED_OUT code None | 15:08 |
clarkb | That was for the playbook that fetches the devstack logs | 15:08 |
clarkb | this is the step that copies from the test node to the executor. Not the step that copies from executor to swift | 15:09 |
clarkb | https://zuul.opendev.org/t/openstack/build/363873433e4d4f1ab66f6b2e97fb9429/log/job-output.txt#18905 the logs actually capture that. I'm surprised the console log doesn't show that but I guess since the timeout operates above ansible the ansible console log has a hard time showing it | 15:12 |
clarkb | https://zuul.opendev.org/t/openstack/build/363873433e4d4f1ab66f6b2e97fb9429/log/job-output.txt#15596 that task seems to be consuming a significant amount of time | 15:13 |
clarkb | https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/stage-output/tasks/main.yaml#L103 that is doing a mv of filenames that should be on the same fs based on the path (it is only changing the file suffix not the location dir) | 15:14 |
clarkb | But it takes about 3 seconds per file | 15:15 |
clarkb | I wonder if the slowness is ansible or the host | 15:15 |
clarkb | if you look at that log the task that check sudo https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/stage-output/tasks/main.yaml#L13 also takes about 3 seconds | 15:21 |
clarkb | I think either ansible is being slow or the host is being slow. But it doesn't seem specifically related to perform disk writes | 15:21 |
clarkb | Looking at the dstat log for that job I don't see anything that stands out to me (cpu utilization seems fine, plenty of memory, not swapping a ton, etc) | 15:23 |
*** dviroel is now known as dviroel|lunch | 15:25 | |
clarkb | Comparing to another job I've been looking at logs for https://zuul.opendev.org/t/openstack/build/c9b77addea87426e995d9a0ba0b1784f/log/job-output.txt#21348 check sudo takes almost a second and a half there. Not fast either, but twice as fast as this example | 15:26 |
*** dviroel|lunch is now known as dviroel | 16:25 | |
slaweq | clarkb: frickler thx. I will try to upload those logs in tar.gz archive. I hope it will help | 16:26 |
fungi | or if there are a lot of files being logged unnecessarily, finding ways to not copy them could help | 16:37 |
*** jpena is now known as jpena|off | 16:43 | |
*** dviroel is now known as dviroel|afk | 19:47 | |
*** dasm is now known as dasm|off | 21:40 | |
*** rlandy is now known as rlandy|bbl | 22:25 | |
*** dviroel|afk is now known as dviroel | 23:12 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!