Thursday, 2025-09-18

*** haleyb is now known as haleyb|out00:31
seongsoocho[m]Hi Infra Team,... (full message at <https://matrix.org/oftc/media/v1/media/download/AVMHsz-QNRZyYlms6tOmkUCI8lGxiVn99cdec6h431_DbXWNM3cCPSXgnwrIcqDCnsHWApiA9cpZ3mZ51VuuhB1CeZpKULsgAG1hdHJpeC5vcmcva0tWTHh3aVBzalNRTGR0UFd3WkZUZ2pX>)11:18
fricklerseongsoocho[m]: it would help it you send messages only one line at a time, that makes them readable for people outside the matrix, too. (no need to repeat this one, just a note for next time)11:25
fricklerwe can review and hopefully merge https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/921878 without impact, then likely you can amend the project-config change to create a new job first, which could then be tested against specific projects11:26
seongsoocho[m]frickler:  got it. I’ll send messages in one line from next time.11:27
*** ykarel_ is now known as ykarel11:43
opendevreviewSeongsoo Cho proposed openstack/openstack-zuul-jobs master: Add ansible play for weblate client configuration  https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/92187812:57
*** croeland1 is now known as croelandt13:21
opendevreviewChaemin Lim proposed openstack/openstack-zuul-jobs master: Add ansible play for weblate client configuration  https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/92187815:00
sfernandhey guys! I'm working on some zuul jobs for Cinder and noticed something weird when some tempest tests fail to execute. Tempest run completes as excepted even if some tests fail to execute, but the job outputs POST_FAILURE instead of FAILURE. I see no controller logs so I suspect it might be some timeout pushing log information or something related18:26
sfernandhttps://zuul.opendev.org/t/openstack/build/702ef66f6d8e4c80b9a68b4dccb08046/log/zuul-info/zuul-info.controller.txt18:26
clarkblet me see18:27
sfernandops sorry, wrong link18:27
sfernandhttps://zuul.opendev.org/t/openstack/build/702ef66f6d8e4c80b9a68b4dccb08046/log/job-output.txt18:27
clarkbsfernand: https://zuul.opendev.org/t/openstack/build/702ef66f6d8e4c80b9a68b4dccb08046/log/job-output.txt#34961 this shows at least one test does fail18:28
clarkboh I see youexpect a FAILURE result not POST_FAILURE18:29
sfernandyep! I expect it to ouput FAILURE so I could check the logs 18:29
clarkbPOST_FAILURE occurs when at least one post-run playbook fails18:29
clarkband this value overrides the SUCCESS/FAILURE state of the run playbook18:29
clarkbI don't see any obvious failures in post-run at https://zuul.opendev.org/t/openstack/build/702ef66f6d8e4c80b9a68b4dccb08046/console so maybe the failure occurs after we upload logs. Let me see if I can find this in the executor logs18:30
sfernandoh I see18:32
clarkbsfernand: https://zuul.opendev.org/t/openstack/build/702ef66f6d8e4c80b9a68b4dccb08046/log/job-output.txt#54711-54798 this is the problem. The post run playbook to capture system logs timed out18:33
clarkbsomething about `TASK [capture-system-logs : Stage various logs and reports]` is taking about half an hour18:34
clarkba naive guess is that significant amounts of logs have been written so they need more processing than can be performed in that period of time and the task times out18:34
sfernandwow18:34
sfernandyeah all tests are failing due to volume or server creation timeouts so it writes lots of logs sayings like "waiting for resource"18:36
clarkbunfortunately it looks like the df and du tasks didn't produce useable output18:37
clarkbthat may have given us some insight into the scope of the problem18:37
clarkbbecause that playbook timed out and wasn't ended properly we don't see it in the console page18:38
clarkbya I think it is scrubbed out of the json entirely too :/18:38
clarkbso the best clue we have is the job-output.txt fiel indicating whih task was started when the playbook timed out18:39
clarkbfrom that you might be able to infer where the specific issues are or potentially add more debugging18:39
clarkbsfernand: https://opendev.org/openstack/devstack/src/branch/master/roles/capture-system-logs/tasks/main.yaml this is what that task is doing18:41
clarkbI suspect either https://opendev.org/openstack/devstack/src/branch/master/roles/capture-system-logs/tasks/main.yaml#L36-L40 or https://opendev.org/openstack/devstack/src/branch/master/roles/capture-system-logs/tasks/main.yaml#L44-L5318:42
sfernandsorry for dumb question is there a way to change the capture-system-logs for testings just for a specific job?18:42
clarkbsfernand: the same script will be used in every job that calls it. So you either need to modify devstack to switch behavior based on parameters or change the script to work for everything. I would probably start by pushing updates to that script to try and identify where the specific problem is first before settling on any solution18:43
clarkbhwoever if your job is creating many core dumps or so many deprecation warnings that you cannot process them in half an hour I would consider each of those to be bugs that should be fixed in your job and not in the script18:44
sfernandyeah for sure 18:44
clarkband if pushing debug updates to that script (use depends on to those updates from your change to see what happens) doesn't identify a source of the problem we can probably also hold a node and inspect it directly18:45
clarkbI would probably modify it to do something like if [ -d /var/core ] ; then ls -lh /var/core && du -hs /var/core ; fi18:48
clarkbthen similar with the deprecation warnings drop all the seds and just do something like | wc -l to see how many are in there18:48
clarkbmaybe also du -hs {{ stage_dir }}/logs/* {{ stage_dir }}/apache/*18:49
clarkbsomething along those lines to try and narrow down where the slowdown might be occuring18:49
sfernandyeah that is helpful thanks a lot clarkb!18:51
sfernandI will talk to the devstack folks and propose a change to the script with the debugs 18:53

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!