Thursday, 2025-09-18

*** haleyb is now known as haleyb\|out		00:31
seongsoocho[m]	Hi Infra Team,... (full message at <https://matrix.org/oftc/media/v1/media/download/AVMHsz-QNRZyYlms6tOmkUCI8lGxiVn99cdec6h431_DbXWNM3cCPSXgnwrIcqDCnsHWApiA9cpZ3mZ51VuuhB1CeZpKULsgAG1hdHJpeC5vcmcva0tWTHh3aVBzalNRTGR0UFd3WkZUZ2pX>)	11:18
frickler	seongsoocho[m]: it would help it you send messages only one line at a time, that makes them readable for people outside the matrix, too. (no need to repeat this one, just a note for next time)	11:25
frickler	we can review and hopefully merge https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/921878 without impact, then likely you can amend the project-config change to create a new job first, which could then be tested against specific projects	11:26
seongsoocho[m]	frickler: got it. I’ll send messages in one line from next time.	11:27
*** ykarel_ is now known as ykarel		11:43
opendevreview	Seongsoo Cho proposed openstack/openstack-zuul-jobs master: Add ansible play for weblate client configuration https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/921878	12:57
*** croeland1 is now known as croelandt		13:21
opendevreview	Chaemin Lim proposed openstack/openstack-zuul-jobs master: Add ansible play for weblate client configuration https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/921878	15:00
sfernand	hey guys! I'm working on some zuul jobs for Cinder and noticed something weird when some tempest tests fail to execute. Tempest run completes as excepted even if some tests fail to execute, but the job outputs POST_FAILURE instead of FAILURE. I see no controller logs so I suspect it might be some timeout pushing log information or something related	18:26
sfernand	https://zuul.opendev.org/t/openstack/build/702ef66f6d8e4c80b9a68b4dccb08046/log/zuul-info/zuul-info.controller.txt	18:26
clarkb	let me see	18:27
sfernand	ops sorry, wrong link	18:27
sfernand	https://zuul.opendev.org/t/openstack/build/702ef66f6d8e4c80b9a68b4dccb08046/log/job-output.txt	18:27
clarkb	sfernand: https://zuul.opendev.org/t/openstack/build/702ef66f6d8e4c80b9a68b4dccb08046/log/job-output.txt#34961 this shows at least one test does fail	18:28
clarkb	oh I see youexpect a FAILURE result not POST_FAILURE	18:29
sfernand	yep! I expect it to ouput FAILURE so I could check the logs	18:29
clarkb	POST_FAILURE occurs when at least one post-run playbook fails	18:29
clarkb	and this value overrides the SUCCESS/FAILURE state of the run playbook	18:29
clarkb	I don't see any obvious failures in post-run at https://zuul.opendev.org/t/openstack/build/702ef66f6d8e4c80b9a68b4dccb08046/console so maybe the failure occurs after we upload logs. Let me see if I can find this in the executor logs	18:30
sfernand	oh I see	18:32
clarkb	sfernand: https://zuul.opendev.org/t/openstack/build/702ef66f6d8e4c80b9a68b4dccb08046/log/job-output.txt#54711-54798 this is the problem. The post run playbook to capture system logs timed out	18:33
clarkb	something about `TASK [capture-system-logs : Stage various logs and reports]` is taking about half an hour	18:34
clarkb	a naive guess is that significant amounts of logs have been written so they need more processing than can be performed in that period of time and the task times out	18:34
sfernand	wow	18:34
sfernand	yeah all tests are failing due to volume or server creation timeouts so it writes lots of logs sayings like "waiting for resource"	18:36
clarkb	unfortunately it looks like the df and du tasks didn't produce useable output	18:37
clarkb	that may have given us some insight into the scope of the problem	18:37
clarkb	because that playbook timed out and wasn't ended properly we don't see it in the console page	18:38
clarkb	ya I think it is scrubbed out of the json entirely too :/	18:38
clarkb	so the best clue we have is the job-output.txt fiel indicating whih task was started when the playbook timed out	18:39
clarkb	from that you might be able to infer where the specific issues are or potentially add more debugging	18:39
clarkb	sfernand: https://opendev.org/openstack/devstack/src/branch/master/roles/capture-system-logs/tasks/main.yaml this is what that task is doing	18:41
clarkb	I suspect either https://opendev.org/openstack/devstack/src/branch/master/roles/capture-system-logs/tasks/main.yaml#L36-L40 or https://opendev.org/openstack/devstack/src/branch/master/roles/capture-system-logs/tasks/main.yaml#L44-L53	18:42
sfernand	sorry for dumb question is there a way to change the capture-system-logs for testings just for a specific job?	18:42
clarkb	sfernand: the same script will be used in every job that calls it. So you either need to modify devstack to switch behavior based on parameters or change the script to work for everything. I would probably start by pushing updates to that script to try and identify where the specific problem is first before settling on any solution	18:43
clarkb	hwoever if your job is creating many core dumps or so many deprecation warnings that you cannot process them in half an hour I would consider each of those to be bugs that should be fixed in your job and not in the script	18:44
sfernand	yeah for sure	18:44
clarkb	and if pushing debug updates to that script (use depends on to those updates from your change to see what happens) doesn't identify a source of the problem we can probably also hold a node and inspect it directly	18:45
clarkb	I would probably modify it to do something like if [ -d /var/core ] ; then ls -lh /var/core && du -hs /var/core ; fi	18:48
clarkb	then similar with the deprecation warnings drop all the seds and just do something like \| wc -l to see how many are in there	18:48
clarkb	maybe also du -hs {{ stage_dir }}/logs/* {{ stage_dir }}/apache/*	18:49
clarkb	something along those lines to try and narrow down where the slowdown might be occuring	18:49
sfernand	yeah that is helpful thanks a lot clarkb!	18:51
sfernand	I will talk to the devstack folks and propose a change to the script with the debugs	18:53

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!