Thursday, 2022-10-13

*** rlandy is now known as rlandy\|out		01:00
*** ysandeep\|out is now known as ysandeep		05:55
*** jpena\|off is now known as jpena		07:18
slaweq	fungi hi, can You help me again with tobiko-upload-git-mirror job? I added ssh key encrypted with zuul-client in tobiko repo: https://zuul-ci.org/docs/zuul/latest/project-config.html#encryption	08:09
slaweq	but job is still failing without any logs: https://zuul.opendev.org/t/openstack/build/9fd7eeac81354a23a0e09d1dd5a8e150/logs	08:09
slaweq	can You check for me what can be wrong there still?	08:09
slaweq	thx in advance	08:09
*** ysandeep is now known as ysandeep\|afk		08:17
* frickler not fungi, but taking a look at logs		08:33
slaweq	frickler thx a lot :)	09:01
frickler	slaweq: still seeing the same: ValueError: Encryption/decryption failed.	10:12
*** rlandy\|out is now known as rlandy		10:24
frickler	slaweq: looking at https://opendev.org/x/tobiko/commit/ae6deb883f6ce87c2d069013d94c753af793974a the new secret seems much shorter than the previous one, not sure if that might be legit or is an indication of currupted data	10:26
slaweq	frickler yeah, maybe I did something wrong first time	10:30
slaweq	I will try to merge that new patch and check then	10:30
*** ysandeep\|afk is now known as ysandeep		10:42
*** rlandy is now known as rlandy\|mtg		11:09
yadnesh	hello all, could someone take a look at this job https://zuul.opendev.org/t/openstack/build/c39ca54a9167461fb974033a1846a3dc/logs	11:19
yadnesh	it seems to be timing out while downloading pip modules RUN END RESULT_TIMED_OUT: [untrusted : opendev.org/zuul/zuul-jobs/playbooks/tox/run.yaml@master]	11:19
opendevreview	Merged openstack/openstack-zuul-jobs master: Remove pike branch from periodic-stable templates https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/851613	11:19
*** yadnesh is now known as yadnesh\|afk		11:36
fungi	yadnesh\|afk: the cause is fairly straightforward. notice these messages in the job log:	11:43
fungi	INFO: This is taking longer than usual. You might need to provide the dependency resolver with stricter constraints to reduce runtime. See https://pip.pypa.io/warnings/backtracking for guidance. If you want to abort this run, press Ctrl + C.	11:44
fungi	now see this comment in the project's tox configuration: https://opendev.org/openstack/python-aodhclient/src/branch/stable/yoga/tox.ini#L16	11:44
fungi	yadnesh\|afk: the solutions available are to either start using constraints like the vast majority of openstack projects, or get rid of some of your dependencies, or choose much tighter dependency version ranges, or increase your job timeout	11:45
*** rlandy\|mtg is now known as rlandy		11:50
*** yadnesh\|afk is now known as yadnesh		12:16
*** dasm\|off is now known as dasm		13:00
*** ministry is now known as __ministry		13:11
*** blarnath is now known as d34dh0r53		13:32
*** ysandeep is now known as ysandeep\|dinner		14:19
*** yadnesh is now known as yadnesh\|away		14:52
*** dviroel_ is now known as dviroel		14:55
*** ysandeep\|dinner is now known as ysandeep		15:18
*** ysandeep is now known as ysandeep\|out		15:33
clarkb	dansmith: fungi: I spot checked jobs listed by https://zuul.opendev.org/t/openstack/builds?result=POST_FAILURE&skip=0 and all have had logs so far	15:36
clarkb	there seem to be some persistently broken jobs too which likely indicates a problem with those specific jobs	15:36
fungi	did they log what failed though?	15:36
clarkb	yes	15:37
fungi	https://zuul.opendev.org/t/openstack/build/3c40559f664543359c0109b28dc07656 was the example brought up in the tc meeting, i was about to look into it	15:37
clarkb	for example all the monasca errors are due to python not being found	15:37
clarkb	that example has logs fwiw	15:38
fungi	yeah, i haven't dived into them yet, but i wonder if the confusion is that the failed ansible tasks aren't rendered in the console view (because we only render tasks which complete)	15:39
clarkb	ittimed out	15:39
clarkb	so ya I'm not sure that fits the "no logs post failure" situation	15:39
clarkb	https://zuul.opendev.org/t/openstack/build/3c40559f664543359c0109b28dc07656/log/job-output.txt#34100-34215 looks like slow log processing likely due to ansible loop costs	15:40
clarkb	this is related to the improvements I've already made to stage-output, but I'm sure more can bedone	15:41
clarkb	just processing apache logs there took 13 minutes	15:42
fungi	yeah, i'm on my netbook at the moment and browsing that build result is bogging down the browser something fierce	15:42
*** tkajinam is now known as Guest2971		15:43
clarkb	also looks like the first task in stage output which stats all the files in a loop wasn't fast. I didn't update that part of stage-output. I only updated the portin that renames files to have .txt suffixes	15:44
clarkb	so that may be another target for optimizing	15:44
dansmith	clarkb: ack, I think the one I got with a slew of post_failures with no logs was last week sometime, but it was just one	15:46
dansmith	so then the post fails this week, which were smaller, were on my radar, but several (including the ones mentioned) were with-logs and not super obvious	15:47
fungi	there have been some ssh host key mismatch errors which look like rogue virtual machines in cloud providers getting into arp overwrite situations with our test nodes, and if that hits at the wrong moment then the executor can break while trying to retrieve logs from the node	15:47
clarkb	ya my experience has been that as long as there are logs these tend to be debuggable.	15:47
clarkb	fungi: the issue there is that rsync doesn't chare the ssh control pesistency with regular ssh so it is more susceptible to those failures	15:48
clarkb	s/chare/share/	15:48
fungi	yeah, that definitely doesn't help matters	15:48
dansmith	clarkb: so what was the failure in the above-quoted one	15:49
dansmith	?	15:49
dansmith	like, I see this: POST-RUN END RESULT_NORMAL	15:49
fungi	slowness in processing all the logs from the job led to a timeout for the post-run playbooks	15:50
dansmith	and the successful log upload, seems like POST_FAILURE doesn't fit	15:50
clarkb	dansmith: I linked to it. the post run timed out processing all of the devstack log files	15:50
clarkb	https://zuul.opendev.org/t/openstack/build/3c40559f664543359c0109b28dc07656/log/job-output.txt#34100-34215	15:50
clarkb	it took 13 minutes processing apache logs	15:50
clarkb	with a half hour timeout (and there were other bits that are slow too)	15:50
dansmith	aha, multiple post-runs	15:51
dansmith	I guess I look for the post-run from the bottom, and expect that to be a post_failure-inducing fail	15:52
clarkb	https://review.opendev.org/c/zuul/zuul-jobs/+/858961 https://review.opendev.org/c/zuul/zuul-jobs/+/857228 and https://review.opendev.org/c/zuul/zuul-jobs/+/855402 are three changes I've already made to speed these sorts of things up for many jobs	15:55
clarkb	But basically timeouts like this indicate we need to make more changes like ^	15:55
fungi	and/or reduce unnecessary log collection where it's reasonable to do so	15:56
clarkb	++	15:56
clarkb	also if anyone happens to know how to convince ansible to treat high task runtime costs as a bug that should be fixed I will owe you a $beverage	16:04
JayF	clarkb: Red Hat runs ansible. Red Hat has a large number of developers invested in OpenStack. Maybe we can use that to leverage?	16:12
clarkb	JayF: maybe, its definitely something we've struggled to convince them in the past. But part of the issue with those tasks above is that ansible has like a 1-2 ish second per task minimum cost. When you run a loop in ansible you take the number of loop entries and multiply it by that base cost for a floor of the best runtime	16:13
fungi	the ansible core devs have frequently viewed zuul's use case as rather fringe, which hasn't helped when we bring up problems we experience	16:13
clarkb	in this case if you have a ton of log files to process and loop over them it can easily be just 5 minutes of ansible task startup cost when the actual work is a few milliseconds	16:13
fungi	also ansible-lint really does push people to write inefficient loops over ansible tasks rather than have a single shell task with a loop inside it	16:14
clarkb	yes the ansible way to do things is slow	16:15
fungi	anyway, point being solving this within ansible upstream would involve a change in approach and mindset, not just some simple patching	16:17
fungi	they've made explicit design choices to trade efficiency for clarity and consistency	16:19
clarkb	I'm looking at the apache tasks for that job more closely and I think it is taking a minute to run each task.	16:26
clarkb	Its possible that maybe we are deep into swap or something which is just compounding the problem? It is weird to me that we aren't running those tasks in parallel across the different nodes too	16:27
clarkb	we run ansible with -f 5 in zuul jobs iirc. Which should allow a two node job to run tasks like that in parallel.	16:27
clarkb	it doesn't seem to be quite so bad earlier in the job so ya maybe there is some side effect (like swapping) that the job is inducing that causes it to get worse	16:29
*** jpena is now known as jpena\|off		16:34
*** dviroel is now known as dviroel\|biab		19:55
*** dviroel\|biab is now known as dviroel		20:51
*** dviroel is now known as dviroel\|afk		21:33
*** rlandy is now known as rlandy\|bbl		22:12
*** dasm is now known as dasm\|off		23:52
*** rlandy\|bbl is now known as rlandy		23:55
*** rlandy is now known as rlandy\|out		23:59

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!