Thursday, 2024-07-25

frickler	I'd still argue that we'd only want to throw away only 1010m rather than 2010m, early detection does nothing to reduce the rate of actual gate failures	06:00
frickler	also, 10m is unreasonable, 20m alone for the devstack phase, which very rarely fails. assuming the failure happens on average halfway through the tempest run, that would be 40m vs. 60m	06:02
frickler	or, for the more relevant case of a 2h full tempest job, 70m vs. 120m	06:02
opendevreview	Amit Uniyal proposed openstack/whitebox-tempest-plugin master: verify vencrypt feature https://review.opendev.org/c/openstack/whitebox-tempest-plugin/+/923824	06:09
opendevreview	yatin proposed openstack/devstack master: [DNM] check abandoned patch https://review.opendev.org/c/openstack/devstack/+/924822	07:59
opendevreview	yatin proposed openstack/devstack master: [DNM] check abandoned patch https://review.opendev.org/c/openstack/devstack/+/924822	08:03
opendevreview	Katarina Strenkova proposed openstack/tempest master: Fix cleanup of keypairs for --prefix option https://review.opendev.org/c/openstack/tempest/+/924919	10:56
*** ykarel_ is now known as ykarel		12:07
opendevreview	Riccardo Pittau proposed openstack/devstack master: Install simplejson in devstack venv https://review.opendev.org/c/openstack/devstack/+/924867	12:38
frickler	melwitt: fyi ^^ is the workaround for the issue that you noticed yesterday. triggered by osc-lib being released but osc still pending, see discussion in #-sdks	12:48
ykarel	kopecmartin, dansmith can we also get this https://review.opendev.org/c/openstack/devstack/+/924864	14:04
dansmith	ykarel: no, your quota has been exceeded!	14:06
dansmith	ykarel: but seriously, I don't have +2 on unmaintained/*	14:06
dansmith	you'll need to ask someone who does	14:06
dansmith	I gave you a shiny +1 at least :)	14:08
frickler	ykarel: elod is also out, maybe lajoskatona can simply single-approve?	14:12
ykarel	thx dansmith frickler , lajoskatona if you can +W ^	14:14
lajoskatona	frickler, ykarel: let me check	14:17
lajoskatona	ykarel: done	14:18
opendevreview	Merged openstack/devstack-plugin-ceph stable/2024.1: Fix ingress deamon https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/924727	14:18
ykarel	thx lajoskatona	14:18
melwitt	thanks frickler	15:04
*** dviroel is now known as dviroel\|afk		15:14
clarkb	frickler: I think those numbers were just illustrative. I don't know what the actual timespans look like for a tempest job particularly since there have been improvements with the openstack cli server for example.	15:48
clarkb	but the idea is to see if that help sufficiently well to relieve the pressure on resources during gate thrash and high demand.	15:48
clarkb	I think we're comparing ~20 * however long it takes to reach the first tempest test fail (so 25-120 minutes?) to 10 * 120 minutes	15:49
clarkb	there are scenarios where the first option is better	15:49
clarkb	but we don't really know until we try (or spend a lot of effort modeling it, in this case I think its less effort and quicker wins to just experiment directly)	15:50
frickler	yes, and lowering the lower bound is an easy and quick way to test things, so I don't understand why people insist on blocking that experiment :(	15:59
clarkb	because we aren't under that pressure right now so we don't get the benefits either way. So shouldn't we at least try to do the more nimble and correct thing?	16:01
frickler	I still think that both a valid options that do not exclude each other. also it will be too late to start the experiment while zuul is under fire. I would even go as far as suggesting to set the lower bound very low, like 2 or 3, and then watch how the calculated value develops	16:04
clarkb	another thing to keep in mind is that zuul is designed to enable high throughput. Setting low bounds like that effectively undoes the very reason for zuul existing	16:05
clarkb	(zuul was written because the old limit was 1 for openstack and openstack jobs at the time took about an hour to complete so openstack could only merge a maximum of 24 items a day. Zuul was specifically created to address this bottleneck)	16:06
clarkb	Personally I don't want to subvert the tool because th tests are flaky. I'd rather use the built in pressure relief valves to enable improvements while still aiming at higher throughput as a goal	16:07
frickler	the lower bound is only a lower bound. IMO the algorithm to have a floating queue depth was exactly made for our case, we just don't let it play out by capping the lower bound at a much too high value	16:11
frickler	nothing would block it reaching a queue depth of 50 or more if the gate was really deep green	16:13
clarkb	right its a balancing act. But just like tcp slow start that it is modeled after you probably don't want to set it to the lowest value to start because then you'll never reach best case throughput even if everything runs smoothly most of the time	16:14
frickler	I purposely refrained from suggesting to start with a lower bound of 1	16:15
frickler	though the tcp example is also misleading, we only have essentially a single long running connection in our case	16:16
frickler	so the startup behaviour doesn't matter much after a couple of days of running. and that is one more argument to start the experiment on a quiet day	16:17
clarkb	if we set the floor to 2 I bet we never go above 8	16:19
clarkb	and the average will probably be around 3	16:20
clarkb	maybe that is a good thing if it motivates people to improve job stability	16:20
frickler	but if you think that that is the wrong outcome, isn't then the algorithm flawed?	16:20
corvus	if things are bad, we sit at the floor. we're at the floor now. we're almost always at the floor. so setting the floor is essentially setting the fixed window size for openstack. it would be good to let the tool do it's job and try to achieve the max throughput.	16:20
corvus	meanwhile, there is also another tool available that could have a significant impact on resource usage and throughput. like it could actually speed up throughput without any downside whatsoever. that's the early failure detection. i think it would be a really good idea for openstack to take advantage of it.	16:22
corvus	especially since it's so easy.	16:22
corvus	here is how we configured it for the zuul project itself: https://opendev.org/zuul/zuul/src/branch/master/.zuul.yaml#L103-L106	16:22
frickler	if it were so easy, it would long be done. avoiding false positives is non-trivial IMO	16:22
clarkb	frickler: I think that openstack's blase approach to CI stability is flawed (note I don't think any individual is responsible its a culture problem) and that leads to the algorithm being a less good choice for openstack. We know that other zuul users don't have these problems and have had to implement a window ceiling as a result	16:22
corvus	that's a regex that seems to work very well for zuul with stestr	16:23
clarkb	I would love for openstack to go back to having a culture where needing the window ceiling is something we strive for but it has been a while since I got that feeling from the community	16:23
corvus	any job that uses stestr can probably start with that and have a really good chance that it just works	16:23
corvus	but depending on what else is output during a job, it might not work, which is why some familiarity with the job is important	16:24
frickler	lots of devstack based jobs add some custom testing at the end. nobody knows even near all of those jobs	16:24
corvus	it fails safe. so even if there's a false positive, it would be seen in testing.	16:26
corvus	(and that's a pretty tight regex, i would be surprised by a false positive)	16:27
frickler	how would that be seen other than as a gate failure?	16:27
corvus	it would be seen in check. and there is special logging for it, so it can be discovered in the log for the build.	16:28
*** iurygregory__ is now known as iurygregory		16:35
frickler	"If an early failure is triggered, the job result will be recorded as FAILURE even if the job playbooks ultimately succeed." that sounds safe enough I guess	16:38
corvus	yeah, so a false positive will not go unnoticed. it should mean that it either works as expected or if there's a problem with the regex, then someone will need to fix the regex or disable it.	16:41
frickler	how could a child job disble it? according to the docs regexes can only be added. does this also support negate?	16:46
frickler	but anyway, I think we could simply copy the regex from zuul-nox to tempest if kopecmartin and/or gmann agree	16:47
frickler	and then see how far that gets us	16:47
gmann	clarkb: corvus: frickler: I am ok to try and checking about regex if zuul one will work fine or any other safe regex we can deifne.	16:48
gmann	but two question. 1. same as frickler pointed that 'option is combined from all parent' and not override 2. 'special logging ' do we have example of that special logging? I think that can help to detect false failure	16:50
gmann	due to first one, I am hesitant to do that in devstack base job but can do in tempest base job and let project to check right regex carefully as per their jobs and define by themself (mainly for non tempst jobs)	16:51
gmann	for tempest jobs we know how job output is and what all string we log when test pass or fail	16:52
frickler	for devstack itself I don't think this is needed, devstack fails very fast on any error (except for some pathological cases like rpm issues, but that I'd consider a bug in devstack anyway)	16:54
frickler	so yes this is why I suggested tempest base job	16:54
frickler	let me see if I can get this pushed still before I end my day	16:55
gmann	in devstack base job I mean for all the parent jobs we can define this at single place but yes it will be good to do in tempest base job	16:55
gmann	this is how test failure of tests '..[28.423190s] ... FAILED' so zuul deifned regex seems safe for me and very very rare to get false failure	16:55
gmann	frickler: ohk, ping me if you can push otherwise I can push the change	16:56
corvus	question #2: https://zuul.opendev.org/t/zuul/build/418cdcd6b4f141de9a3d162aa6be4113/log/job-output.txt#6620	16:56
corvus	that's the log output	16:56
gmann	clarkb: ++ thanks	16:56
gmann	corvus: thanks	16:56
corvus	question #1 -- yes it's append-only so a child job would need to get its parent to fix or remove it. it is the re2 syntax, but i don't believe that negate is supported. that should be possible to add.	16:58
gmann	that regex might not cover all failure but most common test failing case will be covered and as we see how it goes we can improve it gradually	16:58
corvus	gmann: ++	16:59
frickler	corvus: would it make sense to add another warning at the end of the run playbook for the case there was an early failure but then an rc=0? or is there one already maybe?	17:02
corvus	there is not, but i agree that would make sense and would not be difficult	17:03
gmann	++. in that case, maybe job status can be something else so that it is easily detected 'FAILED ?' with question mark if that is easy to do	17:03
corvus	that's more difficult; but i think adding a build event might be possible; those show up in build metadata; i don't think they are directly queryable, but they could be in the future. but they can be easily scanned in the api.	17:05
gmann	k. make sense. at least we will be able to run such query in opensearch dashboard periodically and see if there are any such case	17:08
opendevreview	Jens Harbott proposed openstack/tempest master: Add early failure detection for tempest jobs https://review.opendev.org/c/openstack/tempest/+/924956	17:16
opendevreview	Jens Harbott proposed openstack/tempest master: DNM: Test early failure detection https://review.opendev.org/c/openstack/tempest/+/924957	17:16
opendevreview	Merged openstack/devstack master: Install simplejson in devstack venv https://review.opendev.org/c/openstack/devstack/+/924867	17:29
opendevreview	Ashley Rodriguez proposed openstack/devstack-plugin-ceph stable/2023.2: Fix ingress deamon https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/924960	17:48
opendevreview	Ashley Rodriguez proposed openstack/devstack-plugin-ceph stable/2023.2: Fix ingress deamon https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/924960	17:53
opendevreview	Ashley Rodriguez proposed openstack/devstack-plugin-ceph stable/2023.2: Fix ingress deamon https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/924960	17:54
*** iurygregory__ is now known as iurygregory		17:57
opendevreview	Ashley Rodriguez proposed openstack/devstack-plugin-ceph stable/2024.1: Follow Up Patch Fix Ingress Deamon https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/924964	17:59
frickler	btw. as I was just looking at the final nova patch, integrated gate has window_size=30 currently, so it isn't glued to the 20 permanently at least	18:10
corvus	https://graphite.opendev.org/?width=586&height=308&target=stats.gauges.zuul.tenant.openstack.pipeline.gate.queue.integrated.window&from=00%3A00_20240701&until=23%3A59_20240725	18:14
corvus	there have been some impressive spikes, but it looks like it stays pretty close to the floor	18:15
opendevreview	Jens Harbott proposed openstack/tempest master: Add early failure detection for tempest base job https://review.opendev.org/c/openstack/tempest/+/924956	18:39
opendevreview	Jens Harbott proposed openstack/tempest master: DNM: Test early failure detection https://review.opendev.org/c/openstack/tempest/+/924957	18:39
clarkb	corvus: ^ that DNM change failed to match and early fail. is that because the message comes from a debug task and we're maybe filting by task type when doing early detection?	21:08
clarkb	otherwise the regex seems to work with some (naive admittedly) local testing	21:08
corvus	clarkb: has to come from a shell task	21:17
corvus	if you want it to early fail from an ansible task, just have the task fail. that's automatic. :)	21:17
corvus	so to make that dnm test effective, it should be shell: 'echo "{0} tempest.api.test_fake [0.1s] ... FAILED"' or something vaguely like that	21:18
clarkb	cool that confirms that the test is flawed and not the regex at least until we can rewrite the test	21:18
clarkb	yup	21:18
clarkb	I guess I can push an update to do that	21:18
opendevreview	Clark Boylan proposed openstack/tempest master: DNM: Test early failure detection https://review.opendev.org/c/openstack/tempest/+/924957	21:21

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!