Thursday, 2024-07-25

fricklerI'd still argue that we'd only want to throw away only 10*10m rather than 20*10m, early detection does nothing to reduce the rate of actual gate failures06:00
frickleralso, 10m is unreasonable, 20m alone for the devstack phase, which very rarely fails. assuming the failure happens on average halfway through the tempest run, that would be 40m vs. 60m 06:02
frickleror, for the more relevant case of a 2h full tempest job, 70m vs. 120m06:02
opendevreviewAmit Uniyal proposed openstack/whitebox-tempest-plugin master: verify vencrypt feature  https://review.opendev.org/c/openstack/whitebox-tempest-plugin/+/92382406:09
opendevreviewyatin proposed openstack/devstack master: [DNM] check abandoned patch  https://review.opendev.org/c/openstack/devstack/+/92482207:59
opendevreviewyatin proposed openstack/devstack master: [DNM] check abandoned patch  https://review.opendev.org/c/openstack/devstack/+/92482208:03
opendevreviewKatarina Strenkova proposed openstack/tempest master: Fix cleanup of keypairs for --prefix option  https://review.opendev.org/c/openstack/tempest/+/92491910:56
*** ykarel_ is now known as ykarel12:07
opendevreviewRiccardo Pittau proposed openstack/devstack master: Install simplejson in devstack venv  https://review.opendev.org/c/openstack/devstack/+/92486712:38
fricklermelwitt: fyi ^^ is the workaround for the issue that you noticed yesterday. triggered by osc-lib being released but osc still pending, see discussion in #-sdks12:48
ykarelkopecmartin, dansmith can we also get this https://review.opendev.org/c/openstack/devstack/+/92486414:04
dansmithykarel: no, your quota has been exceeded!14:06
dansmithykarel: but seriously, I don't have +2 on unmaintained/*14:06
dansmithyou'll need to ask someone who does14:06
dansmithI gave you a shiny +1 at least :)14:08
fricklerykarel: elod is also out, maybe lajoskatona can simply single-approve?14:12
ykarelthx dansmith frickler , lajoskatona if you can +W ^14:14
lajoskatonafrickler, ykarel: let me check14:17
lajoskatonaykarel: done14:18
opendevreviewMerged openstack/devstack-plugin-ceph stable/2024.1: Fix ingress deamon  https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/92472714:18
ykarelthx lajoskatona 14:18
melwittthanks frickler 15:04
*** dviroel is now known as dviroel|afk15:14
clarkbfrickler: I think those numbers were just illustrative. I don't know what the actual timespans look like for a tempest job particularly since there have been improvements with the openstack cli server for example.15:48
clarkbbut the idea is to see if that help sufficiently well to relieve the pressure on resources during gate thrash and high demand.15:48
clarkbI think we're comparing ~20 * however long it takes to reach the first tempest test fail (so 25-120 minutes?) to 10 * 120 minutes15:49
clarkbthere are scenarios where the first option is better15:49
clarkbbut we don't really know until we try (or spend a lot of effort modeling it, in this case I think its less effort and quicker wins to just experiment directly)15:50
frickleryes, and lowering the lower bound is an easy and quick way to test things, so I don't understand why people insist on blocking that experiment :(15:59
clarkbbecause we aren't under that pressure right now so we don't get the benefits either way. So shouldn't we at least try to do the more nimble and correct thing?16:01
fricklerI still think that both a valid options that do not exclude each other. also it will be too late to start the experiment while zuul is under fire. I would even go as far as suggesting to set the lower bound very low, like 2 or 3, and then watch how the calculated value develops16:04
clarkbanother thing to keep in mind is that zuul is designed to enable high throughput. Setting low bounds like that effectively undoes the very reason for zuul existing16:05
clarkb(zuul was written because the old limit was 1 for openstack and openstack jobs at the time took about an hour to complete so openstack could only merge a maximum of 24 items a day. Zuul was specifically created to address this bottleneck)16:06
clarkbPersonally I don't want to subvert the tool because th tests are flaky. I'd rather use the built in pressure relief valves to enable improvements while still aiming at higher throughput as a goal16:07
fricklerthe lower bound is only a lower bound. IMO the algorithm to have a floating queue depth was exactly made for our case, we just don't let it play out by capping the lower bound at a much too high value16:11
fricklernothing would block it reaching a queue depth of 50 or more if the gate was really deep green16:13
clarkbright its a balancing act. But just like tcp slow start that it is modeled after you probably don't want to set it to the lowest value to start because then you'll never reach best case throughput even if everything runs smoothly most of the time16:14
fricklerI purposely refrained from suggesting to start with a lower bound of 116:15
fricklerthough the tcp example is also misleading, we only have essentially a single long running connection in our case16:16
fricklerso the startup behaviour doesn't matter much after a couple of days of running. and that is one more argument to start the experiment on a quiet day16:17
clarkbif we set the floor to 2 I bet we never go above 816:19
clarkband the average will probably be around 316:20
clarkbmaybe that is a good thing if it motivates people to improve job stability16:20
fricklerbut if you think that that is the wrong outcome, isn't then the algorithm flawed?16:20
corvusif things are bad, we sit at the floor.  we're at the floor now.  we're almost always at the floor.  so setting the floor is essentially setting the fixed window size for openstack.  it would be good to let the tool do it's job and try to achieve the max throughput.16:20
corvusmeanwhile, there is also another tool available that could have a significant impact on resource usage and throughput.  like it could actually speed up throughput without any downside whatsoever.  that's the early failure detection.  i think it would be a really good idea for openstack to take advantage of it.16:22
corvusespecially since it's so easy.16:22
corvushere is how we configured it for the zuul project itself: https://opendev.org/zuul/zuul/src/branch/master/.zuul.yaml#L103-L10616:22
fricklerif it were so easy, it would long be done. avoiding false positives is non-trivial IMO16:22
clarkbfrickler: I think that openstack's blase approach to CI stability is flawed (note I don't think any individual is responsible its a culture problem) and that leads to the algorithm being a less good choice for openstack. We know that other zuul users don't have these problems and have had to implement a window ceiling as a result16:22
corvusthat's a regex that seems to work very well for zuul with stestr16:23
clarkbI would love for openstack to go back to having a culture where needing the window ceiling is something we strive for but it has been a while since I got that feeling from the community16:23
corvusany job that uses stestr can probably start with that and have a really good chance that it just works16:23
corvusbut depending on what else is output during a job, it might not work, which is why some familiarity with the job is important16:24
fricklerlots of devstack based jobs add some custom testing at the end. nobody knows even near all of those jobs16:24
corvusit fails safe.  so even if there's a false positive, it would be seen in testing.16:26
corvus(and that's a pretty tight regex, i would be surprised by a false positive)16:27
fricklerhow would that be seen other than as a gate failure?16:27
corvusit would be seen in check.  and there is special logging for it, so it can be discovered in the log for the build.16:28
*** iurygregory__ is now known as iurygregory16:35
frickler"If an early failure is triggered, the job result will be recorded as FAILURE even if the job playbooks ultimately succeed." that sounds safe enough I guess16:38
corvusyeah, so a false positive will not go unnoticed.  it should mean that it either works as expected or if there's a problem with the regex, then someone will need to fix the regex or disable it.16:41
fricklerhow could a child job disble it? according to the docs regexes can only be added. does this also support negate?16:46
fricklerbut anyway, I think we could simply copy the regex from zuul-nox to tempest if kopecmartin and/or gmann agree16:47
fricklerand then see how far that gets us16:47
gmannclarkb: corvus: frickler:  I am ok to try and checking about regex if zuul one will work fine or any other safe regex we can deifne.16:48
gmannbut two question. 1. same as frickler pointed that 'option is combined from all parent' and not override 2. 'special logging ' do we have example of that special logging? I think that can help to detect false failure16:50
gmanndue to first one, I am hesitant to do that in devstack base job but can do in tempest base job and let project to check right regex carefully as per their jobs and define by themself (mainly for non tempst jobs)16:51
gmannfor tempest jobs we know how job output is and what all string we log when test pass or fail16:52
fricklerfor devstack itself I don't think this is needed, devstack fails very fast on any error (except for some pathological cases like rpm issues, but that I'd consider a bug in devstack anyway)16:54
fricklerso yes this is why I suggested tempest base job16:54
fricklerlet me see if I can get this pushed still before I end my day16:55
gmannin devstack base job I mean for all the parent jobs we can define this at single place but yes it will be good to do in tempest base job16:55
gmannthis is how test failure of tests '..[28.423190s] ... FAILED' so zuul deifned regex seems safe for me and very very rare to get false failure16:55
gmannfrickler: ohk, ping me if you can push otherwise I can push the change16:56
corvusquestion #2: https://zuul.opendev.org/t/zuul/build/418cdcd6b4f141de9a3d162aa6be4113/log/job-output.txt#662016:56
corvusthat's the log output16:56
gmannclarkb: ++ thanks16:56
gmanncorvus: thanks16:56
corvusquestion #1 -- yes it's append-only so a child job would need to get its parent to fix or remove it.  it is the re2 syntax, but i don't believe that negate is supported.  that should be possible to add.16:58
gmannthat regex might not cover all failure but most common test failing case will be covered and as we see how it goes we can improve it gradually 16:58
corvusgmann: ++16:59
fricklercorvus: would it make sense to add another warning at the end of the run playbook for the case there was an early failure but then an rc=0? or is there one already maybe?17:02
corvusthere is not, but i agree that would make sense and would not be difficult17:03
gmann++. in that case, maybe job status can be something else so that it is easily detected 'FAILED ?' with question mark if that is easy to do17:03
corvusthat's more difficult; but i think adding a build event might be possible; those show up in build metadata; i don't think they are directly queryable, but they could be in the future.  but they can be easily scanned in the api.17:05
gmannk. make sense. at least we will be able to run such query in opensearch dashboard periodically and see if there are any such case17:08
opendevreviewJens Harbott proposed openstack/tempest master: Add early failure detection for tempest jobs  https://review.opendev.org/c/openstack/tempest/+/92495617:16
opendevreviewJens Harbott proposed openstack/tempest master: DNM: Test early failure detection  https://review.opendev.org/c/openstack/tempest/+/92495717:16
opendevreviewMerged openstack/devstack master: Install simplejson in devstack venv  https://review.opendev.org/c/openstack/devstack/+/92486717:29
opendevreviewAshley Rodriguez proposed openstack/devstack-plugin-ceph stable/2023.2: Fix ingress deamon  https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/92496017:48
opendevreviewAshley Rodriguez proposed openstack/devstack-plugin-ceph stable/2023.2: Fix ingress deamon  https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/92496017:53
opendevreviewAshley Rodriguez proposed openstack/devstack-plugin-ceph stable/2023.2: Fix ingress deamon  https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/92496017:54
*** iurygregory__ is now known as iurygregory17:57
opendevreviewAshley Rodriguez proposed openstack/devstack-plugin-ceph stable/2024.1: Follow Up Patch Fix Ingress Deamon  https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/92496417:59
fricklerbtw. as I was just looking at the final nova patch, integrated gate has window_size=30 currently, so it isn't glued to the 20 permanently at least18:10
corvushttps://graphite.opendev.org/?width=586&height=308&target=stats.gauges.zuul.tenant.openstack.pipeline.gate.queue.integrated.window&from=00%3A00_20240701&until=23%3A59_2024072518:14
corvusthere have been some impressive spikes, but it looks like it stays pretty close to the floor18:15
opendevreviewJens Harbott proposed openstack/tempest master: Add early failure detection for tempest base job  https://review.opendev.org/c/openstack/tempest/+/92495618:39
opendevreviewJens Harbott proposed openstack/tempest master: DNM: Test early failure detection  https://review.opendev.org/c/openstack/tempest/+/92495718:39
clarkbcorvus: ^ that DNM change failed to match and early fail. is that because the message comes from a debug task and we're maybe filting by task type when doing early detection?21:08
clarkbotherwise the regex seems to work with some (naive admittedly) local testing21:08
corvusclarkb: has to come from a shell task21:17
corvusif you want it to early fail from an ansible task, just have the task fail.  that's automatic.  :)21:17
corvusso to make that dnm test effective, it should be shell: 'echo "{0} tempest.api.test_fake [0.1s] ... FAILED"' or something vaguely like that21:18
clarkbcool that confirms that the test is flawed and not the regex at least until we can rewrite the test21:18
clarkbyup21:18
clarkbI guess I can push an update to do that21:18
opendevreviewClark Boylan proposed openstack/tempest master: DNM: Test early failure detection  https://review.opendev.org/c/openstack/tempest/+/92495721:21

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!