frickler | I'd still argue that we'd only want to throw away only 10*10m rather than 20*10m, early detection does nothing to reduce the rate of actual gate failures | 06:00 |
---|---|---|
frickler | also, 10m is unreasonable, 20m alone for the devstack phase, which very rarely fails. assuming the failure happens on average halfway through the tempest run, that would be 40m vs. 60m | 06:02 |
frickler | or, for the more relevant case of a 2h full tempest job, 70m vs. 120m | 06:02 |
opendevreview | Amit Uniyal proposed openstack/whitebox-tempest-plugin master: verify vencrypt feature https://review.opendev.org/c/openstack/whitebox-tempest-plugin/+/923824 | 06:09 |
opendevreview | yatin proposed openstack/devstack master: [DNM] check abandoned patch https://review.opendev.org/c/openstack/devstack/+/924822 | 07:59 |
opendevreview | yatin proposed openstack/devstack master: [DNM] check abandoned patch https://review.opendev.org/c/openstack/devstack/+/924822 | 08:03 |
opendevreview | Katarina Strenkova proposed openstack/tempest master: Fix cleanup of keypairs for --prefix option https://review.opendev.org/c/openstack/tempest/+/924919 | 10:56 |
*** ykarel_ is now known as ykarel | 12:07 | |
opendevreview | Riccardo Pittau proposed openstack/devstack master: Install simplejson in devstack venv https://review.opendev.org/c/openstack/devstack/+/924867 | 12:38 |
frickler | melwitt: fyi ^^ is the workaround for the issue that you noticed yesterday. triggered by osc-lib being released but osc still pending, see discussion in #-sdks | 12:48 |
ykarel | kopecmartin, dansmith can we also get this https://review.opendev.org/c/openstack/devstack/+/924864 | 14:04 |
dansmith | ykarel: no, your quota has been exceeded! | 14:06 |
dansmith | ykarel: but seriously, I don't have +2 on unmaintained/* | 14:06 |
dansmith | you'll need to ask someone who does | 14:06 |
dansmith | I gave you a shiny +1 at least :) | 14:08 |
frickler | ykarel: elod is also out, maybe lajoskatona can simply single-approve? | 14:12 |
ykarel | thx dansmith frickler , lajoskatona if you can +W ^ | 14:14 |
lajoskatona | frickler, ykarel: let me check | 14:17 |
lajoskatona | ykarel: done | 14:18 |
opendevreview | Merged openstack/devstack-plugin-ceph stable/2024.1: Fix ingress deamon https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/924727 | 14:18 |
ykarel | thx lajoskatona | 14:18 |
melwitt | thanks frickler | 15:04 |
*** dviroel is now known as dviroel|afk | 15:14 | |
clarkb | frickler: I think those numbers were just illustrative. I don't know what the actual timespans look like for a tempest job particularly since there have been improvements with the openstack cli server for example. | 15:48 |
clarkb | but the idea is to see if that help sufficiently well to relieve the pressure on resources during gate thrash and high demand. | 15:48 |
clarkb | I think we're comparing ~20 * however long it takes to reach the first tempest test fail (so 25-120 minutes?) to 10 * 120 minutes | 15:49 |
clarkb | there are scenarios where the first option is better | 15:49 |
clarkb | but we don't really know until we try (or spend a lot of effort modeling it, in this case I think its less effort and quicker wins to just experiment directly) | 15:50 |
frickler | yes, and lowering the lower bound is an easy and quick way to test things, so I don't understand why people insist on blocking that experiment :( | 15:59 |
clarkb | because we aren't under that pressure right now so we don't get the benefits either way. So shouldn't we at least try to do the more nimble and correct thing? | 16:01 |
frickler | I still think that both a valid options that do not exclude each other. also it will be too late to start the experiment while zuul is under fire. I would even go as far as suggesting to set the lower bound very low, like 2 or 3, and then watch how the calculated value develops | 16:04 |
clarkb | another thing to keep in mind is that zuul is designed to enable high throughput. Setting low bounds like that effectively undoes the very reason for zuul existing | 16:05 |
clarkb | (zuul was written because the old limit was 1 for openstack and openstack jobs at the time took about an hour to complete so openstack could only merge a maximum of 24 items a day. Zuul was specifically created to address this bottleneck) | 16:06 |
clarkb | Personally I don't want to subvert the tool because th tests are flaky. I'd rather use the built in pressure relief valves to enable improvements while still aiming at higher throughput as a goal | 16:07 |
frickler | the lower bound is only a lower bound. IMO the algorithm to have a floating queue depth was exactly made for our case, we just don't let it play out by capping the lower bound at a much too high value | 16:11 |
frickler | nothing would block it reaching a queue depth of 50 or more if the gate was really deep green | 16:13 |
clarkb | right its a balancing act. But just like tcp slow start that it is modeled after you probably don't want to set it to the lowest value to start because then you'll never reach best case throughput even if everything runs smoothly most of the time | 16:14 |
frickler | I purposely refrained from suggesting to start with a lower bound of 1 | 16:15 |
frickler | though the tcp example is also misleading, we only have essentially a single long running connection in our case | 16:16 |
frickler | so the startup behaviour doesn't matter much after a couple of days of running. and that is one more argument to start the experiment on a quiet day | 16:17 |
clarkb | if we set the floor to 2 I bet we never go above 8 | 16:19 |
clarkb | and the average will probably be around 3 | 16:20 |
clarkb | maybe that is a good thing if it motivates people to improve job stability | 16:20 |
frickler | but if you think that that is the wrong outcome, isn't then the algorithm flawed? | 16:20 |
corvus | if things are bad, we sit at the floor. we're at the floor now. we're almost always at the floor. so setting the floor is essentially setting the fixed window size for openstack. it would be good to let the tool do it's job and try to achieve the max throughput. | 16:20 |
corvus | meanwhile, there is also another tool available that could have a significant impact on resource usage and throughput. like it could actually speed up throughput without any downside whatsoever. that's the early failure detection. i think it would be a really good idea for openstack to take advantage of it. | 16:22 |
corvus | especially since it's so easy. | 16:22 |
corvus | here is how we configured it for the zuul project itself: https://opendev.org/zuul/zuul/src/branch/master/.zuul.yaml#L103-L106 | 16:22 |
frickler | if it were so easy, it would long be done. avoiding false positives is non-trivial IMO | 16:22 |
clarkb | frickler: I think that openstack's blase approach to CI stability is flawed (note I don't think any individual is responsible its a culture problem) and that leads to the algorithm being a less good choice for openstack. We know that other zuul users don't have these problems and have had to implement a window ceiling as a result | 16:22 |
corvus | that's a regex that seems to work very well for zuul with stestr | 16:23 |
clarkb | I would love for openstack to go back to having a culture where needing the window ceiling is something we strive for but it has been a while since I got that feeling from the community | 16:23 |
corvus | any job that uses stestr can probably start with that and have a really good chance that it just works | 16:23 |
corvus | but depending on what else is output during a job, it might not work, which is why some familiarity with the job is important | 16:24 |
frickler | lots of devstack based jobs add some custom testing at the end. nobody knows even near all of those jobs | 16:24 |
corvus | it fails safe. so even if there's a false positive, it would be seen in testing. | 16:26 |
corvus | (and that's a pretty tight regex, i would be surprised by a false positive) | 16:27 |
frickler | how would that be seen other than as a gate failure? | 16:27 |
corvus | it would be seen in check. and there is special logging for it, so it can be discovered in the log for the build. | 16:28 |
*** iurygregory__ is now known as iurygregory | 16:35 | |
frickler | "If an early failure is triggered, the job result will be recorded as FAILURE even if the job playbooks ultimately succeed." that sounds safe enough I guess | 16:38 |
corvus | yeah, so a false positive will not go unnoticed. it should mean that it either works as expected or if there's a problem with the regex, then someone will need to fix the regex or disable it. | 16:41 |
frickler | how could a child job disble it? according to the docs regexes can only be added. does this also support negate? | 16:46 |
frickler | but anyway, I think we could simply copy the regex from zuul-nox to tempest if kopecmartin and/or gmann agree | 16:47 |
frickler | and then see how far that gets us | 16:47 |
gmann | clarkb: corvus: frickler: I am ok to try and checking about regex if zuul one will work fine or any other safe regex we can deifne. | 16:48 |
gmann | but two question. 1. same as frickler pointed that 'option is combined from all parent' and not override 2. 'special logging ' do we have example of that special logging? I think that can help to detect false failure | 16:50 |
gmann | due to first one, I am hesitant to do that in devstack base job but can do in tempest base job and let project to check right regex carefully as per their jobs and define by themself (mainly for non tempst jobs) | 16:51 |
gmann | for tempest jobs we know how job output is and what all string we log when test pass or fail | 16:52 |
frickler | for devstack itself I don't think this is needed, devstack fails very fast on any error (except for some pathological cases like rpm issues, but that I'd consider a bug in devstack anyway) | 16:54 |
frickler | so yes this is why I suggested tempest base job | 16:54 |
frickler | let me see if I can get this pushed still before I end my day | 16:55 |
gmann | in devstack base job I mean for all the parent jobs we can define this at single place but yes it will be good to do in tempest base job | 16:55 |
gmann | this is how test failure of tests '..[28.423190s] ... FAILED' so zuul deifned regex seems safe for me and very very rare to get false failure | 16:55 |
gmann | frickler: ohk, ping me if you can push otherwise I can push the change | 16:56 |
corvus | question #2: https://zuul.opendev.org/t/zuul/build/418cdcd6b4f141de9a3d162aa6be4113/log/job-output.txt#6620 | 16:56 |
corvus | that's the log output | 16:56 |
gmann | clarkb: ++ thanks | 16:56 |
gmann | corvus: thanks | 16:56 |
corvus | question #1 -- yes it's append-only so a child job would need to get its parent to fix or remove it. it is the re2 syntax, but i don't believe that negate is supported. that should be possible to add. | 16:58 |
gmann | that regex might not cover all failure but most common test failing case will be covered and as we see how it goes we can improve it gradually | 16:58 |
corvus | gmann: ++ | 16:59 |
frickler | corvus: would it make sense to add another warning at the end of the run playbook for the case there was an early failure but then an rc=0? or is there one already maybe? | 17:02 |
corvus | there is not, but i agree that would make sense and would not be difficult | 17:03 |
gmann | ++. in that case, maybe job status can be something else so that it is easily detected 'FAILED ?' with question mark if that is easy to do | 17:03 |
corvus | that's more difficult; but i think adding a build event might be possible; those show up in build metadata; i don't think they are directly queryable, but they could be in the future. but they can be easily scanned in the api. | 17:05 |
gmann | k. make sense. at least we will be able to run such query in opensearch dashboard periodically and see if there are any such case | 17:08 |
opendevreview | Jens Harbott proposed openstack/tempest master: Add early failure detection for tempest jobs https://review.opendev.org/c/openstack/tempest/+/924956 | 17:16 |
opendevreview | Jens Harbott proposed openstack/tempest master: DNM: Test early failure detection https://review.opendev.org/c/openstack/tempest/+/924957 | 17:16 |
opendevreview | Merged openstack/devstack master: Install simplejson in devstack venv https://review.opendev.org/c/openstack/devstack/+/924867 | 17:29 |
opendevreview | Ashley Rodriguez proposed openstack/devstack-plugin-ceph stable/2023.2: Fix ingress deamon https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/924960 | 17:48 |
opendevreview | Ashley Rodriguez proposed openstack/devstack-plugin-ceph stable/2023.2: Fix ingress deamon https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/924960 | 17:53 |
opendevreview | Ashley Rodriguez proposed openstack/devstack-plugin-ceph stable/2023.2: Fix ingress deamon https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/924960 | 17:54 |
*** iurygregory__ is now known as iurygregory | 17:57 | |
opendevreview | Ashley Rodriguez proposed openstack/devstack-plugin-ceph stable/2024.1: Follow Up Patch Fix Ingress Deamon https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/924964 | 17:59 |
frickler | btw. as I was just looking at the final nova patch, integrated gate has window_size=30 currently, so it isn't glued to the 20 permanently at least | 18:10 |
corvus | https://graphite.opendev.org/?width=586&height=308&target=stats.gauges.zuul.tenant.openstack.pipeline.gate.queue.integrated.window&from=00%3A00_20240701&until=23%3A59_20240725 | 18:14 |
corvus | there have been some impressive spikes, but it looks like it stays pretty close to the floor | 18:15 |
opendevreview | Jens Harbott proposed openstack/tempest master: Add early failure detection for tempest base job https://review.opendev.org/c/openstack/tempest/+/924956 | 18:39 |
opendevreview | Jens Harbott proposed openstack/tempest master: DNM: Test early failure detection https://review.opendev.org/c/openstack/tempest/+/924957 | 18:39 |
clarkb | corvus: ^ that DNM change failed to match and early fail. is that because the message comes from a debug task and we're maybe filting by task type when doing early detection? | 21:08 |
clarkb | otherwise the regex seems to work with some (naive admittedly) local testing | 21:08 |
corvus | clarkb: has to come from a shell task | 21:17 |
corvus | if you want it to early fail from an ansible task, just have the task fail. that's automatic. :) | 21:17 |
corvus | so to make that dnm test effective, it should be shell: 'echo "{0} tempest.api.test_fake [0.1s] ... FAILED"' or something vaguely like that | 21:18 |
clarkb | cool that confirms that the test is flawed and not the regex at least until we can rewrite the test | 21:18 |
clarkb | yup | 21:18 |
clarkb | I guess I can push an update to do that | 21:18 |
opendevreview | Clark Boylan proposed openstack/tempest master: DNM: Test early failure detection https://review.opendev.org/c/openstack/tempest/+/924957 | 21:21 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!