*** __ministry is now known as Guest2506 | 07:24 | |
*** ralonsoh_ is now known as ralonsoh | 09:21 | |
opendevreview | JaromÃr Wysoglad proposed openstack/project-config master: Add openstack-k8s-operators/sg-core to zuul repos https://review.opendev.org/c/openstack/project-config/+/937339 | 09:52 |
---|---|---|
*** __ministry is now known as Guest2522 | 10:42 | |
opendevreview | Merged openstack/project-config master: Add openstack-k8s-operators/sg-core to zuul repos https://review.opendev.org/c/openstack/project-config/+/937339 | 13:29 |
*** darmach5 is now known as darmach | 13:30 | |
ykarel | Hi noticed Provider: rax-ord, Label: ubuntu-noble have only 1 cpu(ansible_processor_nproc: 1), is it normal? | 14:48 |
ykarel | noticed when a neutron unit test job timed out | 14:48 |
ykarel | same provider on success job have 8 cpus | 14:48 |
ykarel | pass:- https://66dcde1d9aeaad4e1862-35f3f5b682b7c3e09c41881ae7991d96.ssl.cf2.rackcdn.com/936850/6/check/openstack-tox-py312/28b1f63/zuul-info/host-info.ubuntu-noble.yaml | 14:49 |
ykarel | fail:- https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_5c1/936807/5/check/openstack-tox-py312/5c1b043/zuul-info/host-info.ubuntu-noble.yaml | 14:49 |
fungi | ykarel: not normal, no | 14:59 |
frickler | that's weird indeed, I don't see how that should be possible to happen with the set of labels that we use | 15:13 |
frickler | or maybe ansible for some reason is collecting bogus data? possibly we should also dump /proc/cpuinfo somewhere? could also be interesting to track the cpu flags situation a bit | 15:15 |
fungi | i'm trying to get a snapshot of the server instances currently booted there to see if any report using an incorrect flavor | 15:25 |
fungi | but the api is hanging | 15:25 |
ykarel | doesn't seem ansible bogus, as the failed job also ran tests with concurrency 1 | 15:26 |
frickler | we also seem to be kind of running at capacity since about 11:30, see https://grafana.opendev.org/d/a8667d6647/nodepool3a-rackspace?orgId=1 but also other providers | 15:27 |
frickler | although nowhere completely using the expected max, still suspicously flatlining | 15:28 |
fungi | the rackspace cloud status page doesn't indicate any outages nor maintenance in ord | 15:33 |
clarkb | fungi: frickler ykarel iirc thats the issue caused by the old version of xen | 15:38 |
clarkb | the ansible facts capture the xen version and noble with and older xen is sad | 15:39 |
fungi | several attempts now to perform `openstack server list` there, and it just hangs indefinitely. presumably zuul is having a similar experience right now | 15:39 |
ykarel | clarkb, you mean ^ can happen inconsistently on same provider? | 15:47 |
clarkb | ykarel: on the same hypervisor(s) on the same provider yes I think so | 15:48 |
clarkb | I'm still digging through logs to find where this was discussed previously but it came up on october 2, 2024 and I was really hoping that those directly affected would dig into it more | 15:48 |
clarkb | https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2024-10-02.log.html#t2024-10-02T15:55:13 | 15:49 |
ykarel | thx clarkb , now i see some discussion on #opendev, so will watch there | 15:51 |
clarkb | ykarel: fwiw I'm still trying to encourage people to link to the logs inside of zuul so that we can link specific lines more easily | 15:56 |
clarkb | but I've confirmed the fail case above has an old version of xen like we saw before | 15:57 |
fungi | where did you find the xen version reported? | 15:57 |
clarkb | fungi: in the host-info.ubuntu-noble.yaml file linked above its near the top of the file | 15:58 |
clarkb | (I would link to it if I had a link via zuul but I'm not going to work backwards as its too much work) | 15:58 |
ykarel | clarkb, ack will keep in mind next time, just got more habitual to other interface | 15:59 |
clarkb | and then this file https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_5c1/936807/5/check/openstack-tox-py312/5c1b043/zuul-info/inventory.yaml gives us the hypervisor host id | 15:59 |
fungi | got it | 15:59 |
fungi | i didn't realize we got the xen version from host-info | 16:00 |
ykarel | https://zuul.openstack.org/build/5c1b0433c263448c9c66ead7ff84866f/log/zuul-info/host-info.ubuntu-noble.yaml#1 | 16:00 |
ykarel | ^ the failed build zuul link | 16:01 |
ykarel | host_id: 7eca1835ed13e21e6a6b3c7bba861f314865eb616acfeaf63911026b | 16:01 |
clarkb | fungi: its reported as the "bios version" | 16:03 |
fungi | that makes more sense, yep | 16:04 |
haleyb | frickler: can you take a look at https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/937194 ? I think it's sane based on my testing in the charms repo, thanks | 17:22 |
tonyb | I wonder if there is enough data in logsearch to create a query to prove or otherwise that hypothesis | 18:55 |
clarkb | tonyb: I think frickler tried but doesn't seem like there are many occurences | 19:05 |
tonyb | ahh okay. | 19:06 |
clarkb | tonyb: I can clean up your gerrit autohold in zuul right? thats from a while back | 19:06 |
tonyb | oh probably. I thought I'd cleaned them all up | 19:23 |
clarkb | tonyb: I still see one using zuul-client autohold-list I can clean it up | 19:23 |
tonyb | please do | 19:27 |
clarkb | done | 19:27 |
tonyb | c'est parfait | 19:32 |
sean-k-mooney | was there a zuul restart today? | 20:00 |
fungi | sean-k-mooney: no, last zuul restart was saturday (automated weekly upgrades) | 20:01 |
sean-k-mooney | actully no i think the canceled jobs in https://review.opendev.org/c/openstack/nova/+/924593/4?tab=change-view-tab-header-zuul-results-summary are because the py312 job timed out | 20:01 |
fungi | sean-k-mooney: find a very lengthy deep dive into what caused that in the #opendev channel log | 20:01 |
sean-k-mooney | that almost never happens so i bet it was a package install issue or something like that | 20:01 |
sean-k-mooney | well i know we have the fast fail thing to canchel the build set | 20:02 |
fungi | short answer, a job on the change ahead of those started to fail early which caused all the jobs for changes after it to get cancelled, but then the failing job hit another issue which caused it to be marked unreachable and zuul retried the job, so the change ended up merging | 20:02 |
sean-k-mooney | but we almost never have teh tox jobs time out so its relitivly novel to see it cancel like that | 20:02 |
fungi | the plan is to fix zuul to not retry builds which are indicating early failure pattern matches | 20:03 |
sean-k-mooney | fungi: did we fix that already | 20:03 |
sean-k-mooney | i tought we fixed it the other way | 20:03 |
sean-k-mooney | to try and have zuul not aboort if it coudl be retrieed | 20:03 |
sean-k-mooney | fungi: didnt you write a patch to adress that previosuly | 20:04 |
fungi | sean-k-mooney: no, the problem is the job hit a legitimate failure in a tempest test and so indicated to zuul that it was in the process of failing but would finish running the remaining tempest tests first, but then at the end something happened to make the node unreachable, which triggered zuul's retry of the build | 20:05 |
fungi | but by then the in progress builds for all the changes behind that one had already been canceled and the changes set to report as failed | 20:05 |
sean-k-mooney | oh ok | 20:05 |
sean-k-mooney | i tought unreachble was translated ot error without a retry | 20:06 |
sean-k-mooney | so that a different failure mode then we had before | 20:06 |
fungi | unreachable job node makes zuul think it probably failed due to an issue in the underlying cloud provider unrelated to the job | 20:06 |
sean-k-mooney | right but i tought the retry only happend if it failed in a pre-playbook | 20:07 |
fungi | but really the job should have failed due to the earlier tempest failure and not been retried automatically | 20:07 |
sean-k-mooney | where as tempest woudl be in run or post | 20:07 |
fungi | any fails in pre-run playbooks trigger a retry, only unreachable errors after pre-run trigger a retry | 20:07 |
sean-k-mooney | ok i trust that ye have that in hand then | 20:08 |
sean-k-mooney | and yes the previous change that merge did have a retry of some of the jobs https://zuul.opendev.org/t/openstack/buildset/e0429ce698fa427b9dfdc2ed2682b578 | 20:09 |
fungi | yeah, short-short summary is that this is a corner case in zuul's retry logic, first time we've observed it, but fix is probably easy | 20:09 |
sean-k-mooney | it is simialr but not quite the same as the previous time | 20:09 |
sean-k-mooney | previously it was somethign to do with the post playbook i think in this case its related to run | 20:10 |
fungi | sean-k-mooney: hairy details including job log links can be found starting at https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2024-12-09.log.html#t2024-12-09T17:22:52 | 20:11 |
fungi | took us a while to hit on the cause | 20:12 |
sean-k-mooney | i have enouch to feel ok with rechekcing the patch | 20:13 |
sean-k-mooney | i wanted ot determin why it aborted before doing that | 20:13 |
fungi | yeah, that's the way to go. thanks for bringing it up! | 20:13 |
fungi | clarkb was the one who spotted it, and thought it looked odd enough to dig into | 20:13 |
sean-k-mooney | we are currenlty in the process of merging 2 diffent 30+ patch series which is fun :) | 20:14 |
fungi | you're finding all the bugs! | 20:14 |
sean-k-mooney | they are both making progres but most are taking at least one recheck | 20:14 |
sean-k-mooney | we really do have to evently adress the cinder volume isseus | 20:15 |
sean-k-mooney | tl;dr cinder volume attach/detach can trigger kernel pannic or just never complete at the qemu level | 20:16 |
sean-k-mooney | depending in the version of qemu, kernel, phase of the moon, that are in use in any given job... | 20:16 |
fungi | setUpClass (tempest.api.compute.admin.test_volume.AttachSCSIVolumeTestJSON) is what failed in the job that ultimately got retried, so possibly the same | 20:17 |
fungi | https://zuul.opendev.org/t/openstack/build/1a4f2c1d17b946cda628491f5a5d91f8/log/job-output.txt#34945 | 20:17 |
sean-k-mooney | oh we don thave the post playbooks logs | 20:20 |
sean-k-mooney | ya we cant see exactly but it may be the same issue | 20:20 |
sean-k-mooney | i kin of whish we had a better way to anotate test as flaky (i.e. to pass in a set of regexes to match on know error mssages or jsut a list of flaky tests) so that in any given job we could decalr which tests to allow to be retired | 20:22 |
sean-k-mooney | i know we want to have stable test but soemtimes we just can fix the instablity without reducign coverage | 20:24 |
sean-k-mooney | the detach issue only happens with the lvm backend in ci as far as i am aware but we dont want to loose all testig of iscisi but just testing with ceph | 20:24 |
sean-k-mooney | we use the lvm diver as a stand in for all the vendor specific sans like netapp that use iscsi | 20:25 |
fungi | at one time there was the openstackhealth service which used data from subunit2sql to determine fail rates for specific tempest tests | 20:27 |
fungi | but maintaining solutions like that definitely requires people spending time on it, which is hard for our contributors as a whole to prioritize | 20:28 |
sean-k-mooney | tempest has a way for use to manually mark specific tests as flaky | 20:32 |
sean-k-mooney | but they have not relaly wanted us to use that because once its marked as such it will likely never get removed | 20:33 |
sean-k-mooney | that why i wish there was a way to do that externally in the jobs | 20:33 |
sean-k-mooney | then nova could declar them flaky in our gate but cinder could keep the "non-flaky" in theres | 20:34 |
sean-k-mooney | we are not sure why they fail more ofthen in our gates then theres but proably load related | 20:34 |
fungi | without knowing much about the mechanism, it doesn't sound like it would be too hard to add a project-local override list | 20:35 |
sean-k-mooney | fungi: i think we are using stestr underneet to run the tests so without code chagnes to temepst to annotate tests as flaky i think we woudl need a support form stestr to pass a flaky test regex or list | 20:44 |
sean-k-mooney | thats not somethign it currently supprots | 20:44 |
sean-k-mooney | anyway im going to call it a day o/ | 20:44 |
fungi | aha, got it. have a good evening! | 20:47 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!