Monday, 2024-12-09

*** __ministry is now known as Guest250607:24
*** ralonsoh_ is now known as ralonsoh09:21
opendevreviewJaromír Wysoglad proposed openstack/project-config master: Add openstack-k8s-operators/sg-core to zuul repos  https://review.opendev.org/c/openstack/project-config/+/93733909:52
*** __ministry is now known as Guest252210:42
opendevreviewMerged openstack/project-config master: Add openstack-k8s-operators/sg-core to zuul repos  https://review.opendev.org/c/openstack/project-config/+/93733913:29
*** darmach5 is now known as darmach13:30
ykarelHi noticed Provider: rax-ord, Label: ubuntu-noble have only 1 cpu(ansible_processor_nproc: 1), is it normal?14:48
ykarelnoticed when a neutron unit test job timed out14:48
ykarelsame provider on success job have 8 cpus14:48
ykarelpass:- https://66dcde1d9aeaad4e1862-35f3f5b682b7c3e09c41881ae7991d96.ssl.cf2.rackcdn.com/936850/6/check/openstack-tox-py312/28b1f63/zuul-info/host-info.ubuntu-noble.yaml14:49
ykarelfail:- https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_5c1/936807/5/check/openstack-tox-py312/5c1b043/zuul-info/host-info.ubuntu-noble.yaml14:49
fungiykarel: not normal, no14:59
fricklerthat's weird indeed, I don't see how that should be possible to happen with the set of labels that we use15:13
frickleror maybe ansible for some reason is collecting bogus data? possibly we should also dump /proc/cpuinfo somewhere? could also be interesting to track the cpu flags situation a bit15:15
fungii'm trying to get a snapshot of the server instances currently booted there to see if any report using an incorrect flavor15:25
fungibut the api is hanging15:25
ykareldoesn't seem ansible bogus, as the failed job also ran tests with concurrency 115:26
fricklerwe also seem to be kind of running at capacity since about 11:30, see https://grafana.opendev.org/d/a8667d6647/nodepool3a-rackspace?orgId=1 but also other providers15:27
frickleralthough nowhere completely using the expected max, still suspicously flatlining15:28
fungithe rackspace cloud status page doesn't indicate any outages nor maintenance in ord15:33
clarkbfungi: frickler ykarel iirc thats the issue caused by the old version of xen15:38
clarkbthe ansible facts capture the xen version and noble with and older xen is sad15:39
fungiseveral attempts now to perform `openstack server list` there, and it just hangs indefinitely. presumably zuul is having a similar experience right now15:39
ykarelclarkb, you mean ^ can happen inconsistently on same provider?15:47
clarkbykarel: on the same hypervisor(s) on the same provider yes I think so15:48
clarkbI'm still digging through logs to find where this was discussed previously but it came up on october 2, 2024 and I was really hoping that those directly affected would dig into it more15:48
clarkbhttps://meetings.opendev.org/irclogs/%23opendev/%23opendev.2024-10-02.log.html#t2024-10-02T15:55:1315:49
ykarelthx clarkb , now i see some discussion on #opendev, so will watch there15:51
clarkbykarel: fwiw I'm still trying to encourage people to link to the logs inside of zuul so that we can link specific lines more easily15:56
clarkbbut I've confirmed the fail case above has an old version of xen like we saw before15:57
fungiwhere did you find the xen version reported?15:57
clarkbfungi: in the host-info.ubuntu-noble.yaml file linked above its near the top of the file15:58
clarkb(I would link to it if I had a link via zuul but I'm not going to work backwards as its too much work)15:58
ykarelclarkb, ack will keep in mind next time, just got more habitual to other interface15:59
clarkband then this file https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_5c1/936807/5/check/openstack-tox-py312/5c1b043/zuul-info/inventory.yaml gives us the hypervisor host id15:59
fungigot it15:59
fungii didn't realize we got the xen version from host-info16:00
ykarelhttps://zuul.openstack.org/build/5c1b0433c263448c9c66ead7ff84866f/log/zuul-info/host-info.ubuntu-noble.yaml#116:00
ykarel^ the failed build zuul link16:01
ykarelhost_id: 7eca1835ed13e21e6a6b3c7bba861f314865eb616acfeaf63911026b16:01
clarkbfungi: its reported as the "bios version"16:03
fungithat makes more sense, yep16:04
haleybfrickler: can you take a look at https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/937194 ? I think it's sane based on my testing in the charms repo, thanks17:22
tonybI wonder if there is enough data in logsearch to create a query to prove or otherwise that hypothesis 18:55
clarkbtonyb: I think frickler tried but doesn't seem like there are many occurences19:05
tonybahh okay.19:06
clarkbtonyb: I can clean up your gerrit autohold in zuul right? thats from a while back19:06
tonyboh probably.  I thought I'd cleaned them all up19:23
clarkbtonyb: I still see one using zuul-client autohold-list I can clean it up19:23
tonybplease do19:27
clarkbdone19:27
tonybc'est parfait19:32
sean-k-mooneywas there a zuul restart today?20:00
fungisean-k-mooney: no, last zuul restart was saturday (automated weekly upgrades)20:01
sean-k-mooneyactully no i think the canceled jobs in https://review.opendev.org/c/openstack/nova/+/924593/4?tab=change-view-tab-header-zuul-results-summary are because the py312 job timed out20:01
fungisean-k-mooney: find a very lengthy deep dive into what caused that in the #opendev channel log20:01
sean-k-mooneythat almost never happens so i bet it was a package install issue or something like that20:01
sean-k-mooneywell i know we have the fast fail thing to canchel the build set20:02
fungishort answer, a job on the change ahead of those started to fail early which caused all the jobs for changes after it to get cancelled, but then the failing job hit another issue which caused it to be marked unreachable and zuul retried the job, so the change ended up merging20:02
sean-k-mooneybut we almost never have teh tox jobs time out so its relitivly novel to see it cancel like that20:02
fungithe plan is to fix zuul to not retry builds which are indicating early failure pattern matches20:03
sean-k-mooneyfungi: did we fix that already20:03
sean-k-mooneyi tought we fixed it the other way20:03
sean-k-mooneyto try and have zuul not aboort if it coudl be retrieed20:03
sean-k-mooneyfungi: didnt you write a patch to adress that previosuly20:04
fungisean-k-mooney: no, the problem is the job hit a legitimate failure in a tempest test and so indicated to zuul that it was in the process of failing but would finish running the remaining tempest tests first, but then at the end something happened to make the node unreachable, which triggered zuul's retry of the build20:05
fungibut by then the in progress builds for all the changes behind that one had already been canceled and the changes set to report as failed20:05
sean-k-mooneyoh ok20:05
sean-k-mooneyi tought unreachble was translated ot error without a retry20:06
sean-k-mooneyso that a different failure mode then we had before20:06
fungiunreachable job node makes zuul think it probably failed due to an issue in the underlying cloud provider unrelated to the job20:06
sean-k-mooneyright but i tought the retry only happend if it failed in a pre-playbook20:07
fungibut really the job should have failed due to the earlier tempest failure and not been retried automatically20:07
sean-k-mooneywhere as tempest woudl be in run or post20:07
fungiany fails in pre-run playbooks trigger a retry, only unreachable errors after pre-run trigger a retry20:07
sean-k-mooneyok i trust that ye have that in hand then20:08
sean-k-mooneyand yes the previous change that merge did have a retry of some of the jobs https://zuul.opendev.org/t/openstack/buildset/e0429ce698fa427b9dfdc2ed2682b57820:09
fungiyeah, short-short summary is that this is a corner case in zuul's retry logic, first time we've observed it, but fix is probably easy20:09
sean-k-mooneyit  is simialr but not quite the same as the previous time20:09
sean-k-mooneypreviously it was somethign to do with the post playbook i think in this case its related to run20:10
fungisean-k-mooney: hairy details including job log links can be found starting at https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2024-12-09.log.html#t2024-12-09T17:22:5220:11
fungitook us a while to hit on the cause20:12
sean-k-mooneyi have enouch to feel ok with rechekcing the patch 20:13
sean-k-mooneyi wanted ot determin why it aborted before doing that20:13
fungiyeah, that's the way to go. thanks for bringing it up!20:13
fungiclarkb was the one who spotted it, and thought it looked odd enough to dig into20:13
sean-k-mooneywe are currenlty in the process of merging 2 diffent 30+ patch series which is fun :)20:14
fungiyou're finding all the bugs!20:14
sean-k-mooneythey are both making progres but most are taking at least one recheck20:14
sean-k-mooneywe really do have to evently adress the cinder volume isseus20:15
sean-k-mooneytl;dr cinder volume attach/detach can trigger kernel pannic or just never complete at the qemu level20:16
sean-k-mooneydepending in the version of qemu, kernel, phase of the moon, that are in use in any given job...20:16
fungisetUpClass (tempest.api.compute.admin.test_volume.AttachSCSIVolumeTestJSON) is what failed in the job that ultimately got retried, so possibly the same20:17
fungihttps://zuul.opendev.org/t/openstack/build/1a4f2c1d17b946cda628491f5a5d91f8/log/job-output.txt#3494520:17
sean-k-mooneyoh we don thave the post playbooks logs20:20
sean-k-mooneyya we cant see exactly but it may be the same issue20:20
sean-k-mooneyi kin of whish we had a better way to anotate test as flaky (i.e. to pass in a set of regexes to match on know error mssages or jsut a list of flaky tests) so that in any given job we could decalr which tests to allow to be retired20:22
sean-k-mooneyi know we want to have stable test but soemtimes we just can fix the instablity without reducign coverage20:24
sean-k-mooneythe detach issue only happens with the lvm backend in ci as far as i am aware but we dont want to loose all testig of iscisi but just testing with ceph20:24
sean-k-mooneywe use the lvm diver as a stand in for all the vendor specific sans like netapp that use iscsi20:25
fungiat one time there was the openstackhealth service which used data from subunit2sql to determine fail rates for specific tempest tests20:27
fungibut maintaining solutions like that definitely requires people spending time on it, which is hard for our contributors as a whole to prioritize20:28
sean-k-mooneytempest has a way for use to manually mark specific tests as flaky20:32
sean-k-mooneybut they have not relaly wanted us to use that because once its marked as such it will likely never get removed20:33
sean-k-mooneythat why i wish there was a way to do that externally in the jobs20:33
sean-k-mooneythen nova could declar them flaky in our gate but cinder could keep the "non-flaky" in theres20:34
sean-k-mooneywe are not sure why they fail more ofthen in our gates then theres but proably load related20:34
fungiwithout knowing much about the mechanism, it doesn't sound like it would be too hard to add a project-local override list20:35
sean-k-mooneyfungi: i think we are using stestr underneet to run the tests so without code chagnes to temepst to annotate tests as flaky i think we woudl need a support form stestr to pass a flaky test regex or list20:44
sean-k-mooneythats not somethign it currently supprots20:44
sean-k-mooneyanyway im going to call it a day o/20:44
fungiaha, got it. have a good evening!20:47

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!