Monday, 2024-12-09

*** __ministry is now known as Guest2506		07:24
*** ralonsoh_ is now known as ralonsoh		09:21
opendevreview	Jaromír Wysoglad proposed openstack/project-config master: Add openstack-k8s-operators/sg-core to zuul repos https://review.opendev.org/c/openstack/project-config/+/937339	09:52
*** __ministry is now known as Guest2522		10:42
opendevreview	Merged openstack/project-config master: Add openstack-k8s-operators/sg-core to zuul repos https://review.opendev.org/c/openstack/project-config/+/937339	13:29
*** darmach5 is now known as darmach		13:30
ykarel	Hi noticed Provider: rax-ord, Label: ubuntu-noble have only 1 cpu(ansible_processor_nproc: 1), is it normal?	14:48
ykarel	noticed when a neutron unit test job timed out	14:48
ykarel	same provider on success job have 8 cpus	14:48
ykarel	pass:- https://66dcde1d9aeaad4e1862-35f3f5b682b7c3e09c41881ae7991d96.ssl.cf2.rackcdn.com/936850/6/check/openstack-tox-py312/28b1f63/zuul-info/host-info.ubuntu-noble.yaml	14:49
ykarel	fail:- https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_5c1/936807/5/check/openstack-tox-py312/5c1b043/zuul-info/host-info.ubuntu-noble.yaml	14:49
fungi	ykarel: not normal, no	14:59
frickler	that's weird indeed, I don't see how that should be possible to happen with the set of labels that we use	15:13
frickler	or maybe ansible for some reason is collecting bogus data? possibly we should also dump /proc/cpuinfo somewhere? could also be interesting to track the cpu flags situation a bit	15:15
fungi	i'm trying to get a snapshot of the server instances currently booted there to see if any report using an incorrect flavor	15:25
fungi	but the api is hanging	15:25
ykarel	doesn't seem ansible bogus, as the failed job also ran tests with concurrency 1	15:26
frickler	we also seem to be kind of running at capacity since about 11:30, see https://grafana.opendev.org/d/a8667d6647/nodepool3a-rackspace?orgId=1 but also other providers	15:27
frickler	although nowhere completely using the expected max, still suspicously flatlining	15:28
fungi	the rackspace cloud status page doesn't indicate any outages nor maintenance in ord	15:33
clarkb	fungi: frickler ykarel iirc thats the issue caused by the old version of xen	15:38
clarkb	the ansible facts capture the xen version and noble with and older xen is sad	15:39
fungi	several attempts now to perform `openstack server list` there, and it just hangs indefinitely. presumably zuul is having a similar experience right now	15:39
ykarel	clarkb, you mean ^ can happen inconsistently on same provider?	15:47
clarkb	ykarel: on the same hypervisor(s) on the same provider yes I think so	15:48
clarkb	I'm still digging through logs to find where this was discussed previously but it came up on october 2, 2024 and I was really hoping that those directly affected would dig into it more	15:48
clarkb	https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2024-10-02.log.html#t2024-10-02T15:55:13	15:49
ykarel	thx clarkb , now i see some discussion on #opendev, so will watch there	15:51
clarkb	ykarel: fwiw I'm still trying to encourage people to link to the logs inside of zuul so that we can link specific lines more easily	15:56
clarkb	but I've confirmed the fail case above has an old version of xen like we saw before	15:57
fungi	where did you find the xen version reported?	15:57
clarkb	fungi: in the host-info.ubuntu-noble.yaml file linked above its near the top of the file	15:58
clarkb	(I would link to it if I had a link via zuul but I'm not going to work backwards as its too much work)	15:58
ykarel	clarkb, ack will keep in mind next time, just got more habitual to other interface	15:59
clarkb	and then this file https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_5c1/936807/5/check/openstack-tox-py312/5c1b043/zuul-info/inventory.yaml gives us the hypervisor host id	15:59
fungi	got it	15:59
fungi	i didn't realize we got the xen version from host-info	16:00
ykarel	https://zuul.openstack.org/build/5c1b0433c263448c9c66ead7ff84866f/log/zuul-info/host-info.ubuntu-noble.yaml#1	16:00
ykarel	^ the failed build zuul link	16:01
ykarel	host_id: 7eca1835ed13e21e6a6b3c7bba861f314865eb616acfeaf63911026b	16:01
clarkb	fungi: its reported as the "bios version"	16:03
fungi	that makes more sense, yep	16:04
haleyb	frickler: can you take a look at https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/937194 ? I think it's sane based on my testing in the charms repo, thanks	17:22
tonyb	I wonder if there is enough data in logsearch to create a query to prove or otherwise that hypothesis	18:55
clarkb	tonyb: I think frickler tried but doesn't seem like there are many occurences	19:05
tonyb	ahh okay.	19:06
clarkb	tonyb: I can clean up your gerrit autohold in zuul right? thats from a while back	19:06
tonyb	oh probably. I thought I'd cleaned them all up	19:23
clarkb	tonyb: I still see one using zuul-client autohold-list I can clean it up	19:23
tonyb	please do	19:27
clarkb	done	19:27
tonyb	c'est parfait	19:32
sean-k-mooney	was there a zuul restart today?	20:00
fungi	sean-k-mooney: no, last zuul restart was saturday (automated weekly upgrades)	20:01
sean-k-mooney	actully no i think the canceled jobs in https://review.opendev.org/c/openstack/nova/+/924593/4?tab=change-view-tab-header-zuul-results-summary are because the py312 job timed out	20:01
fungi	sean-k-mooney: find a very lengthy deep dive into what caused that in the #opendev channel log	20:01
sean-k-mooney	that almost never happens so i bet it was a package install issue or something like that	20:01
sean-k-mooney	well i know we have the fast fail thing to canchel the build set	20:02
fungi	short answer, a job on the change ahead of those started to fail early which caused all the jobs for changes after it to get cancelled, but then the failing job hit another issue which caused it to be marked unreachable and zuul retried the job, so the change ended up merging	20:02
sean-k-mooney	but we almost never have teh tox jobs time out so its relitivly novel to see it cancel like that	20:02
fungi	the plan is to fix zuul to not retry builds which are indicating early failure pattern matches	20:03
sean-k-mooney	fungi: did we fix that already	20:03
sean-k-mooney	i tought we fixed it the other way	20:03
sean-k-mooney	to try and have zuul not aboort if it coudl be retrieed	20:03
sean-k-mooney	fungi: didnt you write a patch to adress that previosuly	20:04
fungi	sean-k-mooney: no, the problem is the job hit a legitimate failure in a tempest test and so indicated to zuul that it was in the process of failing but would finish running the remaining tempest tests first, but then at the end something happened to make the node unreachable, which triggered zuul's retry of the build	20:05
fungi	but by then the in progress builds for all the changes behind that one had already been canceled and the changes set to report as failed	20:05
sean-k-mooney	oh ok	20:05
sean-k-mooney	i tought unreachble was translated ot error without a retry	20:06
sean-k-mooney	so that a different failure mode then we had before	20:06
fungi	unreachable job node makes zuul think it probably failed due to an issue in the underlying cloud provider unrelated to the job	20:06
sean-k-mooney	right but i tought the retry only happend if it failed in a pre-playbook	20:07
fungi	but really the job should have failed due to the earlier tempest failure and not been retried automatically	20:07
sean-k-mooney	where as tempest woudl be in run or post	20:07
fungi	any fails in pre-run playbooks trigger a retry, only unreachable errors after pre-run trigger a retry	20:07
sean-k-mooney	ok i trust that ye have that in hand then	20:08
sean-k-mooney	and yes the previous change that merge did have a retry of some of the jobs https://zuul.opendev.org/t/openstack/buildset/e0429ce698fa427b9dfdc2ed2682b578	20:09
fungi	yeah, short-short summary is that this is a corner case in zuul's retry logic, first time we've observed it, but fix is probably easy	20:09
sean-k-mooney	it is simialr but not quite the same as the previous time	20:09
sean-k-mooney	previously it was somethign to do with the post playbook i think in this case its related to run	20:10
fungi	sean-k-mooney: hairy details including job log links can be found starting at https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2024-12-09.log.html#t2024-12-09T17:22:52	20:11
fungi	took us a while to hit on the cause	20:12
sean-k-mooney	i have enouch to feel ok with rechekcing the patch	20:13
sean-k-mooney	i wanted ot determin why it aborted before doing that	20:13
fungi	yeah, that's the way to go. thanks for bringing it up!	20:13
fungi	clarkb was the one who spotted it, and thought it looked odd enough to dig into	20:13
sean-k-mooney	we are currenlty in the process of merging 2 diffent 30+ patch series which is fun :)	20:14
fungi	you're finding all the bugs!	20:14
sean-k-mooney	they are both making progres but most are taking at least one recheck	20:14
sean-k-mooney	we really do have to evently adress the cinder volume isseus	20:15
sean-k-mooney	tl;dr cinder volume attach/detach can trigger kernel pannic or just never complete at the qemu level	20:16
sean-k-mooney	depending in the version of qemu, kernel, phase of the moon, that are in use in any given job...	20:16
fungi	setUpClass (tempest.api.compute.admin.test_volume.AttachSCSIVolumeTestJSON) is what failed in the job that ultimately got retried, so possibly the same	20:17
fungi	https://zuul.opendev.org/t/openstack/build/1a4f2c1d17b946cda628491f5a5d91f8/log/job-output.txt#34945	20:17
sean-k-mooney	oh we don thave the post playbooks logs	20:20
sean-k-mooney	ya we cant see exactly but it may be the same issue	20:20
sean-k-mooney	i kin of whish we had a better way to anotate test as flaky (i.e. to pass in a set of regexes to match on know error mssages or jsut a list of flaky tests) so that in any given job we could decalr which tests to allow to be retired	20:22
sean-k-mooney	i know we want to have stable test but soemtimes we just can fix the instablity without reducign coverage	20:24
sean-k-mooney	the detach issue only happens with the lvm backend in ci as far as i am aware but we dont want to loose all testig of iscisi but just testing with ceph	20:24
sean-k-mooney	we use the lvm diver as a stand in for all the vendor specific sans like netapp that use iscsi	20:25
fungi	at one time there was the openstackhealth service which used data from subunit2sql to determine fail rates for specific tempest tests	20:27
fungi	but maintaining solutions like that definitely requires people spending time on it, which is hard for our contributors as a whole to prioritize	20:28
sean-k-mooney	tempest has a way for use to manually mark specific tests as flaky	20:32
sean-k-mooney	but they have not relaly wanted us to use that because once its marked as such it will likely never get removed	20:33
sean-k-mooney	that why i wish there was a way to do that externally in the jobs	20:33
sean-k-mooney	then nova could declar them flaky in our gate but cinder could keep the "non-flaky" in theres	20:34
sean-k-mooney	we are not sure why they fail more ofthen in our gates then theres but proably load related	20:34
fungi	without knowing much about the mechanism, it doesn't sound like it would be too hard to add a project-local override list	20:35
sean-k-mooney	fungi: i think we are using stestr underneet to run the tests so without code chagnes to temepst to annotate tests as flaky i think we woudl need a support form stestr to pass a flaky test regex or list	20:44
sean-k-mooney	thats not somethign it currently supprots	20:44
sean-k-mooney	anyway im going to call it a day o/	20:44
fungi	aha, got it. have a good evening!	20:47

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!