Tuesday, 2025-02-18

sean-k-mooney	hi o/	00:39
sean-k-mooney	we had a tox coverage job failure because the node provided only had 1 cpu	00:40
sean-k-mooney	it cause the job to time out	00:40
sean-k-mooney	its also a really old cpu	00:40
sean-k-mooney	- Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz	00:40
sean-k-mooney	ansible_processor_cores: 1	00:40
sean-k-mooney	ansible_processor_count: 1	00:40
sean-k-mooney	ansible_processor_nproc: 1	00:40
sean-k-mooney	ansible_processor_threads_per_core: 1	00:40
sean-k-mooney	ansible_processor_vcpus: 1	00:40
sean-k-mooney	any idea why that might have happened. it should have at least 4 cpus ideally 8	00:41
sean-k-mooney	novas unit test take a long time to run serially	00:41
clarkb	this is a known problem that occurs extremely infrequently. It appears to occur with noble only landing on an older xen version (if you look at the ansible vars the xen version is older than where noble nodes get 8 vcpu from rax)	00:41
clarkb	the problem is somewhere in nova/xen/noble_kernel most likely	00:42
sean-k-mooney	oh weried	00:42
clarkb	we've tried reporting it to rax with minimal traction	00:42
sean-k-mooney	i didnt even think that this coudl be related to xen not virutalising it properly	00:42
sean-k-mooney	i did see HVM domU and think oh its xen	00:43
sean-k-mooney	but not that it woudl manifest like this	00:43
sean-k-mooney	but ok i have only seen it once	00:43
sean-k-mooney	so ill keep an eye out and see if it start happenign more frequently	00:43
clarkb	one idea was to have jobs check if running xen and if cpu count is 1 then error maybe. But it happens very infrequently	00:43
clarkb	so not sure where the right amount of involvement is. Also if we paper over it the chances of it ever getting fixed seem very slim	00:44
sean-k-mooney	you mena in a pre playbook so that zuul will retry it	00:44
clarkb	yes	00:44
sean-k-mooney	if it becomes poblematic ill keep that in mind	00:44
clarkb	probably in https://opendev.org/opendev/base-jobs/src/branch/master/playbooks/base/pre.yaml so that it runs as early as possible	00:45
sean-k-mooney	i was wondering if that would be too early but maybe	00:45
sean-k-mooney	i was nto sure that this is somethign we woudl want in all jobs	00:45
sean-k-mooney	clarkb: the fact its a know issue is reassuring	00:46
clarkb	for something that is inherent to the test node like that discarding it as early as possible will be the most efficient solution	00:46
sean-k-mooney	i was actuly wondering if there had been a regresssion in stestr or if the nodeset used for the tox jobs had been changed recently	00:46
clarkb	you could detect it later but then you're doing all of the other setup first	00:46
clarkb	ya the first thing I did when tracking this down before was do my best to check that we used the correct flavor	00:47
clarkb	and as far as I can tell the problem is not on our side but something in the cloud providing the wrong setup and it only shows for the old version of xen	00:47
sean-k-mooney	i was kind of debating if having a semi standard job varibale "min_cpu" or something like that woudl make sense that he pre playbook would use so you could override the min count in later jobs if needed	00:48
clarkb	no	00:48
clarkb	we shouldn't have test jobs reject nodes beacuse they aren't big enough	00:48
clarkb	we should only have them reject nodes if the node is broken	00:48
sean-k-mooney	now that you have explained it i understand why that is not desired	00:48
sean-k-mooney	ya	00:48
clarkb	selecting a large enough node should be accomplished via the nodeset	00:49
clarkb	if nova/xen/qemu/linux get it wrnog later that is the only thing we should workaround	00:49
sean-k-mooney	right and to be fair if that was placement or os-vif 1 core might have worked	00:49
sean-k-mooney	nova just has a lot of tests	00:49
sean-k-mooney	nothing was actully failing it jsut timed out	00:50
clarkb	it was first detected in numa jobs iirc bceause that failed due to lack of cpus	00:50
sean-k-mooney	that was a diffent issue i think	00:50
clarkb	but this is only the third report I'ev seen about it since october? so it really does seem to be infrequent	00:50
sean-k-mooney	that failed because i had written that epxlicty assuming we woudl have 8 and it got 4	00:50
clarkb	that was a separate issue	00:51
sean-k-mooney	but that was just because ye performance match it	00:51
clarkb	then later we got a similar thing that is this issue but also in jobs that are explicit about cpu counts	00:51
clarkb	maybe it wasn't numa but something else that does similar	00:51
sean-k-mooney	ack	00:51
sean-k-mooney	there are some numa nodesets i think but i didnt think those were aviable on xen	00:51
clarkb	though back in october we were using much less ubuntu noble. Its possible we'll see more of this due to expanded use of noble	00:52
clarkb	but not all hypervisors in that same provider have the old xen version (I confirmed that) so it seems you have to be in the one provider and get lucky with the hypervisor	00:52
sean-k-mooney	i think this was rax-ord	00:53
sean-k-mooney	isint that being replaced with the new kvm rax-flex provider over time	00:53
sean-k-mooney	i think you mentioned we will eventually not have xen host	00:53
clarkb	there is a new kvm rax-flex provider	00:53
clarkb	I don't know that one will replace the other or not. Rax classic is still the bulk of our available quota and 10x the rax flex quota	00:54
sean-k-mooney	oh ok no worries	00:54
sean-k-mooney	it odd/worriing that the toploty would change basedon the image	00:55
sean-k-mooney	the only reason i cna think that that would happen is if we are using the libosinfo feature when uploading the image to rax	00:55
clarkb	https://meetings.opendev.org/irclogs/%23openstack-infra/%23openstack-infra.2024-12-09.log.html#t2024-12-09T14:48:23	00:56
clarkb	but I debugged it prior to this time too	00:56
sean-k-mooney	our nodepool config is not seting any image properties that i can see when uploading the images	00:57
clarkb	https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2024-10-02.log.html#t2024-10-02T15:55:13	00:57
sean-k-mooney	so it not what i tought i was	00:57
sean-k-mooney	oh it was a cinder job that needed 2 cpus	00:58
sean-k-mooney	based on the old logs	00:58
clarkb	https://meetings.opendev.org/irclogs/%23openstack-cinder/%23openstack-cinder.2024-10-02.log.html#t2024-10-02T15:20:13 this was the original place I saw it	01:00
clarkb	and then ya happened again in december and now your report	01:00
sean-k-mooney	they must be running a very old version of nova/xen	01:01
sean-k-mooney	nov drop all supprot for xen in wallaby	01:01
clarkb	yes though this is an even older version of xen because newer xen works	01:01
sean-k-mooney	well we dorped xen via libvirt in wallaby but xenserver was dropped in victoria and we deprecated all of them in train ish	01:02
sean-k-mooney	so ya thanks for the context	01:02
*** tkajinam is now known as Guest9449		13:10
-opendevstatus- NOTICE: nominations for the OpenStack PTL and TC positions are closing soon, for details see https://lists.openstack.org/archives/list/openstack-discuss@lists.openstack.org/message/7DKEV7IEHOTHED7RVEFG7WIDVUC4MY3Z/		15:58
kozhukalov	Hi team. I use the single node nodeset with 32GB Ubuntu Jammy for one of the jobs. I know that the number of such huge nodes is extremely limited, so I used it for just one job. I use the label ubuntu-jammy-32GB which is provided by Vexxhost in ca-ymq-1 region. At the moment I see the job fails with the error `Error: Node(set) request 300-0026339961 failed`. Does this mean that this label is not available any more? Should	17:33
kozhukalov	we remove this label (and probably older such labels) from here https://opendev.org/openstack/project-config/src/branch/master/nodepool/nl03.opendev.org.yaml#L171-L182 ?	17:33
fungi	we do still have that label configured: https://opendev.org/openstack/project-config/src/branch/master/nodepool/nl03.opendev.org.yaml#L76	17:35
fungi	i'll see if that flavor still exists in the provider	17:37
fungi	yeah, still shows up in flavor list	17:37
fungi	(v3-standard-8)	17:37
fungi	we do seem to be booting nodes in that provider at the moment: https://grafana.opendev.org/d/b283670153/nodepool3a-vexxhost	17:39
clarkb	vexxhost had sdk issues from ~Thursday to today?	17:39
clarkb	er to yesterday	17:39
fungi	kozhukalov: how recently did tiy see the error	17:39
fungi	er, s/tiy/you/	17:39
kozhukalov	most recent yesterday https://zuul.opendev.org/t/openstack/build/9a578caa5443422787d60a1e2e3f1a98	17:40
clarkb	could be related to ^ or maybe there are capacity issues	17:40
clarkb	but this is the risk of using labels with a single provider	17:40
kozhukalov	Yes, I understand. This is the only job that uses such 32GB nodes. Thanks for the information.	17:42
fungi	it should have been fixed after https://review.opendev.org/941933 deployed yesterday (2025-02-17) at 17:27:13 utc	17:42

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!