Tuesday, 2025-02-18

sean-k-mooneyhi o/00:39
sean-k-mooneywe had a tox coverage job failure because the node provided only had 1 cpu00:40
sean-k-mooneyit cause the job to time out00:40
sean-k-mooneyits also a really old cpu00:40
sean-k-mooney  - Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz00:40
sean-k-mooney  ansible_processor_cores: 100:40
sean-k-mooney  ansible_processor_count: 100:40
sean-k-mooney  ansible_processor_nproc: 100:40
sean-k-mooney  ansible_processor_threads_per_core: 100:40
sean-k-mooney  ansible_processor_vcpus: 100:40
sean-k-mooneyany idea why that might have happened. it should have at least 4 cpus ideally 800:41
sean-k-mooneynovas unit test take a long time to run serially00:41
clarkbthis is a known problem that occurs extremely infrequently. It appears to occur with noble only landing on an older xen version (if you look at the ansible vars the xen version is older than where noble nodes get 8 vcpu from rax)00:41
clarkbthe problem is somewhere in nova/xen/noble_kernel most likely00:42
sean-k-mooneyoh weried00:42
clarkbwe've tried reporting it to rax with minimal traction00:42
sean-k-mooneyi didnt even think that this coudl be related to xen not virutalising it properly00:42
sean-k-mooneyi did see HVM domU and think oh its xen00:43
sean-k-mooneybut not that it woudl manifest like this00:43
sean-k-mooneybut ok i have only seen it once00:43
sean-k-mooneyso ill keep an eye out and see if it start happenign more frequently00:43
clarkbone idea was to have jobs check if running xen and if cpu count is 1 then error maybe. But it happens very infrequently00:43
clarkbso not sure where the right amount of involvement is. Also if we paper over it the chances of it ever getting fixed seem very slim00:44
sean-k-mooneyyou mena in a pre playbook so that zuul will retry it00:44
clarkbyes00:44
sean-k-mooneyif it becomes poblematic ill keep that in mind00:44
clarkbprobably in https://opendev.org/opendev/base-jobs/src/branch/master/playbooks/base/pre.yaml so that it runs as early as possible00:45
sean-k-mooneyi was wondering if that would be too early but maybe00:45
sean-k-mooneyi was nto sure that this is somethign we woudl want in all jobs00:45
sean-k-mooneyclarkb: the fact its a know issue is reassuring00:46
clarkbfor something that is inherent to the test node like that discarding it as early as possible will be the most efficient solution00:46
sean-k-mooneyi was actuly wondering if there had been a regresssion in stestr or if the nodeset used for the tox jobs had been changed recently00:46
clarkbyou could detect it later but then you're doing all of the other setup first00:46
clarkbya the first thing I did when tracking this down before was do my best to check that we used the correct flavor00:47
clarkband as far as I can tell the problem is not on our side but something in the cloud providing the wrong setup and it only shows for the old version of xen00:47
sean-k-mooneyi was kind of debating if having a semi standard job varibale "min_cpu" or something like that woudl make sense that he pre playbook would use so you could override the min count in later jobs if needed00:48
clarkbno00:48
clarkbwe shouldn't have test jobs reject nodes beacuse they aren't big enough00:48
clarkbwe should only have them reject nodes if the node is broken00:48
sean-k-mooneynow that you have explained it i understand why that is not desired00:48
sean-k-mooneyya00:48
clarkbselecting a large enough node should be accomplished via the nodeset00:49
clarkbif nova/xen/qemu/linux get it wrnog later that is the only thing we should workaround00:49
sean-k-mooneyright and to be fair if that was placement or os-vif 1 core might have worked00:49
sean-k-mooneynova just has a lot of tests00:49
sean-k-mooneynothing was actully failing it jsut timed out00:50
clarkbit was first detected in numa jobs iirc bceause that failed due to lack of cpus00:50
sean-k-mooneythat was a diffent issue i think00:50
clarkbbut this is only the third report I'ev seen about it since october? so it really does seem to be infrequent00:50
sean-k-mooneythat failed because i had written that epxlicty assuming we woudl have 8 and it got 400:50
clarkbthat was a separate issue00:51
sean-k-mooneybut that was just because ye performance match it00:51
clarkbthen later we got a similar thing that is this issue but also in jobs that are explicit about cpu counts00:51
clarkbmaybe it wasn't numa but something else that does similar00:51
sean-k-mooneyack00:51
sean-k-mooneythere are some numa nodesets i think but i didnt think those were aviable on xen 00:51
clarkbthough back in october we were using much less ubuntu noble. Its possible we'll see more of this due to expanded use of noble00:52
clarkbbut not all hypervisors in that same provider have the old xen version (I confirmed that) so it seems you have to be in the one provider and get lucky with the hypervisor00:52
sean-k-mooneyi think this was rax-ord00:53
sean-k-mooneyisint that being replaced with the new kvm rax-flex provider over time00:53
sean-k-mooneyi think you mentioned we will eventually not have xen host00:53
clarkbthere is a new kvm rax-flex provider00:53
clarkbI don't know that one will replace the other or not. Rax classic is still the bulk of our available quota and 10x the rax flex quota00:54
sean-k-mooneyoh ok no worries00:54
sean-k-mooneyit odd/worriing that the toploty would change basedon the image00:55
sean-k-mooneythe only reason i cna think that that would happen is if we are using the libosinfo feature when uploading the image to rax00:55
clarkbhttps://meetings.opendev.org/irclogs/%23openstack-infra/%23openstack-infra.2024-12-09.log.html#t2024-12-09T14:48:2300:56
clarkbbut I debugged it prior to this time too00:56
sean-k-mooneyour nodepool config is not seting any image properties that i can see when uploading the images00:57
clarkbhttps://meetings.opendev.org/irclogs/%23opendev/%23opendev.2024-10-02.log.html#t2024-10-02T15:55:1300:57
sean-k-mooneyso it not what i tought i was00:57
sean-k-mooney oh it was a cinder job that needed 2 cpus00:58
sean-k-mooneybased on the old logs00:58
clarkbhttps://meetings.opendev.org/irclogs/%23openstack-cinder/%23openstack-cinder.2024-10-02.log.html#t2024-10-02T15:20:13 this was the original place I saw it01:00
clarkband then ya happened again in december and now your report01:00
sean-k-mooneythey must be running a very old version of nova/xen01:01
sean-k-mooneynov drop all supprot for xen in wallaby01:01
clarkbyes though this is an even older version of xen because newer xen works01:01
sean-k-mooneywell we dorped xen via libvirt in wallaby but xenserver was dropped in victoria and we deprecated all of them in train ish01:02
sean-k-mooneyso ya thanks for the context01:02
*** tkajinam is now known as Guest944913:10
-opendevstatus- NOTICE: nominations for the OpenStack PTL and TC positions are closing soon, for details see https://lists.openstack.org/archives/list/openstack-discuss@lists.openstack.org/message/7DKEV7IEHOTHED7RVEFG7WIDVUC4MY3Z/15:58
kozhukalovHi team. I use the single node nodeset with 32GB Ubuntu Jammy for one of the jobs. I know that the number of such huge nodes is extremely limited, so I used it for just one job.  I use the label ubuntu-jammy-32GB which is provided by Vexxhost in ca-ymq-1 region. At the moment I see the job fails with the error `Error: Node(set) request 300-0026339961 failed`. Does this mean that this label is not available any more? Should 17:33
kozhukalovwe remove this label (and probably older such labels) from here  https://opendev.org/openstack/project-config/src/branch/master/nodepool/nl03.opendev.org.yaml#L171-L182 ?17:33
fungiwe do still have that label configured: https://opendev.org/openstack/project-config/src/branch/master/nodepool/nl03.opendev.org.yaml#L7617:35
fungii'll see if that flavor still exists in the provider17:37
fungiyeah, still shows up in flavor list17:37
fungi(v3-standard-8)17:37
fungiwe do seem to be booting nodes in that provider at the moment: https://grafana.opendev.org/d/b283670153/nodepool3a-vexxhost17:39
clarkbvexxhost had sdk issues from ~Thursday to today?17:39
clarkber to yesterday17:39
fungikozhukalov: how recently did tiy see the error17:39
fungier, s/tiy/you/17:39
kozhukalovmost recent yesterday https://zuul.opendev.org/t/openstack/build/9a578caa5443422787d60a1e2e3f1a9817:40
clarkbcould be related to ^ or maybe there are capacity issues17:40
clarkbbut this is the risk of using labels with a single provider17:40
kozhukalovYes, I understand. This is the only job that uses such 32GB nodes. Thanks for the information.17:42
fungiit should have been fixed after https://review.opendev.org/941933 deployed yesterday (2025-02-17) at 17:27:13 utc17:42

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!