sean-k-mooney | hi o/ | 00:39 |
---|---|---|
sean-k-mooney | we had a tox coverage job failure because the node provided only had 1 cpu | 00:40 |
sean-k-mooney | it cause the job to time out | 00:40 |
sean-k-mooney | its also a really old cpu | 00:40 |
sean-k-mooney | - Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz | 00:40 |
sean-k-mooney | ansible_processor_cores: 1 | 00:40 |
sean-k-mooney | ansible_processor_count: 1 | 00:40 |
sean-k-mooney | ansible_processor_nproc: 1 | 00:40 |
sean-k-mooney | ansible_processor_threads_per_core: 1 | 00:40 |
sean-k-mooney | ansible_processor_vcpus: 1 | 00:40 |
sean-k-mooney | any idea why that might have happened. it should have at least 4 cpus ideally 8 | 00:41 |
sean-k-mooney | novas unit test take a long time to run serially | 00:41 |
clarkb | this is a known problem that occurs extremely infrequently. It appears to occur with noble only landing on an older xen version (if you look at the ansible vars the xen version is older than where noble nodes get 8 vcpu from rax) | 00:41 |
clarkb | the problem is somewhere in nova/xen/noble_kernel most likely | 00:42 |
sean-k-mooney | oh weried | 00:42 |
clarkb | we've tried reporting it to rax with minimal traction | 00:42 |
sean-k-mooney | i didnt even think that this coudl be related to xen not virutalising it properly | 00:42 |
sean-k-mooney | i did see HVM domU and think oh its xen | 00:43 |
sean-k-mooney | but not that it woudl manifest like this | 00:43 |
sean-k-mooney | but ok i have only seen it once | 00:43 |
sean-k-mooney | so ill keep an eye out and see if it start happenign more frequently | 00:43 |
clarkb | one idea was to have jobs check if running xen and if cpu count is 1 then error maybe. But it happens very infrequently | 00:43 |
clarkb | so not sure where the right amount of involvement is. Also if we paper over it the chances of it ever getting fixed seem very slim | 00:44 |
sean-k-mooney | you mena in a pre playbook so that zuul will retry it | 00:44 |
clarkb | yes | 00:44 |
sean-k-mooney | if it becomes poblematic ill keep that in mind | 00:44 |
clarkb | probably in https://opendev.org/opendev/base-jobs/src/branch/master/playbooks/base/pre.yaml so that it runs as early as possible | 00:45 |
sean-k-mooney | i was wondering if that would be too early but maybe | 00:45 |
sean-k-mooney | i was nto sure that this is somethign we woudl want in all jobs | 00:45 |
sean-k-mooney | clarkb: the fact its a know issue is reassuring | 00:46 |
clarkb | for something that is inherent to the test node like that discarding it as early as possible will be the most efficient solution | 00:46 |
sean-k-mooney | i was actuly wondering if there had been a regresssion in stestr or if the nodeset used for the tox jobs had been changed recently | 00:46 |
clarkb | you could detect it later but then you're doing all of the other setup first | 00:46 |
clarkb | ya the first thing I did when tracking this down before was do my best to check that we used the correct flavor | 00:47 |
clarkb | and as far as I can tell the problem is not on our side but something in the cloud providing the wrong setup and it only shows for the old version of xen | 00:47 |
sean-k-mooney | i was kind of debating if having a semi standard job varibale "min_cpu" or something like that woudl make sense that he pre playbook would use so you could override the min count in later jobs if needed | 00:48 |
clarkb | no | 00:48 |
clarkb | we shouldn't have test jobs reject nodes beacuse they aren't big enough | 00:48 |
clarkb | we should only have them reject nodes if the node is broken | 00:48 |
sean-k-mooney | now that you have explained it i understand why that is not desired | 00:48 |
sean-k-mooney | ya | 00:48 |
clarkb | selecting a large enough node should be accomplished via the nodeset | 00:49 |
clarkb | if nova/xen/qemu/linux get it wrnog later that is the only thing we should workaround | 00:49 |
sean-k-mooney | right and to be fair if that was placement or os-vif 1 core might have worked | 00:49 |
sean-k-mooney | nova just has a lot of tests | 00:49 |
sean-k-mooney | nothing was actully failing it jsut timed out | 00:50 |
clarkb | it was first detected in numa jobs iirc bceause that failed due to lack of cpus | 00:50 |
sean-k-mooney | that was a diffent issue i think | 00:50 |
clarkb | but this is only the third report I'ev seen about it since october? so it really does seem to be infrequent | 00:50 |
sean-k-mooney | that failed because i had written that epxlicty assuming we woudl have 8 and it got 4 | 00:50 |
clarkb | that was a separate issue | 00:51 |
sean-k-mooney | but that was just because ye performance match it | 00:51 |
clarkb | then later we got a similar thing that is this issue but also in jobs that are explicit about cpu counts | 00:51 |
clarkb | maybe it wasn't numa but something else that does similar | 00:51 |
sean-k-mooney | ack | 00:51 |
sean-k-mooney | there are some numa nodesets i think but i didnt think those were aviable on xen | 00:51 |
clarkb | though back in october we were using much less ubuntu noble. Its possible we'll see more of this due to expanded use of noble | 00:52 |
clarkb | but not all hypervisors in that same provider have the old xen version (I confirmed that) so it seems you have to be in the one provider and get lucky with the hypervisor | 00:52 |
sean-k-mooney | i think this was rax-ord | 00:53 |
sean-k-mooney | isint that being replaced with the new kvm rax-flex provider over time | 00:53 |
sean-k-mooney | i think you mentioned we will eventually not have xen host | 00:53 |
clarkb | there is a new kvm rax-flex provider | 00:53 |
clarkb | I don't know that one will replace the other or not. Rax classic is still the bulk of our available quota and 10x the rax flex quota | 00:54 |
sean-k-mooney | oh ok no worries | 00:54 |
sean-k-mooney | it odd/worriing that the toploty would change basedon the image | 00:55 |
sean-k-mooney | the only reason i cna think that that would happen is if we are using the libosinfo feature when uploading the image to rax | 00:55 |
clarkb | https://meetings.opendev.org/irclogs/%23openstack-infra/%23openstack-infra.2024-12-09.log.html#t2024-12-09T14:48:23 | 00:56 |
clarkb | but I debugged it prior to this time too | 00:56 |
sean-k-mooney | our nodepool config is not seting any image properties that i can see when uploading the images | 00:57 |
clarkb | https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2024-10-02.log.html#t2024-10-02T15:55:13 | 00:57 |
sean-k-mooney | so it not what i tought i was | 00:57 |
sean-k-mooney | oh it was a cinder job that needed 2 cpus | 00:58 |
sean-k-mooney | based on the old logs | 00:58 |
clarkb | https://meetings.opendev.org/irclogs/%23openstack-cinder/%23openstack-cinder.2024-10-02.log.html#t2024-10-02T15:20:13 this was the original place I saw it | 01:00 |
clarkb | and then ya happened again in december and now your report | 01:00 |
sean-k-mooney | they must be running a very old version of nova/xen | 01:01 |
sean-k-mooney | nov drop all supprot for xen in wallaby | 01:01 |
clarkb | yes though this is an even older version of xen because newer xen works | 01:01 |
sean-k-mooney | well we dorped xen via libvirt in wallaby but xenserver was dropped in victoria and we deprecated all of them in train ish | 01:02 |
sean-k-mooney | so ya thanks for the context | 01:02 |
*** tkajinam is now known as Guest9449 | 13:10 | |
-opendevstatus- NOTICE: nominations for the OpenStack PTL and TC positions are closing soon, for details see https://lists.openstack.org/archives/list/openstack-discuss@lists.openstack.org/message/7DKEV7IEHOTHED7RVEFG7WIDVUC4MY3Z/ | 15:58 | |
kozhukalov | Hi team. I use the single node nodeset with 32GB Ubuntu Jammy for one of the jobs. I know that the number of such huge nodes is extremely limited, so I used it for just one job. I use the label ubuntu-jammy-32GB which is provided by Vexxhost in ca-ymq-1 region. At the moment I see the job fails with the error `Error: Node(set) request 300-0026339961 failed`. Does this mean that this label is not available any more? Should | 17:33 |
kozhukalov | we remove this label (and probably older such labels) from here https://opendev.org/openstack/project-config/src/branch/master/nodepool/nl03.opendev.org.yaml#L171-L182 ? | 17:33 |
fungi | we do still have that label configured: https://opendev.org/openstack/project-config/src/branch/master/nodepool/nl03.opendev.org.yaml#L76 | 17:35 |
fungi | i'll see if that flavor still exists in the provider | 17:37 |
fungi | yeah, still shows up in flavor list | 17:37 |
fungi | (v3-standard-8) | 17:37 |
fungi | we do seem to be booting nodes in that provider at the moment: https://grafana.opendev.org/d/b283670153/nodepool3a-vexxhost | 17:39 |
clarkb | vexxhost had sdk issues from ~Thursday to today? | 17:39 |
clarkb | er to yesterday | 17:39 |
fungi | kozhukalov: how recently did tiy see the error | 17:39 |
fungi | er, s/tiy/you/ | 17:39 |
kozhukalov | most recent yesterday https://zuul.opendev.org/t/openstack/build/9a578caa5443422787d60a1e2e3f1a98 | 17:40 |
clarkb | could be related to ^ or maybe there are capacity issues | 17:40 |
clarkb | but this is the risk of using labels with a single provider | 17:40 |
kozhukalov | Yes, I understand. This is the only job that uses such 32GB nodes. Thanks for the information. | 17:42 |
fungi | it should have been fixed after https://review.opendev.org/941933 deployed yesterday (2025-02-17) at 17:27:13 utc | 17:42 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!