opendevreview | Steve Baker proposed openstack/diskimage-builder master: Add san support to growvols https://review.opendev.org/c/openstack/diskimage-builder/+/903265 | 01:08 |
---|---|---|
amorin | hello, do you have an idea why this is not merged? https://review.opendev.org/c/openstack/mistral/+/899245 | 07:11 |
frickler | amorin: yes, needs a rebase, since a newer version of the patch below it was merged. see the red color of the "merged" text in the relation chain | 07:49 |
amorin | Ah! Gerrit UI is sometimes giving me headaches ;) | 08:24 |
frickler | infra-root: couple of things I notice while checking the grafana AFS page: mirror.openeuler has reached its quota limit and the mirror job seems to be failing since two weeks. I'm also a bit worried that they seem do have double their volume over the last 12 months | 08:55 |
frickler | ubuntu mirrors are also getting close, but we might have another couple of months time there | 08:56 |
frickler | mirror.centos-stream seems to have a steep increase in the last two months and might also run into quota limits soon | 08:56 |
frickler | project.zuul with the latest releases is getting close to its tight limit of 1GB (sic), I suggest to simply double that | 08:57 |
frickler | then the wheel builds for centos >=8 seem broken, with nobody maintaining these it might be better to drop them? | 08:59 |
frickler | (context was the recurring discussion of whether we'd have enough space to mirror rocky repos) | 09:00 |
frickler | I guess I'll add all of these topics to the meeting agenda so we can follow up after the holidays or so | 09:01 |
opendevreview | James Page proposed openstack/project-config master: sunbeam: retire all single charm repositories https://review.opendev.org/c/openstack/project-config/+/903666 | 11:04 |
opendevreview | James Page proposed openstack/project-config master: Fix the ACL associated with charm-keystone-ldap-k8s https://review.opendev.org/c/openstack/project-config/+/903667 | 11:14 |
fungi | yeah, for openeuler it might be that we simply need to add some filters for things jobs won't need, like we do with other rsync mirrors | 14:34 |
*** jamesdenton_ is now known as jamesdenton | 15:37 | |
frickler | infra-root: I don't have time to dig now, but we're seeing 100% node_failures in kolla for arm nodes currently | 15:58 |
fungi | frickler: i'll take a look, probably both providers are offline or full of leaked nodes | 15:59 |
fungi | i like that node failures now indicate the node request id that failed to be satisfied. saves having to hunt it down in the scheduler logs | 16:05 |
fungi | 2023-12-14 15:40:30,787 INFO nodepool.driver.NodeRequestHandler[nl03.opendev.org-PoolWorker.osuosl-regionone-main-0a52d0ebcb6146c2aaf61729723e3ffa]: [e: abd9cd79035c43e3a1f6a20313ffa157] [node_request: 300-0022991976] Not enough quota remaining to satisfy request | 16:07 |
fungi | 2023-12-14 15:41:06,147 INFO nodepool.driver.NodeRequestHandler[nl03.opendev.org-PoolWorker.linaro-regionone-main-ec6303cd8d4e4785a3ada55c7d750d53]: [e: abd9cd79035c43e3a1f6a20313ffa157] [node_request: 300-0022991976] Not enough quota remaining to satisfy request | 16:08 |
fungi | so both providers were tried for https://zuul.opendev.org/t/openstack/build/68f8cea90b234953a4942a444d027a13 and neither had sufficient quota even after multiple retries | 16:09 |
fungi | i'll see if we can get things cleaned up | 16:09 |
fungi | nodepool reports 8 arm64 nodes in use in osuosl and 7 in linaro | 16:13 |
fungi | openstack server list shows that many active nodes in each provider too | 16:15 |
fungi | neither provider has nodes in other states besides active | 16:15 |
fungi | i'm at a loss to explain | 16:15 |
fungi | openstack limits show --absolute also doesn't indicate either one is anywhere near capacity | 16:18 |
fungi | not all arm jobs are failing, https://zuul.opendev.org/t/openstack/build/7d18e32410704563ad81c2ba28181a8b just succeeded a few minutes ago | 16:19 |
fungi | aha, these seem to be what's causing the actual failures: | 16:21 |
fungi | nodepool.exceptions.LaunchStatusException: Server in error state | 16:21 |
fungi | seeing them come from both osuosl and linaro | 16:22 |
fungi | the linaro ones seem to be this: | 16:25 |
fungi | 2023-12-14 15:40:45,416 ERROR nodepool.StateMachineNodeLauncher.linaro-regionone: [e: 4c20fc054eaa4392ae50aec65c7bb6e4] [node_request: 300-0022991970] [node: 0036041570] Error in creating the server. Compute service reports fault: No valid host was found. | 16:25 |
fungi | i think what's happening is that it's trying osuosl and getting a softfail (insufficient quota) so then it goes on to linaro and gets a hardfail (no valid host was found) | 16:26 |
fungi | corvus: ^ does that sound right? if you have two providers, one says "not now" because it has insufficient quota and then the other gets api errors back for all its retries, the result is node_failure not just waiting for available quota in the first provider? | 16:28 |
fungi | i can propose a patch to temporarily lower max-servers in linaro until someone has time to investigate what's happened to some of its compute nodes | 16:30 |
fungi | https://grafana.opendev.org/d/391eb7bb3c/nodepool3a-linaro seems to show it topping out at 16 in-use so that seems like a good number for now | 16:32 |
opendevreview | Jeremy Stanley proposed openstack/project-config master: Temporarily lower max-servers for linaro https://review.opendev.org/c/openstack/project-config/+/903708 | 16:36 |
fungi | infra-root: ^ | 16:36 |
fungi | i've gone ahead and self-approved it, since i'm probably the only sysadmin around today | 16:58 |
corvus | fungi: looking into your q now | 17:00 |
fungi | corvus: no rush, mainly just making sure i understand the process flow that leads to that situation | 17:02 |
corvus | fungi: close -- the "not enough quota" is not an error (note the log is at info level); that's explaining why it's about to pause request handling without attempting a launch. once there is quota available, it proceeds to attempt a launch, and that fails. this repeats for both providers (yes, both providers were at quota, they waited, they launched, they failed), then request is deemed node_failure | 17:08 |
corvus | fungi: it looks like one provider is failing with "no valid host" and the other is failing with "server in error state". | 17:09 |
corvus | fungi: and yeah, lowering the max-servers is a reasonable way to compensate for the cloud lying to us about its capacity :) | 17:10 |
fungi | oh, i got confused and thought the "no valid host" was the api detail for the "server in error state" | 17:10 |
fungi | thanks | 17:10 |
corvus | fungi: maybe something similar should be done for the other provider? | 17:10 |
fungi | yeah, i'll see if i can tell what's going on there. the error state nodes may be random and not related to capacity | 17:11 |
fungi | thanks again! | 17:11 |
corvus | fungi: yeah, i think neither of us completely characterized the error messages -- let's try again :) it looks like they both put servers in error state, but additionally linaro says "no valid host found" but osuosl doesn't give us the extra info | 17:12 |
corvus | here's an excerpt from each: https://paste.opendev.org/show/bqiZJwIbOpcDFeJHQ3w6/ | 17:12 |
opendevreview | Merged openstack/project-config master: Temporarily lower max-servers for linaro https://review.opendev.org/c/openstack/project-config/+/903708 | 17:12 |
fungi | corvus: aha, that explains my confusion. thanks | 17:15 |
fungi | Ramereth: if you're around, any idea why server creation in the osuosl openstack arm cloud is sometimes ending with instances in an error state? | 17:16 |
fungi | will reducing our utilization help? | 17:16 |
fungi | or is it unrelated to available resources/capacity? | 17:16 |
Clark[m] | frickler: for stream mirror growth it looks like some packages get new versions but they don't clean up the old packages. This leads to growth. Some of the packages are quite large too iirc. I want to say things like thunderbird? | 17:36 |
corvus | dockerhub appears to be having intermittent issues again. just fyi. | 18:59 |
opendevreview | Merged openstack/diskimage-builder master: Remove cloud-init when using simple-init https://review.opendev.org/c/openstack/diskimage-builder/+/899885 | 19:05 |
Ramereth | fungi: do you have some uuids and timestamps that I can look at? I made a recent change which might be related | 19:08 |
fungi | Ramereth: sorry, disappeared for a late lunch, but i can dig some samples up for sure, just a sec | 20:50 |
opendevreview | Steve Baker proposed openstack/diskimage-builder master: Add san support to growvols https://review.opendev.org/c/openstack/diskimage-builder/+/903265 | 20:56 |
fungi | Ramereth: these are the uuids of error-state nodes we created between 15:29:38 and 18:07:27 utc today: https://paste.opendev.org/show/bM1nBxM62nsDeEYuF95H/ | 21:19 |
fungi | just a sample, i haven't looked to see how far back this goes but can if it's relevant | 21:20 |
Ramereth | fungi: thanks, I'll take a look later and get back to you | 21:27 |
fungi | Ramereth: at your convenience, it's not at all urgent. thanks! | 21:27 |
*** dmellado2 is now known as dmelladoo | 21:55 | |
*** dmelladoo is now known as dmellado | 21:58 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!