| opendevreview | yatin proposed opendev/irc-meetings master: Move neutron team and CI meeting by 1 hour https://review.opendev.org/c/opendev/irc-meetings/+/965906 | 10:47 |
|---|---|---|
| ykarel | corvus, fungi started seeing the mixed node provider issue again, was something changed again? | 12:39 |
| ykarel | https://zuul.openstack.org/build/8afc3ba76b6a437d82bb37a0cea06370 | 12:39 |
| ykarel | https://zuul.openstack.org/build/9a8dc417c1164b3482d4daecf0f0af67 | 12:39 |
| opendevreview | Merged opendev/irc-meetings master: Move neutron team and CI meeting by 1 hour https://review.opendev.org/c/opendev/irc-meetings/+/965906 | 12:52 |
| ykarel | another from today https://zuul.opendev.org/t/openstack/build/facb5a9deebe4af3b1f7170618338b47 | 13:28 |
| frickler | ykarel: looks like something is amiss with rax in general since two days, very likely related to the latest updates https://grafana.opendev.org/d/fd44466e7f/zuul-launcher3a-rackspace?orgId=1&from=now-7d&to=now&timezone=utc&var-region=$__all | 13:35 |
| frickler | infra-root: ^^ | 13:35 |
| frickler | although other graphs look similarly broken and if we were really running only 5 instances in total I think we'd know? | 13:37 |
| frickler | yes, global zuul status graph looks more like I'd expect it to. so the multinode issue will likely be something different, will defer to corvus I guess | 13:41 |
| ykarel | frickler, ack | 13:51 |
| ykarel | yes the last one above was on rax itself and that's not on graphs so something odd there | 13:52 |
| *** dhill is now known as Guest30469 | 13:53 | |
| corvus | hrm, i wonder if we're letting "ready/building node" take priority over provider locality after a failure.... | 14:49 |
| corvus | the graphs are an easy fix, we just changed some of the escaping in the metric names. i'll look at both things later today. | 14:49 |
| opendevreview | James E. Blair proposed openstack/project-config master: Update zuul-launcher metrics https://review.opendev.org/c/openstack/project-config/+/965940 | 15:04 |
| fungi | so... someone reached out to me via irc privmsg over the weekend asking to have their third-party ci account added to the "Zuul Summary" tab in gerrit, which is controlled by inclusion in the "Third-Party CI" group in gerrit i think? | 15:23 |
| fungi | that group is owned by the "Third-Party Coordinators" group, for which i think the only currently active member is frickler | 15:23 |
| fungi | is this a workflow, or even feature, that we want to reconsider? | 15:24 |
| clarkb | fungi: https://gerrit.googlesource.com/plugins/zuul-results-summary/+/refs/heads/main/web/zuul-summary-status-tab.ts#270 I think it only looks at message content | 15:47 |
| fungi | aha, okay so it's a matter of formatting the results appropriately? | 15:48 |
| clarkb | fungi: https://gerrit.googlesource.com/plugins/zuul-results-summary/+/refs/heads/main/web/zuul-summary-status-tab.ts#223 looks like it checks other attributes too but its all message content | 15:48 |
| clarkb | so yes I think you need to configure the commenter name and the message correctly then it is automatically picked up | 15:49 |
| fungi | makes sense. thanks for the clarification! i expect we have outdated user-facing documentation about a lot of this stuff | 15:51 |
| clarkb | infra-root I believe https://review.opendev.org/c/opendev/system-config/+/965866 will address the zuul restart stoppage when a service is already shutdown. The message check there is based on the stderr captured by the failed run from the weekend before last | 16:06 |
| clarkb | and then I was going to double check no one else wanted to try the gitea 1.25 update before I started working on that. Oh and if we think we can land the etherpad upgrade with the launcher situation then I'd like to try and do that soon too (please test on the held node too it is at 50.56.157.144 and I have a clarkb-test pad up there you can check | 16:08 |
| corvus | clarkb: fungi i think https://review.opendev.org/965940 warrants quick review to fix the graphs | 16:42 |
| corvus | has a +2 from frickler already but i want us all to be aware of that | 16:43 |
| frickler | I am "Third-Party Coordinators"? good to know ;) should I add infra - root or some other similar group? I think this isn't something really openstack specific anymore, right? or is that group really obsolete, then? | 16:43 |
| clarkb | done | 16:43 |
| clarkb | frickler: I suspect that the way the group is used is fairly openstack specific, but yes if some other projects (like zuul with softare factory third party ci) wanted to allow voting +/-1 we'd need to sort that out | 16:44 |
| frickler | the other option would be to add the TC. I'd just prefer not to be left on my own there once I remove the inactive members | 16:47 |
| corvus | fungi: just a guess: opendev puts the pipeline name in its results, which is not standard zuul. that's the first thing i'd check for what may not match. | 16:52 |
| fungi | thanks. in this case i have a feeling they're not using zuul at all, or it's a very old zuul. for example, no buildset result test/link, and individual build result entries aren't in a bullet list | 17:04 |
| opendevreview | Merged openstack/project-config master: Update zuul-launcher metrics https://review.opendev.org/c/openstack/project-config/+/965940 | 17:10 |
| mnasiadka | frickler: is that third party thing OpenStack or OpenDev? ;-) (judging if TC should be even involved) | 17:20 |
| clarkb | I think the friction here is that opendev doesn't want to police third party ci for projects bceause the policies are varied and it puts us closer to the position of supporting people with the ci itself which is something we really don't have the time to do | 17:21 |
| clarkb | but at the same time third party ci isn't inherently openstack specific ( zuul has thrid party ci for example) | 17:21 |
| clarkb | I'm trying to remember what exactly that group all does. I think I'm mistaken that it allows +/-1 voting and intsead it merely identifies the bots as bots in the gerrit system | 17:22 |
| clarkb | its probably fine for opendev to manage that | 17:22 |
| frickler | https://docs.opendev.org/opendev/system-config/latest/gerrit.html has some hints. like "capability.emailReviewers = deny group Third-Party CI", but not sure if that is still current? | 17:26 |
| fungi | i hope it's still current, since we don't normally make changes to the global gerrit configuration without documenting it there | 17:27 |
| clarkb | ya it should be in sync | 17:27 |
| corvus | ykarel: remote: https://review.opendev.org/c/zuul/zuul/+/965954 Launcher: honor provider when selecting ready nodes [NEW] | 17:36 |
| corvus | that should take care of it -- sorry that slipped through. that was the "reuse nodes that were already built for a canceled request" code path, but we forgot to limit it by provider. | 17:36 |
| fungi | clarkb: stephenfin: looks like pypi has pbr 7.0.3 as of 35 minutes ago | 17:39 |
| clarkb | corvus: in that test case you set the instances limit to 2, is that per provider not global right? | 17:39 |
| clarkb | fungi: yup I just looked at that | 17:39 |
| clarkb | so theoretically we're ready now from the pbr side of things? | 17:39 |
| stephenfin | there's still pyproject.toml work to be done, but hopefully that's the end of the panic | 17:40 |
| corvus | clarkb: yep | 17:40 |
| stephenfin | guess we'll know at some point in the next month or so | 17:40 |
| clarkb | corvus: thanks +2 but with a small testing suggestion | 17:41 |
| clarkb | stephenfin: I think upstream had some ability to test it. Maybe we ask them to check now? | 17:41 |
| stephenfin | I already left notes for jaraco on the two tickets he opened | 17:42 |
| clarkb | perfect thanks | 17:43 |
| corvus | clarkb: ++ done | 17:44 |
| frickler | fungi: so it seems to me that the 3rd party thing is better suited to be in opendev hands then rather than openstack. how about I add you to it and then you can follow-up however you think is appropriate? | 17:44 |
| corvus | (and yes, it was implicit due to quotas, but belts and suspenders) | 17:44 |
| fungi | frickler: that's fine, though in this case it turned out there was no followup required since what the individual was asking about wasn't actually controlled by membership in that group anyway | 17:46 |
| frickler | fungi: yeah, I was more thinking about following up with further member additions and/or cleanup | 17:47 |
| clarkb | corvus: considering that launcher bug I would guess that things generally work as its still a fallback case? In that case moving forweard with https://review.opendev.org/c/opendev/system-config/+/956593 is probably fine as long as the held etherpad looks good | 18:15 |
| corvus | yes, i don't think we need to change any plans due to it. | 18:16 |
| clarkb | I just wanted to make sure its not a 100% fail case for multinode jobs that rely on shared networks | 18:17 |
| corvus | it's more likely to be hit after gate resets when there are canceled node requests | 18:17 |
| corvus | if things are proceeding smoothly, less so. | 18:18 |
| corvus | i'd like to get it merged today if possible; so i'm trying to figure out what's causing the timeouts. | 18:18 |
| opendevreview | Clark Boylan proposed opendev/system-config master: Update Gitea to 1.25 https://review.opendev.org/c/opendev/system-config/+/965960 | 18:30 |
| clarkb | that is a first pass at gitea 1.25. | 18:31 |
| clarkb | Looks like we also need to prune the vexxhsot backup server. We can potentially purge backups for eavesdrop01, paste01, restack01, and review02 from that server as well if we're comfortable doing that (maybe a subset or all of them?) | 18:35 |
| fungi | i just started | 18:36 |
| fungi | it's been running in a root screen session for about 5 minutes | 18:36 |
| clarkb | ack thanks | 18:38 |
| clarkb | any thoughts on whether we should be purging any of those four sets of backups at this point? | 18:38 |
| fungi | no thoughts. my head is currently full of other thoughts | 18:39 |
| fungi | also see #openstack-infra where i pinged dmsimard (since he doesn't seem to be in here) about our ovh credits running out | 18:40 |
| clarkb | ya just saw that. We can ping amorin too if necessary | 18:40 |
| clarkb | good news on the backup front is we get regular reminders about it so if we don't make a decision now we'll be reminded in a month or so :) | 18:40 |
| fungi | yeah, he seems to be in here (and not there) | 18:40 |
| clarkb | https://zuul.opendev.org/t/openstack/build/6b4b75604b4b49c3b4431327b3a1c25e something is angry | 18:49 |
| clarkb | afs: disk cache read error in CacheItems slot 1562498 off 124999860/125000020 code -5/80 pid 493006 (apache2) | 18:50 |
| clarkb | from dmesg | 18:50 |
| fungi | urg | 18:50 |
| opendevreview | James E. Blair proposed openstack/project-config master: Further fixes to zuul-launcher metrics https://review.opendev.org/c/openstack/project-config/+/965966 | 18:50 |
| clarkb | spot checking another mirror the problem does not appear to be global | 18:50 |
| fungi | i guess that's on an executor? | 18:50 |
| clarkb | fungi: no that is mirror.dfw3.raxflex.opendev.org | 18:50 |
| fungi | oh | 18:50 |
| corvus | maybe stop apache there and clean the cache? | 18:51 |
| fungi | ah, the apache2 there should have been a giveaway | 18:51 |
| corvus | actually, maybe just fs flushall first without even stopping apache | 18:51 |
| clarkb | ya looking at the logs its the same slot each time | 18:52 |
| clarkb | ack will try that | 18:52 |
| fungi | checking dmesg to see if there are any lower level errors mentioned for that block device | 18:52 |
| fungi | though these errors have completely filled the ring buffer | 18:53 |
| clarkb | fungi: I couldn't find any and the volume appears to still be rw | 18:53 |
| clarkb | fs flushall is running and not returning quickly. But I guess that may be expected | 18:53 |
| fungi | slot 1562498 osm | 18:53 |
| corvus | i think it's a GC based thing | 18:54 |
| fungi | slot 1562498 isn't the only one | 18:54 |
| clarkb | system load is raelly high now | 18:54 |
| fungi | up around 18:52:05 i see it complain about slots 1404560, 1523212, 1363181, and 1432463 | 18:54 |
| clarkb | but the system seems mostly idle. Maybe it is something with writing to that disk? | 18:55 |
| fungi | [Mon Nov 3 18:52:05 2025] afs: Cannot open cache file (code -5). Trying to continue, but AFS accesses may return errors or panic the system | 18:55 |
| fungi | and then there's a kernel oops | 18:55 |
| corvus | reboot may be in order then | 18:56 |
| fungi | kernel BUG at /var/lib/dkms/openafs/1.8.13/build/src/libafs/MODLOAD-6.8.0-85-generic-SP/afs_dcache.c:1209! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI | 18:56 |
| clarkb | should we stop apache and disable it then reboot so we can try clearing the cache manually after it comes back up? | 18:56 |
| fungi | looks like we mount /var/cache/openafs from a logical volume on a cinder volume | 18:56 |
| clarkb | fungi: yes and that cinder volunme should be split with apache2's cache | 18:57 |
| fungi | so something may have happened to the cinder volume, e.g. iscsi network interruption | 18:57 |
| clarkb | should I reboot now? | 18:57 |
| corvus | ++ | 18:58 |
| clarkb | I'll disable apache2 as well | 18:58 |
| clarkb | fungi: you're ssh'd into the host is it ok if I reboot now? | 18:59 |
| fungi | yes, i'm logged out, sorry! | 19:00 |
| clarkb | I've asked the server to reboot itself. Pings have not stopped so I worry it may be hung up on some shutdown routine. But I've got a local machine with something like a 60 second timeout then it proceeds so I'll wait for a couple of minutes to see if that is the case here bfore asking openstack to shutdown and start again | 19:01 |
| clarkb | after 176 pings it appears to have stopped pinging. Just as I was sorting out how to ask openstack to shut it down | 19:04 |
| clarkb | now we wait to see if it boots up properly | 19:04 |
| clarkb | ok console log show shows it waiting on a number of processes. I'll give them a bit longer but I think I may need to hard stop it | 19:05 |
| clarkb | Failed to unmount /afs: Device or resource busy is the most recent message | 19:06 |
| fungi | yeah, perhaps unsurprising | 19:07 |
| clarkb | ok I think it may be time to ask for the "physical" power button | 19:08 |
| amorin | hey, something wrong? | 19:08 |
| fungi | the shiny, candy-red button | 19:08 |
| clarkb | amorin: the issue we're debugging is unrelated to ovh. But we are near the end of our ovh credits | 19:09 |
| fungi | amorin: unrelated to what we're working on right now, just ovh credits expiring. got the "unable to process payment" notice today | 19:09 |
| amorin | oh, damn | 19:09 |
| fungi | i pinged dmsimard about it already over in #openstack-infra since he's hanging out in there | 19:09 |
| amorin | ok | 19:10 |
| fungi | i'll bother you with it if he's not around | 19:10 |
| clarkb | mirror.dfw3 is in a powering off state but is not yet powered off | 19:10 |
| amorin | will double chck with him then and keep you in touch | 19:10 |
| fungi | thanks amorin! | 19:10 |
| amorin | good luck with the debug session | 19:10 |
| clarkb | aha there it goes it is shutdown now. I'll try booting it now | 19:10 |
| fungi | #status log Pruned /opt/backups-202010 on backup02.ca-ymq-1.vexxhost reducing utilization from 92% to 67% | 19:12 |
| opendevstatus | fungi: finished logging | 19:12 |
| clarkb | it won't power on | 19:12 |
| clarkb | I think we should disable that provider in zuul and then we're either booting a new server like we've done in the past and replacing the whole thing or engaging rax to help debug | 19:13 |
| fungi | are you trying server start or reboot? | 19:13 |
| clarkb | fungi: server start after a server stop | 19:13 |
| fungi | k | 19:13 |
| clarkb | should I try a server reboot? | 19:13 |
| fungi | no i don't think that'll work from a shutoff state | 19:13 |
| corvus | i'll write a config change to stop the provider | 19:13 |
| fungi | anyway, i agree on pausing the provider for now | 19:13 |
| clarkb | fungi: yup confirmed it is an error to try rebooting a stopped instance | 19:13 |
| fungi | we probably need dan_with or one of his colleagues to intercede | 19:14 |
| opendevreview | James E. Blair proposed opendev/zuul-providers master: Disable rax-flex due to mirror issue https://review.opendev.org/c/opendev/zuul-providers/+/965971 | 19:14 |
| clarkb | dan_with: fyi we had a server in DFW3 (mirror01.dfw3.raxflex.opendev.org) become unhappy with io to a filesystem backed by a cinder volume. We ended up asking noav to shutdown the server and now it won't boot up again after that | 19:14 |
| opendevreview | James E. Blair proposed opendev/zuul-providers master: Revert "Disable rax-flex due to mirror issue" https://review.opendev.org/c/opendev/zuul-providers/+/965972 | 19:15 |
| clarkb | dan_with: we are capable of replacing the entire system (server, volume, and floating ip) and moving on from here with a replacement, but I figure considering it may be related to volume behaviors it may be worth having ya'll debug it? | 19:15 |
| clarkb | server show also doesn't seem to record any reason for the boot failure | 19:16 |
| fungi | yeah, if nova thinks something's wrong with an attached volume it may refuse to start the instance | 19:16 |
| clarkb | volume show on the volume doesn't show anything problematic either fwiw, But its not the first time nova may be withholding a full story from us | 19:17 |
| opendevreview | Merged opendev/zuul-providers master: Disable rax-flex due to mirror issue https://review.opendev.org/c/opendev/zuul-providers/+/965971 | 19:17 |
| fungi | right, i don't know if the cinder volume itself is actually the problem. corruption we saw before shutdown could have been entirely guest-side | 19:17 |
| fungi | also volume show is telling you what cinder thinks about the volume, which may not be the same as what nova thinks | 19:18 |
| clarkb | ya I just wish nova had a note for why something isn't booting | 19:19 |
| clarkb | task_state goes from powering-on to None and vm_state never changes from stopped | 19:24 |
| clarkb | (just skimming my terminal scrollback to see if I see anything useful, but thats about it. It does try to power on and then doesn't succeed apparently) | 19:25 |
| opendevreview | Clark Boylan proposed opendev/system-config master: Update Gitea to 1.25 https://review.opendev.org/c/opendev/system-config/+/965960 | 19:38 |
| opendevreview | Clark Boylan proposed opendev/system-config master: Update Gitea to 1.25 https://review.opendev.org/c/opendev/system-config/+/965960 | 20:01 |
| corvus | clarkb: https://review.opendev.org/965966 is a quick oopsie fix that would be good to get in | 20:10 |
| fungi | that is indeed a lot of additional ocurrences | 20:18 |
| clarkb | corvus: ack I've approved it | 20:21 |
| clarkb | (sorry was eating a sandwich) | 20:21 |
| opendevreview | Merged openstack/project-config master: Further fixes to zuul-launcher metrics https://review.opendev.org/c/openstack/project-config/+/965966 | 20:47 |
| clarkb | I'm going to start putting the meeting agenda together. Gitea 1.25, gitea load mitigations, trixie mirroring, and the rax dfw3 mirror issue are all on my list of items to add in | 21:39 |
| clarkb | if there is anything else just let me know | 21:40 |
| clarkb | ok I think I have all of those updates in now | 21:57 |
| cardoe | So looking at the openstack-helm job at https://zuul.opendev.org/t/openstack/status and wondering what I need to do to unstick the 52hour stuck. | 22:07 |
| clarkb | looks like that is request 9d53f1877a6648a38e11e064a3bdb096 | 22:07 |
| clarkb | at 2025-11-01 17:41:16,212 zl01 marks a node for that request as ready but then doesn't proceed from there | 22:09 |
| clarkb | that appears to be after corvus unstuck the upgrade process on zl01 and zl02 | 22:10 |
| clarkb | by about 2.5 hours | 22:10 |
| clarkb | looks like the request is for 5 nodes | 22:11 |
| clarkb | https://grafana.opendev.org/d/0172e0bb72/zuul-launcher3a-rackspace-flex?orgId=1&from=now-24h&to=now&timezone=utc&var-region=$__all seems to reflect this we have 3 ready nodes and 2 requested, but I'm not sure we're doing anything at this point for the other two requseted nodes | 22:12 |
| corvus | i manually re-enqueued that. we should be able to continue to debug based on the ids from clark. | 22:12 |
| clarkb | corvus: cardoe I think this may be that nova api error that forces us to use a different microversion | 22:13 |
| clarkb | a normal server list against iad3 just failed for me | 22:13 |
| clarkb | now to remember how to specify the older version so that we can list servers and delete the | 22:14 |
| clarkb | corvus: all 5 servers show as active so none of them clearly indicate an error. I'm going to try and show each one without the microversion to see if I can determine which one is the problem and delete only that one | 22:18 |
| clarkb | corvus: is that safe enough with zuul launcher? | 22:18 |
| corvus | clarkb: those nodes might get reused | 22:19 |
| clarkb | 7f50209b-8224-4b3a-b192-171d4b6bd185 (npb29cdece0ec64) and ec872fa3-f349-44a5-8694-6d00becef9ea (npa24e1b4413c14) are bad | 22:19 |
| clarkb | corvus: ya I was worried about that. Do we need to mark both of those for deletion on the nodepool side first then I can delete them manually on the cloud side? | 22:20 |
| clarkb | s/nodepool/launcher | 22:20 |
| corvus | npb29cdece0ec64 is in hold | 22:20 |
| corvus | npa24e1b4413c14 is ready | 22:20 |
| clarkb | interesting that implies the node was good for long enough for us to use it | 22:21 |
| clarkb | but now it appears to be creating the you can't show this node with new microversion problems | 22:21 |
| corvus | a24 has been reassigned to 00d042a30e61445f90c498311c299797 | 22:22 |
| corvus | so it's going to be in use when that request is finished | 22:22 |
| clarkb | I guess once the node is booted and we ssh keyscan it we're not interacting with the nova api again until we try to delete the instance | 22:23 |
| corvus | yep | 22:23 |
| clarkb | I have a number of holds in place if that server is one of mine (perhaps for etherpad?) I can recycle it | 22:23 |
| corvus | it is, but that's probably not important now | 22:23 |
| clarkb | corvus: I think we have to clear out the two bad nodes from nova to get things working again. Either that or we have to start forcing zuul lauincher to use the old microversion | 22:24 |
| clarkb | which seems like a step backwards from a high level | 22:24 |
| corvus | so... lemme get this straight: | 22:25 |
| corvus | 1) everything works fine | 22:25 |
| corvus | 2) something happens; we don't know what | 22:25 |
| corvus | 3) all nova api calls start failing | 22:25 |
| corvus | 4) the only way to fix that is a) delete all servers or b) use an old microversion | 22:25 |
| corvus | is that right? | 22:25 |
| clarkb | roughly, for 2) I think nova believes it to be some sort of db field corruption replacing some field with an unexpected null value. that then caused 3) failures for any attribute listing of that node with a new microversion that enforces that field be not null | 22:26 |
| clarkb | the workaround works because old microversions didn't enforce that behavior. It seems likely ot me that this bug has existed since forever and because there was no enforcment it was simply never noticed. Now that they enforce it kaboom | 22:27 |
| corvus | but this affects the "list all servers" call? | 22:27 |
| clarkb | corvus: correct | 22:27 |
| corvus | so one server gets corrupted and it tanks the region. | 22:27 |
| clarkb | (as well as server show which is how I narrowed it down to those two) | 22:27 |
| clarkb | yes | 22:27 |
| corvus | clarkb: it looks like the new request is pretty stuck too.. honestly, i'd say sure just delete the nodes out from under the launcher and lets see what happens. the job will fail, but that should be the extent of the fallout. anything more is a bug. | 22:29 |
| clarkb | corvus: ok this is weird, now server listing works again | 22:29 |
| clarkb | without the microversion I mean | 22:30 |
| corvus | oO | 22:30 |
| clarkb | cardoe you didn't go and magic things on the backend did you? | 22:30 |
| cardoe | I don't know what magic things to do. | 22:30 |
| cardoe | Heck I'd love to figure out Zuul enough to clean up container builds and all of this stuff | 22:31 |
| cardoe | but I'll admit too much black magic and not enough cycles | 22:31 |
| clarkb | cardoe: this is all booting servers in rax flex iad3 fwiw. Not really anything to do with job payloads | 22:31 |
| cardoe | Now the PTL might have done some magic | 22:31 |
| clarkb | cardoe: I believe the problem is that listing servers broke in that region. But now it magically works again for some reason. Maybe by listing/showing things using the old microversion it uncorrupted the feilds or something | 22:32 |
| clarkb | https://bugs.launchpad.net/nova/+bug/2095364 this is the underlying issue I think | 22:34 |
| clarkb | that new request is for the same job on a different opensatck helm change. And it again has 3 available and 2 requested | 22:41 |
| clarkb | server list is working, but I don't see any new booting instances. So it could be that there are two problems here. Or one now but was two before when server listing was not working | 22:42 |
| clarkb | ok I think I see what may be part of problem 2 | 22:44 |
| clarkb | we have these quotas: 10 instances, 51200MB of memory, and 20 cores. Unfortauntely only 5 four core instances fit itno the core limit and 6 fit into the memory limit | 22:45 |
| clarkb | so I think we need to drop the instance limit in that cloud region to 5 :/ | 22:45 |
| clarkb | I know that cloudnull intended to bump the quotas up for us but it hasn't happened | 22:45 |
| clarkb | cardoe: ^ not sure if that is something you are bale to assist with. I'm not sure if it hasn't happened due to capacity of the cloud or just time constraints | 22:46 |
| clarkb | corvus: on the launcher side I suspect that the expectations is the held nodes will eventually go away and we'll haev enough room to fulfill the request. But maybe we need to consider nodes in a hold state as less ephemeral when calculating whether or not we haev sufficient room to supply a request? | 22:47 |
| cardoe | Let me prod him. | 22:47 |
| opendevreview | Clark Boylan proposed opendev/system-config master: DNM intentional Gitea failure to hold a node https://review.opendev.org/c/opendev/system-config/+/848181 | 23:10 |
| corvus | clarkb: that is an interesting suggestion. i can see how it makes sense in this case, but i wonder whether we should generalize it (since in other cases, it could cause NODE_FAILUREs). it probably does make sense, but it's a tradeoff to consider carefully. | 23:10 |
| clarkb | corvus: I'm not sure I understand the note about node failures? Wouldn't we boot fewer nodes if we counted held nodes against "permanent" quota? | 23:13 |
| corvus | if we only had iad3 with quota for 5 nodes and received a request for 5 nodes while we had 1 held node and we implement your suggestion, that becomes a NODE_FAILURE in the future, but today it waits for the held node to free up. | 23:14 |
| corvus | (with your suggestion, in opendev, we fail over to another cloud, but not everyone has 6 clouds) | 23:14 |
| clarkb | ah I guess in my mind it would've never accepted the request int he first place | 23:14 |
| clarkb | so in the opendev scenario it falls back to another provider and in a single cloud deployment it would indeed node failures since there is no fallback | 23:15 |
| corvus | it fails because if we consider held nodes as unusable provider quota, then we have no provider that could possibly fulfill the request. | 23:15 |
| corvus | yeah. there's not really much of a distinction between "accepted" and "not accepted" for purposes of this conversation. it's more about deciding whether a provider can possibly fulfill a request. | 23:16 |
| clarkb | other (probably bad ideas that come to mind): we could canabalize held nodes if necessary and force them to be deleted to return quota. I wonder too if we can annotate the request in the status page somehow as "waiting for quota" | 23:18 |
| clarkb | I think if we don't want to change the behavior of request handling then updating the status page to make it more clear why we're waiting is a decent compromise | 23:18 |
| corvus | i think cannibalizing held nodes is not worth it. but i definitely think we should expose more info. | 23:19 |
| corvus | (held nodes are generally held because a human needs time on them to fix something complex, so that's an expensive process we don't want to elongate. but we do, by default, set a time limit on held nodes) | 23:20 |
| clarkb | in the meantime I guess I can delete my held nodes so that the OSH jobs can complete? Or do you think it is helpful to keep things in this state for now? | 23:29 |
| corvus | yeah, i think fundamentally, we don't have the capacity in raxflex-iad for that many held nodes, so that's probably best. | 23:37 |
| corvus | we could probably fix this situation by deleting the request and the ready nodes; that might get it assigned to a different provider, but no guarantee it won't happen again. | 23:37 |
| clarkb | ok I've deleted the autohold corresponding to those two nodes in rax iad | 23:48 |
| clarkb | that should allow it to boot the two nodes required for the current request | 23:48 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!