Monday, 2025-11-03

opendevreview	yatin proposed opendev/irc-meetings master: Move neutron team and CI meeting by 1 hour https://review.opendev.org/c/opendev/irc-meetings/+/965906	10:47
ykarel	corvus, fungi started seeing the mixed node provider issue again, was something changed again?	12:39
ykarel	https://zuul.openstack.org/build/8afc3ba76b6a437d82bb37a0cea06370	12:39
ykarel	https://zuul.openstack.org/build/9a8dc417c1164b3482d4daecf0f0af67	12:39
opendevreview	Merged opendev/irc-meetings master: Move neutron team and CI meeting by 1 hour https://review.opendev.org/c/opendev/irc-meetings/+/965906	12:52
ykarel	another from today https://zuul.opendev.org/t/openstack/build/facb5a9deebe4af3b1f7170618338b47	13:28
frickler	ykarel: looks like something is amiss with rax in general since two days, very likely related to the latest updates https://grafana.opendev.org/d/fd44466e7f/zuul-launcher3a-rackspace?orgId=1&from=now-7d&to=now&timezone=utc&var-region=$__all	13:35
frickler	infra-root: ^^	13:35
frickler	although other graphs look similarly broken and if we were really running only 5 instances in total I think we'd know?	13:37
frickler	yes, global zuul status graph looks more like I'd expect it to. so the multinode issue will likely be something different, will defer to corvus I guess	13:41
ykarel	frickler, ack	13:51
ykarel	yes the last one above was on rax itself and that's not on graphs so something odd there	13:52
*** dhill is now known as Guest30469		13:53
corvus	hrm, i wonder if we're letting "ready/building node" take priority over provider locality after a failure....	14:49
corvus	the graphs are an easy fix, we just changed some of the escaping in the metric names. i'll look at both things later today.	14:49
opendevreview	James E. Blair proposed openstack/project-config master: Update zuul-launcher metrics https://review.opendev.org/c/openstack/project-config/+/965940	15:04
fungi	so... someone reached out to me via irc privmsg over the weekend asking to have their third-party ci account added to the "Zuul Summary" tab in gerrit, which is controlled by inclusion in the "Third-Party CI" group in gerrit i think?	15:23
fungi	that group is owned by the "Third-Party Coordinators" group, for which i think the only currently active member is frickler	15:23
fungi	is this a workflow, or even feature, that we want to reconsider?	15:24
clarkb	fungi: https://gerrit.googlesource.com/plugins/zuul-results-summary/+/refs/heads/main/web/zuul-summary-status-tab.ts#270 I think it only looks at message content	15:47
fungi	aha, okay so it's a matter of formatting the results appropriately?	15:48
clarkb	fungi: https://gerrit.googlesource.com/plugins/zuul-results-summary/+/refs/heads/main/web/zuul-summary-status-tab.ts#223 looks like it checks other attributes too but its all message content	15:48
clarkb	so yes I think you need to configure the commenter name and the message correctly then it is automatically picked up	15:49
fungi	makes sense. thanks for the clarification! i expect we have outdated user-facing documentation about a lot of this stuff	15:51
clarkb	infra-root I believe https://review.opendev.org/c/opendev/system-config/+/965866 will address the zuul restart stoppage when a service is already shutdown. The message check there is based on the stderr captured by the failed run from the weekend before last	16:06
clarkb	and then I was going to double check no one else wanted to try the gitea 1.25 update before I started working on that. Oh and if we think we can land the etherpad upgrade with the launcher situation then I'd like to try and do that soon too (please test on the held node too it is at 50.56.157.144 and I have a clarkb-test pad up there you can check	16:08
corvus	clarkb: fungi i think https://review.opendev.org/965940 warrants quick review to fix the graphs	16:42
corvus	has a +2 from frickler already but i want us all to be aware of that	16:43
frickler	I am "Third-Party Coordinators"? good to know ;) should I add infra - root or some other similar group? I think this isn't something really openstack specific anymore, right? or is that group really obsolete, then?	16:43
clarkb	done	16:43
clarkb	frickler: I suspect that the way the group is used is fairly openstack specific, but yes if some other projects (like zuul with softare factory third party ci) wanted to allow voting +/-1 we'd need to sort that out	16:44
frickler	the other option would be to add the TC. I'd just prefer not to be left on my own there once I remove the inactive members	16:47
corvus	fungi: just a guess: opendev puts the pipeline name in its results, which is not standard zuul. that's the first thing i'd check for what may not match.	16:52
fungi	thanks. in this case i have a feeling they're not using zuul at all, or it's a very old zuul. for example, no buildset result test/link, and individual build result entries aren't in a bullet list	17:04
opendevreview	Merged openstack/project-config master: Update zuul-launcher metrics https://review.opendev.org/c/openstack/project-config/+/965940	17:10
mnasiadka	frickler: is that third party thing OpenStack or OpenDev? ;-) (judging if TC should be even involved)	17:20
clarkb	I think the friction here is that opendev doesn't want to police third party ci for projects bceause the policies are varied and it puts us closer to the position of supporting people with the ci itself which is something we really don't have the time to do	17:21
clarkb	but at the same time third party ci isn't inherently openstack specific ( zuul has thrid party ci for example)	17:21
clarkb	I'm trying to remember what exactly that group all does. I think I'm mistaken that it allows +/-1 voting and intsead it merely identifies the bots as bots in the gerrit system	17:22
clarkb	its probably fine for opendev to manage that	17:22
frickler	https://docs.opendev.org/opendev/system-config/latest/gerrit.html has some hints. like "capability.emailReviewers = deny group Third-Party CI", but not sure if that is still current?	17:26
fungi	i hope it's still current, since we don't normally make changes to the global gerrit configuration without documenting it there	17:27
clarkb	ya it should be in sync	17:27
corvus	ykarel: remote: https://review.opendev.org/c/zuul/zuul/+/965954 Launcher: honor provider when selecting ready nodes [NEW]	17:36
corvus	that should take care of it -- sorry that slipped through. that was the "reuse nodes that were already built for a canceled request" code path, but we forgot to limit it by provider.	17:36
fungi	clarkb: stephenfin: looks like pypi has pbr 7.0.3 as of 35 minutes ago	17:39
clarkb	corvus: in that test case you set the instances limit to 2, is that per provider not global right?	17:39
clarkb	fungi: yup I just looked at that	17:39
clarkb	so theoretically we're ready now from the pbr side of things?	17:39
stephenfin	there's still pyproject.toml work to be done, but hopefully that's the end of the panic	17:40
corvus	clarkb: yep	17:40
stephenfin	guess we'll know at some point in the next month or so	17:40
clarkb	corvus: thanks +2 but with a small testing suggestion	17:41
clarkb	stephenfin: I think upstream had some ability to test it. Maybe we ask them to check now?	17:41
stephenfin	I already left notes for jaraco on the two tickets he opened	17:42
clarkb	perfect thanks	17:43
corvus	clarkb: ++ done	17:44
frickler	fungi: so it seems to me that the 3rd party thing is better suited to be in opendev hands then rather than openstack. how about I add you to it and then you can follow-up however you think is appropriate?	17:44
corvus	(and yes, it was implicit due to quotas, but belts and suspenders)	17:44
fungi	frickler: that's fine, though in this case it turned out there was no followup required since what the individual was asking about wasn't actually controlled by membership in that group anyway	17:46
frickler	fungi: yeah, I was more thinking about following up with further member additions and/or cleanup	17:47
clarkb	corvus: considering that launcher bug I would guess that things generally work as its still a fallback case? In that case moving forweard with https://review.opendev.org/c/opendev/system-config/+/956593 is probably fine as long as the held etherpad looks good	18:15
corvus	yes, i don't think we need to change any plans due to it.	18:16
clarkb	I just wanted to make sure its not a 100% fail case for multinode jobs that rely on shared networks	18:17
corvus	it's more likely to be hit after gate resets when there are canceled node requests	18:17
corvus	if things are proceeding smoothly, less so.	18:18
corvus	i'd like to get it merged today if possible; so i'm trying to figure out what's causing the timeouts.	18:18
opendevreview	Clark Boylan proposed opendev/system-config master: Update Gitea to 1.25 https://review.opendev.org/c/opendev/system-config/+/965960	18:30
clarkb	that is a first pass at gitea 1.25.	18:31
clarkb	Looks like we also need to prune the vexxhsot backup server. We can potentially purge backups for eavesdrop01, paste01, restack01, and review02 from that server as well if we're comfortable doing that (maybe a subset or all of them?)	18:35
fungi	i just started	18:36
fungi	it's been running in a root screen session for about 5 minutes	18:36
clarkb	ack thanks	18:38
clarkb	any thoughts on whether we should be purging any of those four sets of backups at this point?	18:38
fungi	no thoughts. my head is currently full of other thoughts	18:39
fungi	also see #openstack-infra where i pinged dmsimard (since he doesn't seem to be in here) about our ovh credits running out	18:40
clarkb	ya just saw that. We can ping amorin too if necessary	18:40
clarkb	good news on the backup front is we get regular reminders about it so if we don't make a decision now we'll be reminded in a month or so :)	18:40
fungi	yeah, he seems to be in here (and not there)	18:40
clarkb	https://zuul.opendev.org/t/openstack/build/6b4b75604b4b49c3b4431327b3a1c25e something is angry	18:49
clarkb	afs: disk cache read error in CacheItems slot 1562498 off 124999860/125000020 code -5/80 pid 493006 (apache2)	18:50
clarkb	from dmesg	18:50
fungi	urg	18:50
opendevreview	James E. Blair proposed openstack/project-config master: Further fixes to zuul-launcher metrics https://review.opendev.org/c/openstack/project-config/+/965966	18:50
clarkb	spot checking another mirror the problem does not appear to be global	18:50
fungi	i guess that's on an executor?	18:50
clarkb	fungi: no that is mirror.dfw3.raxflex.opendev.org	18:50
fungi	oh	18:50
corvus	maybe stop apache there and clean the cache?	18:51
fungi	ah, the apache2 there should have been a giveaway	18:51
corvus	actually, maybe just fs flushall first without even stopping apache	18:51
clarkb	ya looking at the logs its the same slot each time	18:52
clarkb	ack will try that	18:52
fungi	checking dmesg to see if there are any lower level errors mentioned for that block device	18:52
fungi	though these errors have completely filled the ring buffer	18:53
clarkb	fungi: I couldn't find any and the volume appears to still be rw	18:53
clarkb	fs flushall is running and not returning quickly. But I guess that may be expected	18:53
fungi	slot 1562498 osm	18:53
corvus	i think it's a GC based thing	18:54
fungi	slot 1562498 isn't the only one	18:54
clarkb	system load is raelly high now	18:54
fungi	up around 18:52:05 i see it complain about slots 1404560, 1523212, 1363181, and 1432463	18:54
clarkb	but the system seems mostly idle. Maybe it is something with writing to that disk?	18:55
fungi	[Mon Nov 3 18:52:05 2025] afs: Cannot open cache file (code -5). Trying to continue, but AFS accesses may return errors or panic the system	18:55
fungi	and then there's a kernel oops	18:55
corvus	reboot may be in order then	18:56
fungi	kernel BUG at /var/lib/dkms/openafs/1.8.13/build/src/libafs/MODLOAD-6.8.0-85-generic-SP/afs_dcache.c:1209! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI	18:56
clarkb	should we stop apache and disable it then reboot so we can try clearing the cache manually after it comes back up?	18:56
fungi	looks like we mount /var/cache/openafs from a logical volume on a cinder volume	18:56
clarkb	fungi: yes and that cinder volunme should be split with apache2's cache	18:57
fungi	so something may have happened to the cinder volume, e.g. iscsi network interruption	18:57
clarkb	should I reboot now?	18:57
corvus	++	18:58
clarkb	I'll disable apache2 as well	18:58
clarkb	fungi: you're ssh'd into the host is it ok if I reboot now?	18:59
fungi	yes, i'm logged out, sorry!	19:00
clarkb	I've asked the server to reboot itself. Pings have not stopped so I worry it may be hung up on some shutdown routine. But I've got a local machine with something like a 60 second timeout then it proceeds so I'll wait for a couple of minutes to see if that is the case here bfore asking openstack to shutdown and start again	19:01
clarkb	after 176 pings it appears to have stopped pinging. Just as I was sorting out how to ask openstack to shut it down	19:04
clarkb	now we wait to see if it boots up properly	19:04
clarkb	ok console log show shows it waiting on a number of processes. I'll give them a bit longer but I think I may need to hard stop it	19:05
clarkb	Failed to unmount /afs: Device or resource busy is the most recent message	19:06
fungi	yeah, perhaps unsurprising	19:07
clarkb	ok I think it may be time to ask for the "physical" power button	19:08
amorin	hey, something wrong?	19:08
fungi	the shiny, candy-red button	19:08
clarkb	amorin: the issue we're debugging is unrelated to ovh. But we are near the end of our ovh credits	19:09
fungi	amorin: unrelated to what we're working on right now, just ovh credits expiring. got the "unable to process payment" notice today	19:09
amorin	oh, damn	19:09
fungi	i pinged dmsimard about it already over in #openstack-infra since he's hanging out in there	19:09
amorin	ok	19:10
fungi	i'll bother you with it if he's not around	19:10
clarkb	mirror.dfw3 is in a powering off state but is not yet powered off	19:10
amorin	will double chck with him then and keep you in touch	19:10
fungi	thanks amorin!	19:10
amorin	good luck with the debug session	19:10
clarkb	aha there it goes it is shutdown now. I'll try booting it now	19:10
fungi	#status log Pruned /opt/backups-202010 on backup02.ca-ymq-1.vexxhost reducing utilization from 92% to 67%	19:12
opendevstatus	fungi: finished logging	19:12
clarkb	it won't power on	19:12
clarkb	I think we should disable that provider in zuul and then we're either booting a new server like we've done in the past and replacing the whole thing or engaging rax to help debug	19:13
fungi	are you trying server start or reboot?	19:13
clarkb	fungi: server start after a server stop	19:13
fungi	k	19:13
clarkb	should I try a server reboot?	19:13
fungi	no i don't think that'll work from a shutoff state	19:13
corvus	i'll write a config change to stop the provider	19:13
fungi	anyway, i agree on pausing the provider for now	19:13
clarkb	fungi: yup confirmed it is an error to try rebooting a stopped instance	19:13
fungi	we probably need dan_with or one of his colleagues to intercede	19:14
opendevreview	James E. Blair proposed opendev/zuul-providers master: Disable rax-flex due to mirror issue https://review.opendev.org/c/opendev/zuul-providers/+/965971	19:14
clarkb	dan_with: fyi we had a server in DFW3 (mirror01.dfw3.raxflex.opendev.org) become unhappy with io to a filesystem backed by a cinder volume. We ended up asking noav to shutdown the server and now it won't boot up again after that	19:14
opendevreview	James E. Blair proposed opendev/zuul-providers master: Revert "Disable rax-flex due to mirror issue" https://review.opendev.org/c/opendev/zuul-providers/+/965972	19:15
clarkb	dan_with: we are capable of replacing the entire system (server, volume, and floating ip) and moving on from here with a replacement, but I figure considering it may be related to volume behaviors it may be worth having ya'll debug it?	19:15
clarkb	server show also doesn't seem to record any reason for the boot failure	19:16
fungi	yeah, if nova thinks something's wrong with an attached volume it may refuse to start the instance	19:16
clarkb	volume show on the volume doesn't show anything problematic either fwiw, But its not the first time nova may be withholding a full story from us	19:17
opendevreview	Merged opendev/zuul-providers master: Disable rax-flex due to mirror issue https://review.opendev.org/c/opendev/zuul-providers/+/965971	19:17
fungi	right, i don't know if the cinder volume itself is actually the problem. corruption we saw before shutdown could have been entirely guest-side	19:17
fungi	also volume show is telling you what cinder thinks about the volume, which may not be the same as what nova thinks	19:18
clarkb	ya I just wish nova had a note for why something isn't booting	19:19
clarkb	task_state goes from powering-on to None and vm_state never changes from stopped	19:24
clarkb	(just skimming my terminal scrollback to see if I see anything useful, but thats about it. It does try to power on and then doesn't succeed apparently)	19:25
opendevreview	Clark Boylan proposed opendev/system-config master: Update Gitea to 1.25 https://review.opendev.org/c/opendev/system-config/+/965960	19:38
opendevreview	Clark Boylan proposed opendev/system-config master: Update Gitea to 1.25 https://review.opendev.org/c/opendev/system-config/+/965960	20:01
corvus	clarkb: https://review.opendev.org/965966 is a quick oopsie fix that would be good to get in	20:10
fungi	that is indeed a lot of additional ocurrences	20:18
clarkb	corvus: ack I've approved it	20:21
clarkb	(sorry was eating a sandwich)	20:21
opendevreview	Merged openstack/project-config master: Further fixes to zuul-launcher metrics https://review.opendev.org/c/openstack/project-config/+/965966	20:47
clarkb	I'm going to start putting the meeting agenda together. Gitea 1.25, gitea load mitigations, trixie mirroring, and the rax dfw3 mirror issue are all on my list of items to add in	21:39
clarkb	if there is anything else just let me know	21:40
clarkb	ok I think I have all of those updates in now	21:57
cardoe	So looking at the openstack-helm job at https://zuul.opendev.org/t/openstack/status and wondering what I need to do to unstick the 52hour stuck.	22:07
clarkb	looks like that is request 9d53f1877a6648a38e11e064a3bdb096	22:07
clarkb	at 2025-11-01 17:41:16,212 zl01 marks a node for that request as ready but then doesn't proceed from there	22:09
clarkb	that appears to be after corvus unstuck the upgrade process on zl01 and zl02	22:10
clarkb	by about 2.5 hours	22:10
clarkb	looks like the request is for 5 nodes	22:11
clarkb	https://grafana.opendev.org/d/0172e0bb72/zuul-launcher3a-rackspace-flex?orgId=1&from=now-24h&to=now&timezone=utc&var-region=$__all seems to reflect this we have 3 ready nodes and 2 requested, but I'm not sure we're doing anything at this point for the other two requseted nodes	22:12
corvus	i manually re-enqueued that. we should be able to continue to debug based on the ids from clark.	22:12
clarkb	corvus: cardoe I think this may be that nova api error that forces us to use a different microversion	22:13
clarkb	a normal server list against iad3 just failed for me	22:13
clarkb	now to remember how to specify the older version so that we can list servers and delete the	22:14
clarkb	corvus: all 5 servers show as active so none of them clearly indicate an error. I'm going to try and show each one without the microversion to see if I can determine which one is the problem and delete only that one	22:18
clarkb	corvus: is that safe enough with zuul launcher?	22:18
corvus	clarkb: those nodes might get reused	22:19
clarkb	7f50209b-8224-4b3a-b192-171d4b6bd185 (npb29cdece0ec64) and ec872fa3-f349-44a5-8694-6d00becef9ea (npa24e1b4413c14) are bad	22:19
clarkb	corvus: ya I was worried about that. Do we need to mark both of those for deletion on the nodepool side first then I can delete them manually on the cloud side?	22:20
clarkb	s/nodepool/launcher	22:20
corvus	npb29cdece0ec64 is in hold	22:20
corvus	npa24e1b4413c14 is ready	22:20
clarkb	interesting that implies the node was good for long enough for us to use it	22:21
clarkb	but now it appears to be creating the you can't show this node with new microversion problems	22:21
corvus	a24 has been reassigned to 00d042a30e61445f90c498311c299797	22:22
corvus	so it's going to be in use when that request is finished	22:22
clarkb	I guess once the node is booted and we ssh keyscan it we're not interacting with the nova api again until we try to delete the instance	22:23
corvus	yep	22:23
clarkb	I have a number of holds in place if that server is one of mine (perhaps for etherpad?) I can recycle it	22:23
corvus	it is, but that's probably not important now	22:23
clarkb	corvus: I think we have to clear out the two bad nodes from nova to get things working again. Either that or we have to start forcing zuul lauincher to use the old microversion	22:24
clarkb	which seems like a step backwards from a high level	22:24
corvus	so... lemme get this straight:	22:25
corvus	1) everything works fine	22:25
corvus	2) something happens; we don't know what	22:25
corvus	3) all nova api calls start failing	22:25
corvus	4) the only way to fix that is a) delete all servers or b) use an old microversion	22:25
corvus	is that right?	22:25
clarkb	roughly, for 2) I think nova believes it to be some sort of db field corruption replacing some field with an unexpected null value. that then caused 3) failures for any attribute listing of that node with a new microversion that enforces that field be not null	22:26
clarkb	the workaround works because old microversions didn't enforce that behavior. It seems likely ot me that this bug has existed since forever and because there was no enforcment it was simply never noticed. Now that they enforce it kaboom	22:27
corvus	but this affects the "list all servers" call?	22:27
clarkb	corvus: correct	22:27
corvus	so one server gets corrupted and it tanks the region.	22:27
clarkb	(as well as server show which is how I narrowed it down to those two)	22:27
clarkb	yes	22:27
corvus	clarkb: it looks like the new request is pretty stuck too.. honestly, i'd say sure just delete the nodes out from under the launcher and lets see what happens. the job will fail, but that should be the extent of the fallout. anything more is a bug.	22:29
clarkb	corvus: ok this is weird, now server listing works again	22:29
clarkb	without the microversion I mean	22:30
corvus	oO	22:30
clarkb	cardoe you didn't go and magic things on the backend did you?	22:30
cardoe	I don't know what magic things to do.	22:30
cardoe	Heck I'd love to figure out Zuul enough to clean up container builds and all of this stuff	22:31
cardoe	but I'll admit too much black magic and not enough cycles	22:31
clarkb	cardoe: this is all booting servers in rax flex iad3 fwiw. Not really anything to do with job payloads	22:31
cardoe	Now the PTL might have done some magic	22:31
clarkb	cardoe: I believe the problem is that listing servers broke in that region. But now it magically works again for some reason. Maybe by listing/showing things using the old microversion it uncorrupted the feilds or something	22:32
clarkb	https://bugs.launchpad.net/nova/+bug/2095364 this is the underlying issue I think	22:34
clarkb	that new request is for the same job on a different opensatck helm change. And it again has 3 available and 2 requested	22:41
clarkb	server list is working, but I don't see any new booting instances. So it could be that there are two problems here. Or one now but was two before when server listing was not working	22:42
clarkb	ok I think I see what may be part of problem 2	22:44
clarkb	we have these quotas: 10 instances, 51200MB of memory, and 20 cores. Unfortauntely only 5 four core instances fit itno the core limit and 6 fit into the memory limit	22:45
clarkb	so I think we need to drop the instance limit in that cloud region to 5 :/	22:45
clarkb	I know that cloudnull intended to bump the quotas up for us but it hasn't happened	22:45
clarkb	cardoe: ^ not sure if that is something you are bale to assist with. I'm not sure if it hasn't happened due to capacity of the cloud or just time constraints	22:46
clarkb	corvus: on the launcher side I suspect that the expectations is the held nodes will eventually go away and we'll haev enough room to fulfill the request. But maybe we need to consider nodes in a hold state as less ephemeral when calculating whether or not we haev sufficient room to supply a request?	22:47
cardoe	Let me prod him.	22:47
opendevreview	Clark Boylan proposed opendev/system-config master: DNM intentional Gitea failure to hold a node https://review.opendev.org/c/opendev/system-config/+/848181	23:10
corvus	clarkb: that is an interesting suggestion. i can see how it makes sense in this case, but i wonder whether we should generalize it (since in other cases, it could cause NODE_FAILUREs). it probably does make sense, but it's a tradeoff to consider carefully.	23:10
clarkb	corvus: I'm not sure I understand the note about node failures? Wouldn't we boot fewer nodes if we counted held nodes against "permanent" quota?	23:13
corvus	if we only had iad3 with quota for 5 nodes and received a request for 5 nodes while we had 1 held node and we implement your suggestion, that becomes a NODE_FAILURE in the future, but today it waits for the held node to free up.	23:14
corvus	(with your suggestion, in opendev, we fail over to another cloud, but not everyone has 6 clouds)	23:14
clarkb	ah I guess in my mind it would've never accepted the request int he first place	23:14
clarkb	so in the opendev scenario it falls back to another provider and in a single cloud deployment it would indeed node failures since there is no fallback	23:15
corvus	it fails because if we consider held nodes as unusable provider quota, then we have no provider that could possibly fulfill the request.	23:15
corvus	yeah. there's not really much of a distinction between "accepted" and "not accepted" for purposes of this conversation. it's more about deciding whether a provider can possibly fulfill a request.	23:16
clarkb	other (probably bad ideas that come to mind): we could canabalize held nodes if necessary and force them to be deleted to return quota. I wonder too if we can annotate the request in the status page somehow as "waiting for quota"	23:18
clarkb	I think if we don't want to change the behavior of request handling then updating the status page to make it more clear why we're waiting is a decent compromise	23:18
corvus	i think cannibalizing held nodes is not worth it. but i definitely think we should expose more info.	23:19
corvus	(held nodes are generally held because a human needs time on them to fix something complex, so that's an expensive process we don't want to elongate. but we do, by default, set a time limit on held nodes)	23:20
clarkb	in the meantime I guess I can delete my held nodes so that the OSH jobs can complete? Or do you think it is helpful to keep things in this state for now?	23:29
corvus	yeah, i think fundamentally, we don't have the capacity in raxflex-iad for that many held nodes, so that's probably best.	23:37
corvus	we could probably fix this situation by deleting the request and the ready nodes; that might get it assigned to a different provider, but no guarantee it won't happen again.	23:37
clarkb	ok I've deleted the autohold corresponding to those two nodes in rax iad	23:48
clarkb	that should allow it to boot the two nodes required for the current request	23:48

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!