Tuesday, 2025-08-12

janders	TheJulia testing BIOS update on R640. It's in service wait while "do not reboot" message is on the console. I figure if I tried to abort now, Ironic would reject the API call, right? I don't see SERVICEWAIT as a valid state for abort.	00:19
janders	(I don't want to actually try cause Feature Freeze is next week, losing a day redoing a lab may cause things to slip)	00:19
janders	scratch that, I tried AFTER it finished flashing but was still in wait. It accepted the request and put node in servicing failed. No reboot though.	00:21
janders	https://paste.openstack.org/show/bZu1H3L9dvp0cz7V05jt/	00:22
janders	didn't expect this to work. But - if the flow is always like this (no reboot and putting node to service failed) this shoudln't be dangerous in theory. I wonder what you folks think though.	00:23
janders	we could have some really strong wording in the doco around this but in any case in enteprise software it does have a feel of flying a jetliner with GPWS off, kind of.	00:24
janders	alternative I suppose would be to reject abort on the wait state and wait for the wait to time out. As a power user with decade of devops background I prefer current behaviour. But I would never abort service without eyeballing console first. I bet there is people out there who would, just because they can. Question is whether we find this an	00:26
janders	acceptable risk or not.	00:26
janders	last relevant piece to the picture is this: to get out of this state I had to do two aborts: first interrupted servicing and put me in servicing failed (no reboot and no maintenance set). Second abort got me out of service failed to active WITH a reboot. So if someone did that (two aborts back to back) trying to get from service wait to active,	00:32
janders	they could theoretically cause damage (if BMC doesn't have any protection against this). https://paste.openstack.org/show/b4EfuIcrYFkm2oHsmcLC/	00:32
TheJulia	wait being aborted to fail makes sense I guess	01:46
TheJulia	oh, abort abort	01:46
TheJulia	man... no idea	01:47
TheJulia	the risk of someone aborting a node in servicewait while a step runs has always been there	01:47
TheJulia	bmcs should have guards, but also some vendors mandate a/b style firmware	01:48
TheJulia	so if A is bad, system reboots back to b and b gets promoted to a	01:48
janders	Makes sense, thank you TheJulia	02:06
janders	My take: if we're not adding any new risk, let's proceed. We're fixing something useful.	02:07
janders	we can think of some extra guards separately from this if we decide we care. Strong phrasing in doco ("make sure to check console for running firmware upgrades prior to attempting abort") may be good enough	02:07
janders	JayF iurygregory dtantsur WDYT?	02:07
TheJulia	I think, in an ideal case, the overall process of applying the update can know when the world is safe to be able to proceed to the next step safely	02:10
TheJulia	that lingering... != great	02:10
iurygregory	I would be ok with it	02:18
iurygregory	do we store the model of the server in properties? like I know we detect the vendor and we update the node properties with the information...	02:34
iurygregory	humm seems like it would be in properties under capabilities 'server_model' if the information is available	02:41
opendevreview	Kaifeng Wang proposed openstack/ironic master: Trivial: minor typo fix around network boot document https://review.opendev.org/c/openstack/ironic/+/957085	06:06
rpittau	good morning ironic! o/	06:45
queensly[m]	Good morning!	07:13
dtantsur	JayF: the script-y WSGI approach is being removed from pbr, that's why	10:42
dtantsur	JayF: on top of that, it seems that the world is moving towards running something like gunicorn as a WSGI server and then proxying it via apache if needed	10:43
opendevreview	Verification of a change to openstack/ironic master failed: Add periodic cleanup of dead conductor entries https://review.opendev.org/c/openstack/ironic/+/956500	10:44
dtantsur	janders: I haven't been following the discussion, sorry. Could you summarize?	10:52
janders	dtantsur sure. There are a couple different threads here but I think this one is the most important for now. Adding "abort" to servicing means that theoretically the operator can abort service wait state.	10:56
janders	based on my testing, at least on my lab machine, once SimpleUpdate call completes and the machine reboots to apply the update, the node lands in service wait	10:56
janders	service wait, being wait state, can be aborted	10:56
janders	running abort in service wait results in service failed (no reboot)	10:56
dtantsur	I think steps themselves have an "abortable" flag, no?	10:56
janders	then, running abort again against service failed results in active AND reboot. Which can happen when there is a big "updating BIOS, do not power off" message in the console	10:57
janders	I was wondering if documenting this along the lines "if aborting a service step, operator is strongly advised to check the console and make sure there is no in-progress updates before issuing the command" is sufficient safeguard here	10:58
janders	or should we make abort not apply to service wait (or look at steps themselves as you just suggested)	10:59
dtantsur	Hmm, we get a bit of a mix-up here. Traditionally, abort in Ironci applies to *wait states.	10:59
janders	correct, and that is the case here as well	11:00
dtantsur	And indeed, at least for cleaning, we don't allow abort for steps that are not marked as abortable	11:00
dtantsur	(to be precise, we abort the process after the step is over)	11:00
janders	makes sense	11:00
dtantsur	Allowing abort on service fail, I now realize, would set an entirely new precedent in Ironic	11:00
janders	perhaps the way forward here is to make redfish firmware update step non-abortable?	11:00
dtantsur	yeah	11:00
dtantsur	Thinking more about it, the Ironic way out of service wait would probably be reusing the "active" verb	11:01
dtantsur	Similarly to how clean failed is left by "manage" IIRC	11:01
dtantsur	TheJulia: ^^^	11:01
janders	I don't disagree.. though allowing abort on "service failed" seems way slicker than previous approach of passing wait/hold step	11:01
dtantsur	allowing "active" is exactly the same UX :)	11:02
janders	so are you proposing "service failed" > manage > "managable" > provide or sth like this?	11:02
dtantsur	another precedent is unrescue, which is closer to servicing in spirit	11:02
dtantsur	wait, where is manageable coming from?	11:02
janders	sorry I am a bit rusty with the manage verb	11:03
dtantsur	Note: all we're talking about is which verb to use, not which path to take	11:03
janders	if we "manage" a "clean failed" note what state does it go to?	11:03
dtantsur	to manageable	11:03
janders	(/me is looking at state transition diagram)	11:03
dtantsur	All I'm pointing out is that "abort" feels weird for a state where no process is happening. It's a plausible interpretation, but it's not done this way anywhere in Ironic	11:05
janders	makes sense	11:05
janders	what would be an alternative? thinking how to translate clean-failed recovery to servicing	11:05
dtantsur	We either do un<something> or reuse whatever verb normally leads to the expected state	11:05
dtantsur	The former would be unservice, the latter would be active	11:05
janders	so "service failed" > unservice > "active" potentially?	11:05
dtantsur	yeah, for example	11:06
janders	fair, let's see what Julia thinks	11:06
dtantsur	or "service failed" > active > "active" if we want to reuse an already existing verb	11:06
janders	this would mean consistency++	11:06
janders	so in terms of action items, we need to consider this PLUS making firmware update step non-abortable?	11:07
dtantsur	yeah, I think so	11:07
janders	I tend to agree	11:07
janders	I will look into the latter (once I am finished with a customer bug)	11:07
janders	I have no major issue with the current proposed way (it is a significant improvement already) but while we're at it why not consider going all the way	11:08
janders	consistency is good	11:08
janders	coding wise, unservice would almost be an alias for abort right?	11:09
janders	(since service-abort codepaths already exist and that's pretty much what we need to do IIUC)	11:10
janders	can we either abort rescue or unrescue and if so is there any significant difference between the flow of the two?	11:11
janders	( janders pulls out states diagram again)	11:11
dtantsur	I don't think the hypothetical unservice and abort are very similar. The latter is "stop doing thing, do not clean-up". The former - "nothing is happening, but do clean-up".	11:12
janders	then it feels like current abort-service is actually unservice	11:13
janders	(as it flips things back to booting into instance)	11:13
janders	but reality is likely a bit more vague	11:14
janders	regarding my previous comment looks like there is no rescue-abort only unrescue	11:14
JayF	dtantsur: so that change should probably undocument apache wsgi	12:45
TheJulia	good morning	13:00
TheJulia	dtantsur: I thought about taking the active verb, but that is also a very different code path and it seemed better to just back out of the opeation then call the provisioning code path	13:05
TheJulia	that also means a user needs to do something like baremetal node deploy again	13:06
TheJulia	my grenade job fix finally works	13:14
dtantsur	TheJulia: to be clear, I meant reusing the "active" verb, not the entire machinery behind it. Just like "abort" does different things depending on the initial state.	13:15
dtantsur	good morning!	13:15
TheJulia	and to be clear, I didn't want to go down the path of special casing it along that provisioning code path	13:16
opendevreview	Nahian Pathan proposed openstack/ironic master: Reduce API calls when collecting sensor data with redfish https://review.opendev.org/c/openstack/ironic/+/955484	13:16
TheJulia	since there is way more there, abort felt like more an escape hatch	13:16
dtantsur	I imagine, but now we assign an entire new meaning to the "abort" verb	13:16
dtantsur	in which case we probably need to make sure it works on other *failed states?	13:17
TheJulia	I don't really think so, but I can kind of see the point. Truthfully it feels like the discussion has shifted as well	13:17
dtantsur	I"m coming from the perspective of "I don't understand which command I'm supposed to use" being a common complaint	13:18
dtantsur	and I can see your logic that "abort" is the most intuitive thing. But it will be a disaster if it only works on some states.	13:18
TheJulia	That is totally fair	13:18
dtantsur	(also, our state machine is a mess..)	13:19
TheJulia	unfortunately, active also feels less intuitive	13:19
TheJulia	(we have strong guardrails)	13:19
dtantsur	I agree. I only suggested it because it seems consistent with how "manage" returns you to "manageable".	13:19
TheJulia	Yeah	13:19
TheJulia	its just less intuitive and makes me worry people will have the same expectation	13:20
TheJulia	"hey, I'm in clean fail, I want to be active, why can't I be active right now"	13:20
dtantsur	lol, I see your point :D	13:21
TheJulia	(its the same point!) :)	13:21
TheJulia	I guess the thing we need to be careful about is setting expectations of "we're going to deploy it" again if you broke it in service	13:26
TheJulia	then again, we have rebuild	13:26
dtantsur	by the way, I don't think rebuild works out of service failed or any servicing parts?	13:26
TheJulia	it does not, you'd need to get back to active	13:26
TheJulia	really we should permit service fail -> rebuild verb	13:27
TheJulia	separately	13:27
* dtantsur nods		13:27
* TheJulia wheels in the "drink more coffee" coffee machine with a sticker which says "review eventlet removal patches!"		13:28
dtantsur	are we ready with them now?	13:29
TheJulia	I need to sit down and spin up the novncproxy stuffs just to double check no_fork, which was my plan for this afternoon, but the other changes are just waiting for reviews at this point	13:30
dtantsur	k gimme a few minutes	13:30
TheJulia	no rush, just don't want to push towards the very end and be in the endless battle with CI to get stuff to merge	13:30
TheJulia	We've all put a ton of effort in on this	13:31
TheJulia	Also: https://etherpad.opendev.org/p/ironic-eventlet-removal line 56 for any additional testing per the discussion yesterday	13:34
JayF	You can put my vote in the column of unservice	14:01
JayF	Maybe this is a stupid question, but would it be annoying to have active be the state to escape a service fail from an RBAC standpoint? I definitely envision use cases for service where people would have access to perform service but not to do a deployment	14:01
dtantsur	TheJulia: a bunch of comments on https://review.opendev.org/c/openstack/ironic/+/952939, one of them important-ish, but nothing truly blocking	14:04
dtantsur	JayF: good point re "active" and RBAC	14:04
TheJulia	dtantsur: w/r/t queing, you do understand its the rapid lock access which piles from the comment right?	14:06
TheJulia	the only way to prevent the case is all work enqueing to be held under the lock in futurist	14:06
TheJulia	but that will also slow down thread launches	14:06
dtantsur	TheJulia: I'm not sure I'm following. I suspect our average thread's lifetime greatly exceeds any potential lock contention time	14:07
TheJulia	granted, I've functionally changed it more so to "will this exceed our overall limits"	14:07
TheJulia	yeah, but the lock is around adding the entry to the list	14:07
dtantsur	also, the lock is always there, whether with queue or not	14:08
TheJulia	and so when launching 8 threads in rapid succession, somethings going to hit the lock and see entries being launched	14:08
dtantsur	could you take a step back and specify which concern you're addressing?	14:09
dtantsur	the lock in submit(), if you mean it, applies always	14:09
TheJulia	I'm having to look for it at this point, it was clear as day when I was bashing my head against it weeks ago	14:12
TheJulia	I think it was https://github.com/openstack/futurist/blob/master/futurist/_futures.py#L173C9-L173C29 where it holds the lock to get the number to make the decision, but that requires the lock to make that decision, meanwhile submission if a new thread is also under a lock so we get into a state where we haven't really gotten it started yet but we're releasing the lock on one side, the other side grabs the lock before the	14:16
TheJulia	"queue" is empty and it sees a value	14:16
dtantsur	TheJulia: note that get_num_idle_workers is called from submit where the lock is already held, e.g. it's re-entrant	14:17
dtantsur	https://github.com/openstack/futurist/blob/master/futurist/_futures.py#L219	14:17
dtantsur	so what you say only applies to logic that is executed outside of submit(). not to rejection or the growth logic.	14:18
TheJulia	at this point, I've lost the context	14:19
dtantsur	I don't think I've even gained it in the first place :D	14:20
TheJulia	just I know at this point we see it stacking up when it is running and the default rejector basically knee-caps the conductor's operation	14:20
TheJulia	my troubleshooting then yielded that it was sometimes... and often mid-sequence of starting workers would see one in the queue and it would reject the additional work	14:20
TheJulia	so it would get 2-3 workers created and then see an entry in the queue from another launch	14:21
TheJulia	and freak out	14:21
TheJulia	and by freak out, halt the creation of new workers on power sync and suddenly it takes 20 minutes instead of 3	14:21
dtantsur	Right, because their rejection does not take into account that work does not disappear from the queue immediately	14:22
TheJulia	yes, exactly	14:22
dtantsur	it has little to do with locking and a lot to do with the fact that things are not immediate	14:22
TheJulia	yup	14:22
dtantsur	it was the case in eventlet too, by the way. I think that's why we have this "1" constant there	14:22
dtantsur	But. I don't think I objected to having our own rejector, did it? hence my confusion about the context of this conversation	14:23
TheJulia	What you expressed was more "oh no, we're queuing, I don't want queuing"	14:23
dtantsur	Definitely not. In fact, I started musing if we need to allow queueing. Which, I guess you do, now that I think more about it?	14:24
TheJulia	oh, okay, I misunderstood you	14:24
TheJulia	I kind of do, really	14:24
TheJulia	but, yeah	14:24
dtantsur	Your rejector basically rejects work if the queue size + number of active threads exceeds the maximum number of workers, right?	14:24
TheJulia	yeah	14:25
dtantsur	Right, so you enable queueing, and this aligns with what I've been pondering :)	14:25
dtantsur	I also think you've simply invented a better rejector than the one in futurist	14:25
TheJulia	\o/	14:26
TheJulia	okay, it was clear I was kind of misunderstanding you as well	14:26
dtantsur	My only concern with queuing is the existence of synchronous API that uses spawn_worker. But that's just life, we cannot change it overnight.	14:26
dtantsur	The only important-ish comment there is about the possibility of someone setting very low workers number in their configuration.	14:26
TheJulia	yeah, that... is an awful api endpoint	14:26
TheJulia	ahhhhhh okay	14:27
TheJulia	We've generally seen the opposite	14:27
TheJulia	... support case with 1000+ workers	14:27
dtantsur	oh, wow, nice	14:27
TheJulia	yeah, they tuned all the timeouts to like 300+ seconds	14:27
dtantsur	still, I'd rather have some sort of a guardrail (in a follow-up if needed)	14:27
dtantsur	heh	14:27
TheJulia	and turned on as retry logic as possible to get zero failures	14:28
TheJulia	"no, really, trust us, let things fail, it will be fine"	14:28
JayF	some people need to take a physics class and remember we're in the real world lol	14:31
JayF	'try to shove the electrons harder through the [for example] failed network cable'	14:31
JayF	OK; I have a redfish issue I was trying to see if we had already put in a workaround for and I couldn't find it.	14:32
JayF	I have a machine which Ironic redfish-virtual-media mounts the iso USB2, but then we set the next boot device to cdrom instead of usb ... and I'm not even sure we have support for setting next boot device to usb?	14:33
TheJulia	uhhh... I thought we had code to deal with that edge case on that hardware	14:34
TheJulia	sorry, brain semi-scrambled because gearing up for meetings and discussion :)	14:34
JayF	hmm I'll look for it, this isn't on master so may just be "find the patch to backport"	14:35
TheJulia	oh	14:35
TheJulia	uhhh.	14:36
TheJulia	https://github.com/openstack/ironic/blob/master/ironic/drivers/modules/redfish/boot.py#L298 is the path I'm thinking	14:38
TheJulia	https://github.com/openstack/ironic/blob/master/ironic/drivers/modules/redfish/boot.py#L790	14:38
TheJulia	versus https://github.com/openstack/ironic/blob/master/ironic/drivers/modules/redfish/boot.py#L903 you'll need to figure out a good way to detect it, my thought was dvd vs cd conntation in bmc response data	14:39
JayF	you're linking two separate things afaict?	14:40
JayF	one is talking about the iso, one is talking about configdrive ?	14:40
TheJulia	yes	14:40
TheJulia	but if you look at the very first one, the confusion I was sourcing was dvd vs cd in the api, not usb	14:40
JayF	yeah	14:40
TheJulia	I get why they likely chose it, its actually a better analog	14:41
TheJulia	because the aspeed BMCs do the same thing, they present a volume $thing, and label it whatever they feel like... all over USB	14:42
TheJulia	(and yes, I've seen it present a virtual cd, and a thumb drive in response to a cd/dvd attach request in the past	14:44
JayF	so there's probably a hacky-fix to make it use usb to get them going	14:45
JayF	and the "real fix" is to maybe ask the redfish what is inserted	14:45
JayF	and use that info to set boot device?	14:45
TheJulia	possibly	14:46
JayF	ack; I suspect we'll hack a fix in downstream and I might address this as a longer-term bug since it'd be crunch to get it in sushy pre-release	14:51
TheJulia	ack	14:54
TheJulia	to be fair, its likely better to try and get a release out at some point sooner than waiting	14:55
TheJulia	but really the difference is more in waiting for release process to complete	14:55
JayF	When I originally proposed to you the idea of doing some dedicated QA	14:55
JayF	I assumed it might come with an earlier release, too	14:56
JayF	so we could +x.1 or +x.x.1 it if we needed before integrated release	14:56
TheJulia	I don't think we need to cut a QA release, its not like folks are going to jump on it	14:56
TheJulia	and the release team historically pushes back on last minute releases for press release purposes	14:56
JayF	I guess I personally like the idea of not testing a moving target; but that's more academic than a real risk	14:56
TheJulia	There is also reality, as we move into the next few weeks, CI is going to get worse and worse	14:57
dtantsur	JayF: asking sushy what's inserted may not work: we only consider devices that expose CD or DVD	15:11
dtantsur	then you have a media_types list which may also contain USB, I guess.. but I don't think we have a way to know that the device is USB internally	15:12
TheJulia	I think the best thing to do is to check media_types and lean towards USB	15:23
TheJulia	... Has anyone purchased a CD or USB recently?!	15:23
TheJulia	err	15:23
TheJulia	CD or DVD	15:23
opendevreview	Jakub Jelinek proposed openstack/ironic-python-agent master: WIP: Fix skip block devices for RAID arrays https://review.opendev.org/c/openstack/ironic-python-agent/+/937342	15:25
jssfr	TheJulia, some time after 2020, but I don't know the exact year... probably 2022 or so. and then even four of them at once!	15:50
TheJulia	I don't think we've directly purchased any since ?2019 or 2020?	15:54
TheJulia	Granted, some have appeared. My wife has sponsored some indie movies and been in a couple as well, so we've gotten the DVDs for those.	15:54
jssfr	(those were CDs though)	15:56
JayF	dtantsur: we have an internal device labelled USB	15:56
JayF	dtantsur: so likely adding support for such a device would be part of it	15:56
jssfr	(we tend to lend DVDs from the library rather than buying them, but I like to maintain my music collection. and since I'm allergic to streaming services, I either buy CDs or from bandcamp or wherever else I get losslessly compressed audio)	15:57
JayF	I took an action item to document this in a bug	15:57
JayF	and will look before PTG to see how painful it would be to eliminate this class of issue/workaround	15:57
dtantsur	TheJulia: media_types is not the device bus though. Most BMC's nowadays use CD or DVD regardless of how the virtual device is connected	16:02
dtantsur	(except for the SMC's UsbCd...)	16:02
TheJulia	definitely, but one could suggest another is what I'm thinking	16:03
JayF	What I've heard reported is more that we use the "normal Ironic path" for inserting it, but it only boots to the correct device when you tell the BMC to boot from USB.	16:04
dtantsur	Maybe. I don't know if we can assume that if media_types contains Usb, the boot device should also be Usb. Seems like a bit of a stretch.	16:04
JayF	They have it working by racing the boot with a call to the BMC to boot from usb.	16:04
dtantsur	We could have something like driver_info['redfish_virtual_media_device'] for crazy hardware..	16:05
TheJulia	dtantsur: I agree completely, I sort of suspect we might have to special case the decision	16:05
JayF	dtantsur: that was my original thought, but when I realized it was inserting one place and showing another, I was curious if there was a "kill this class of bug" solution rather than whackamole	16:05
dtantsur	Right. I don't know unfortunately.	16:06
dtantsur	And as always with quirky hardware, we risk breaking one model by fixing another.	16:06
JayF	yeah, me neither, but I don't wanna not-try, especially given there's a fallback position	16:06
TheJulia	dtantsur: yup	16:06
JayF	but I suspect over time we'll move more and more towards being able to tell our driver what goes where	16:07
JayF	especially as we get more setups where one physical server == N redfish systems	16:07
JayF	(I have one where N=3)	16:07
dtantsur	Well, we finally got VirtualMedia linked through System instead of Manager :)	16:07
JayF	and you have to walk all the systems if you wanna see all the hardware/attrs	16:07
dtantsur	but it's still a PITA with firmware updates that all go through UPdateService	16:07
TheJulia	did dmtf ever agree to have a "I want to identify the manager from the system" issue?	16:09
dtantsur	A system links to its Manager, if that's your question. It's not a solution for both virtual media and update service though since the same Manager can be used by other Systems (and Manager is not linked to UpdateService)	16:10
TheJulia	yeah, true	16:11
opendevreview	Verification of a change to openstack/ironic master failed: Add periodic cleanup of dead conductor entries https://review.opendev.org/c/openstack/ironic/+/956500	16:11
JayF	cid: late review on ^ but if you could https://review.opendev.org/c/openstack/ironic/+/956500/3#message-de0261bc90108d86852cfc3fa71f7d86352fc2ee in a follow-up, it'd be awesome	16:20
cid	JayF: okay, good catch. I'll push a follow up today	16:35
JayF	make sure if you pick a number >1:1 ratio that we update the release note and config desc to indicate the requirement	16:35
JayF	since right now it just says heartbeat_timeout should be > conductor-timeout-command	16:36
opendevreview	Verification of a change to openstack/ironic master failed: api: Add schema for allocations API (requests) https://review.opendev.org/c/openstack/ironic/+/945218	16:39
opendevreview	Verification of a change to openstack/ironic master failed: api: Add schema for allocations API (responses) https://review.opendev.org/c/openstack/ironic/+/928921	16:39
opendevreview	Verification of a change to openstack/ironic master failed: api: Add schema for bios API (versioning) https://review.opendev.org/c/openstack/ironic/+/952147	16:39
opendevreview	Verification of a change to openstack/ironic master failed: api: Add schema for bios API (requests) https://review.opendev.org/c/openstack/ironic/+/952148	16:39
opendevreview	Verification of a change to openstack/ironic master failed: api: Add schema for bios API (responses) https://review.opendev.org/c/openstack/ironic/+/952149	16:39
opendevreview	Verification of a change to openstack/ironic master failed: api: Add schema for inspection rules API (versioning) https://review.opendev.org/c/openstack/ironic/+/954038	16:43
clif	Is the current release of ironic 31.0? I'm looking at ironic/common/release_mappings.py and the latest I see besides master is '29.0'. Should there be entries in here for '30.0' and '31.0'?	17:05
clif	I checked and I'm rebased on latest master afaict, last commit is a zuul merge from yesterday	17:06
dtantsur	ugh, we should have added them..	17:06
clif	Hopefully not difficult to fix?	17:07
dtantsur	yeah, but not for the existing releases	17:07
clif	I'm adding fields to Port and Portgroup, was going to bump the object versions	17:08
clif	then a unit test screamed at me for there being gaps in the released object versions :)	17:08
dtantsur	yeah, please add the missing ones (ideally, in a separate patch)	17:08
clif	I'll see what I can do.	17:17
opendevreview	Clif Houck proposed openstack/ironic master: Add RELEASE_MAPPING entry for 30.0 release https://review.opendev.org/c/openstack/ironic/+/957164	17:27
opendevreview	Clif Houck proposed openstack/ironic master: Add RELEASE_MAPPING entry for 31.0 release https://review.opendev.org/c/openstack/ironic/+/957166	17:34
clif	Created these by checking out 30.0.0 and 31.0.0 tags and copy-pasting the master entry into their corresponding version entry, lmk if this is good	17:34
JayF	Looks accurate at a glance. I'll give it a thorough review after lunch.	17:49
JayF	Plus we'll have to see what branches this needs backporting to	17:49
JayF	clif: so basically we have to "bisect" the version, and ensure the values are right :(	18:15
JayF	meaning I'd get the SHA 30.0 and 31.0 were cut from	18:15
JayF	and take the HEAD copy advertised for master for those commits	18:16
JayF	gerrit/bugfix/30.0 gerrit/bugfix/31.0	18:16
JayF	so they are bugfix branches	18:16
JayF	hmm maybe you already did this? or just no object bumps at all since 30.0? Seems unlikely	18:18
clif	I only know of the Port bump, but I haven't dug further to check	18:18
JayF	I compared bugfix/30.0 to the update for 30.0 and it matched	18:19
JayF	checking 31.0 and expecting the same result	18:19
JayF	do you know what that is for?	18:19
clif	the port bump? I think adding the description field	18:19
JayF	no, the release_mappings.py generally	18:19
JayF	essentially you can tell Ironic, via config, the "max version" you want it to ever use in a cluster, allowing for rolling updates. So you could set it to 30.0 and be assured no API/object above the versions in that dict would be used, then once the full cluster is upgraded (say to master or 31.0) you remove the config and let everyone use new objects now that everyone can process them	18:20
clif	I think I understand	18:21
JayF	+2 on both, that's a good fix	18:21
clif	ty	18:22
clif	is the release version tracked or exposed anywhere else in the codebase?	18:22
clif	I did't find references to 30.0 or 31.0 elsewhere	18:22
JayF	pbr derives it from git metadata in-repo	18:22
JayF	out-of-repo it's in releases repo	18:22
JayF	let me link ya	18:23
JayF	https://opendev.org/openstack/releases/src/branch/master/deliverables/flamingo/ironic.yaml each deliverable:release_cycle mapping has a yaml file	18:24
JayF	with specific shas and details on the release	18:24
JayF	this is the source of truth -- e.g. PR to this repo merged is actually how the release is cut	18:24
JayF	so that says the SHA from the repo for each release, then indicates that bugfix/30.0 (for instance) was cut off the same SHA as 30.0 release version	18:25
JayF	we could later add a 30.1 release but that bugfix/30.0-branch-create-stanza at the bottom doesn't change	18:25
JayF	all the actual automation is in repo there, but really unless you wanna get in the weeds, the yaml is readable	18:25
* TheJulia needs a giant cup-o-brains		19:49
TheJulia	dtantsur: its fine to backport the mapping fix, as long as it is an super early action before we cut any other release off the branch	19:53
opendevreview	Julia Kreger proposed openstack/ironic master: Replace GreenThreadPoolExecutor in conductor https://review.opendev.org/c/openstack/ironic/+/952939	20:10
opendevreview	Julia Kreger proposed openstack/ironic master: Set the backend to threading https://review.opendev.org/c/openstack/ironic/+/953683	20:10
opendevreview	Julia Kreger proposed openstack/ironic master: Remove direct mapping from API -> DB https://review.opendev.org/c/openstack/ironic/+/956512	20:10
opendevreview	Julia Kreger proposed openstack/ironic master: Optional indirection API use https://review.opendev.org/c/openstack/ironic/+/956504	20:10
opendevreview	Julia Kreger proposed openstack/ironic master: Revert "ci: temporary metal3 integration job disable" https://review.opendev.org/c/openstack/ironic/+/956953	20:10
opendevreview	Julia Kreger proposed openstack/ironic master: Clean-up misc eventlet references https://review.opendev.org/c/openstack/ironic/+/955632	20:10
opendevreview	Julia Kreger proposed openstack/ironic master: Launch vnc proxy with no_fork https://review.opendev.org/c/openstack/ironic/+/957044	20:16
TheJulia	fyi, it looks like our jobs have been failing due to bad qcow2 images	21:12
TheJulia	but, latest I just pulled seems to be good, so... I've rechecked a job to see	21:12
opendevreview	Merged openstack/ironic master: Trivial: minor typo fix around network boot document https://review.opendev.org/c/openstack/ironic/+/957085	21:17
TheJulia	Going back to janders and the whole "escape hatch" out of "service fail" state, seems like we're 2 votes "active", 1 for a new unservice verb. Was that it for our options ?	21:18
JayF	I think I got the active vote from dmitry flipped with my rbac concern?	21:21
JayF	since we may want to be able to allow someone to `unservice` without permitting `active`	21:22
opendevreview	Julia Kreger proposed openstack/ironic master: Set the backend to threading https://review.opendev.org/c/openstack/ironic/+/953683	21:49
opendevreview	Julia Kreger proposed openstack/ironic master: Launch vnc proxy with no_fork https://review.opendev.org/c/openstack/ironic/+/957044	21:49
opendevreview	Julia Kreger proposed openstack/ironic master: Remove direct mapping from API -> DB https://review.opendev.org/c/openstack/ironic/+/956512	21:49
opendevreview	Julia Kreger proposed openstack/ironic master: Optional indirection API use https://review.opendev.org/c/openstack/ironic/+/956504	21:49
opendevreview	Julia Kreger proposed openstack/ironic master: Revert "ci: temporary metal3 integration job disable" https://review.opendev.org/c/openstack/ironic/+/956953	21:49
TheJulia	so now we're talking about building an rbac model for individual verbs? To be clear, we do have a policy rule like baremetal:node:set_provision_state:service_steps now, but it is around the ability to pass service_steps in general, not invoke the verb. In general set provision state endpoint all has a single policy applied to it for all logic? At some point, we need to get back to the original problem I guess...	21:56
JayF	TheJulia: then my question-comment above resolves to "it doesn't matter" and I'm back to not having a strong opinion	21:57
JayF	;D	21:57
JayF	:D	21:57
TheJulia	okay	21:57
JayF	hmm actually	21:57
JayF	I do still prefer unservice; because unrescue already set the precendent of un[verb]->active	21:58
opendevreview	Verification of a change to openstack/ironic master failed: api: Add schema for allocations API (requests) https://review.opendev.org/c/openstack/ironic/+/945218	22:31
janders	TheJulia w/r/t service-failed recovery I vote "unservice"	22:42
janders	it feels like we s/abort/unservice the verb in the current implementation we're not far off work complete, it seems to work pretty good	22:44
janders	I will look into making redfish update step non-abortable to address my other concern (feels like separate change material, likely super-trivial)	22:45
janders	iurygregory any thoughts from you on the choice of "unservice" verb?	22:46
janders	TheJulia I had a look at the code, abortable=False is already set: https://opendev.org/openstack/ironic/src/commit/d7cea2a8bc60e19453e588e602fae527af9b3c68/ironic/drivers/modules/redfish/firmware.py#L181	22:51
TheJulia	Cool	22:52
janders	would you expect this to cause abort/unservice requests to be rejected while node is in "service wait" or would that only cause them to be rejected in "servicing"?	22:52
janders	I was able to abort it in "service wait", trying to understand if this is expected	22:52
janders	(will try interrupt "servicing" now)	22:52
janders	update: API won't allow that. So I suspect abortable=True means -ing can't be interrupted, -wait is probably another beast	22:54
JayF	-ing means, typically, the conductor is actively working	22:55
janders	agreed	22:55
JayF	and we never (I'm pretty sure?) allow aborting from an -ing	22:55
JayF	but -ing will always resolve to a wait, fail, or the target state	22:55
JayF	even if it takes ${action}_timeout to do so	22:55
janders	would you expect abortable flag in step's decorator to affect whether we can interrupt the wait state after a certain step?	22:56
JayF	If step X is not abortable, and step X is the "running" step, you should not be able to abort from that	22:56
JayF	the canonical example for that behavior is an in-band bios update in progress	22:57
JayF	and ironic lol-rebooting your computer into a brick	22:57
janders	yeah I am trying to protect against similar case just with OOB/SimpleUpdate update	22:57
janders	so what happens is I run the step, API takes the update, node reboots and starts writing new BIOS image	22:57
janders	and then it sits in "service wait" as this happens	22:57
janders	with current iteration of the service-abort patch, I can issue abort twice (first: from service wait to service failed, then from service failed to active) and that will reboot the node, potentially causing damage	22:58
JayF	AIUI, you should not be able to abort from service wait in that case	22:58
janders	we were discussing acceptable risk around that	22:59
JayF	but I am more of an in-band cleaning expert than an OOB cleaning expert	22:59
janders	OK so it seems like we may be hitting a bug?	22:59
JayF	janders: acceptable risk, in context of this, also means "can the user with the access to do this (maybe a lessee member for servicing) be trusted to take this risk"	22:59
JayF	whereas in all other abort scenarios, at least thru an integrated-openstack lens, it would be an admin/manager level person doing the abort	23:00
JayF	which IMO implies we would have to be more careful (or at a minimum, the same amount of careful)	23:00
janders	I see two obvious ways out of this personally: 1) if abort request against "service wait" node shouldn't be accepted, we need to find out why it is accepted and fix it. If abort in wait states is always allowed regardless of abortable=True I would make a very clear documentation addition stating that if abort is used in servicing states, the user	23:01
janders	must monitor the console and ensure there is no indication of an ongoing firmware update.	23:01
janders	(sorry, missed the 2) after "and fix it" :)	23:01
JayF	abort in wait states always allowed is (arguably?) a bug or design misfeature IMO; but I obviously have a specific perpsective :D	23:01
janders	I got slack with my editing because... slack	23:01
JayF	I hope it's not that :D	23:02
janders	I wonder what TheJulia thinks	23:02
janders	I will have a quick browse through the code	23:02
JayF	It's my EOD, but I'll check back tomorrow if you leave any messages you want me to check just ping me on em	23:02
JayF	have a good one!	23:02
janders	JayF ack+thx	23:02
janders	you too	23:02
TheJulia	There is some validity to aborting in a -ing state, like deploying	23:04
TheJulia	but... obviously that shouldnt' work out of the box	23:04
JayF	that makes sense that deploying would be an exception there	23:05
janders	how about -wait states (and specifically if the wait state happens after a step with abortable=False)?	23:05
TheJulia	I don't think we even permit that, but we might	23:05
TheJulia	there is some weirdness there	23:05
iurygregory	janders, I have no problem with it =)	23:06
TheJulia	waits states should be abortable	23:06
TheJulia	the issue is when we have steps which launch an asynchronosu thing like.. software updates on bmcs and doesn't reflect if it is done, or not	23:06
janders	OK I think I see what is happening	23:07
janders	code reference coming shortly	23:07
janders	https://opendev.org/openstack/ironic/src/branch/master/ironic/conductor/manager.py#L1474	23:09
janders	I am not sure if we have similar code for servicing, will go looking	23:09
janders	but from ^^ it seems like in cleaning the value of abortable param for the step would determine if it can be aborted in wait state	23:09
TheJulia	bahahahaha	23:10
TheJulia	so back to my original change but add in a SERVICEWAIT	23:10
TheJulia	I'm way past EOD and need to exercise. but another thing for tomorow I guess	23:10
janders	thank you TheJulia	23:12
janders	have a good exercise session	23:12
janders	it feels like we're getting there, "unservice" + addressing the service wait abort/unservice should be getting us close	23:12
janders	once we sort out the service wait interruptions, as a bonus we fix one more issue:	23:13
janders	right now, to get out of service wait to active we need to abort twice (service wait > service failed and then service failed > active)	23:14
janders	if there is no abort/unservice from service wait this whole use case is gone	23:14
janders	(and I tend to think it's better this way, safer; if we see things getting stuck in servicing and not timing out we may need to address that separately)	23:15

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!