Tuesday, 2025-08-12

jandersTheJulia testing BIOS update on R640. It's in service wait while "do not reboot" message is on the console. I figure if I tried to abort now, Ironic would reject the API call, right? I don't see SERVICEWAIT as a valid state for abort.00:19
janders(I don't want to actually try cause Feature Freeze is next week, losing a day redoing a lab may cause things to slip)00:19
jandersscratch that, I tried AFTER it finished flashing but was still in wait. It accepted the request and put node in servicing failed. No reboot though.00:21
jandershttps://paste.openstack.org/show/bZu1H3L9dvp0cz7V05jt/00:22
jandersdidn't expect this to work. But - if the flow is always like this (no reboot and putting node to service failed) this shoudln't be dangerous in theory. I wonder what you folks think though.00:23
janderswe could have some really strong wording in the doco around this but in any case in enteprise software it does have a feel of flying a jetliner with GPWS off, kind of.00:24
jandersalternative I suppose would be to reject abort on the wait state and wait for the wait to time out. As a power user with decade of devops background I prefer current behaviour. But I would never abort service without eyeballing console first. I bet there is people out there who would, just because they can. Question is whether we find this an00:26
jandersacceptable risk or not.00:26
janderslast relevant piece to the picture is this: to get out of this state I had to do two aborts: first interrupted servicing and put me in servicing failed (no reboot and no maintenance set). Second abort got me out of service failed to active WITH a reboot. So if someone did that (two aborts back to back) trying to get from service wait to active,00:32
jandersthey could theoretically cause damage (if BMC doesn't have any protection against this). https://paste.openstack.org/show/b4EfuIcrYFkm2oHsmcLC/00:32
TheJuliawait being aborted to fail makes sense I guess01:46
TheJuliaoh, abort abort01:46
TheJuliaman... no idea01:47
TheJuliathe risk of someone aborting a node in servicewait while a step runs has always been there01:47
TheJuliabmcs *should* have guards, but also some vendors mandate a/b style firmware01:48
TheJuliaso if A is bad, system reboots back to b and b gets promoted to a01:48
jandersMakes sense, thank you TheJulia02:06
jandersMy take: if we're not adding any new risk, let's proceed. We're fixing something useful.02:07
janderswe can think of some extra guards separately from this if we decide we care. Strong phrasing in doco ("make sure to check console for running firmware upgrades prior to attempting abort") may be good enough02:07
jandersJayF iurygregory dtantsur WDYT?02:07
TheJuliaI think, in an ideal case, the overall process of applying the update can know when the world is safe to be able to proceed to the next step safely02:10
TheJuliathat lingering... != great02:10
iurygregoryI would be ok with it02:18
iurygregorydo we store the model of the server in properties? like I know we detect the vendor and we update the node properties with the information...02:34
iurygregoryhumm seems like it would be in properties under capabilities 'server_model' if the information is available02:41
opendevreviewKaifeng Wang proposed openstack/ironic master: Trivial: minor typo fix around network boot document  https://review.opendev.org/c/openstack/ironic/+/95708506:06
rpittaugood morning ironic! o/06:45
queensly[m]Good morning!07:13
dtantsurJayF: the script-y WSGI approach is being removed from pbr, that's why10:42
dtantsurJayF: on top of that, it seems that the world is moving towards running something like gunicorn as a WSGI server and then proxying it via apache if needed10:43
opendevreviewVerification of a change to openstack/ironic master failed: Add periodic cleanup of dead conductor entries  https://review.opendev.org/c/openstack/ironic/+/95650010:44
dtantsurjanders: I haven't been following the discussion, sorry. Could you summarize?10:52
jandersdtantsur sure. There are a couple different threads here but I think this one is the most important for now. Adding "abort" to servicing means that theoretically the operator can abort service wait state.10:56
jandersbased on my testing, at least on my lab machine, once SimpleUpdate call completes and the machine reboots to apply the update, the node lands in service wait10:56
jandersservice wait, being wait state, can be aborted 10:56
jandersrunning abort in service wait results in service failed (no reboot)10:56
dtantsurI think steps themselves have an "abortable" flag, no?10:56
jandersthen, running abort again against service failed results in active AND reboot. Which can happen when there is a big "updating BIOS, do not power off" message in the console10:57
jandersI was wondering if documenting this along the lines "if aborting a service step, operator is strongly advised to check the console and make sure there is no in-progress updates before issuing the command" is sufficient safeguard here10:58
jandersor should we make abort not apply to service wait (or look at steps themselves as you just suggested)10:59
dtantsurHmm, we get a bit of a mix-up here. Traditionally, abort in Ironci applies to *wait states.10:59
janderscorrect, and that is the case here as well11:00
dtantsurAnd indeed, at least for cleaning, we don't allow abort for steps that are not marked as abortable11:00
dtantsur(to be precise, we abort the process after the step is over)11:00
jandersmakes sense11:00
dtantsurAllowing abort on service fail, I now realize, would set an entirely new precedent in Ironic11:00
jandersperhaps the way forward here is to make redfish firmware update step non-abortable?11:00
dtantsuryeah11:00
dtantsurThinking more about it, the Ironic way out of service wait would probably be reusing the "active" verb11:01
dtantsurSimilarly to how clean failed is left by "manage" IIRC11:01
dtantsurTheJulia: ^^^11:01
jandersI don't disagree.. though allowing abort on "service failed" seems way slicker than previous approach of passing wait/hold step11:01
dtantsurallowing "active" is exactly the same UX :)11:02
jandersso are you proposing "service failed" > manage > "managable" > provide or sth like this?11:02
dtantsuranother precedent is unrescue, which is closer to servicing in spirit11:02
dtantsurwait, where is manageable coming from?11:02
janderssorry I am a bit rusty with the manage verb11:03
dtantsurNote: all we're talking about is which verb to use, not which path to take11:03
jandersif we "manage" a "clean failed" note what state does it go to?11:03
dtantsurto manageable11:03
janders(/me is looking at state transition diagram)11:03
dtantsurAll I'm pointing out is that "abort" feels weird for a state where no process is happening. It's a plausible interpretation, but it's not done this way anywhere in Ironic11:05
jandersmakes sense11:05
janderswhat would be an alternative? thinking how to translate clean-failed recovery to servicing11:05
dtantsurWe either do un<something> or reuse whatever verb normally leads to the expected state11:05
dtantsurThe former would be unservice, the latter would be active11:05
jandersso "service failed" > unservice > "active" potentially?11:05
dtantsuryeah, for example11:06
jandersfair, let's see what Julia thinks11:06
dtantsuror "service failed" > active > "active" if we want to reuse an already existing verb11:06
jandersthis would mean consistency++11:06
jandersso in terms of action items, we need to consider this PLUS making firmware update step non-abortable?11:07
dtantsuryeah, I think so11:07
jandersI tend to agree11:07
jandersI will look into the latter (once I am finished with a customer bug)11:07
jandersI have no major issue with the current proposed way (it is a significant improvement already) but while we're at it why not consider going all the way 11:08
jandersconsistency is good11:08
janderscoding wise, unservice would almost be an alias for abort right?11:09
janders(since service-abort codepaths already exist and that's pretty much what we need to do IIUC)11:10
janderscan we either abort rescue or unrescue and if so is there any significant difference between the flow of the two?11:11
janders( janders pulls out states diagram again)11:11
dtantsurI don't think the hypothetical unservice and abort are very similar. The latter is "stop doing thing, do not clean-up". The former - "nothing is happening, but do clean-up".11:12
jandersthen it feels like current abort-service is actually unservice11:13
janders(as it flips things back to booting into instance)11:13
jandersbut reality is likely a bit more vague11:14
jandersregarding my previous comment looks like there is no rescue-abort only unrescue11:14
JayFdtantsur: so that change should probably undocument apache wsgi12:45
TheJuliagood morning13:00
TheJuliadtantsur: I thought about taking the active verb, but that is also a very different code path and it seemed better to just back out of the opeation then call the provisioning code path13:05
TheJuliathat also means a user needs to do something like baremetal node deploy again13:06
TheJuliamy grenade job fix finally works13:14
dtantsurTheJulia: to be clear, I meant reusing the "active" verb, not the entire machinery behind it. Just like "abort" does different things depending on the initial state.13:15
dtantsurgood morning!13:15
TheJuliaand to be clear, I didn't want to go down the path of special casing it along that provisioning code path13:16
opendevreviewNahian Pathan proposed openstack/ironic master: Reduce API calls when collecting sensor data with redfish  https://review.opendev.org/c/openstack/ironic/+/95548413:16
TheJuliasince there is way more there, abort felt like more an escape hatch13:16
dtantsurI imagine, but now we assign an entire new meaning to the "abort" verb13:16
dtantsurin which case we probably need to make sure it works on other *failed states?13:17
TheJuliaI don't really think so, but I can kind of see the point. Truthfully it feels like the discussion has shifted as well13:17
dtantsurI"m coming from the perspective of "I don't understand which command I'm supposed to use" being a common complaint13:18
dtantsurand I can see your logic that "abort" is the most intuitive thing. But it will be a disaster if it only works on some states.13:18
TheJuliaThat is totally fair13:18
dtantsur(also, our state machine is a mess..)13:19
TheJuliaunfortunately, active also feels less intuitive13:19
TheJulia(we have strong guardrails)13:19
dtantsurI agree. I only suggested it because it seems consistent with how "manage" returns you to "manageable".13:19
TheJuliaYeah13:19
TheJuliaits just less intuitive and makes me worry people will have the same expectation13:20
TheJulia"hey, I'm in clean fail, I want to be active, why can't I be active right now"13:20
dtantsurlol, I see your point :D13:21
TheJulia(its the same point!) :)13:21
TheJuliaI guess the thing we need to be careful about is setting expectations of "we're going to deploy it" again if you broke it in service13:26
TheJuliathen again, we have rebuild13:26
dtantsurby the way, I don't think rebuild works out of service failed or any servicing parts?13:26
TheJuliait does not, you'd need to get back to active13:26
TheJuliareally we should permit service fail -> rebuild verb13:27
TheJuliaseparately13:27
* dtantsur nods13:27
* TheJulia wheels in the "drink more coffee" coffee machine with a sticker which says "review eventlet removal patches!"13:28
dtantsurare we ready with them now?13:29
TheJuliaI need to sit down and spin up the novncproxy stuffs just to double check no_fork, which was my plan for this afternoon, but the other changes are just waiting for reviews at this point13:30
dtantsurk gimme a few minutes13:30
TheJuliano rush, just don't want to push towards the very end and be in the endless battle with CI to get stuff to merge13:30
TheJuliaWe've all put a ton of effort in on this13:31
TheJuliaAlso: https://etherpad.opendev.org/p/ironic-eventlet-removal line 56 for any additional testing per the discussion yesterday13:34
JayFYou can put my vote in the column of unservice14:01
JayFMaybe this is a stupid question, but would it be annoying to have active be the state to escape a service fail from an RBAC standpoint? I definitely envision use cases for service where people would have access to perform service but not to do a deployment14:01
dtantsurTheJulia: a bunch of comments on https://review.opendev.org/c/openstack/ironic/+/952939, one of them important-ish, but nothing truly blocking14:04
dtantsurJayF: good point re "active" and RBAC14:04
TheJuliadtantsur: w/r/t queing, you do understand its the rapid lock access which piles from the comment right?14:06
TheJuliathe only way to prevent the case is all work enqueing to be held under the lock in futurist14:06
TheJuliabut that will also slow down thread launches14:06
dtantsurTheJulia: I'm not sure I'm following. I suspect our average thread's lifetime greatly exceeds any potential lock contention time14:07
TheJuliagranted, I've functionally changed it more so to "will this exceed our overall limits"14:07
TheJuliayeah, but the lock is around adding the entry to the list14:07
dtantsuralso, the lock is always there, whether with queue or not14:08
TheJuliaand so when launching 8 threads in rapid succession, somethings going to hit the lock and see entries being launched14:08
dtantsurcould you take a step back and specify which concern you're addressing?14:09
dtantsurthe lock in submit(), if you mean it, applies always14:09
TheJuliaI'm having to look for it at this point, it was clear as day when I was bashing my head against it weeks ago14:12
TheJuliaI think it was https://github.com/openstack/futurist/blob/master/futurist/_futures.py#L173C9-L173C29 where it holds the lock to get the number to make the decision, but that requires the lock to make that decision, meanwhile submission if a new thread is also under a lock so we get into a state where we haven't really gotten it started yet but we're releasing the lock on one side, the other side grabs the lock before the 14:16
TheJulia"queue" is empty and it sees a value14:16
dtantsurTheJulia: note that get_num_idle_workers is called from submit where the lock is already held, e.g. it's re-entrant14:17
dtantsurhttps://github.com/openstack/futurist/blob/master/futurist/_futures.py#L21914:17
dtantsurso what you say only applies to logic that is executed outside of submit(). not to rejection or the growth logic.14:18
TheJuliaat this point, I've lost the context14:19
dtantsurI don't think I've even gained it in the first place :D14:20
TheJuliajust I know at this point we see it stacking up when it is running and the default rejector basically knee-caps the conductor's operation14:20
TheJuliamy troubleshooting then yielded that it was sometimes... and often mid-sequence of starting workers would see one in the queue and it would reject the additional work14:20
TheJuliaso it would get 2-3 workers created and then see an entry in the queue from another launch14:21
TheJuliaand freak out14:21
TheJuliaand by freak out, halt the creation of new workers on power sync and suddenly it takes 20 minutes instead of 314:21
dtantsurRight, because their rejection does not take into account that work does not disappear from the queue immediately14:22
TheJuliayes, exactly14:22
dtantsurit has little to do with locking and a lot to do with the fact that things are not immediate14:22
TheJuliayup14:22
dtantsurit was the case in eventlet too, by the way. I think that's why we have this "1" constant there14:22
dtantsurBut. I don't think I objected to having our own rejector, did it? hence my confusion about the context of this conversation14:23
TheJuliaWhat you expressed was more "oh no, we're queuing, I don't want queuing"14:23
dtantsurDefinitely not. In fact, I started musing if we need to *allow* queueing. Which, I guess you do, now that I think more about it?14:24
TheJuliaoh, okay, I misunderstood you14:24
TheJuliaI kind of do, really14:24
TheJuliabut, yeah14:24
dtantsurYour rejector basically rejects work if the queue size + number of active threads exceeds the maximum number of workers, right?14:24
TheJuliayeah14:25
dtantsurRight, so you enable queueing, and this aligns with what I've been pondering :)14:25
dtantsurI also think you've simply invented a better rejector than the one in futurist14:25
TheJulia\o/14:26
TheJuliaokay, it was clear I was kind of misunderstanding you as well14:26
dtantsurMy only concern with queuing is the existence of synchronous API that uses spawn_worker. But that's just life, we cannot change it overnight.14:26
dtantsurThe only important-ish comment there is about the possibility of someone setting very low workers number in their configuration.14:26
TheJuliayeah, that... is an awful api endpoint14:26
TheJuliaahhhhhh okay14:27
TheJuliaWe've generally seen the opposite14:27
TheJulia... support case with 1000+ workers14:27
dtantsuroh, wow, nice14:27
TheJuliayeah, they tuned all the timeouts to like 300+ seconds14:27
dtantsurstill, I'd rather have some sort of a guardrail (in a follow-up if needed)14:27
dtantsurheh14:27
TheJuliaand turned on as retry logic as possible to get zero failures14:28
TheJulia"no, really, trust us, let things fail, it will be fine"14:28
JayFsome people need to take a physics class and remember we're in the real world lol14:31
JayF'try to shove the electrons harder through the [for example] failed network cable'14:31
JayFOK; I have a redfish issue I was trying to see if we had already put in a workaround for and I couldn't find it.14:32
JayFI have a machine which Ironic redfish-virtual-media mounts the iso USB2, but then we set the next boot device to cdrom instead of usb ... and I'm not even sure we have support for setting next boot device to usb?14:33
TheJuliauhhh... I thought we had code to deal with that edge case on that hardware14:34
TheJuliasorry, brain semi-scrambled because gearing up for meetings and discussion :)14:34
JayFhmm I'll look for it, this isn't on master so may just be "find the patch to backport"14:35
TheJuliaoh14:35
TheJuliauhhh.14:36
TheJuliahttps://github.com/openstack/ironic/blob/master/ironic/drivers/modules/redfish/boot.py#L298 is the path I'm thinking14:38
TheJuliahttps://github.com/openstack/ironic/blob/master/ironic/drivers/modules/redfish/boot.py#L79014:38
TheJuliaversus https://github.com/openstack/ironic/blob/master/ironic/drivers/modules/redfish/boot.py#L903 you'll need to figure out a good way to detect it, my thought was dvd vs cd conntation in bmc response data14:39
JayFyou're linking two separate things afaict?14:40
JayFone is talking about the iso, one is talking about configdrive ?14:40
TheJuliayes14:40
TheJuliabut if you look at the very first one, the confusion I was sourcing was dvd vs cd in the api, not usb14:40
JayFyeah14:40
TheJuliaI get why they likely chose it, its actually a better analog14:41
TheJuliabecause the aspeed BMCs do the same thing, they present a volume $thing, and label it whatever they feel like... all over USB14:42
TheJulia(and yes, I've seen it present a virtual cd, and a thumb drive in response to a cd/dvd attach request in the past14:44
JayFso there's probably a hacky-fix to make it use usb to get them going14:45
JayFand the "real fix" is to maybe ask the redfish what is inserted14:45
JayFand use that info to set boot device?14:45
TheJuliapossibly14:46
JayFack; I suspect we'll hack a fix in downstream and I might address this as a longer-term bug since it'd be crunch to get it in sushy pre-release14:51
TheJuliaack14:54
TheJuliato be fair, its likely better to try and get a release out at some point sooner than waiting14:55
TheJuliabut really the difference is more in waiting for release process to complete14:55
JayFWhen I originally proposed to you the idea of doing some dedicated QA14:55
JayFI assumed it might come with an earlier release, too14:56
JayFso we could +x.1 or +x.x.1 it if we needed before integrated release14:56
TheJuliaI don't think we need to *cut* a QA release, its not like folks are going to jump on it14:56
TheJuliaand the release team historically pushes back on last minute releases for press release purposes14:56
JayFI guess I personally like the idea of not testing a moving target; but that's more academic than a real risk14:56
TheJuliaThere is also reality, as we move into the next few weeks, CI is going to get worse and worse14:57
dtantsurJayF: asking sushy what's inserted may not work: we only consider devices that expose CD or DVD15:11
dtantsurthen you have a media_types list which may also contain USB, I guess.. but I don't think we have a way to know that the device is USB internally15:12
TheJuliaI think the best thing to do is to check media_types and lean towards USB15:23
TheJulia... Has anyone purchased a CD or USB recently?!15:23
TheJuliaerr15:23
TheJuliaCD or DVD15:23
opendevreviewJakub Jelinek proposed openstack/ironic-python-agent master: WIP: Fix skip block devices for RAID arrays  https://review.opendev.org/c/openstack/ironic-python-agent/+/93734215:25
jssfrTheJulia, some time after 2020, but I don't know the exact year... probably 2022 or so. and then even four of them at once!15:50
TheJuliaI don't think we've directly purchased any since ?2019 or 2020?15:54
TheJuliaGranted, some have appeared. My wife has sponsored some indie movies and been in a couple as well, so we've gotten the DVDs for those.15:54
jssfr(those were CDs though)15:56
JayFdtantsur: we have an internal device labelled USB15:56
JayFdtantsur: so likely adding support for such a device would be part of it15:56
jssfr(we tend to lend DVDs from the library rather than buying them, but I like to maintain my music collection. and since I'm allergic to streaming services, I either buy CDs or from bandcamp or wherever else I get losslessly compressed audio)15:57
JayFI took an action item to document this in a bug15:57
JayFand will look before PTG to see how painful it would be to eliminate this class of issue/workaround15:57
dtantsurTheJulia: media_types is not the device bus though. Most BMC's nowadays use CD or DVD regardless of how the virtual device is connected16:02
dtantsur(except for the SMC's UsbCd...)16:02
TheJuliadefinitely, but one could suggest another is what I'm thinking16:03
JayFWhat I've heard reported is more that we use the "normal Ironic path" for inserting it, but it only boots to the correct device when you tell the BMC to boot from USB.16:04
dtantsurMaybe. I don't know if we can assume that if media_types contains Usb, the boot device should also be Usb. Seems like a bit of a stretch.16:04
JayFThey have it working by racing the boot with a call to the BMC to boot from usb.16:04
dtantsurWe could have something like driver_info['redfish_virtual_media_device'] for crazy hardware..16:05
TheJuliadtantsur: I agree completely, I sort of suspect we might have to special case the decision16:05
JayFdtantsur: that was my original thought, but when I realized it was inserting one place and showing another, I was curious if there was a "kill this class of bug" solution rather than whackamole16:05
dtantsurRight. I don't know unfortunately.16:06
dtantsurAnd as always with quirky hardware, we risk breaking one model by fixing another.16:06
JayFyeah, me neither, but I don't wanna not-try, especially given there's a fallback position16:06
TheJuliadtantsur: yup16:06
JayFbut I suspect over time we'll move more and more towards being able to tell our driver what goes where16:07
JayFespecially as we get more setups where one physical server == N redfish systems16:07
JayF(I have one where N=3)16:07
dtantsurWell, we finally got VirtualMedia linked through System instead of Manager :)16:07
JayFand you have to walk all the systems if you wanna see all the hardware/attrs16:07
dtantsurbut it's still a PITA with firmware updates that all go through UPdateService16:07
TheJuliadid dmtf ever agree to have a "I want to identify the manager from the system" issue?16:09
dtantsurA system links to its Manager, if that's your question. It's not a solution for both virtual media and update service though since the same Manager can be used by other Systems (and Manager is not linked to UpdateService)16:10
TheJuliayeah, true16:11
opendevreviewVerification of a change to openstack/ironic master failed: Add periodic cleanup of dead conductor entries  https://review.opendev.org/c/openstack/ironic/+/95650016:11
JayFcid: late review on ^ but if you could https://review.opendev.org/c/openstack/ironic/+/956500/3#message-de0261bc90108d86852cfc3fa71f7d86352fc2ee in a follow-up, it'd be awesome16:20
cidJayF: okay, good catch. I'll push a follow up today 16:35
JayFmake sure if you pick a number >1:1 ratio that we update the release note and config desc to indicate the requirement16:35
JayFsince right now it just says heartbeat_timeout should be > conductor-timeout-command16:36
opendevreviewVerification of a change to openstack/ironic master failed: api: Add schema for allocations API (requests)  https://review.opendev.org/c/openstack/ironic/+/94521816:39
opendevreviewVerification of a change to openstack/ironic master failed: api: Add schema for allocations API (responses)  https://review.opendev.org/c/openstack/ironic/+/92892116:39
opendevreviewVerification of a change to openstack/ironic master failed: api: Add schema for bios API (versioning)  https://review.opendev.org/c/openstack/ironic/+/95214716:39
opendevreviewVerification of a change to openstack/ironic master failed: api: Add schema for bios API (requests)  https://review.opendev.org/c/openstack/ironic/+/95214816:39
opendevreviewVerification of a change to openstack/ironic master failed: api: Add schema for bios API (responses)  https://review.opendev.org/c/openstack/ironic/+/95214916:39
opendevreviewVerification of a change to openstack/ironic master failed: api: Add schema for inspection rules API (versioning)  https://review.opendev.org/c/openstack/ironic/+/95403816:43
clifIs the current release of ironic 31.0? I'm looking at ironic/common/release_mappings.py and the latest I see besides master is '29.0'. Should there be entries in here for '30.0' and '31.0'?17:05
clifI checked and I'm rebased on latest master afaict, last commit is a zuul merge from yesterday17:06
dtantsurugh, we should have added them..17:06
clifHopefully not difficult to fix?17:07
dtantsuryeah, but not for the existing releases17:07
clifI'm adding fields to Port and Portgroup, was going to bump the object versions17:08
clifthen a unit test screamed at me for there being gaps in the released object versions :)17:08
dtantsuryeah, please add the missing ones (ideally, in a separate patch)17:08
clifI'll see what I can do.17:17
opendevreviewClif Houck proposed openstack/ironic master: Add RELEASE_MAPPING entry for 30.0 release  https://review.opendev.org/c/openstack/ironic/+/95716417:27
opendevreviewClif Houck proposed openstack/ironic master: Add RELEASE_MAPPING entry for 31.0 release  https://review.opendev.org/c/openstack/ironic/+/95716617:34
clifCreated these by checking out 30.0.0 and 31.0.0 tags and copy-pasting the master entry into their corresponding version entry, lmk if this is good17:34
JayFLooks accurate at a glance. I'll give it a thorough review after lunch.17:49
JayFPlus we'll have to see what branches this needs backporting to17:49
JayFclif: so basically we have to "bisect" the version, and ensure the values are right :( 18:15
JayFmeaning I'd get the SHA 30.0 and 31.0 were cut from18:15
JayFand take the HEAD copy advertised for master for those commits18:16
JayFgerrit/bugfix/30.0 gerrit/bugfix/31.018:16
JayFso they are bugfix branches18:16
JayFhmm maybe you already did this? or just no object bumps  at all since 30.0? Seems unlikely18:18
clifI only know of the Port bump, but I haven't dug further to check18:18
JayFI compared bugfix/30.0 to the update for 30.0 and it matched18:19
JayFchecking 31.0 and expecting the same result18:19
JayFdo you know what that is for?18:19
clifthe port bump? I think adding the description field18:19
JayFno, the release_mappings.py generally18:19
JayFessentially you can tell Ironic, via config, the "max version" you want it to ever use in a cluster, allowing for rolling updates. So you could set it to 30.0 and be assured no API/object above the versions in that dict would be used, then once the full cluster is upgraded (say to master or 31.0) you remove the config and let everyone use new objects now that everyone can process them18:20
clifI think I understand18:21
JayF+2 on both, that's a good fix18:21
clifty18:22
clifis the release version tracked or exposed anywhere else in the codebase?18:22
clifI did't find references to 30.0 or 31.0 elsewhere18:22
JayFpbr derives it from git metadata in-repo18:22
JayFout-of-repo it's in releases repo18:22
JayFlet me link ya18:23
JayFhttps://opendev.org/openstack/releases/src/branch/master/deliverables/flamingo/ironic.yaml each deliverable:release_cycle mapping has a yaml file18:24
JayFwith specific shas and details on the release18:24
JayFthis is the *source of truth* -- e.g. PR to this repo merged is actually how the release is cut18:24
JayFso that says the SHA from the repo for each release, then indicates that bugfix/30.0 (for instance) was cut off the same SHA as 30.0 release version18:25
JayFwe could later add a 30.1 release but that bugfix/30.0-branch-create-stanza at the bottom doesn't change18:25
JayFall the actual automation is in repo there, but really unless you wanna get in the weeds, the yaml is readable18:25
* TheJulia needs a giant cup-o-brains19:49
TheJuliadtantsur: its fine to backport the mapping fix, as long as it is an super early action before we cut any other release off the branch19:53
opendevreviewJulia Kreger proposed openstack/ironic master: Replace GreenThreadPoolExecutor in conductor  https://review.opendev.org/c/openstack/ironic/+/95293920:10
opendevreviewJulia Kreger proposed openstack/ironic master: Set the backend to threading  https://review.opendev.org/c/openstack/ironic/+/95368320:10
opendevreviewJulia Kreger proposed openstack/ironic master: Remove direct mapping from API -> DB  https://review.opendev.org/c/openstack/ironic/+/95651220:10
opendevreviewJulia Kreger proposed openstack/ironic master: Optional indirection API use  https://review.opendev.org/c/openstack/ironic/+/95650420:10
opendevreviewJulia Kreger proposed openstack/ironic master: Revert "ci: temporary metal3 integration job disable"  https://review.opendev.org/c/openstack/ironic/+/95695320:10
opendevreviewJulia Kreger proposed openstack/ironic master: Clean-up misc eventlet references  https://review.opendev.org/c/openstack/ironic/+/95563220:10
opendevreviewJulia Kreger proposed openstack/ironic master: Launch vnc proxy with no_fork  https://review.opendev.org/c/openstack/ironic/+/95704420:16
TheJuliafyi, it looks like our jobs have been failing due to bad qcow2 images21:12
TheJuliabut, latest I just pulled seems to be good, so... I've rechecked a job to see21:12
opendevreviewMerged openstack/ironic master: Trivial: minor typo fix around network boot document  https://review.opendev.org/c/openstack/ironic/+/95708521:17
TheJuliaGoing back to janders and the whole "escape hatch" out of "service fail" state, seems like we're 2 votes "active", 1 for a new unservice verb. Was that it for our options ?21:18
JayFI think I got the active vote from dmitry flipped with my rbac concern?21:21
JayFsince we may want to be able to allow someone to `unservice` without permitting `active`21:22
opendevreviewJulia Kreger proposed openstack/ironic master: Set the backend to threading  https://review.opendev.org/c/openstack/ironic/+/95368321:49
opendevreviewJulia Kreger proposed openstack/ironic master: Launch vnc proxy with no_fork  https://review.opendev.org/c/openstack/ironic/+/95704421:49
opendevreviewJulia Kreger proposed openstack/ironic master: Remove direct mapping from API -> DB  https://review.opendev.org/c/openstack/ironic/+/95651221:49
opendevreviewJulia Kreger proposed openstack/ironic master: Optional indirection API use  https://review.opendev.org/c/openstack/ironic/+/95650421:49
opendevreviewJulia Kreger proposed openstack/ironic master: Revert "ci: temporary metal3 integration job disable"  https://review.opendev.org/c/openstack/ironic/+/95695321:49
TheJuliaso now we're talking about building an rbac model for individual verbs? To be clear, we do have a policy rule like baremetal:node:set_provision_state:service_steps now, but it is around the ability to pass service_steps in general, not invoke the verb. In general set provision state endpoint all has a single policy applied to it for all logic? At some point, we need to get back to the original problem I guess...21:56
JayFTheJulia: then my question-comment above resolves to "it doesn't matter" and I'm back to not having a strong opinion21:57
JayF;D 21:57
JayF:D 21:57
TheJuliaokay21:57
JayFhmm actually21:57
JayFI do still prefer unservice; because unrescue already set the precendent of un[verb]->active21:58
opendevreviewVerification of a change to openstack/ironic master failed: api: Add schema for allocations API (requests)  https://review.opendev.org/c/openstack/ironic/+/94521822:31
jandersTheJulia w/r/t service-failed recovery I vote "unservice"22:42
jandersit feels like we s/abort/unservice the verb in the current implementation we're not far off work complete, it seems to work pretty good22:44
jandersI will look into making redfish update step non-abortable to address my other concern (feels like separate change material, likely super-trivial)22:45
jandersiurygregory any thoughts from you on the choice of "unservice" verb?22:46
jandersTheJulia I had a look at the code, abortable=False is already set: https://opendev.org/openstack/ironic/src/commit/d7cea2a8bc60e19453e588e602fae527af9b3c68/ironic/drivers/modules/redfish/firmware.py#L18122:51
TheJuliaCool22:52
janderswould you expect this to cause abort/unservice requests to be rejected while node is in "service wait" or would that only cause them to be rejected in "servicing"?22:52
jandersI was able to abort it in "service wait", trying to understand if this is expected22:52
janders(will try interrupt "servicing" now)22:52
jandersupdate: API won't allow that. So I suspect abortable=True means -ing can't be interrupted, -wait is probably another beast22:54
JayF-ing means, typically, the conductor is *actively working*22:55
jandersagreed22:55
JayFand we never (I'm pretty sure?) allow aborting from an -ing22:55
JayFbut -ing will always resolve to a wait, fail, or the target state22:55
JayFeven if it takes ${action}_timeout to do so22:55
janderswould you expect abortable flag in step's decorator to affect whether we can interrupt the wait state after a certain step?22:56
JayFIf step X is not abortable, and step X is the "running" step, you should not be able to abort from that22:56
JayFthe canonical example for that behavior is an in-band bios update in progress22:57
JayFand ironic lol-rebooting your computer into a brick22:57
jandersyeah I am trying to protect against similar case just with OOB/SimpleUpdate update22:57
jandersso what happens is I run the step, API takes the update, node reboots and starts writing new BIOS image 22:57
jandersand then it sits in "service wait" as this happens22:57
janderswith current iteration of the service-abort patch, I can issue abort twice (first: from service wait to service failed, then from service failed to active) and that will reboot the node, potentially causing damage22:58
JayFAIUI, you should *not* be able to abort from service wait in that case22:58
janderswe were discussing acceptable risk around that22:59
JayFbut I am more of an in-band cleaning expert than an OOB cleaning expert22:59
jandersOK so it seems like we may be hitting a bug?22:59
JayFjanders: acceptable risk, in context of this, also means "can the user with the access to do this (maybe a lessee member for servicing) be trusted to take this risk" 22:59
JayFwhereas in all other abort scenarios, at least thru an integrated-openstack lens, it would be an admin/manager level person doing the abort23:00
JayFwhich IMO implies we would have to be more careful (or at a minimum, the same amount of careful)23:00
jandersI see two obvious ways out of this personally: 1) if abort request against "service wait" node shouldn't be accepted, we need to find out why it is accepted and fix it. If abort in wait states is always allowed regardless of abortable=True I would make a very clear documentation addition stating that if abort is used in servicing states, the user23:01
jandersmust monitor the console and ensure there is no indication of an ongoing firmware update.23:01
janders(sorry, missed the 2) after "and fix it" :)23:01
JayFabort in wait states always allowed is (arguably?) a bug or design misfeature IMO; but I obviously have a specific perpsective :D 23:01
jandersI got slack with my editing because... slack23:01
JayFI hope it's not that :D23:02
jandersI wonder what TheJulia thinks23:02
jandersI will have a quick browse through the code23:02
JayFIt's my EOD, but I'll check back tomorrow if you leave any messages you want me to check just ping me on em23:02
JayFhave a good one!23:02
jandersJayF ack+thx23:02
jandersyou too23:02
TheJuliaThere is some validity to aborting in a -ing state, like deploying23:04
TheJuliabut... obviously that shouldnt' work out of the box23:04
JayFthat makes sense that deploying would be an exception there23:05
jandershow about -wait states (and specifically if the wait state happens after a step with abortable=False)?23:05
TheJuliaI don't think we even permit that, but we might23:05
TheJuliathere is some weirdness there23:05
iurygregoryjanders, I have no problem with it =)23:06
TheJuliawaits states *should* be abortable23:06
TheJuliathe issue is when we have steps which launch an asynchronosu thing like.. software updates on bmcs and doesn't reflect if it is done, or not23:06
jandersOK I think I see what is happening23:07
janderscode reference coming shortly23:07
jandershttps://opendev.org/openstack/ironic/src/branch/master/ironic/conductor/manager.py#L147423:09
jandersI am not sure if we have similar code for servicing, will go looking23:09
jandersbut from ^^ it seems like in cleaning the value of abortable param for the step would determine if it can be aborted in wait state23:09
TheJuliabahahahaha23:10
TheJuliaso back to my original change but add in a SERVICEWAIT23:10
TheJuliaI'm way past EOD and need to exercise. but another thing for tomorow I guess23:10
jandersthank you TheJulia23:12
jandershave a good exercise session23:12
jandersit feels like we're getting there, "unservice" + addressing the service wait abort/unservice should be getting us close23:12
jandersonce we sort out the service wait interruptions, as a bonus we fix one more issue:23:13
jandersright now, to get out of service wait to active we need to abort twice (service wait > service failed and then service failed > active)23:14
jandersif there is no abort/unservice from service wait this whole use case is gone23:14
janders(and I tend to think it's better this way, safer; if we see things getting stuck in servicing and not timing out we may need to address that separately)23:15

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!