janders | TheJulia testing BIOS update on R640. It's in service wait while "do not reboot" message is on the console. I figure if I tried to abort now, Ironic would reject the API call, right? I don't see SERVICEWAIT as a valid state for abort. | 00:19 |
---|---|---|
janders | (I don't want to actually try cause Feature Freeze is next week, losing a day redoing a lab may cause things to slip) | 00:19 |
janders | scratch that, I tried AFTER it finished flashing but was still in wait. It accepted the request and put node in servicing failed. No reboot though. | 00:21 |
janders | https://paste.openstack.org/show/bZu1H3L9dvp0cz7V05jt/ | 00:22 |
janders | didn't expect this to work. But - if the flow is always like this (no reboot and putting node to service failed) this shoudln't be dangerous in theory. I wonder what you folks think though. | 00:23 |
janders | we could have some really strong wording in the doco around this but in any case in enteprise software it does have a feel of flying a jetliner with GPWS off, kind of. | 00:24 |
janders | alternative I suppose would be to reject abort on the wait state and wait for the wait to time out. As a power user with decade of devops background I prefer current behaviour. But I would never abort service without eyeballing console first. I bet there is people out there who would, just because they can. Question is whether we find this an | 00:26 |
janders | acceptable risk or not. | 00:26 |
janders | last relevant piece to the picture is this: to get out of this state I had to do two aborts: first interrupted servicing and put me in servicing failed (no reboot and no maintenance set). Second abort got me out of service failed to active WITH a reboot. So if someone did that (two aborts back to back) trying to get from service wait to active, | 00:32 |
janders | they could theoretically cause damage (if BMC doesn't have any protection against this). https://paste.openstack.org/show/b4EfuIcrYFkm2oHsmcLC/ | 00:32 |
TheJulia | wait being aborted to fail makes sense I guess | 01:46 |
TheJulia | oh, abort abort | 01:46 |
TheJulia | man... no idea | 01:47 |
TheJulia | the risk of someone aborting a node in servicewait while a step runs has always been there | 01:47 |
TheJulia | bmcs *should* have guards, but also some vendors mandate a/b style firmware | 01:48 |
TheJulia | so if A is bad, system reboots back to b and b gets promoted to a | 01:48 |
janders | Makes sense, thank you TheJulia | 02:06 |
janders | My take: if we're not adding any new risk, let's proceed. We're fixing something useful. | 02:07 |
janders | we can think of some extra guards separately from this if we decide we care. Strong phrasing in doco ("make sure to check console for running firmware upgrades prior to attempting abort") may be good enough | 02:07 |
janders | JayF iurygregory dtantsur WDYT? | 02:07 |
TheJulia | I think, in an ideal case, the overall process of applying the update can know when the world is safe to be able to proceed to the next step safely | 02:10 |
TheJulia | that lingering... != great | 02:10 |
iurygregory | I would be ok with it | 02:18 |
iurygregory | do we store the model of the server in properties? like I know we detect the vendor and we update the node properties with the information... | 02:34 |
iurygregory | humm seems like it would be in properties under capabilities 'server_model' if the information is available | 02:41 |
opendevreview | Kaifeng Wang proposed openstack/ironic master: Trivial: minor typo fix around network boot document https://review.opendev.org/c/openstack/ironic/+/957085 | 06:06 |
rpittau | good morning ironic! o/ | 06:45 |
queensly[m] | Good morning! | 07:13 |
dtantsur | JayF: the script-y WSGI approach is being removed from pbr, that's why | 10:42 |
dtantsur | JayF: on top of that, it seems that the world is moving towards running something like gunicorn as a WSGI server and then proxying it via apache if needed | 10:43 |
opendevreview | Verification of a change to openstack/ironic master failed: Add periodic cleanup of dead conductor entries https://review.opendev.org/c/openstack/ironic/+/956500 | 10:44 |
dtantsur | janders: I haven't been following the discussion, sorry. Could you summarize? | 10:52 |
janders | dtantsur sure. There are a couple different threads here but I think this one is the most important for now. Adding "abort" to servicing means that theoretically the operator can abort service wait state. | 10:56 |
janders | based on my testing, at least on my lab machine, once SimpleUpdate call completes and the machine reboots to apply the update, the node lands in service wait | 10:56 |
janders | service wait, being wait state, can be aborted | 10:56 |
janders | running abort in service wait results in service failed (no reboot) | 10:56 |
dtantsur | I think steps themselves have an "abortable" flag, no? | 10:56 |
janders | then, running abort again against service failed results in active AND reboot. Which can happen when there is a big "updating BIOS, do not power off" message in the console | 10:57 |
janders | I was wondering if documenting this along the lines "if aborting a service step, operator is strongly advised to check the console and make sure there is no in-progress updates before issuing the command" is sufficient safeguard here | 10:58 |
janders | or should we make abort not apply to service wait (or look at steps themselves as you just suggested) | 10:59 |
dtantsur | Hmm, we get a bit of a mix-up here. Traditionally, abort in Ironci applies to *wait states. | 10:59 |
janders | correct, and that is the case here as well | 11:00 |
dtantsur | And indeed, at least for cleaning, we don't allow abort for steps that are not marked as abortable | 11:00 |
dtantsur | (to be precise, we abort the process after the step is over) | 11:00 |
janders | makes sense | 11:00 |
dtantsur | Allowing abort on service fail, I now realize, would set an entirely new precedent in Ironic | 11:00 |
janders | perhaps the way forward here is to make redfish firmware update step non-abortable? | 11:00 |
dtantsur | yeah | 11:00 |
dtantsur | Thinking more about it, the Ironic way out of service wait would probably be reusing the "active" verb | 11:01 |
dtantsur | Similarly to how clean failed is left by "manage" IIRC | 11:01 |
dtantsur | TheJulia: ^^^ | 11:01 |
janders | I don't disagree.. though allowing abort on "service failed" seems way slicker than previous approach of passing wait/hold step | 11:01 |
dtantsur | allowing "active" is exactly the same UX :) | 11:02 |
janders | so are you proposing "service failed" > manage > "managable" > provide or sth like this? | 11:02 |
dtantsur | another precedent is unrescue, which is closer to servicing in spirit | 11:02 |
dtantsur | wait, where is manageable coming from? | 11:02 |
janders | sorry I am a bit rusty with the manage verb | 11:03 |
dtantsur | Note: all we're talking about is which verb to use, not which path to take | 11:03 |
janders | if we "manage" a "clean failed" note what state does it go to? | 11:03 |
dtantsur | to manageable | 11:03 |
janders | (/me is looking at state transition diagram) | 11:03 |
dtantsur | All I'm pointing out is that "abort" feels weird for a state where no process is happening. It's a plausible interpretation, but it's not done this way anywhere in Ironic | 11:05 |
janders | makes sense | 11:05 |
janders | what would be an alternative? thinking how to translate clean-failed recovery to servicing | 11:05 |
dtantsur | We either do un<something> or reuse whatever verb normally leads to the expected state | 11:05 |
dtantsur | The former would be unservice, the latter would be active | 11:05 |
janders | so "service failed" > unservice > "active" potentially? | 11:05 |
dtantsur | yeah, for example | 11:06 |
janders | fair, let's see what Julia thinks | 11:06 |
dtantsur | or "service failed" > active > "active" if we want to reuse an already existing verb | 11:06 |
janders | this would mean consistency++ | 11:06 |
janders | so in terms of action items, we need to consider this PLUS making firmware update step non-abortable? | 11:07 |
dtantsur | yeah, I think so | 11:07 |
janders | I tend to agree | 11:07 |
janders | I will look into the latter (once I am finished with a customer bug) | 11:07 |
janders | I have no major issue with the current proposed way (it is a significant improvement already) but while we're at it why not consider going all the way | 11:08 |
janders | consistency is good | 11:08 |
janders | coding wise, unservice would almost be an alias for abort right? | 11:09 |
janders | (since service-abort codepaths already exist and that's pretty much what we need to do IIUC) | 11:10 |
janders | can we either abort rescue or unrescue and if so is there any significant difference between the flow of the two? | 11:11 |
janders | ( janders pulls out states diagram again) | 11:11 |
dtantsur | I don't think the hypothetical unservice and abort are very similar. The latter is "stop doing thing, do not clean-up". The former - "nothing is happening, but do clean-up". | 11:12 |
janders | then it feels like current abort-service is actually unservice | 11:13 |
janders | (as it flips things back to booting into instance) | 11:13 |
janders | but reality is likely a bit more vague | 11:14 |
janders | regarding my previous comment looks like there is no rescue-abort only unrescue | 11:14 |
JayF | dtantsur: so that change should probably undocument apache wsgi | 12:45 |
TheJulia | good morning | 13:00 |
TheJulia | dtantsur: I thought about taking the active verb, but that is also a very different code path and it seemed better to just back out of the opeation then call the provisioning code path | 13:05 |
TheJulia | that also means a user needs to do something like baremetal node deploy again | 13:06 |
TheJulia | my grenade job fix finally works | 13:14 |
dtantsur | TheJulia: to be clear, I meant reusing the "active" verb, not the entire machinery behind it. Just like "abort" does different things depending on the initial state. | 13:15 |
dtantsur | good morning! | 13:15 |
TheJulia | and to be clear, I didn't want to go down the path of special casing it along that provisioning code path | 13:16 |
opendevreview | Nahian Pathan proposed openstack/ironic master: Reduce API calls when collecting sensor data with redfish https://review.opendev.org/c/openstack/ironic/+/955484 | 13:16 |
TheJulia | since there is way more there, abort felt like more an escape hatch | 13:16 |
dtantsur | I imagine, but now we assign an entire new meaning to the "abort" verb | 13:16 |
dtantsur | in which case we probably need to make sure it works on other *failed states? | 13:17 |
TheJulia | I don't really think so, but I can kind of see the point. Truthfully it feels like the discussion has shifted as well | 13:17 |
dtantsur | I"m coming from the perspective of "I don't understand which command I'm supposed to use" being a common complaint | 13:18 |
dtantsur | and I can see your logic that "abort" is the most intuitive thing. But it will be a disaster if it only works on some states. | 13:18 |
TheJulia | That is totally fair | 13:18 |
dtantsur | (also, our state machine is a mess..) | 13:19 |
TheJulia | unfortunately, active also feels less intuitive | 13:19 |
TheJulia | (we have strong guardrails) | 13:19 |
dtantsur | I agree. I only suggested it because it seems consistent with how "manage" returns you to "manageable". | 13:19 |
TheJulia | Yeah | 13:19 |
TheJulia | its just less intuitive and makes me worry people will have the same expectation | 13:20 |
TheJulia | "hey, I'm in clean fail, I want to be active, why can't I be active right now" | 13:20 |
dtantsur | lol, I see your point :D | 13:21 |
TheJulia | (its the same point!) :) | 13:21 |
TheJulia | I guess the thing we need to be careful about is setting expectations of "we're going to deploy it" again if you broke it in service | 13:26 |
TheJulia | then again, we have rebuild | 13:26 |
dtantsur | by the way, I don't think rebuild works out of service failed or any servicing parts? | 13:26 |
TheJulia | it does not, you'd need to get back to active | 13:26 |
TheJulia | really we should permit service fail -> rebuild verb | 13:27 |
TheJulia | separately | 13:27 |
* dtantsur nods | 13:27 | |
* TheJulia wheels in the "drink more coffee" coffee machine with a sticker which says "review eventlet removal patches!" | 13:28 | |
dtantsur | are we ready with them now? | 13:29 |
TheJulia | I need to sit down and spin up the novncproxy stuffs just to double check no_fork, which was my plan for this afternoon, but the other changes are just waiting for reviews at this point | 13:30 |
dtantsur | k gimme a few minutes | 13:30 |
TheJulia | no rush, just don't want to push towards the very end and be in the endless battle with CI to get stuff to merge | 13:30 |
TheJulia | We've all put a ton of effort in on this | 13:31 |
TheJulia | Also: https://etherpad.opendev.org/p/ironic-eventlet-removal line 56 for any additional testing per the discussion yesterday | 13:34 |
JayF | You can put my vote in the column of unservice | 14:01 |
JayF | Maybe this is a stupid question, but would it be annoying to have active be the state to escape a service fail from an RBAC standpoint? I definitely envision use cases for service where people would have access to perform service but not to do a deployment | 14:01 |
dtantsur | TheJulia: a bunch of comments on https://review.opendev.org/c/openstack/ironic/+/952939, one of them important-ish, but nothing truly blocking | 14:04 |
dtantsur | JayF: good point re "active" and RBAC | 14:04 |
TheJulia | dtantsur: w/r/t queing, you do understand its the rapid lock access which piles from the comment right? | 14:06 |
TheJulia | the only way to prevent the case is all work enqueing to be held under the lock in futurist | 14:06 |
TheJulia | but that will also slow down thread launches | 14:06 |
dtantsur | TheJulia: I'm not sure I'm following. I suspect our average thread's lifetime greatly exceeds any potential lock contention time | 14:07 |
TheJulia | granted, I've functionally changed it more so to "will this exceed our overall limits" | 14:07 |
TheJulia | yeah, but the lock is around adding the entry to the list | 14:07 |
dtantsur | also, the lock is always there, whether with queue or not | 14:08 |
TheJulia | and so when launching 8 threads in rapid succession, somethings going to hit the lock and see entries being launched | 14:08 |
dtantsur | could you take a step back and specify which concern you're addressing? | 14:09 |
dtantsur | the lock in submit(), if you mean it, applies always | 14:09 |
TheJulia | I'm having to look for it at this point, it was clear as day when I was bashing my head against it weeks ago | 14:12 |
TheJulia | I think it was https://github.com/openstack/futurist/blob/master/futurist/_futures.py#L173C9-L173C29 where it holds the lock to get the number to make the decision, but that requires the lock to make that decision, meanwhile submission if a new thread is also under a lock so we get into a state where we haven't really gotten it started yet but we're releasing the lock on one side, the other side grabs the lock before the | 14:16 |
TheJulia | "queue" is empty and it sees a value | 14:16 |
dtantsur | TheJulia: note that get_num_idle_workers is called from submit where the lock is already held, e.g. it's re-entrant | 14:17 |
dtantsur | https://github.com/openstack/futurist/blob/master/futurist/_futures.py#L219 | 14:17 |
dtantsur | so what you say only applies to logic that is executed outside of submit(). not to rejection or the growth logic. | 14:18 |
TheJulia | at this point, I've lost the context | 14:19 |
dtantsur | I don't think I've even gained it in the first place :D | 14:20 |
TheJulia | just I know at this point we see it stacking up when it is running and the default rejector basically knee-caps the conductor's operation | 14:20 |
TheJulia | my troubleshooting then yielded that it was sometimes... and often mid-sequence of starting workers would see one in the queue and it would reject the additional work | 14:20 |
TheJulia | so it would get 2-3 workers created and then see an entry in the queue from another launch | 14:21 |
TheJulia | and freak out | 14:21 |
TheJulia | and by freak out, halt the creation of new workers on power sync and suddenly it takes 20 minutes instead of 3 | 14:21 |
dtantsur | Right, because their rejection does not take into account that work does not disappear from the queue immediately | 14:22 |
TheJulia | yes, exactly | 14:22 |
dtantsur | it has little to do with locking and a lot to do with the fact that things are not immediate | 14:22 |
TheJulia | yup | 14:22 |
dtantsur | it was the case in eventlet too, by the way. I think that's why we have this "1" constant there | 14:22 |
dtantsur | But. I don't think I objected to having our own rejector, did it? hence my confusion about the context of this conversation | 14:23 |
TheJulia | What you expressed was more "oh no, we're queuing, I don't want queuing" | 14:23 |
dtantsur | Definitely not. In fact, I started musing if we need to *allow* queueing. Which, I guess you do, now that I think more about it? | 14:24 |
TheJulia | oh, okay, I misunderstood you | 14:24 |
TheJulia | I kind of do, really | 14:24 |
TheJulia | but, yeah | 14:24 |
dtantsur | Your rejector basically rejects work if the queue size + number of active threads exceeds the maximum number of workers, right? | 14:24 |
TheJulia | yeah | 14:25 |
dtantsur | Right, so you enable queueing, and this aligns with what I've been pondering :) | 14:25 |
dtantsur | I also think you've simply invented a better rejector than the one in futurist | 14:25 |
TheJulia | \o/ | 14:26 |
TheJulia | okay, it was clear I was kind of misunderstanding you as well | 14:26 |
dtantsur | My only concern with queuing is the existence of synchronous API that uses spawn_worker. But that's just life, we cannot change it overnight. | 14:26 |
dtantsur | The only important-ish comment there is about the possibility of someone setting very low workers number in their configuration. | 14:26 |
TheJulia | yeah, that... is an awful api endpoint | 14:26 |
TheJulia | ahhhhhh okay | 14:27 |
TheJulia | We've generally seen the opposite | 14:27 |
TheJulia | ... support case with 1000+ workers | 14:27 |
dtantsur | oh, wow, nice | 14:27 |
TheJulia | yeah, they tuned all the timeouts to like 300+ seconds | 14:27 |
dtantsur | still, I'd rather have some sort of a guardrail (in a follow-up if needed) | 14:27 |
dtantsur | heh | 14:27 |
TheJulia | and turned on as retry logic as possible to get zero failures | 14:28 |
TheJulia | "no, really, trust us, let things fail, it will be fine" | 14:28 |
JayF | some people need to take a physics class and remember we're in the real world lol | 14:31 |
JayF | 'try to shove the electrons harder through the [for example] failed network cable' | 14:31 |
JayF | OK; I have a redfish issue I was trying to see if we had already put in a workaround for and I couldn't find it. | 14:32 |
JayF | I have a machine which Ironic redfish-virtual-media mounts the iso USB2, but then we set the next boot device to cdrom instead of usb ... and I'm not even sure we have support for setting next boot device to usb? | 14:33 |
TheJulia | uhhh... I thought we had code to deal with that edge case on that hardware | 14:34 |
TheJulia | sorry, brain semi-scrambled because gearing up for meetings and discussion :) | 14:34 |
JayF | hmm I'll look for it, this isn't on master so may just be "find the patch to backport" | 14:35 |
TheJulia | oh | 14:35 |
TheJulia | uhhh. | 14:36 |
TheJulia | https://github.com/openstack/ironic/blob/master/ironic/drivers/modules/redfish/boot.py#L298 is the path I'm thinking | 14:38 |
TheJulia | https://github.com/openstack/ironic/blob/master/ironic/drivers/modules/redfish/boot.py#L790 | 14:38 |
TheJulia | versus https://github.com/openstack/ironic/blob/master/ironic/drivers/modules/redfish/boot.py#L903 you'll need to figure out a good way to detect it, my thought was dvd vs cd conntation in bmc response data | 14:39 |
JayF | you're linking two separate things afaict? | 14:40 |
JayF | one is talking about the iso, one is talking about configdrive ? | 14:40 |
TheJulia | yes | 14:40 |
TheJulia | but if you look at the very first one, the confusion I was sourcing was dvd vs cd in the api, not usb | 14:40 |
JayF | yeah | 14:40 |
TheJulia | I get why they likely chose it, its actually a better analog | 14:41 |
TheJulia | because the aspeed BMCs do the same thing, they present a volume $thing, and label it whatever they feel like... all over USB | 14:42 |
TheJulia | (and yes, I've seen it present a virtual cd, and a thumb drive in response to a cd/dvd attach request in the past | 14:44 |
JayF | so there's probably a hacky-fix to make it use usb to get them going | 14:45 |
JayF | and the "real fix" is to maybe ask the redfish what is inserted | 14:45 |
JayF | and use that info to set boot device? | 14:45 |
TheJulia | possibly | 14:46 |
JayF | ack; I suspect we'll hack a fix in downstream and I might address this as a longer-term bug since it'd be crunch to get it in sushy pre-release | 14:51 |
TheJulia | ack | 14:54 |
TheJulia | to be fair, its likely better to try and get a release out at some point sooner than waiting | 14:55 |
TheJulia | but really the difference is more in waiting for release process to complete | 14:55 |
JayF | When I originally proposed to you the idea of doing some dedicated QA | 14:55 |
JayF | I assumed it might come with an earlier release, too | 14:56 |
JayF | so we could +x.1 or +x.x.1 it if we needed before integrated release | 14:56 |
TheJulia | I don't think we need to *cut* a QA release, its not like folks are going to jump on it | 14:56 |
TheJulia | and the release team historically pushes back on last minute releases for press release purposes | 14:56 |
JayF | I guess I personally like the idea of not testing a moving target; but that's more academic than a real risk | 14:56 |
TheJulia | There is also reality, as we move into the next few weeks, CI is going to get worse and worse | 14:57 |
dtantsur | JayF: asking sushy what's inserted may not work: we only consider devices that expose CD or DVD | 15:11 |
dtantsur | then you have a media_types list which may also contain USB, I guess.. but I don't think we have a way to know that the device is USB internally | 15:12 |
TheJulia | I think the best thing to do is to check media_types and lean towards USB | 15:23 |
TheJulia | ... Has anyone purchased a CD or USB recently?! | 15:23 |
TheJulia | err | 15:23 |
TheJulia | CD or DVD | 15:23 |
opendevreview | Jakub Jelinek proposed openstack/ironic-python-agent master: WIP: Fix skip block devices for RAID arrays https://review.opendev.org/c/openstack/ironic-python-agent/+/937342 | 15:25 |
jssfr | TheJulia, some time after 2020, but I don't know the exact year... probably 2022 or so. and then even four of them at once! | 15:50 |
TheJulia | I don't think we've directly purchased any since ?2019 or 2020? | 15:54 |
TheJulia | Granted, some have appeared. My wife has sponsored some indie movies and been in a couple as well, so we've gotten the DVDs for those. | 15:54 |
jssfr | (those were CDs though) | 15:56 |
JayF | dtantsur: we have an internal device labelled USB | 15:56 |
JayF | dtantsur: so likely adding support for such a device would be part of it | 15:56 |
jssfr | (we tend to lend DVDs from the library rather than buying them, but I like to maintain my music collection. and since I'm allergic to streaming services, I either buy CDs or from bandcamp or wherever else I get losslessly compressed audio) | 15:57 |
JayF | I took an action item to document this in a bug | 15:57 |
JayF | and will look before PTG to see how painful it would be to eliminate this class of issue/workaround | 15:57 |
dtantsur | TheJulia: media_types is not the device bus though. Most BMC's nowadays use CD or DVD regardless of how the virtual device is connected | 16:02 |
dtantsur | (except for the SMC's UsbCd...) | 16:02 |
TheJulia | definitely, but one could suggest another is what I'm thinking | 16:03 |
JayF | What I've heard reported is more that we use the "normal Ironic path" for inserting it, but it only boots to the correct device when you tell the BMC to boot from USB. | 16:04 |
dtantsur | Maybe. I don't know if we can assume that if media_types contains Usb, the boot device should also be Usb. Seems like a bit of a stretch. | 16:04 |
JayF | They have it working by racing the boot with a call to the BMC to boot from usb. | 16:04 |
dtantsur | We could have something like driver_info['redfish_virtual_media_device'] for crazy hardware.. | 16:05 |
TheJulia | dtantsur: I agree completely, I sort of suspect we might have to special case the decision | 16:05 |
JayF | dtantsur: that was my original thought, but when I realized it was inserting one place and showing another, I was curious if there was a "kill this class of bug" solution rather than whackamole | 16:05 |
dtantsur | Right. I don't know unfortunately. | 16:06 |
dtantsur | And as always with quirky hardware, we risk breaking one model by fixing another. | 16:06 |
JayF | yeah, me neither, but I don't wanna not-try, especially given there's a fallback position | 16:06 |
TheJulia | dtantsur: yup | 16:06 |
JayF | but I suspect over time we'll move more and more towards being able to tell our driver what goes where | 16:07 |
JayF | especially as we get more setups where one physical server == N redfish systems | 16:07 |
JayF | (I have one where N=3) | 16:07 |
dtantsur | Well, we finally got VirtualMedia linked through System instead of Manager :) | 16:07 |
JayF | and you have to walk all the systems if you wanna see all the hardware/attrs | 16:07 |
dtantsur | but it's still a PITA with firmware updates that all go through UPdateService | 16:07 |
TheJulia | did dmtf ever agree to have a "I want to identify the manager from the system" issue? | 16:09 |
dtantsur | A system links to its Manager, if that's your question. It's not a solution for both virtual media and update service though since the same Manager can be used by other Systems (and Manager is not linked to UpdateService) | 16:10 |
TheJulia | yeah, true | 16:11 |
opendevreview | Verification of a change to openstack/ironic master failed: Add periodic cleanup of dead conductor entries https://review.opendev.org/c/openstack/ironic/+/956500 | 16:11 |
JayF | cid: late review on ^ but if you could https://review.opendev.org/c/openstack/ironic/+/956500/3#message-de0261bc90108d86852cfc3fa71f7d86352fc2ee in a follow-up, it'd be awesome | 16:20 |
cid | JayF: okay, good catch. I'll push a follow up today | 16:35 |
JayF | make sure if you pick a number >1:1 ratio that we update the release note and config desc to indicate the requirement | 16:35 |
JayF | since right now it just says heartbeat_timeout should be > conductor-timeout-command | 16:36 |
opendevreview | Verification of a change to openstack/ironic master failed: api: Add schema for allocations API (requests) https://review.opendev.org/c/openstack/ironic/+/945218 | 16:39 |
opendevreview | Verification of a change to openstack/ironic master failed: api: Add schema for allocations API (responses) https://review.opendev.org/c/openstack/ironic/+/928921 | 16:39 |
opendevreview | Verification of a change to openstack/ironic master failed: api: Add schema for bios API (versioning) https://review.opendev.org/c/openstack/ironic/+/952147 | 16:39 |
opendevreview | Verification of a change to openstack/ironic master failed: api: Add schema for bios API (requests) https://review.opendev.org/c/openstack/ironic/+/952148 | 16:39 |
opendevreview | Verification of a change to openstack/ironic master failed: api: Add schema for bios API (responses) https://review.opendev.org/c/openstack/ironic/+/952149 | 16:39 |
opendevreview | Verification of a change to openstack/ironic master failed: api: Add schema for inspection rules API (versioning) https://review.opendev.org/c/openstack/ironic/+/954038 | 16:43 |
clif | Is the current release of ironic 31.0? I'm looking at ironic/common/release_mappings.py and the latest I see besides master is '29.0'. Should there be entries in here for '30.0' and '31.0'? | 17:05 |
clif | I checked and I'm rebased on latest master afaict, last commit is a zuul merge from yesterday | 17:06 |
dtantsur | ugh, we should have added them.. | 17:06 |
clif | Hopefully not difficult to fix? | 17:07 |
dtantsur | yeah, but not for the existing releases | 17:07 |
clif | I'm adding fields to Port and Portgroup, was going to bump the object versions | 17:08 |
clif | then a unit test screamed at me for there being gaps in the released object versions :) | 17:08 |
dtantsur | yeah, please add the missing ones (ideally, in a separate patch) | 17:08 |
clif | I'll see what I can do. | 17:17 |
opendevreview | Clif Houck proposed openstack/ironic master: Add RELEASE_MAPPING entry for 30.0 release https://review.opendev.org/c/openstack/ironic/+/957164 | 17:27 |
opendevreview | Clif Houck proposed openstack/ironic master: Add RELEASE_MAPPING entry for 31.0 release https://review.opendev.org/c/openstack/ironic/+/957166 | 17:34 |
clif | Created these by checking out 30.0.0 and 31.0.0 tags and copy-pasting the master entry into their corresponding version entry, lmk if this is good | 17:34 |
JayF | Looks accurate at a glance. I'll give it a thorough review after lunch. | 17:49 |
JayF | Plus we'll have to see what branches this needs backporting to | 17:49 |
JayF | clif: so basically we have to "bisect" the version, and ensure the values are right :( | 18:15 |
JayF | meaning I'd get the SHA 30.0 and 31.0 were cut from | 18:15 |
JayF | and take the HEAD copy advertised for master for those commits | 18:16 |
JayF | gerrit/bugfix/30.0 gerrit/bugfix/31.0 | 18:16 |
JayF | so they are bugfix branches | 18:16 |
JayF | hmm maybe you already did this? or just no object bumps at all since 30.0? Seems unlikely | 18:18 |
clif | I only know of the Port bump, but I haven't dug further to check | 18:18 |
JayF | I compared bugfix/30.0 to the update for 30.0 and it matched | 18:19 |
JayF | checking 31.0 and expecting the same result | 18:19 |
JayF | do you know what that is for? | 18:19 |
clif | the port bump? I think adding the description field | 18:19 |
JayF | no, the release_mappings.py generally | 18:19 |
JayF | essentially you can tell Ironic, via config, the "max version" you want it to ever use in a cluster, allowing for rolling updates. So you could set it to 30.0 and be assured no API/object above the versions in that dict would be used, then once the full cluster is upgraded (say to master or 31.0) you remove the config and let everyone use new objects now that everyone can process them | 18:20 |
clif | I think I understand | 18:21 |
JayF | +2 on both, that's a good fix | 18:21 |
clif | ty | 18:22 |
clif | is the release version tracked or exposed anywhere else in the codebase? | 18:22 |
clif | I did't find references to 30.0 or 31.0 elsewhere | 18:22 |
JayF | pbr derives it from git metadata in-repo | 18:22 |
JayF | out-of-repo it's in releases repo | 18:22 |
JayF | let me link ya | 18:23 |
JayF | https://opendev.org/openstack/releases/src/branch/master/deliverables/flamingo/ironic.yaml each deliverable:release_cycle mapping has a yaml file | 18:24 |
JayF | with specific shas and details on the release | 18:24 |
JayF | this is the *source of truth* -- e.g. PR to this repo merged is actually how the release is cut | 18:24 |
JayF | so that says the SHA from the repo for each release, then indicates that bugfix/30.0 (for instance) was cut off the same SHA as 30.0 release version | 18:25 |
JayF | we could later add a 30.1 release but that bugfix/30.0-branch-create-stanza at the bottom doesn't change | 18:25 |
JayF | all the actual automation is in repo there, but really unless you wanna get in the weeds, the yaml is readable | 18:25 |
* TheJulia needs a giant cup-o-brains | 19:49 | |
TheJulia | dtantsur: its fine to backport the mapping fix, as long as it is an super early action before we cut any other release off the branch | 19:53 |
opendevreview | Julia Kreger proposed openstack/ironic master: Replace GreenThreadPoolExecutor in conductor https://review.opendev.org/c/openstack/ironic/+/952939 | 20:10 |
opendevreview | Julia Kreger proposed openstack/ironic master: Set the backend to threading https://review.opendev.org/c/openstack/ironic/+/953683 | 20:10 |
opendevreview | Julia Kreger proposed openstack/ironic master: Remove direct mapping from API -> DB https://review.opendev.org/c/openstack/ironic/+/956512 | 20:10 |
opendevreview | Julia Kreger proposed openstack/ironic master: Optional indirection API use https://review.opendev.org/c/openstack/ironic/+/956504 | 20:10 |
opendevreview | Julia Kreger proposed openstack/ironic master: Revert "ci: temporary metal3 integration job disable" https://review.opendev.org/c/openstack/ironic/+/956953 | 20:10 |
opendevreview | Julia Kreger proposed openstack/ironic master: Clean-up misc eventlet references https://review.opendev.org/c/openstack/ironic/+/955632 | 20:10 |
opendevreview | Julia Kreger proposed openstack/ironic master: Launch vnc proxy with no_fork https://review.opendev.org/c/openstack/ironic/+/957044 | 20:16 |
TheJulia | fyi, it looks like our jobs have been failing due to bad qcow2 images | 21:12 |
TheJulia | but, latest I just pulled seems to be good, so... I've rechecked a job to see | 21:12 |
opendevreview | Merged openstack/ironic master: Trivial: minor typo fix around network boot document https://review.opendev.org/c/openstack/ironic/+/957085 | 21:17 |
TheJulia | Going back to janders and the whole "escape hatch" out of "service fail" state, seems like we're 2 votes "active", 1 for a new unservice verb. Was that it for our options ? | 21:18 |
JayF | I think I got the active vote from dmitry flipped with my rbac concern? | 21:21 |
JayF | since we may want to be able to allow someone to `unservice` without permitting `active` | 21:22 |
opendevreview | Julia Kreger proposed openstack/ironic master: Set the backend to threading https://review.opendev.org/c/openstack/ironic/+/953683 | 21:49 |
opendevreview | Julia Kreger proposed openstack/ironic master: Launch vnc proxy with no_fork https://review.opendev.org/c/openstack/ironic/+/957044 | 21:49 |
opendevreview | Julia Kreger proposed openstack/ironic master: Remove direct mapping from API -> DB https://review.opendev.org/c/openstack/ironic/+/956512 | 21:49 |
opendevreview | Julia Kreger proposed openstack/ironic master: Optional indirection API use https://review.opendev.org/c/openstack/ironic/+/956504 | 21:49 |
opendevreview | Julia Kreger proposed openstack/ironic master: Revert "ci: temporary metal3 integration job disable" https://review.opendev.org/c/openstack/ironic/+/956953 | 21:49 |
TheJulia | so now we're talking about building an rbac model for individual verbs? To be clear, we do have a policy rule like baremetal:node:set_provision_state:service_steps now, but it is around the ability to pass service_steps in general, not invoke the verb. In general set provision state endpoint all has a single policy applied to it for all logic? At some point, we need to get back to the original problem I guess... | 21:56 |
JayF | TheJulia: then my question-comment above resolves to "it doesn't matter" and I'm back to not having a strong opinion | 21:57 |
JayF | ;D | 21:57 |
JayF | :D | 21:57 |
TheJulia | okay | 21:57 |
JayF | hmm actually | 21:57 |
JayF | I do still prefer unservice; because unrescue already set the precendent of un[verb]->active | 21:58 |
opendevreview | Verification of a change to openstack/ironic master failed: api: Add schema for allocations API (requests) https://review.opendev.org/c/openstack/ironic/+/945218 | 22:31 |
janders | TheJulia w/r/t service-failed recovery I vote "unservice" | 22:42 |
janders | it feels like we s/abort/unservice the verb in the current implementation we're not far off work complete, it seems to work pretty good | 22:44 |
janders | I will look into making redfish update step non-abortable to address my other concern (feels like separate change material, likely super-trivial) | 22:45 |
janders | iurygregory any thoughts from you on the choice of "unservice" verb? | 22:46 |
janders | TheJulia I had a look at the code, abortable=False is already set: https://opendev.org/openstack/ironic/src/commit/d7cea2a8bc60e19453e588e602fae527af9b3c68/ironic/drivers/modules/redfish/firmware.py#L181 | 22:51 |
TheJulia | Cool | 22:52 |
janders | would you expect this to cause abort/unservice requests to be rejected while node is in "service wait" or would that only cause them to be rejected in "servicing"? | 22:52 |
janders | I was able to abort it in "service wait", trying to understand if this is expected | 22:52 |
janders | (will try interrupt "servicing" now) | 22:52 |
janders | update: API won't allow that. So I suspect abortable=True means -ing can't be interrupted, -wait is probably another beast | 22:54 |
JayF | -ing means, typically, the conductor is *actively working* | 22:55 |
janders | agreed | 22:55 |
JayF | and we never (I'm pretty sure?) allow aborting from an -ing | 22:55 |
JayF | but -ing will always resolve to a wait, fail, or the target state | 22:55 |
JayF | even if it takes ${action}_timeout to do so | 22:55 |
janders | would you expect abortable flag in step's decorator to affect whether we can interrupt the wait state after a certain step? | 22:56 |
JayF | If step X is not abortable, and step X is the "running" step, you should not be able to abort from that | 22:56 |
JayF | the canonical example for that behavior is an in-band bios update in progress | 22:57 |
JayF | and ironic lol-rebooting your computer into a brick | 22:57 |
janders | yeah I am trying to protect against similar case just with OOB/SimpleUpdate update | 22:57 |
janders | so what happens is I run the step, API takes the update, node reboots and starts writing new BIOS image | 22:57 |
janders | and then it sits in "service wait" as this happens | 22:57 |
janders | with current iteration of the service-abort patch, I can issue abort twice (first: from service wait to service failed, then from service failed to active) and that will reboot the node, potentially causing damage | 22:58 |
JayF | AIUI, you should *not* be able to abort from service wait in that case | 22:58 |
janders | we were discussing acceptable risk around that | 22:59 |
JayF | but I am more of an in-band cleaning expert than an OOB cleaning expert | 22:59 |
janders | OK so it seems like we may be hitting a bug? | 22:59 |
JayF | janders: acceptable risk, in context of this, also means "can the user with the access to do this (maybe a lessee member for servicing) be trusted to take this risk" | 22:59 |
JayF | whereas in all other abort scenarios, at least thru an integrated-openstack lens, it would be an admin/manager level person doing the abort | 23:00 |
JayF | which IMO implies we would have to be more careful (or at a minimum, the same amount of careful) | 23:00 |
janders | I see two obvious ways out of this personally: 1) if abort request against "service wait" node shouldn't be accepted, we need to find out why it is accepted and fix it. If abort in wait states is always allowed regardless of abortable=True I would make a very clear documentation addition stating that if abort is used in servicing states, the user | 23:01 |
janders | must monitor the console and ensure there is no indication of an ongoing firmware update. | 23:01 |
janders | (sorry, missed the 2) after "and fix it" :) | 23:01 |
JayF | abort in wait states always allowed is (arguably?) a bug or design misfeature IMO; but I obviously have a specific perpsective :D | 23:01 |
janders | I got slack with my editing because... slack | 23:01 |
JayF | I hope it's not that :D | 23:02 |
janders | I wonder what TheJulia thinks | 23:02 |
janders | I will have a quick browse through the code | 23:02 |
JayF | It's my EOD, but I'll check back tomorrow if you leave any messages you want me to check just ping me on em | 23:02 |
JayF | have a good one! | 23:02 |
janders | JayF ack+thx | 23:02 |
janders | you too | 23:02 |
TheJulia | There is some validity to aborting in a -ing state, like deploying | 23:04 |
TheJulia | but... obviously that shouldnt' work out of the box | 23:04 |
JayF | that makes sense that deploying would be an exception there | 23:05 |
janders | how about -wait states (and specifically if the wait state happens after a step with abortable=False)? | 23:05 |
TheJulia | I don't think we even permit that, but we might | 23:05 |
TheJulia | there is some weirdness there | 23:05 |
iurygregory | janders, I have no problem with it =) | 23:06 |
TheJulia | waits states *should* be abortable | 23:06 |
TheJulia | the issue is when we have steps which launch an asynchronosu thing like.. software updates on bmcs and doesn't reflect if it is done, or not | 23:06 |
janders | OK I think I see what is happening | 23:07 |
janders | code reference coming shortly | 23:07 |
janders | https://opendev.org/openstack/ironic/src/branch/master/ironic/conductor/manager.py#L1474 | 23:09 |
janders | I am not sure if we have similar code for servicing, will go looking | 23:09 |
janders | but from ^^ it seems like in cleaning the value of abortable param for the step would determine if it can be aborted in wait state | 23:09 |
TheJulia | bahahahaha | 23:10 |
TheJulia | so back to my original change but add in a SERVICEWAIT | 23:10 |
TheJulia | I'm way past EOD and need to exercise. but another thing for tomorow I guess | 23:10 |
janders | thank you TheJulia | 23:12 |
janders | have a good exercise session | 23:12 |
janders | it feels like we're getting there, "unservice" + addressing the service wait abort/unservice should be getting us close | 23:12 |
janders | once we sort out the service wait interruptions, as a bonus we fix one more issue: | 23:13 |
janders | right now, to get out of service wait to active we need to abort twice (service wait > service failed and then service failed > active) | 23:14 |
janders | if there is no abort/unservice from service wait this whole use case is gone | 23:14 |
janders | (and I tend to think it's better this way, safer; if we see things getting stuck in servicing and not timing out we may need to address that separately) | 23:15 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!