opendevreview | Iury Gregory Melo Ferreira proposed openstack/ironic master: RedfishFirmware Interface https://review.opendev.org/c/openstack/ironic/+/885425 | 02:47 |
---|---|---|
dtantsur | JayF, I cannot do it either, I wonder if hashtags are not enabled there at all | 08:48 |
opendevreview | Dmitry Tantsur proposed openstack/bifrost stable/yoga: Remove Fedora from the CI https://review.opendev.org/c/openstack/bifrost/+/893754 | 08:52 |
opendevreview | Merged openstack/ironic master: [releasenotes] Prelude for 2023.2/bobcat https://review.opendev.org/c/openstack/ironic/+/895007 | 08:58 |
opendevreview | Merged openstack/python-ironic-inspector-client stable/2023.2: Update .gitreview for stable/2023.2 https://review.opendev.org/c/openstack/python-ironic-inspector-client/+/895093 | 09:30 |
opendevreview | Merged openstack/python-ironic-inspector-client stable/2023.2: Update TOX_CONSTRAINTS_FILE for stable/2023.2 https://review.opendev.org/c/openstack/python-ironic-inspector-client/+/895094 | 09:30 |
opendevreview | Merged openstack/python-ironic-inspector-client master: Update master for stable/2023.2 https://review.opendev.org/c/openstack/python-ironic-inspector-client/+/895095 | 09:30 |
opendevreview | Adam Rozman proposed openstack/ironic-python-agent master: implement basic-auth support for user-image download process https://review.opendev.org/c/openstack/ironic-python-agent/+/890272 | 10:19 |
opendevreview | Adam Rozman proposed openstack/ironic-python-agent master: implement basic-auth support for user-image download process https://review.opendev.org/c/openstack/ironic-python-agent/+/890272 | 10:30 |
opendevreview | Verification of a change to openstack/networking-generic-switch master failed: Do not make actual device changes in bind_port() https://review.opendev.org/c/openstack/networking-generic-switch/+/847592 | 10:52 |
iurygregory | good morning Ironic | 11:25 |
iurygregory | dtantsur, JayF hashtags need to be enabled in the ACL of each project | 11:34 |
opendevreview | Dmitry Tantsur proposed openstack/ironic master: Remove most prints for unit tests https://review.opendev.org/c/openstack/ironic/+/895269 | 13:01 |
dtantsur | FYI TheJulia ^^^ | 13:01 |
TheJulia | that that will make debugging api stuff harder, but c'est la vie | 13:11 |
opendevreview | Julia Kreger proposed openstack/ironic stable/2023.1: DB: Streamline allocation interactions https://review.opendev.org/c/openstack/ironic/+/895038 | 13:17 |
opendevreview | Julia Kreger proposed openstack/ironic stable/2023.1: DB: Select upon delete for allocations https://review.opendev.org/c/openstack/ironic/+/895039 | 13:21 |
TheJulia | ^ are a result of https://github.com/openstack-k8s-operators/ironic-operator/pull/333#issuecomment-1721103705 | 13:35 |
dtantsur | moar database issues \o/ | 14:04 |
dtantsur | iurygregory, TheJulia, could we try merging https://review.opendev.org/c/openstack/ironic/+/895269 please? We're in a pretty bad place downstream with the new packages failing to build and old packages affected by the sqlite issue | 14:08 |
iurygregory | totally forgot to add the +2, I checked when you mentioned in slack the patch | 14:09 |
dtantsur | also needs a 2nd +2 https://review.opendev.org/c/openstack/bifrost/+/893754 | 14:14 |
TheJulia | dtantsur: well, it was months ago, really | 15:14 |
TheJulia | 2-ish | 15:14 |
opendevreview | Merged openstack/ironic master: Remove most prints for unit tests https://review.opendev.org/c/openstack/ironic/+/895269 | 15:29 |
* TheJulia raises at a stable job which failed on ipmi commands and a master branch job which failed on ipmi commands | 15:55 | |
opendevreview | Merged openstack/bifrost stable/yoga: Remove Fedora from the CI https://review.opendev.org/c/openstack/bifrost/+/893754 | 16:06 |
JayF | dtantsur: I think they set some kind of env var in upstream CI to prevent the error you saw on print() | 16:38 |
JayF | dtantsur: I think the removal of prints is a fine fix; just saying if you still have pain you might wanna go looking | 16:38 |
clarkb | JayF: dtantsur: many projects capture stdout and stderr and logging into fixtures. Some projects got a step further and only save that data on failure which keeps things clean on success but when you fail you get all the debugging info you want | 16:40 |
JayF | TheJulia: dtantsur: iurygregory: stevebaker[m]: Unless one of you squeals loudly otherwise, I'm going to start doing work to cut stable/2023.2 of Ironic projects first thing Monday. Are there any items that need to land before we do that? | 16:46 |
JayF | Release is due in 2 weeks and even though we don't cut RCs I'd prefer not cut it close | 16:46 |
iurygregory | JayF,in two weeks you mean Sep 28? | 16:47 |
iurygregory | or something like this | 16:47 |
JayF | I mean I want to cut the releases literally Monday | 16:47 |
JayF | We have to do it no later than > Final RC deadline: September 28th, 2023 (R-1 week) AIUI | 16:47 |
iurygregory | yeah, I'm asking about the due date in 2 weeks =) | 16:47 |
iurygregory | yup! can we wait till 25? :D | 16:47 |
JayF | Yes but I view that as a deadline for us to have everything already done | 16:47 |
JayF | and I'd strongly prefer not frantically cutting a release with a day or two before it's due | 16:48 |
iurygregory | it would be on the week of 28 which can work I would say | 16:48 |
JayF | iurygregory: I'd strongly prefer *not* putting it off that long, lets talk about it this way: what changes were you hoping to get in that aren't ready? What can me, or others in the community do to help you get them ready sooner? | 16:49 |
iurygregory | the Redfish Firmware is almost there (it updates the firmware, but the clean step says it failed to power off) | 16:49 |
iurygregory | so I'm trying to figure out what is happening there | 16:49 |
iurygregory | see my last comment in https://review.opendev.org/c/openstack/ironic/+/885425 there | 16:50 |
JayF | iurygregory: makes sense, I'd like to have that in as well although we put in cycle highlights we didn't have redfish firmware in this release | 16:50 |
JayF | iurygregory: I thought we talked in IRC yesterday (?) about creating a separate, longer timeout for the power action in the firmware upgrade case only? | 16:50 |
JayF | iurygregory: was that not enough? It still didn't get the power flipped? | 16:51 |
iurygregory | JayF, maybe I missed ? | 16:51 |
JayF | iurygregory: yeah b/c I thought you were talking about extending the existing power timeout | 16:51 |
TheJulia | I think it might have been the day before | 16:51 |
JayF | dtantsur: said not in all cases iirc | 16:51 |
TheJulia | I don’t remember, sinus headache atm | 16:51 |
JayF | and I thought that's where it was left | 16:51 |
iurygregory | so, in the management update_firmware clean step has a wait parameter that would handle this | 16:51 |
iurygregory | but in my case it didn't work (and it uses the same logic .-.) | 16:51 |
JayF | If I were experiencing this set of symptoms, I'd probably be out of band of Ironic querying the redfish | 16:52 |
iurygregory | well, directly to redfish it works | 16:52 |
JayF | because it sounds like you're dealing with hardware weirdness moreso than software weirdness, even if the difference is ... minimal in our world | 16:52 |
TheJulia | I bet it says it is done, and then reboots | 16:52 |
iurygregory | TheJulia, correct | 16:52 |
TheJulia | And then we think it is done, but it is not really really done | 16:52 |
iurygregory | because the iLO needs to restart etc so it takes some time | 16:53 |
JayF | ++ We just need to make the code expect the behavior whatever your test machine is doing, and we can tweak from there as we test against other hardware. | 16:53 |
JayF | iurygregory: so is there no timeout long enough to make it work then? I'm confused why it's an issue if we know that waiting it eventually recovers | 16:53 |
JayF | iurygregory: just a matter of Ironic not respecting the timeout? | 16:53 |
iurygregory | the second, (doesn't seem like ironic is accepting the fact I've passed the timeout and just ignores) | 16:53 |
TheJulia | BMC can drop for minutes afterwards, we should likely expect a power action to fail | 16:53 |
iurygregory | I'm going to test the update via management to see | 16:54 |
JayF | yeah I suspect you'll need to loop with a try/catch | 16:54 |
JayF | and just expect power status calls to fail for $long_time | 16:54 |
JayF | > Sep 13 14:37:20 dl380gen10-u25.dev3 ironic[86542]: 2023-09-13 14:37:20.121 86542 ERROR ironic.conductor.utils oslo_service.loopingcall.LoopingCallTimeOut: Looping call timed out after 79.93 seconds # this in your error logs implies there's a ~60-75 second timeout somewhere that probably needs a zero added lol | 16:55 |
JayF | ah that's node_wait_for_power_state | 16:55 |
JayF | will that take a timeout? | 16:55 |
* JayF is looking at this now | 16:56 | |
* iurygregory brb on a customer call | 16:59 | |
JayF | iurygregory: okay, I have a very specific suggestion | 16:59 |
iurygregory | go ahead, I will look in a few =) | 16:59 |
JayF | https://github.com/openstack/ironic/blob/master/ironic/drivers/modules/deploy_utils.py#L1539 should take a timeout kwarg and pass it thru to manager_utils.node_power_action | 16:59 |
JayF | you should be able to make that long enough to work, then if it does, put that in config for like a [redfish]firmware_update_power_timeout or wherever it'd fit | 17:00 |
iurygregory | interesting, I will tesk | 17:02 |
iurygregory | test* | 17:02 |
JayF | hell that timeout might even be something we want overridable per node, but that can be a follow-on if we need it | 17:07 |
JayF | "this terrible hardware needs 10x as long as my normal gear" sorta stuff | 17:07 |
TheJulia | ++ | 17:12 |
opendevreview | Merged openstack/ironic stable/2023.1: DB: Streamline allocation interactions https://review.opendev.org/c/openstack/ironic/+/895038 | 17:29 |
iurygregory | JayF, I couldn't find the config you mentioned O.o | 17:39 |
JayF | iurygregory: that's because you haven't created it yet | 17:39 |
iurygregory | LOL | 17:39 |
JayF | iurygregory: I'm saying, plumb that kwarg thru, create a config to set arbitrarily high timeouts | 17:39 |
JayF | the default is whatever it takes to get your test to pass :) | 17:39 |
iurygregory | this will be the second thing I will try | 17:40 |
iurygregory | let me test the clean step update_firmware in management =D | 17:40 |
iurygregory | JayF, bad news ... seems like the update_firmware clean step also doesn't work LOL | 19:08 |
* iurygregory cries | 19:08 | |
iurygregory | but the firmware was updated LOL | 19:08 |
iurygregory | https://paste.opendev.org/show/brqU8qBgdeDPuSpfGJfs/ | 19:08 |
JayF | Yep, just gotta get Ironic recognizing it | 19:08 |
JayF | I've been in that house of pain before too :) | 19:09 |
iurygregory | oh jesus | 19:09 |
JayF | that is without the change I suggested, yes? | 19:09 |
iurygregory | correct | 19:09 |
iurygregory | I was testing the clean step we have via management interface | 19:09 |
iurygregory | now let me try to work on what you mentioned | 19:09 |
JayF | ah, yeah, I think some of what you're dealing with is the hardware being funky | 19:09 |
iurygregory | and see how ironic will react | 19:09 |
JayF | so I think you've got sorta two things mixed here: | 19:10 |
JayF | 1) it's extremely likely your update_firmware stuff works on some hardware that exists in the world | 19:10 |
JayF | 2) the machine *you are on* has annoying behavior around reboots taking longer after firmware updates | 19:10 |
iurygregory | yeah | 19:10 |
iurygregory | and ironic doesn't like 2 | 19:10 |
iurygregory | lol | 19:10 |
JayF | so when you solve #2 just try to do it in a way that fixes the clean step, too? | 19:10 |
iurygregory | yeah | 19:10 |
iurygregory | fix the error we get, because it triggers the right thing | 19:11 |
iurygregory | and the bmc does what we asked | 19:11 |
JayF | Yeah, I think that the suggestion from earlier is still the right route | 19:11 |
JayF | you just may have found another place you need to plumb that new config value thru to for making timeouts much longer | 19:11 |
iurygregory | let me read now to understand things and make sure I'm not doing crazy things :D | 19:11 |
JayF | iurygregory: I can jump on a call if you want to walk thru it with words, I think I understand the moving parts | 19:12 |
iurygregory | oh manager_utils.node_power_action already has a timeout arg :D | 19:12 |
JayF | yeah you have to hook it in a layer up | 19:13 |
iurygregory | yeah! | 19:13 |
JayF | in like reboot_for_node_clean | 19:13 |
iurygregory | gotcha | 19:13 |
JayF | or something like that | 19:13 |
JayF | just plumb it back as far as you need for the driver to be able to dictate the timeout | 19:13 |
TheJulia | JayF: thinking about it, it would be nice to get https://review.opendev.org/c/openstack/ironic-python-agent/+/890864 and it's child patch merged before releasing | 19:13 |
iurygregory | time to hack some code in the venv here | 19:13 |
iurygregory | :D | 19:13 |
JayF | and in the redfish case, it should likely be configurable | 19:13 |
JayF | (and per-node configurable but that can be a follow-on if it's hard) | 19:13 |
iurygregory | well, since we have the config wait we could check if there is a value and pass I think | 19:14 |
TheJulia | Iw ouldn't name it reboot_for_node_clean, maybe more "reboot_for_firmware_update" | 19:14 |
JayF | TheJulia: there's already a method | 19:14 |
TheJulia | since we in theory should be able to do on deploy firmware | 19:14 |
JayF | TheJulia: I'm just saying he adds a flag to it | 19:14 |
iurygregory | and we wouldn't add new configs.. | 19:14 |
TheJulia | ahh, okay | 19:14 |
iurygregory | it's already configurable in the clean step | 19:14 |
iurygregory | wdyt? | 19:14 |
JayF | iurygregory: I trust you to know how to plumb that thru; just make sure that an operator could have a machine that takes $ridiculously_long and could still get the firmware update to work | 19:14 |
iurygregory | gotcha :D | 19:15 |
JayF | TheJulia: fwiw we could remove hardware manager versioning support now that we don't support manage_boot=false | 19:20 |
JayF | TheJulia: that was working around a bug that only occurred when an operator changed the image w/o ironic being aware, which is ~impossible nowadays | 19:20 |
JayF | TheJulia: +2 on both of those | 19:21 |
TheJulia | Well, you could do it in standalone easily, but yeah | 19:22 |
TheJulia | it just feels like extra code we don't really need | 19:22 |
JayF | TheJulia: oooh, yeah, because we just point to an http url, there's no guarantee the payload doesn't change | 19:23 |
TheJulia | yup | 19:23 |
JayF | TheJulia: so yeah, we need it, thank you for the correction | 19:23 |
iurygregory | good news it worked | 19:51 |
iurygregory | but the database wasn't updated XD | 19:51 |
iurygregory | now I have a good point where to look for things I think | 19:51 |
iurygregory | tks JayF | 19:51 |
iurygregory | time to get more coffee and go to another meeting .-. | 19:52 |
JayF | woohoo | 19:56 |
JayF | I always say that changing the error is a victory, now you start a new problem :D | 19:56 |
TheJulia | Interesting: https://bugs.launchpad.net/ironic/+bug/2034953 | 20:23 |
JayF | TheJulia: I'll let you respond there; I suspect you have more of an idea of hte moving parts? If you don't wanna I will though :) | 20:26 |
TheJulia | Already responded | 20:26 |
TheJulia | And put it on the etherpad. Realistically, it should be an easy win | 20:27 |
JayF | are you sure? seems like a lot of work, adding two fields to get access to an entire new ecosystem of switch integrations /s | 20:32 |
JayF | :D | 20:32 |
JayF | TheJulia: we are both on board, tagging that rfe-approved and I'll mention it in the next meeting | 20:33 |
iurygregory | ok, I'm finally done with the meetings today | 20:41 |
iurygregory | time to code to forget about the problems they mentioned in the meeting :D | 20:42 |
JayF | iurygregory: maybe before you get started on that, take a look at Julia's IPA service steps? | 20:43 |
JayF | https://review.opendev.org/c/openstack/ironic-python-agent/+/890864 | 20:43 |
iurygregory | let me grab coffee first | 20:43 |
JayF | it's the other thing we've ID'd we want to get in for release, so getting it in the gate today would be nice | 20:43 |
JayF | I'm going to be stepping away in about 15 minutes or so, but will be around a little over the weekend so if we get it in the gate I can kick it thru | 20:44 |
iurygregory | ack | 20:44 |
opendevreview | Merged openstack/ironic stable/2023.1: DB: Select upon delete for allocations https://review.opendev.org/c/openstack/ironic/+/895039 | 20:45 |
TheJulia | so w/r/t that bug about not being able to clean using vmedia | 20:58 |
TheJulia | after deploying with a config drive | 20:58 |
TheJulia | (dhcpless mode) | 20:58 |
TheJulia | ... what if we re-trigger config ourselves | 20:58 |
TheJulia | we can detect the condition, and backport that | 20:58 |
JayF | I don't understand "what is we re-trigger config ourselves" | 20:58 |
TheJulia | the thing would be "hey, $network_thingiee, please start interfaces again" | 20:59 |
JayF | and I'll note that anything that involves evaluating contents of a tenant's unwiped disk is a major security risk | 20:59 |
JayF | so how would we get to your spot without pathing through that? | 20:59 |
TheJulia | no, we can explicitly skip it | 20:59 |
TheJulia | we can detect if it is what we came in with boot wise | 20:59 |
TheJulia | just by path checking | 20:59 |
TheJulia | errrr | 20:59 |
TheJulia | no | 20:59 |
TheJulia | shoot | 20:59 |
JayF | Yeah, and there's another point to consider here too | 20:59 |
TheJulia | well, we know the device, we can still do it | 20:59 |
JayF | any path that is "the first step of cleaning is 'execute random code from configdrive'" is SCARY AF | 21:00 |
TheJulia | we just have to trigger python-glean with the right args and then re-trigger networking | 21:00 |
JayF | (and yes I know that's the current condition) | 21:00 |
JayF | I wonder if we could do something like ... if a config opt is set, IPA boots, if it finds a configdrive it automatically removes it and reboots, or something like that | 21:00 |
TheJulia | I'm fairly sure we have the knowledge of the disk we actually booted from | 21:00 |
JayF | like an implicit clean step of "wipe out configdrive and forbid glean/cloud-init from running" | 21:00 |
JayF | if a flag is set | 21:01 |
JayF | that way we don't just recover from the bad behavior; we avoid it altogether | 21:01 |
TheJulia | i guess, yeah, we could detect the condition, determine what is the local disk config drive, nuke it, and reboot | 21:02 |
TheJulia | servicing could be PAIN | 21:02 |
JayF | Well, detecting the condition is hard because we intentionally support network-data | 21:02 |
JayF | servicing can't be supported in this case | 21:03 |
JayF | not in a secure manner | 21:03 |
TheJulia | I'm still fairly sure we can determine what our intended is | 21:03 |
TheJulia | we've got guarding logic on even considering some other aspects when vmedia booting | 21:03 |
JayF | you sorta understand the defensive approach I think we should take though, right? like for me it's *alarms go off* at the possibility of a configdrive from a tenant being evaluated | 21:03 |
TheJulia | yeah, and I think we can bypass that entirely because we should know which one we should use outright | 21:04 |
TheJulia | just glean doesn't | 21:04 |
TheJulia | but we can re-trigger glean | 21:04 |
JayF | I think we should do that *unconditionally* | 21:04 |
JayF | glean/cloud-init, in an IPA ramdisk, never runs against anything but configdrives it's explicitly told are coming | 21:04 |
JayF | e.g. ipa_configdrive_uuid=XXXX in the kernel cli | 21:04 |
TheJulia | might actually make the experience better then, and they could drop simple-init, just have glean present | 21:04 |
JayF | exactly | 21:05 |
JayF | and we cut off, at the pass, an entire class of tenant-escape-to-prov-network bugs | 21:05 |
TheJulia | bahahahahahaha | 21:06 |
JayF | TheJulia: that's either really good or really bad haha | 21:08 |
TheJulia | we already have detection logic in ipa, it is more so just telling glean "use the supplied iso content explicitly" | 21:08 |
JayF | I'm curious to see what code comes out | 21:10 |
opendevreview | Julia Kreger proposed openstack/ironic-python-agent master: WIP/DNM: Get rid of simple-init (but keep glean) https://review.opendev.org/c/openstack/ironic-python-agent/+/895519 | 22:04 |
TheJulia | an idea | 22:05 |
TheJulia | There really doesnt' seem to support all of the legacy interface/configuration logic to make static files at this point | 22:08 |
TheJulia | and then try to init them | 22:08 |
TheJulia | just restart the networking subsystem and hopefully magic() | 22:08 |
JayF | That doesn't address the actual security concern though AFAICT | 22:09 |
JayF | because it doesn't ever prevent us running against the wrong configdrive in the first place .... right? | 22:10 |
JayF | (or is that going to happen in ipa-b later?) | 22:10 |
TheJulia | Do you remember when I added the logic to detect if the config drive came from the system itself ? | 22:15 |
TheJulia | that would have to be in ipa-b | 22:15 |
TheJulia | at least, prevent it oturight | 22:15 |
TheJulia | outright | 22:15 |
JayF | okay I thought so but when you said IPA already had the detection logic | 22:15 |
JayF | I didn't grok fully what you meant | 22:16 |
JayF | now I know: you just reused the "is it vmedia" method | 22:16 |
JayF | and the other stuff is all outside of ipa | 22:16 |
TheJulia | basically, yeah | 22:16 |
TheJulia | *should* work | 22:16 |
TheJulia | and doesn't seem horribly awful | 22:16 |
TheJulia | Like, I don't think adam savage is going to appear and say "That is horrible!" | 22:17 |
JayF | it feels sorta like what we should've done from the start for this feature tbh | 22:17 |
TheJulia | yeah, we resisted for some reason | 22:17 |
TheJulia | I don't remember anymore | 22:17 |
TheJulia | hind sight is also always 20-20 | 22:18 |
JayF | very much | 22:19 |
TheJulia | I bet we were focused on classical interface init | 22:22 |
TheJulia | and I bet tinycore doesn't have systemd :) | 22:22 |
TheJulia | our vmedia job uses tinycore, but does dhcp by default | 22:25 |
JayF | I'm sure we can work out the kinks; this will be an improvement. Not sure it's upstream-backportable though because of the coordination that'll be required in ramdisk and ipa | 22:27 |
JayF | (a new IPA on a not-updated ramdisk build is likely to break; and I don't think we should assume people will bump both) | 22:28 |
JayF | might even wanna ensure that our downstream packagers know what's up for distros like debian that will build their own ramdisk | 22:28 |
TheJulia | as is, that shouldn't break existing ramdisks or users of it, and prior builders should already have glean | 22:29 |
opendevreview | Merged openstack/ironic-python-agent master: Add get_service_steps logic to the agent https://review.opendev.org/c/openstack/ironic-python-agent/+/890864 | 22:29 |
TheJulia | and if we check if it is there and only then invoke, that seems sane | 22:30 |
TheJulia | the element change for simple-init, or for the base default, that... is the conundrum I guess | 22:30 |
JayF | exactly | 22:30 |
JayF | It's one of those things where I'm not sure the combo of: 1) Cases that this can happen in and 2) Risks from backporting it really is worth it? but I could be wrong | 22:31 |
TheJulia | dib itself, actually, which is where afaik simple-init comes from, is also not stable branched, if you install it, you get the latest | 22:43 |
TheJulia | so inadvertently it might just break upon any change | 22:43 |
JayF | So you'd hide the behavior behind and environment variable, defaulted to the old way, and have ipa-b set that variable | 22:44 |
TheJulia | to be determiend | 22:50 |
clarkb | skimming what you've said I feel like you want something other than simple-init | 22:50 |
clarkb | let simple-init continue to be simple and work for existing users then have a different thing that installs glean in a "trigger glean manually" sort of way alongside simple init | 22:51 |
clarkb | then ipa or whatever can trigger the appropriate scripts when logic dictates rather than udev | 22:51 |
TheJulia | that could also work, as long as we disable it on ramdisks from auto-starting, we should be fine, and even then the basic path should also be able to recover things from the unhappy path | 22:54 |
JayF | 👍 the key is just to make sure that nothing is going to launch and do something with the config drive until IPA explicitly tells it to. | 23:14 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!