opendevreview | Jacob Anders proposed openstack/sushy-tools master: [WIP] Replace hardcoded BiosVersion with an updatable field https://review.opendev.org/c/openstack/sushy-tools/+/909487 | 03:16 |
---|---|---|
opendevreview | Jacob Anders proposed openstack/sushy-tools master: Replace hardcoded BiosVersion with an updatable field https://review.opendev.org/c/openstack/sushy-tools/+/909487 | 03:20 |
opendevreview | Ebbex proposed openstack/bifrost master: Refactor the use of include_vars https://review.opendev.org/c/openstack/bifrost/+/888448 | 08:31 |
iurygregory | good morning Ironic | 11:22 |
opendevreview | Jacob Anders proposed openstack/sushy-tools master: Replace hardcoded BiosVersion with an updatable field https://review.opendev.org/c/openstack/sushy-tools/+/909487 | 11:25 |
iurygregory | JayF, hey I just saw I have 1-1 with my manager today and it will conflict with the presentation <facepalm> | 11:32 |
iurygregory | yay 360sec to update firmware is not enough... =X | 11:51 |
TheJulia | iurygregory: this is an hpe machine, does it have system bios firmware 2024.04 ? | 13:17 |
iurygregory | TheJulia, let me get you the info | 13:18 |
TheJulia | I ask because I have a report of a machine which just came in which is reporting UnableToModifyDuringSystemPOST | 13:18 |
iurygregory | TheJulia, U30 v2.54 (09/03/2021) | 13:19 |
iurygregory | at least from redfish this is the info I got for BiosVersion | 13:19 |
TheJulia | oh well, different cases then | 13:19 |
iurygregory | but I remember to see some messages related to POST in the interface The server might be in POST. Information on this page is current as of last power off. This information is updated after the server is powered on and POST is complete. | 13:21 |
iurygregory | but I haven't saw UnableToModifyDuringSystemPOST | 13:21 |
TheJulia | I've seen it sporaticaly, like when someone retries introspection or something silly on a machine which was just powered up or rebooting | 13:43 |
iurygregory | I'm re-configuring my bifrost setup to add an HPE machine an test things | 13:49 |
TheJulia | ok | 15:06 |
JayF | iurygregory: no big deal, it happens. I'll move forward with just Julia today, if you wanna look over the slides for feedback lmk | 15:22 |
opendevreview | Julia Kreger proposed openstack/ironic-tempest-plugin master: CI: dial back inspctor voting https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/920113 | 15:54 |
opendevreview | Julia Kreger proposed openstack/ironic-tempest-plugin master: CI: dial back inspector voting https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/920113 | 15:58 |
iurygregory | JayF, ack, happy to provide feedback in the slides | 17:37 |
TheJulia | Thoughts on bringing back SPUC? | 17:56 |
dtantsur | never against | 18:04 |
JayF | I'm always in | 18:35 |
JayF | iurygregory: https://docs.google.com/presentation/d/1U_I0XC_A5RaZxUotqKGf02hWgMYuPyfSXhirxh9clcI/edit?usp=sharing just a caveat that it's intentionally vague in places so the talk isn't two hours long :D | 18:36 |
iurygregory | First slide is amazing :D | 18:48 |
TheJulia | I really like the bear images | 18:59 |
TheJulia | including the bear on a vacuum | 18:59 |
JayF | it's just copilot-generated images | 18:59 |
JayF | I had to focus on content to get the slides completely done, adding more fun bear photos might be a way I spend my evenings while travelling | 18:59 |
JayF | lol | 18:59 |
TheJulia | The cleaning bear demands servers to clean | 19:01 |
JayF | honestly my favorite is the giant bear racking the servers | 19:01 |
dtantsur | JayF: slide 10, I always make a mistake in this word, but I believe it's dependEnt? | 19:03 |
JayF | you should have comment access if you wanna leave it inline | 19:03 |
JayF | both work for the noun form | 19:04 |
dtantsur | this is why I was not sure whether to leave a comment :) | 19:04 |
JayF | looks like "e" is usually the adjective, I'll check my usage | 19:04 |
JayF | e is probably a better fit | 19:04 |
JayF | that's one of those words I probably spell differently each time I type it and never realized since both are ~valid :D | 19:04 |
dtantsur | lol true | 19:05 |
dtantsur | Bear pictures are absolutely great | 19:05 |
JayF | So I launched a thing today: https://podcast.gr-oss.io or https://youtube.com/@oss-gr under the podcasts tab :D | 19:05 |
dtantsur | There 20 slides in total, right? Bloody google is misbehaving, so I want to double-check | 19:07 |
JayF | 50+ | 19:07 |
dtantsur | Heck | 19:08 |
JayF | 55 | 19:08 |
dtantsur | Nope, it won't load past 20 :( | 19:10 |
JayF | dtantsur: you're missing the best slide https://usercontent.irccloud-cdn.com/file/qPAHKGSH/image.png | 19:10 |
JayF | (rpittau is presenting on metal3 directly beforehand so I am purposefully skipping this) | 19:11 |
dtantsur | JayF: scary! :D | 19:12 |
JayF | I started typing the slide and all the sudden I heard a whisper from the inky shadows telling me to "go" | 19:13 |
dtantsur | Okay, I got a pdf of the whole thing. No commenting for me though. | 19:13 |
TheJulia | JayF: by chance from our discussion yesterday, did you create a bug? | 19:15 |
JayF | going to be honest, I've been in a whirlwind of meetings all morning and forgot about that discussion entirely until now | 19:16 |
JayF | lol | 19:16 |
JayF | no bug for it yet, I wasn't sure exactly how/what to bug | 19:16 |
TheJulia | ahh, okay | 19:16 |
JayF | it seems like some mix of expected behavior adding on top of expected behavior until the VM melts | 19:16 |
TheJulia | I'll take care of it then, I have an idea, and maybe some extra logging might also help provide the needed clarity | 19:16 |
dtantsur | Jira also doesn't open normally. I guess it's a sign for me to stop working past 9pm | 19:16 |
JayF | Probably root_device_hint of md0 when we create the configuration = boom | 19:16 |
TheJulia | dtantsur: I just had to reboot my router, but your like a third of a planet away so *shrug* | 19:17 |
dtantsur | WiFi here is so dense that I seem to feel it while walking through the room :D especially after I added a repeater | 19:19 |
dtantsur | but no, jira loaded in the end, it's just being jira | 19:19 |
TheJulia | Well, jira not loading is a sign it is time to call it a day :) | 19:20 |
dtantsur | And that I will :) See you tomorrow folks | 19:20 |
cid | \o | 19:21 |
cid | JayF: thorough (to my limited knowledge :D), beautiful images, scanty enough that it's no intimidating. | 19:22 |
JayF | ty o/ | 19:25 |
opendevreview | cid proposed openstack/ironic master: Flexible IPMI credential persistence method configuration https://review.opendev.org/c/openstack/ironic/+/917229 | 19:44 |
opendevreview | cid proposed openstack/ironic master: Flexible IPMI credential persistence method configuration https://review.opendev.org/c/openstack/ironic/+/917229 | 19:49 |
opendevreview | Julia Kreger proposed openstack/ironic-python-agent master: WIP: Correct ramdisk from re-scanning disks on step step collection https://review.opendev.org/c/openstack/ironic-python-agent/+/920151 | 21:33 |
opendevreview | Jay Faulkner proposed openstack/ironic master: Add Node Servicing to api-ref https://review.opendev.org/c/openstack/ironic/+/920152 | 21:34 |
TheJulia | JayF: it wasn't the cache action, it was the evaluate_hardware_support | 21:35 |
JayF | well, we also recache the node on get_deploy_steps | 21:35 |
JayF | so we're still doing a *lot* of scanning | 21:36 |
JayF | I think there's a potentially better fix than the one you offer, curious what you think | 21:36 |
JayF | dispatch_to_all_managers need to filter out things with hardware support == 0 (disabled/unsupported) | 21:36 |
TheJulia | oh, we only trigger it *if* and only *if* what ironic has on the node has changed since the original lookup | 21:36 |
JayF | and cache that result | 21:36 |
JayF | then we avoid the call to evaluate_hardware_support at a lower level, and more closely obey the API contract of hardware managers (imo) | 21:37 |
TheJulia | I can see it in the logs, the node didn't change, it didn't trigger there, it was evaluate, but if it did, it would have been double scanned | 21:37 |
TheJulia | I dunno, I suspect evaluate_hardware_support is kind of miss-used | 21:38 |
JayF | I don't suspect it, I kinda know it | 21:38 |
TheJulia | and so it got some of these one off actions | 21:38 |
TheJulia | one kind of might make sense, even the existing disk scan... maybe | 21:38 |
JayF | and my suggested change would have the identical result of yours (running evaluate_hardware_support less) in addition to fixing our result for get_*_steps to not include steps that'll never run on a node | 21:38 |
* JayF has an IDE open and 5 minutes, will push a prototype | 21:40 | |
TheJulia | dunno, that feels like a lot of mechanics to change when this seems sort of clean and at less risk of unexpected things | 21:40 |
TheJulia | cool cool | 21:40 |
JayF | IMHO the get_*_steps stuff is a pretty gnarly bug, too | 21:40 |
TheJulia | it might also be because I'm not seeing it | 21:40 |
JayF | TheJulia: it's because the bug that I thought existed doesn't, and instead another nastier one does. | 21:43 |
JayF | TheJulia: the managers are supposed to be cached -- line 3287 in get_managers() | 21:43 |
JayF | both dispatch_to_managers & dispatch_to_all_managers call get_managers() | 21:43 |
JayF | so theoretically, we should have those results *cached already* and evaluate_hardware_support should not run again | 21:43 |
JayF | the fact it is indicates to me that caching in get_managers() is busted | 21:43 |
JayF | the caching: https://opendev.org/openstack/ironic-python-agent/src/branch/master/ironic_python_agent/hardware.py#L3287 | 21:45 |
JayF | we call get_managers: https://opendev.org/openstack/ironic-python-agent/src/branch/master/ironic_python_agent/hardware.py#L3335 | 21:45 |
* JayF put these notes on your proposed change | 21:48 | |
JayF | I suspect this is a side-effect of get_deploy_steps being synchronouos | 21:48 |
TheJulia | ... by design, it is supposed to be | 21:51 |
JayF | well I'm saying like, we execute that command in context of the wsgi server | 21:52 |
JayF | which I wonder if it makes the execution path different [hand waves] | 21:52 |
JayF | that last line is me guessing at a root cause without enough data (yet) | 21:52 |
TheJulia | yes, likely, and which means my fix might only kick in on the second call | 21:52 |
TheJulia | (might* | 21:52 |
JayF | but I'm fairly certain about what I'm describing being the root cause bad behavior | 21:52 |
JayF | and fixing it would likely help agent in more ways than just this | 21:53 |
TheJulia | I don't see how given the call flow | 21:53 |
TheJulia | unless the intent with dispatch is only ever call once | 21:53 |
TheJulia | and if that is really truly the case, then it has been broken for years | 21:53 |
JayF | > TheJulia> I don't see how given the call flow | 21:53 |
JayF | I don't follow | 21:53 |
TheJulia | or never worked as expected | 21:53 |
JayF | **or** the behavior changed with newer eventlet or the removal of greendns | 21:53 |
TheJulia | quite possible as well | 21:53 |
JayF | I never trust eventlet to behave sanely in IPA because we only partially monkey patch | 21:53 |
TheJulia | the flow, as I read it, because we need to get the steps, we still end up triggering explicit calls to load and evaluate, they are not coded today as one time operations, but operations to be triggered across the board | 21:54 |
JayF | zoom out from that and zoom in just to the flow between dispatch_to{,_all}_managers and get_managers | 21:54 |
TheJulia | okay | 21:55 |
JayF | that is where the bad interaction is AIUI | 21:55 |
JayF | because _global_managers /should be/ keeping that evaluate_hardware_support() call to *exactly once* | 21:55 |
JayF | and there's some reason I don't grok yet why not | 21:55 |
opendevreview | cid proposed openstack/ironic master: Flexible IPMI credential persistence method configuration https://review.opendev.org/c/openstack/ironic/+/917229 | 21:56 |
TheJulia | ahh | 21:57 |
TheJulia | hmmm yes, it *should* | 21:57 |
TheJulia | ... I guess my worry is "fixing" it breaks things even more | 21:58 |
TheJulia | oh | 21:58 |
JayF | We're going to be changing a lot of code in IPA to excise eventlet, we can't be afraid of this kinda structural change IMO | 21:58 |
TheJulia | no | 21:58 |
TheJulia | once per invocation | 21:58 |
TheJulia | it is getting invoked every single time it is called | 21:58 |
JayF | _global_managers is initialized in module scope | 21:58 |
JayF | hardware.py line 54 | 21:59 |
JayF | (yes?) | 21:59 |
TheJulia | https://opendev.org/openstack/ironic-python-agent/src/branch/master/ironic_python_agent/hardware.py#L3504 | 21:59 |
TheJulia | but it only checks in response | 21:59 |
JayF | oh hell | 21:59 |
TheJulia | s/response/manager/ | 22:00 |
TheJulia | with getattr, which makes sense for what it is doing | 22:00 |
JayF | so I suggest we add the value of the hw support number provided by the manager in _global_managers | 22:01 |
JayF | add it to the dict response, maybe optionally if needed for backwards compat, from get_managers() | 22:01 |
TheJulia | shoot | 22:01 |
TheJulia | we need a dry run call or something | 22:01 |
TheJulia | not execute | 22:02 |
TheJulia | jeeeeeeze | 22:02 |
JayF | then refactor that code to use get_managers() directly | 22:02 |
JayF | so it gets a cached copy | 22:02 |
JayF | basically this is just a cache miss that's killing us | 22:02 |
* JayF is whipping a thing up | 22:03 | |
TheJulia | it is not that the method even really needs to be called,, but that it triggers the other actions further down the rabbit hole | 22:03 |
JayF | it's it's explicitly calling it to get a side effect, we shouldn't have merged that code IMO | 22:04 |
JayF | I sure as hell hope we aren't relying on that behavior, if so it's extra gross | 22:04 |
JayF | our documentation says it gets called exactly once iirc | 22:04 |
TheJulia | JayF: I think it getting merged is a side effect of a lack of lower level understaning of the dispatch logic, and upon it suddenly not working the expected way and as you frame it, a cache miss, then this problem gets created | 22:08 |
JayF | well, I'm going to propose a fix which gets rid of the side effect, and then maybe root out the fallout from that and go from there | 22:09 |
JayF | I am -1 to making IPA have more workaround logic because we can't follow our own API spec | 22:09 |
TheJulia | sounds goodd | 22:09 |
TheJulia | I agree, now that I better understand what you were talking about | 22:10 |
JayF | (to be clear: I've always reviewed almost all changes into IPA; not meaning this as blamey-annoyance towards anyone; I have no idea who wrote this but there's a great chance I blindly +2'd it) | 22:10 |
TheJulia | well, if it cached, it is safe | 22:10 |
TheJulia | if not, well then.... | 22:10 |
JayF | like, I'm going to cache and return the number to it | 22:11 |
JayF | but if the code relies on some down-the-call-stack hwm getting evaluate_hardware_support called "N" times | 22:11 |
JayF | that's not going to happen anymore | 22:11 |
JayF | heh | 22:17 |
JayF | get_managers has a double-call of evaluate_hardware_support, too | 22:17 |
opendevreview | Jay Faulkner proposed openstack/ironic-python-agent master: WIP: Don't call evaluate_hardware_support 9001 times https://review.opendev.org/c/openstack/ironic-python-agent/+/920153 | 22:31 |
JayF | TheJulia: ^ I haven't touched unit tests yet, but I'm curious to see what tempest thinks about this | 22:32 |
TheJulia | only 9001 :) | 22:32 |
opendevreview | Jay Faulkner proposed openstack/ironic-python-agent master: WIP: Don't call evaluate_hardware_support 9001 times https://review.opendev.org/c/openstack/ironic-python-agent/+/920153 | 22:33 |
JayF | I noticed I swapped the sort key, would've been sorted in reverse prio order, that would've been fun :| | 22:33 |
TheJulia | feels like 9002 now ;) | 22:33 |
JayF | how's the structure of that? I think I like it | 22:33 |
TheJulia | yeah, so very not fun :) | 22:33 |
JayF | we resort on every call to _get_managers but I think that's probably approximately zero python time | 22:34 |
JayF | sorting a dict of probably <10 entries! Oh no! | 22:34 |
TheJulia | I am wrapping my day, but at a glance without really digging deep into it, i think it makes basic sense, I guess all that really remains at the moment is to see how much explodes, or not | 22:35 |
JayF | I suspect this is going to have a measurable impact on CI runtimes | 22:35 |
JayF | given how slow our VMs are | 22:35 |
JayF | if you think it looks vaguely shaped correctly, I'll get unit tests passing if I can in my last 25 minutes | 22:35 |
TheJulia | yeah, actual execution logs are sort of what I'm interested in seeing | 22:36 |
TheJulia | which will likely be tomorrow() | 22:36 |
opendevreview | Jay Faulkner proposed openstack/ironic-python-agent master: WIP: Don't call evaluate_hardware_support 9001 times https://review.opendev.org/c/openstack/ironic-python-agent/+/920153 | 22:37 |
JayF | yeah changes in dedupe steps are wrong, I'll try to have something working well enough for CI before I leave | 22:40 |
TheJulia | okay, have a good evening when you finally call it a day | 22:41 |
JayF | o/ | 22:43 |
opendevreview | Jay Faulkner proposed openstack/ironic-python-agent master: WIP: Don't call evaluate_hardware_support 9001 times https://review.opendev.org/c/openstack/ironic-python-agent/+/920153 | 22:48 |
opendevreview | Jay Faulkner proposed openstack/ironic-python-agent master: WIP: Don't call evaluate_hardware_support 9001 times https://review.opendev.org/c/openstack/ironic-python-agent/+/920153 | 23:08 |
JayF | unit tests pass locally on this now, although it wouldn't be unreasonable for a reviewer to ask for more | 23:08 |
JayF | nn o/ | 23:08 |
NobodyCam | Good Afternoon, Openstack folks, As usual another off the wall question, Is there an ironic conductor decommission guide? I looking to remove several ironic conductors from one of our regions | 23:52 |
TheJulia | NobodyCam: guide, no. But question before I answer, are there other conductors in the conductor group? | 23:54 |
NobodyCam | I have both, a few that are not in a group and about 10 conductor groups with 2 conductors each | 23:55 |
TheJulia | Okay, so if there is another conductor alive, you can just stop the process, newer versions of ironic ensure work is drained out, fwiw | 23:55 |
NobodyCam | I am manually migrating all the node off them | 23:56 |
TheJulia | When you have the last conductor in a conductor group, you must ensure *all* nodes have been moved out of the conductor group | 23:56 |
NobodyCam | +++ | 23:56 |
TheJulia | yeah, hash ring updates because it will change the query as soon it is noticed by the other conductors | 23:56 |
TheJulia | migration out should jsut be changing the conductor_group value | 23:58 |
NobodyCam | once all nodes are removed, then its just power off, how about the nova compute | 23:59 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!