Tuesday, 2024-05-21

opendevreviewJacob Anders proposed openstack/sushy-tools master: [WIP] Replace hardcoded BiosVersion with an updatable field  https://review.opendev.org/c/openstack/sushy-tools/+/90948703:16
opendevreviewJacob Anders proposed openstack/sushy-tools master: Replace hardcoded BiosVersion with an updatable field  https://review.opendev.org/c/openstack/sushy-tools/+/90948703:20
opendevreviewEbbex proposed openstack/bifrost master: Refactor the use of include_vars  https://review.opendev.org/c/openstack/bifrost/+/88844808:31
iurygregorygood morning Ironic11:22
opendevreviewJacob Anders proposed openstack/sushy-tools master: Replace hardcoded BiosVersion with an updatable field  https://review.opendev.org/c/openstack/sushy-tools/+/90948711:25
iurygregoryJayF, hey I just saw I have 1-1 with my manager today and it will conflict with the presentation <facepalm>11:32
iurygregoryyay 360sec to update firmware is not enough... =X11:51
TheJuliaiurygregory: this is an hpe machine, does it have system bios firmware 2024.04 ?13:17
iurygregoryTheJulia, let me get you the info13:18
TheJuliaI ask because I have a report of a machine which just came in which is reporting UnableToModifyDuringSystemPOST13:18
iurygregoryTheJulia, U30 v2.54 (09/03/2021)13:19
iurygregoryat least from redfish this is the info I got for BiosVersion13:19
TheJuliaoh well, different cases then13:19
iurygregorybut I remember to see some messages related to POST in the interface The server might be in POST. Information on this page is current as of last power off. This information is updated after the server is powered on and POST is complete.13:21
iurygregorybut I haven't saw UnableToModifyDuringSystemPOST13:21
TheJuliaI've seen it sporaticaly, like when someone retries introspection or something silly on a machine which was just powered up or rebooting13:43
iurygregoryI'm re-configuring my bifrost setup to add an HPE machine an test things13:49
TheJuliaok15:06
JayFiurygregory: no big deal, it happens. I'll move forward with just Julia today, if you wanna look over the slides for feedback lmk15:22
opendevreviewJulia Kreger proposed openstack/ironic-tempest-plugin master: CI: dial back inspctor voting  https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/92011315:54
opendevreviewJulia Kreger proposed openstack/ironic-tempest-plugin master: CI: dial back inspector voting  https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/92011315:58
iurygregoryJayF, ack, happy to provide feedback in the slides17:37
TheJuliaThoughts on bringing back SPUC?17:56
dtantsurnever against18:04
JayFI'm always in18:35
JayFiurygregory: https://docs.google.com/presentation/d/1U_I0XC_A5RaZxUotqKGf02hWgMYuPyfSXhirxh9clcI/edit?usp=sharing just a caveat that it's intentionally vague in places so the talk isn't two hours long :D 18:36
iurygregoryFirst slide is amazing :D18:48
TheJuliaI really like the bear images18:59
TheJuliaincluding the bear on a vacuum18:59
JayFit's just copilot-generated images18:59
JayFI had to focus on content to get the slides completely done, adding more fun bear photos might be a way I spend my evenings while travelling18:59
JayFlol18:59
TheJuliaThe cleaning bear demands servers to clean19:01
JayFhonestly my favorite is the giant bear racking the servers19:01
dtantsurJayF: slide 10, I always make a mistake in this word, but I believe it's dependEnt?19:03
JayFyou should have comment access if you wanna leave it inline19:03
JayFboth work for the noun form19:04
dtantsurthis is why I was not sure whether to leave a comment :)19:04
JayFlooks like "e" is usually the adjective, I'll check my usage19:04
JayFe is probably a better fit19:04
JayFthat's one of those words I probably spell differently each time I type it and never realized since both are ~valid :D 19:04
dtantsurlol true19:05
dtantsurBear pictures are absolutely great19:05
JayFSo I launched a thing today: https://podcast.gr-oss.io or https://youtube.com/@oss-gr under the podcasts tab :D19:05
dtantsurThere 20 slides in total, right? Bloody google is misbehaving, so I want to double-check19:07
JayF50+19:07
dtantsurHeck19:08
JayF5519:08
dtantsurNope, it won't load past 20 :(19:10
JayFdtantsur: you're missing the best slide https://usercontent.irccloud-cdn.com/file/qPAHKGSH/image.png19:10
JayF(rpittau is presenting on metal3 directly beforehand so I am purposefully skipping this)19:11
dtantsurJayF: scary! :D19:12
JayFI started typing the slide and all the sudden I heard a whisper from the inky shadows telling me to "go"19:13
dtantsurOkay, I got a pdf of the whole thing. No commenting for me though.19:13
TheJuliaJayF: by chance from our discussion yesterday, did you create a bug?19:15
JayFgoing to be honest, I've been in a whirlwind of meetings all morning and forgot about that discussion entirely until now19:16
JayFlol19:16
JayFno bug for it yet, I wasn't sure exactly how/what to bug19:16
TheJuliaahh, okay19:16
JayFit seems like some mix of expected behavior adding on top of expected behavior until the VM melts19:16
TheJuliaI'll take care of it then, I have an idea, and maybe some extra logging might also help provide the needed clarity19:16
dtantsurJira also doesn't open normally. I guess it's a sign for me to stop working past 9pm19:16
JayFProbably root_device_hint of md0 when we create the configuration = boom19:16
TheJuliadtantsur: I just had to reboot my router, but your like a third of a planet away so *shrug*19:17
dtantsurWiFi here is so dense that I seem to feel it while walking through the room :D especially after I added a repeater19:19
dtantsurbut no, jira loaded in the end, it's just being jira19:19
TheJuliaWell, jira not loading is a sign it is time to call it a day :)19:20
dtantsurAnd that I will :) See you tomorrow folks19:20
cid\o19:21
cidJayF: thorough (to my limited knowledge :D), beautiful images, scanty enough that it's no intimidating.19:22
JayFty o/19:25
opendevreviewcid proposed openstack/ironic master: Flexible IPMI credential persistence method configuration  https://review.opendev.org/c/openstack/ironic/+/91722919:44
opendevreviewcid proposed openstack/ironic master: Flexible IPMI credential persistence method configuration  https://review.opendev.org/c/openstack/ironic/+/91722919:49
opendevreviewJulia Kreger proposed openstack/ironic-python-agent master: WIP: Correct ramdisk from re-scanning disks on step step collection  https://review.opendev.org/c/openstack/ironic-python-agent/+/92015121:33
opendevreviewJay Faulkner proposed openstack/ironic master: Add Node Servicing to api-ref  https://review.opendev.org/c/openstack/ironic/+/92015221:34
TheJuliaJayF: it wasn't the cache action, it was the evaluate_hardware_support21:35
JayFwell, we also recache the node on get_deploy_steps21:35
JayFso we're still doing a *lot* of scanning21:36
JayFI think there's a potentially better fix than the one you offer, curious what you think21:36
JayFdispatch_to_all_managers need to filter out things with hardware support == 0 (disabled/unsupported)21:36
TheJuliaoh, we only trigger it *if* and only *if* what ironic has on the node has changed since the original lookup21:36
JayFand cache that result21:36
JayFthen we avoid the call to evaluate_hardware_support at a lower level, and more closely obey the API contract of hardware managers (imo)21:37
TheJuliaI can see it in the logs, the node didn't change, it didn't trigger there, it was evaluate, but if it did, it would have been double scanned21:37
TheJuliaI dunno, I suspect evaluate_hardware_support is kind of miss-used21:38
JayFI don't suspect it, I kinda know it21:38
TheJuliaand so it got some of these one off actions21:38
TheJuliaone kind of might make sense, even the existing disk scan... maybe21:38
JayFand my suggested change would have the identical result of yours (running evaluate_hardware_support less) in addition to fixing our result for get_*_steps to not include steps that'll never run on a node21:38
* JayF has an IDE open and 5 minutes, will push a prototype21:40
TheJuliadunno, that feels like a lot of mechanics to change when this seems sort of clean and at less risk of unexpected things21:40
TheJuliacool cool21:40
JayFIMHO the get_*_steps stuff is a pretty gnarly bug, too21:40
TheJuliait might also be because I'm not seeing it21:40
JayFTheJulia: it's because the bug that I thought existed doesn't, and instead another nastier one does.21:43
JayFTheJulia: the managers are supposed to be cached -- line 3287 in get_managers()21:43
JayFboth dispatch_to_managers & dispatch_to_all_managers call get_managers()21:43
JayFso theoretically, we should have those results *cached already* and evaluate_hardware_support should not run again21:43
JayFthe fact it is indicates to me that caching in get_managers() is busted21:43
JayFthe caching: https://opendev.org/openstack/ironic-python-agent/src/branch/master/ironic_python_agent/hardware.py#L328721:45
JayFwe call get_managers: https://opendev.org/openstack/ironic-python-agent/src/branch/master/ironic_python_agent/hardware.py#L333521:45
* JayF put these notes on your proposed change21:48
JayFI suspect this is a side-effect of get_deploy_steps being synchronouos21:48
TheJulia... by design, it is supposed to be21:51
JayFwell I'm saying like, we execute that command in context of the wsgi server21:52
JayFwhich I wonder if it makes the execution path different [hand waves]21:52
JayFthat last line is me guessing at a root cause without enough data (yet)21:52
TheJuliayes, likely, and which means my fix might only kick in on the second call21:52
TheJulia(might*21:52
JayFbut I'm fairly certain about what I'm describing being the root cause bad behavior21:52
JayFand fixing it would likely help agent in more ways than just this21:53
TheJuliaI don't see how given the call flow21:53
TheJuliaunless the intent with dispatch is only ever call once21:53
TheJuliaand if that is really truly the case, then it has been broken for years21:53
JayF> TheJulia> I don't see how given the call flow21:53
JayFI don't follow21:53
TheJuliaor never worked as expected21:53
JayF**or** the behavior changed with newer eventlet or the removal of greendns21:53
TheJuliaquite possible as well21:53
JayFI never trust eventlet to behave sanely in IPA because we only partially monkey patch21:53
TheJuliathe flow, as I read it, because we need to get the steps, we still end up triggering explicit calls to load and evaluate, they are not coded today as one time operations, but operations to be triggered across the board21:54
JayFzoom out from that and zoom in just to the flow between dispatch_to{,_all}_managers and get_managers21:54
TheJuliaokay21:55
JayFthat is where the bad interaction is AIUI21:55
JayFbecause _global_managers /should be/ keeping that evaluate_hardware_support() call to *exactly once*21:55
JayFand there's some reason I don't grok yet why not21:55
opendevreviewcid proposed openstack/ironic master: Flexible IPMI credential persistence method configuration  https://review.opendev.org/c/openstack/ironic/+/91722921:56
TheJuliaahh21:57
TheJuliahmmm yes, it *should*21:57
TheJulia... I guess my worry is "fixing" it breaks things even more21:58
TheJuliaoh21:58
JayFWe're going to be changing a lot of code in IPA to excise eventlet, we can't be afraid of this kinda structural change IMO21:58
TheJuliano21:58
TheJuliaonce per invocation21:58
TheJuliait is getting invoked every single time it is called21:58
JayF_global_managers is initialized in module scope21:58
JayFhardware.py line 5421:59
JayF(yes?)21:59
TheJuliahttps://opendev.org/openstack/ironic-python-agent/src/branch/master/ironic_python_agent/hardware.py#L350421:59
TheJuliabut it only checks in response21:59
JayFoh hell21:59
TheJulias/response/manager/22:00
TheJuliawith getattr, which makes sense for what it is doing22:00
JayFso I suggest we add the value of the hw support number provided by the manager in _global_managers22:01
JayFadd it to the dict response, maybe optionally if needed for backwards compat, from get_managers()22:01
TheJuliashoot22:01
TheJuliawe need a dry run call or something22:01
TheJulianot execute22:02
TheJuliajeeeeeeze22:02
JayFthen refactor that code to use get_managers() directly22:02
JayFso it gets a cached copy22:02
JayFbasically this is just a cache miss that's killing us22:02
* JayF is whipping a thing up22:03
TheJuliait is not that the method even really needs to be called,, but that it triggers the other actions further down the rabbit hole22:03
JayFit's it's explicitly calling it to get a side effect, we shouldn't have merged that code IMO22:04
JayFI sure as hell hope we aren't relying on that behavior, if so it's extra gross22:04
JayFour documentation says it gets called exactly once iirc22:04
TheJuliaJayF: I think it getting merged is a side effect of a lack of lower level understaning of the dispatch logic, and upon it suddenly not working the expected way and as you frame it, a cache miss, then this problem gets created22:08
JayFwell, I'm going to propose a fix which gets rid of the side effect, and then maybe root out the fallout from that and go from there22:09
JayFI am -1 to making IPA have more workaround logic because we can't follow our own API spec22:09
TheJuliasounds goodd22:09
TheJuliaI agree, now that I better understand what you were talking about22:10
JayF(to be clear: I've always reviewed almost all changes into IPA; not meaning this as blamey-annoyance towards anyone; I have no idea who wrote this but there's a great chance I blindly +2'd it)22:10
TheJuliawell, if it cached, it is safe22:10
TheJuliaif not, well then....22:10
JayFlike, I'm going to cache and return the number to it22:11
JayFbut if the code relies on some down-the-call-stack hwm getting evaluate_hardware_support called "N" times22:11
JayFthat's not going to happen anymore22:11
JayFheh22:17
JayFget_managers has a double-call of evaluate_hardware_support, too22:17
opendevreviewJay Faulkner proposed openstack/ironic-python-agent master: WIP: Don't call evaluate_hardware_support 9001 times  https://review.opendev.org/c/openstack/ironic-python-agent/+/92015322:31
JayFTheJulia: ^ I haven't touched unit tests yet, but I'm curious to see what tempest thinks about this22:32
TheJuliaonly 9001 :)22:32
opendevreviewJay Faulkner proposed openstack/ironic-python-agent master: WIP: Don't call evaluate_hardware_support 9001 times  https://review.opendev.org/c/openstack/ironic-python-agent/+/92015322:33
JayFI noticed I swapped the sort key, would've been sorted in reverse prio order, that would've been fun :| 22:33
TheJuliafeels like 9002 now ;)22:33
JayFhow's the structure of that? I think I like it22:33
TheJuliayeah, so very not fun :)22:33
JayFwe resort on every call to _get_managers but I think that's probably approximately zero python time22:34
JayFsorting a dict of probably <10 entries! Oh no!22:34
TheJuliaI am wrapping my day, but at a glance without really digging deep into it, i think it makes basic sense, I guess all that really remains at the moment is to see how much explodes, or not22:35
JayFI suspect this is going to have a measurable impact on CI runtimes22:35
JayFgiven how slow our VMs are22:35
JayFif you think it looks vaguely shaped correctly, I'll get unit tests passing if I can in my last 25 minutes22:35
TheJuliayeah, actual execution logs are sort of what I'm interested in seeing22:36
TheJuliawhich will likely be tomorrow()22:36
opendevreviewJay Faulkner proposed openstack/ironic-python-agent master: WIP: Don't call evaluate_hardware_support 9001 times  https://review.opendev.org/c/openstack/ironic-python-agent/+/92015322:37
JayFyeah changes in dedupe steps are wrong, I'll try to have something working well enough for CI before I leave22:40
TheJuliaokay, have a good evening when you finally call it a day22:41
JayFo/22:43
opendevreviewJay Faulkner proposed openstack/ironic-python-agent master: WIP: Don't call evaluate_hardware_support 9001 times  https://review.opendev.org/c/openstack/ironic-python-agent/+/92015322:48
opendevreviewJay Faulkner proposed openstack/ironic-python-agent master: WIP: Don't call evaluate_hardware_support 9001 times  https://review.opendev.org/c/openstack/ironic-python-agent/+/92015323:08
JayFunit tests pass locally on this now, although it wouldn't be unreasonable for a reviewer to ask for more23:08
JayFnn o/23:08
NobodyCamGood Afternoon, Openstack folks, As usual another off the wall question, Is there an ironic conductor decommission guide? I looking to remove several ironic conductors from one of our regions23:52
TheJuliaNobodyCam: guide, no. But question before I answer, are there other conductors in the conductor group?23:54
NobodyCamI have both, a few that are not in a group and about 10 conductor groups with 2 conductors each23:55
TheJuliaOkay, so if there is another conductor alive, you can just stop the process, newer versions of ironic ensure work is drained out, fwiw23:55
NobodyCamI am manually migrating all the node off them23:56
TheJuliaWhen you have the last conductor in a conductor group, you must ensure *all* nodes have been moved out of the conductor group23:56
NobodyCam+++23:56
TheJuliayeah, hash ring updates because it will change the query as soon it is noticed by the other conductors23:56
TheJuliamigration out should jsut be changing the conductor_group value23:58
NobodyCamonce all nodes are removed, then its just power off, how about the nova compute 23:59

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!