Tuesday, 2025-06-03

TheJuliarm_work: ... I guess it would sort of depend on what the interface to convey it across is04:35
TheJuliato the UEFI firmware04:35
TheJuliareally, I'd expect a runtime loader as an intermediary like ipxe or grub, but yeah04:36
TheJuliaJayF: pass in a config drive, yeah04:36
opendevreviewSyed Haseeb Ahmed proposed openstack/ironic stable/2025.1: Control port updates with update_pxe_enabled flag  https://review.opendev.org/c/openstack/ironic/+/95163108:30
opendevreviewcid proposed openstack/python-ironicclient master: Cast string boolean from CLI  https://review.opendev.org/c/openstack/python-ironicclient/+/95160009:33
*** masghar is now known as Mahnoor09:39
*** Mahnoor is now known as masghar09:40
opendevreviewQueensly Kyerewaa Acheampongmaa proposed openstack/sushy-tools master: Add PATCH support for Redfish DateTime fields in Manager resource  https://review.opendev.org/c/openstack/sushy-tools/+/95092511:10
dtantsurTheJulia: yeah, there is nothing so critical on the conductor itself, it's more about RPC and API11:50
TheJuliaoh yeah, definitely12:08
TheJuliaAlso, good morning!12:09
dtantsurmorning!12:09
TheJuliait feels like api and to a similar extent rcp are also sort of blocked though. :\12:13
TheJuliaWell, https://review.opendev.org/c/openstack/ironic/+/951054 :)12:19
TheJuliastevebaker[m]: you mentioned recently that you noticed an issue shutting down the conductor. Did you have the jsonrpc service running?12:22
jrosserthe nova driver for ironic doesnt implement `get_host_uptime` which causes quite some `NotImplementedError` exceptions in my ironic nova-compute instance - is that known about?12:32
TheJulialast call on reviews of https://review.opendev.org/c/openstack/networking-generic-switch/+/95102613:10
TheJuliajrosser: known, yeah. I think we tried to get nova to explicitly treat ironic's driver to just ignore the error, but I think they chose to keep the exception surfacing in the log.13:11
TheJuliaThere is not a parallel line to draw to real hardware which has been deployed.13:12
jrosserTheJulia: thanks for the info, thats a shame to leave the exceptions in the log13:21
TheJuliaYeah, what they want is like a process aliveness time for the compute service if memory serves13:22
TheJuliaand... yeah, that is not how the virt driver is designed.13:22
TheJuliaI guess your running at a debug level?13:22
jrossernope, debug = False13:23
TheJuliawhat version?13:24
jrossercaracal13:24
TheJuliaInteresting, because latest CI job log I had handy doesn't show nova-compute logging that https://1659f7110238be299cb2-d967962f17ef2a051378ed1d5b3db360.ssl.cf2.rackcdn.com/openstack/cf098c89d7f6456d8990d6f90360b127/controller/logs/screen-n-cpu.txt13:25
opendevreviewJulia Kreger proposed openstack/ironic-tempest-plugin master: ci: dial back check intervals  https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/94222013:35
opendevreviewJulia Kreger proposed openstack/ironic master: DNM/Ignore: Science! trim out cirros partition...  https://review.opendev.org/c/openstack/ironic/+/93122213:39
opendevreviewMerged openstack/ironic-ui master: Removing regular from localization  https://review.opendev.org/c/openstack/ironic-ui/+/95121514:19
alegacy_question regarding how the various networks (e.g., rescuing, cleaning, provisioning, etc) are managed in a standalone case.  It looks like those _could_ be each on a different VLAN, and in a full openstack deployment it looks like Neutron would be capable of (re-)programming the switch to set those up properly before booting the node in the right mode.  In that case I assume the IPA would need to be built 14:28
alegacy_knowing which VLAN to use in each scenario based on the info stored in the Neutron network?  ...but in a standalone case, where Neutron isn't in the mix, how would Ironic know which VLAN to use for each...I don't see that defined in any config file?14:28
JayFhttps://review.opendev.org/c/openstack/ironic/+/946741 could use a review if someone else has a sec (API-call action addition to inspector)14:28
JayFalegacy_: IPA either uses DHCP or (rare) static IP configdrive14:29
JayFalegacy_: so the IPA ramdisk is just configured to dhcp on all interfaces and/or read and apply that configdrive14:29
alegacy_but if those networkings (e.g., cleaning) is meant to be on a VLAN the IPA would have to create that VLAN prior to sending out a DHCP, no?14:30
TheJuliabrrrraaains14:30
TheJuliaJayF: I'm seeing more configdrive hinting via IPA as of recent, specifically in the context of metal3 where it is being hinted because folks just don't have/want/need dhcp in those disjointed standalone csaes14:31
TheJuliaalegacy_: generally the advise is to isolate to a specific access port and not try to have vlan tagging native on the interface, because when you offer an entire trunk without restrictions to a host, it is a security nightmare14:32
TheJuliaaccess port in access mode, that is14:32
alegacy_so then all of those networks (rescuing, cleaning, inspecting, etc...) are always on the same NIC and always on the same VLAN (because it is an access port)?14:33
TheJuliaIn an integrated context, they are service networks which can all be the same or be distinctly different networks14:34
TheJuliain a standalone context, it is expected to be static as long as the traffic can somehow reach the endpoints14:34
opendevreviewMithun Krishnan Umesan proposed openstack/networking-generic-switch master: Adds a Sphinx directive to parse each file in the netmiko devices folder and return documentation containing the switch, command modules capable of being executed by the switch, and the CLI commands sent to the switch when the command module is selected.  https://review.opendev.org/c/openstack/networking-generic-switch/+/95165914:34
TheJuliaIn ironic, the overall behavior is governed by the network_interface selection on the node object14:35
alegacy_ya, that's why I'm asking because in the network interface base class I'm seeing a lot of specific handling for those networks and I'm a bit confused as to how that information would be available in a standalone case14:36
cardoeWell I'm back now. All I remember is that Ironic does stuff with computers. How am I doing so far?14:42
TheJuliaalegacy_: I guess the challenge is *what* along with when, and maybe walking through in discussion might help? dunno. A good starting point from a non-integrated case is just keep in mind there are no networks, no neutron, you only have what you know and supply and some of what is known can be config as well14:44
TheJuliacardoe: clearly we need a bot which randomly delivers tasty beverages.14:45
JayFcardoe: Ironic /tries/ to do stuff with computers14:46
TheJuliaAnd potentially make coffee14:52
alegacy_Thanks TheJulia... I'll chew on that for a bit.  A short call/discussion might be good to clarify some basic assumptions before I head off into the weeds with some half-baked assumptions that are wrong.14:57
TheJulialast call on https://review.opendev.org/c/openstack/networking-baremetal/+/94798515:03
opendevreviewVerification of a change to openstack/ironic-python-agent unmaintained/wallaby failed: Update .gitreview for unmaintained/wallaby, fix CI  https://review.opendev.org/c/openstack/ironic-python-agent/+/91294915:05
opendevreviewMithun Krishnan Umesan proposed openstack/networking-generic-switch master: Improve Netmiko Device Commands Documentation  https://review.opendev.org/c/openstack/networking-generic-switch/+/95165915:09
cardoeTheJulia: +1 from me15:24
cardoeI wanted to ask what I should do about https://review.opendev.org/c/openstack/ironic/+/951631 ? Since the original change was merged. Should I make a change to the note in master to be bugfix?15:24
cardoeOr should I fix up the backport?15:26
cardoeCause it really IMHO is a bugfix. Since it's making the inspect code behave how it's documented.15:26
opendevreviewMerged openstack/ironic-tempest-plugin master: Adding better error messages to microversion tests  https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/94594515:37
TheJuliaI'd make a fix to the release note on master, and go ahead and incorporate that into the backport15:42
opendevreviewJay Faulkner proposed openstack/ironic master: Fix minor devstack issues  https://review.opendev.org/c/openstack/ironic/+/95167215:54
opendevreviewSyed Haseeb Ahmed proposed openstack/ironic stable/2025.1: re-framing this as an explicit bugfix to backport  https://review.opendev.org/c/openstack/ironic/+/95167415:55
TheJulialast call for https://review.opendev.org/c/openstack/networking-generic-switch/+/95102615:57
JayFland those so we can pump the "eventlet-free ironic projects" number up :D 15:57
TheJuliaI already workflowed the networking-baremetal change15:58
TheJulia;)15:58
* TheJulia takes a break since she is at the 4 hour mark for the day16:03
cardoeJayF: wrt to pre-commit and stuff.. I did networking-baremetal right? I had forgotten networking-generic-switch only right?16:12
JayFI would have to check to be sure16:12
opendevreviewSyed Haseeb Ahmed proposed openstack/ironic stable/2025.1: re-framing this as an explicit bugfix to backport  https://review.opendev.org/c/openstack/ironic/+/95167416:14
opendevreviewSyed Haseeb Ahmed proposed openstack/ironic stable/2025.1: re-framing this as an explicit bugfix to backport  https://review.opendev.org/c/openstack/ironic/+/95167416:14
opendevreviewSyed Haseeb Ahmed proposed openstack/ironic stable/2025.1: re-framing this as an explicit bugfix to backport  https://review.opendev.org/c/openstack/ironic/+/95167416:32
opendevreviewSyed Haseeb Ahmed proposed openstack/ironic master: re-framing this as an explicit bugfix to backport  https://review.opendev.org/c/openstack/ironic/+/95168016:38
opendevreviewVerification of a change to openstack/ironic master failed: api: Ensure parameter transform happens early  https://review.opendev.org/c/openstack/ironic/+/94879516:39
opendevreviewMithun Krishnan Umesan proposed openstack/networking-generic-switch master: Improve Netmiko Device Commands Documentation  https://review.opendev.org/c/openstack/networking-generic-switch/+/95165916:40
cardoeTheJulia: like https://review.opendev.org/c/openstack/ironic/+/951680 ?16:42
cardoeBut once that's good I squish it into his backport.16:42
TheJuliayes16:49
TheJulia++16:49
TheJuliaso, we could revise text like you've noted16:50
TheJuliait would make it even more clear16:50
TheJuliaso I can workflow it, or not16:51
opendevreviewJulia Kreger proposed openstack/ironic master: ci: combine networking multinode tests with shard tests  https://review.opendev.org/c/openstack/ironic/+/95159316:53
opendevreviewJulia Kreger proposed openstack/ironic-tempest-plugin master: trivial: fix execution error on timeout  https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/95168116:59
opendevreviewJulia Kreger proposed openstack/ironic master: Remove the partition image upload  https://review.opendev.org/c/openstack/ironic/+/93122217:03
opendevreviewJulia Kreger proposed openstack/ironic master: ci: remove the partition image upload  https://review.opendev.org/c/openstack/ironic/+/93122217:03
cardoeTheJulia: Haseeb's going to reword it.17:05
TheJuliaok17:05
opendevreviewSyed Haseeb Ahmed proposed openstack/ironic master: re-framing this as an explicit bugfix to backport  https://review.opendev.org/c/openstack/ironic/+/95168017:08
TheJuliaJayF: I've clarified my comment on https://etherpad.opendev.org/p/ironic-eventlet-removal. I think step 0 is sort of do we keep trying to use sslutils configuration as is and copy it, or do we break into individual section configs as we move parts. See: https://github.com/openstack/oslo.service/blob/master/oslo_service/sslutils.py#L2917:10
JayFTheJulia: okay, that's what I thought you meant. Was being explicit because the idea had been tossed around about migrating off oslo.service entirely17:10
JayFwhich at this point would seem like re-solving problems other people solved for us17:10
cardoeShould we backport https://review.opendev.org/c/openstack/ironic/+/948301 ?17:11
cardoeI'd love to see some of that land into oslo.service17:11
TheJuliacardoe:  likely17:11
TheJuliacardoe: what specifically do you want to see land there?17:12
cardoeTrying to discuss "developer experience" type things at the TC. Like don't make it hard for people.17:12
cardoeTheJulia: not having people re-invent fixes themselves.17:12
JayFcardoe: you want a hot take?17:12
cardoeyes.17:12
cardoeI don't have a good answer.17:12
JayFcardoe: the worst DX in the whole project is the stuff happening in the DCO governance change17:12
TheJuliaI think there is a trap we've fallen into, we have a single ssl configuration section which might actualy be wrong17:12
cardoeCause oslo.* is something that people push stuff to but then nobody maintains17:12
JayFcardoe: trying to do something with reasonable scope, and then trying to boil the entire ocean with it17:12
cardoeYeah I feel like this DCO stuff should be broken up.17:13
JayFoslo having low contribution, especially around bugs and responsiveness, are an artifact of the "we contribute to OPENSTACK" attitudes shifting around to be "we contribute to $subPROJECT" or "we contribute $FEATURE"17:13
TheJuliai think I spotted another option regarding oslo.service which is deprecated which we use or maybe it was vise versa, its getting a little blurry the more I look at the eventlet stuffs17:14
opendevreviewDoug Goldstein proposed openstack/ironic stable/2025.1: Allow to unprovision instance from service wait states  https://review.opendev.org/c/openstack/ironic/+/95168217:14
TheJuliaI think there is an evolution aspect where oslo made sense, but in some areas might not make sense for everything and eveyone. There is a fine line to walk there and it is sort of situational17:15
TheJuliabut that worsens it when orgs shift more to feature focused work17:15
TheJuliabecause there is inherently less motivation to and technical reason to have that centralization for smaller bits of common-ish looking code17:16
opendevreviewDoug Goldstein proposed openstack/ironic master: allow running inspection hooks on redfish interface  https://review.opendev.org/c/openstack/ironic/+/93306617:16
TheJulia(again, fine line, there are totally places where one *should* focus on that instead of the specific case, again, situational)17:16
cardoeokay so bad example then. :-D17:17
JayFIt's interesting that eventlet is kinda demonstrating some of the value oslo still provides17:17
JayFalthough I agree in general that openstack would be well-served to be ... less of our own ecosystem and more a part of the larger python/cloud ecosystems17:18
JayFand oslo is an isolating factor there (as is eventlet, really)17:18
TheJuliayup17:18
cardoeokay https://review.opendev.org/c/openstack/ironic/+/951680 got touched after your +W TheJulia 17:19
cardoeI think that wording is better17:20
TheJuliayeah, formatting fixups will likely occur later, but its all good17:21
TheJuliaIt gets the point across17:21
cardoeHaseeb is one of the devs on my team that I had previously mentioned would be working on release goals that we put forward.17:24
TheJuliacool cool17:26
TheJuliaso going back to the sslutils discussion, https://review.opendev.org/c/openstack/ironic/+/951054 is top of mind17:31
JayFyeah, I owe cid some time looking into those failures with him, too17:31
cid++, that would be very helpful17:39
* JayF just realized he owes cid a devstack VM he never provisioned yesterday 🤦‍♂️17:44
cardoeI think https://review.opendev.org/c/openstack/ironic/+/946741 is ready for a +W17:46
JayFI've been waiting to +A that one because I was hoping Julia or Dmitry might take a look17:49
JayFjust not extremely confident in my ability to review new API endpoints in inspection17:49
opendevreviewMerged openstack/networking-baremetal master: Remove explicit use of eventlet  https://review.opendev.org/c/openstack/networking-baremetal/+/94798517:50
* cardoe golf claps.17:51
TheJuliaI sort of wish that had a test, but... it really is so simple...17:53
TheJuliacardoe: is the use pattern "hey, just call this url" but with no actual specific info regarding what ?17:55
cardoeyup17:55
cardoeNot our change but I can see the usefulness of it.17:55
cardoeThe only guardrail I'd want added to to make sure that timeout is set and not 017:56
cardoeBut that would really require a stupid operator config to set it to 0 right now.17:56
opendevreviewSyed Haseeb Ahmed proposed openstack/ironic master: re-framing this as an explicit bugfix to backport  https://review.opendev.org/c/openstack/ironic/+/95168017:57
TheJuliashouldn't it have a check that timeout is not 0 then ?18:00
TheJuliaor a semi-malicious operator who knows how to hang socket handling18:00
JayFthat's a good call18:05
JayFpart of why I waited was it felt too simple18:05
JayFbut I couldn't put my finger on what it was missing18:05
rm_workOh I gained some new insight into our random server shutdown / power sync problem. I guess the issue is not actually that our connectivity to BMC via ipmi is flaky anymore, it’s that the power checks in ironic queue up and eventually stuff queues so long that time out… lol18:05
rm_workLikely we need cells but we do not have cells yet <_<18:05
TheJuliaBTW, my change to excise tinyipa passes CI https://review.opendev.org/c/openstack/ironic/+/950206 I'd <3 a review18:05
JayFrm_work: add conductors and/or conductor groups, and tweak the power status loop timing + number of threads18:06
TheJuliarm_work: how many baremetal nodes do you have per conductor?18:06
JayFrm_work: there are lots of ways to tune it18:06
rm_workYeah I’ll look at our existing config, I think we already tried18:07
JayFwhat is your node:cond ratio like Julia asked18:07
JayFthe other possibility is one or two bad BMCs used to be able to cause this behavior, because ipmitool would go out to lunch; but I think we timeout those now18:07
rm_workWe have three conductors and … approximately 300 nodes I think18:09
rm_workI’m trying to look up our config quick18:09
rm_workAh just over 400 actually, so like 136 nodes per conductor18:10
TheJuliaand all ipmi ? and is it the execution of ipmitool which is timing out OR the conductor task thread in the queue?18:11
JayFthat is more than enough conductors:node18:11
JayFI usually suggest a ratio of more like 500:1 without taking HA into account18:11
rm_workHmm let me look at logs and try to figure that out18:11
rm_workI was really just intending to drive by and give an update, but if you’re gonna throw possible solutions at me I’ll take some time to follow up on this stuff quick 😁18:12
TheJuliaso we have some operators who are running upwards of like 600+ nodes a conductor, but not all ipmi18:12
TheJuliait really depends on the state, settings and exact behavior18:12
TheJulia(and if the bmcs are well behaved as well, which is never a guarentee with ipmi)18:13
rm_work100% ipmi18:13
rm_workIt is possible we have some misbehaving still but I was corrected as to it being a common occurrence anymore18:14
JayFI would only believe "these IPMI BMCs are not acting up commonly" when backed up with logs :D 18:14
rm_workI believe we attempted to increase parallelism and timeouts, but do you know offhand which config vars are most relevant18:14
rm_workOr what I can search logs for to prove definitively one way or the other18:15
TheJuliaI have seen ipmitool commands just timeout at times, but generally that is outside ipmitool itself running. That gets returned as an error which if you make the parallelism and queue too deep then I seem to think you end up sort of focusing on them because we want to focus on recovery/management of nodes which are being problematic as opposed to in a happy state.18:20
TheJuliaat least, that is what my memory recall and thoughts are leaning towards18:20
rm_workOk, well I have debug on here now so should see that if it happens18:21
opendevreviewMerged openstack/networking-generic-switch master: Remove explicit use of eventlet  https://review.opendev.org/c/openstack/networking-generic-switch/+/95102618:21
rm_work[m]excuse my 6-line paste18:21
rm_work[m]periodic_max_workers = 16... (full message at <https://matrix.org/oftc/media/v1/media/download/ASpmI2zdTVrOaO4Yy4dDGmFkPqRY1YMNh8XaY0VOBo9dsH4L-Y278upYx10zhsmypIA8G2CFr0z5BEs9f1gjjlNCeXfyBY4gAG1hdHJpeC5vcmcvV2pWRVdKS0hveVRsdHNjaFJOYmJHd0Zs>)18:21
rm_work[m]I think that is all the relevant config18:22
JayFrm_work[m]: sync_power_state_interval should be higher18:23
JayFrm_work[m]: and the retries being 30 is your problem18:23
JayF30 retries is absolutely bananas18:24
rm_work[m]ah I thought we already tuned that up... maybe they just tuned up the retries and not the interval? which would only exacerbate the issue I suppose lol18:24
JayFif there's a 60 second timeout (unsure) a single failing machine will overrun your interval18:24
rm_work[m]what is a safe power_state_interval?18:24
JayFthat's 100% a situational question18:24
rm_work[m]I mean should we 10x it?18:24
cardoeTheJulia: ugh sorry... https://review.opendev.org/c/openstack/ironic/+/951680 once more :/18:25
rm_work[m]hmm18:25
JayFI've seen it completely disabled in some worlds18:25
JayFbasically; how much lag time is OK if the node controls it's own power18:25
JayFe.g. shutdown from CLI or error or something18:25
JayFbefore ironic sees it's off18:25
rm_work[m]so like retries=5 interval=1200?18:25
JayFretries=3 is the default18:25
JayFI would tune that down down down down 18:25
JayFyou don't wanna retry, you want the bad ipmi instances to identify themselves, go to maintenance, and get outta your loop18:26
TheJuliaoh yeah, 30 retries is super bananas since ipmitool will sit and run forever18:26
rm_work[m]lol that sounds... fair18:26
JayFin onmetal18:26
JayFwe had something like, 3 conductors and 800 nodes in a region18:26
JayF3 failing nodes was enough to make the power status loop croak18:26
rm_work[m]ok so i'll see if i can tweak it down to default retries and put the interval up to 30min18:26
JayFI think we've improved the story a lot since then, but IPMI BMCs *really* are talented at going out to lunch18:27
rm_work[m]parallelism seem ok?18:27
rm_work[m]32 workers?18:27
JayFyou won't see much pain from that being too high unless you increase your node:conductor ratio18:27
TheJuliahonestly, I've seen folks lean toward retries=2 or even 1 since the loop pattern will focus on those problem nodes18:29
JayF+++18:29
rm_work[m]oh interesting, k18:29
JayFyeah basically the loop is your canary18:29
JayFif you can't get power status18:29
rm_work[m]I will attempt to fix this config and report back in two weeks after they actually let me make a config change >_<18:29
JayFthan if a customer tries to provision that node, it'll also not work18:30
TheJuliayeah, this is one of those perfection is the enemy of good and unhappy nodes can make your power sync loop unhappy quickly18:30
rm_work[m]I ... miss yahoo, rofl18:30
rm_work[m]at least there they let us deploy literally constantly18:30
JayFperfection is only possible by letting the loop fail and pulling the bad machines outta the rotation18:30
JayFrm_work[m]: ...18:30
TheJuliarm_work[m]: feel free to link to eavesdrop and this conversation ;) "I talked with the maitnainers/experts" ;)18:30
rm_work[m]heh yeah18:30
JayFrm_work[m]: I am trying to think of a polite way to respond to that comment18:30
rm_work[m]lol18:30
JayFrm_work[m]: maybe they shoulda tried not needing to deploy all the time ;) 18:30
rm_work[m]I mean the sentiment behind that was "unexpectedly"18:31
rm_work[m]I do not normally expect to miss it :D18:31
TheJuliaI could totally see focusing on the old pattern to try and get perfection on the sync18:31
TheJuliabut we did make some intentional changes to the loop... 6-7 years ago18:32
TheJuliato focus the pattern18:32
rm_work[m]we definitely have flaky nodes, it is just less than I thought, but if 1-2 could bring shit down then... definitely possible18:32
JayFwith 30 replies18:32
JayFone could do it18:32
JayF**retries18:32
TheJuliayeah, that subprocess exec will just spin and spin18:32
TheJuliabecause you end up 30 retries with ?10? second timeouts, it quickly compounds out of control18:33
cardoeI gave that multinode tinyipa thing a +2... took me a couple to understand the switch_info change but it's good.18:33
TheJuliaa single node, 300 seconds, etc18:33
cardoeI've also +2'd the backports for the agent get_XXX_steps stuff. the backport to 27.0 failed but it seems like python crashed so I retried that one.18:34
rm_work[m]thanks a ton, will attempt to get these changes in18:34
TheJuliaoh, yeah, interface name length restrictions too ;)18:34
TheJuliacardoe: thanks!18:34
* TheJulia spends some time reviewing n-g-s changes18:35
cardoeThat release note change failed CI and Haseeb made another push and now it passed... sorry for the wasted +W efforts but it's good now18:36
cardoehttps://review.opendev.org/c/openstack/ironic/+/95168018:40
rm_work[m]just one follow-up question -- how would a single node take up so many workers if it fails? like can multiple workers be retrying the same node at the same time?18:46
opendevreviewDoug Goldstein proposed openstack/ironic master: allow running inspection hooks on redfish interface  https://review.opendev.org/c/openstack/ironic/+/93306618:47
rm_work[m]logically with 30 retries and 300 interval, if the timeout is 10s, then 30 retries would put you right at that threshold (I am guessing someone did that exact math to pick those)18:47
rm_work[m]but why would it block the other ... 31/32 workers18:47
rm_work[m]interesting, the default sync interval is 6018:50
cardoeI created https://review.opendev.org/c/openstack/ironic/+/951682 as a backport (it'll pass CI since the only thing left is a non-voting job). I think that's good because it'll make the behavior consistent in 2025.1. The fix to make unprovision work landed in 2025.1 for all states except for the one that this corrects. I wouldn't backport past that since it would involve other backports to older versions. I gave it a +2 18:51
cardoesince I didn't write the change.18:51
TheJuliarm_work[m]: yeah, so the way it works is the nodes are distributed in lists across the threads to invoke ipmitool18:59
TheJuliaso *one* node spending time being timed out can cause the rest of the nodes in the queue to never be reached, which starts to create a cascading problem of sorts19:00
opendevreviewMerged openstack/ironic master: re-framing this as an explicit bugfix to backport  https://review.opendev.org/c/openstack/ironic/+/95168019:00
opendevreviewMithun Krishnan Umesan proposed openstack/networking-generic-switch master: Improve Netmiko Device Commands Documentation  https://review.opendev.org/c/openstack/networking-generic-switch/+/95165919:02
rm_work[m]TheJulia: wait so a single node fails once... then 32 workers spin up simultaneously to try to sync it??19:04
TheJuliano19:04
TheJuliaso...19:04
TheJuliathink of it this way19:04
rm_work[m]I just don't understand how less than 32 nodes being bad could tie up 32 workers19:04
TheJuliayou end up a with a list of like 25-30 nodes to check the power state19:04
TheJuliaacross say 8 threads19:05
rm_work[m]hmm19:05
TheJuliaso, 20-30 seconds pass, it is half way through the list19:05
TheJuliait hits a node that just stalls out the entire check19:05
TheJuliaso everything behind it, and that node itself, gets prioritized for being checked on the next time the entire power sync loop triggers to allocate sync work19:05
rm_work[m]well we have 32 workers, I feel like for it to be a problem we'd need a LOT of failing BMCs not just 1-219:05
opendevreviewDoug Goldstein proposed openstack/ironic stable/2025.1: Control port updates with update_pxe_enabled flag  https://review.opendev.org/c/openstack/ironic/+/95163119:06
JayFThere is some weirdness about how those variables interact, isn't there? I can't remember exactly, and quite frankly I'm already deep enough in my to-do list that I shouldn't look it up, but I think there's something about that number maybe only being the number of overall conductor threads that are dedicated to it19:06
TheJuliayeah, but then it is the failing node the next time and depending on the split, then you end up with 1-2 threads which are entirely stuck power sync wise because their very first node is the oldest unchecked node19:06
JayFSo like if the overall conductor thread number is lower you won't have 25 threads to dedicate to power status loops19:06
TheJuliabecause we don't have an up to date power state19:07
rm_work[m]also, apparently people internally think we should NEVER shut down a node because of power checks, so ... maybe we should just disable the thing and set interval to 0 lol19:07
cardoeIs https://review.opendev.org/c/openstack/ironic/+/951631 correct for backporting those two together?19:07
TheJuliathose nodes never really get checked, they get focused in on again, but now every sync you have some number of threads which will timeout their overall check operation19:07
TheJuliarm_work[m]: so, the issue is sort of a management issue19:08
TheJuliarm_work[m]: what is happening then is the node state goes to none19:08
TheJulianova eventually picks up on that and flags it out and then ends up telling ironic to power it down19:08
JayFhttps://docs.openstack.org/ironic/latest/configuration/sample-config.html check [conductor]/workers_pool_size19:09
JayFthe 25 threads for power status loop come from that pool19:10
rm_work[m]yeah I think we would also set nova sync_power_state_interval = -119:10
rm_work[m]yeah our workers_pool_size is 500 lol19:10
JayFso if the number is too low you might see weird behavior if you're trying to spawn more threads than you have19:10
JayFokay 19:10
rm_work[m]IDK if that is stupidly high tho19:10
TheJuliawe're sort of talking about changing the default with eventlet19:10
TheJuliabut.... that is not a now() change19:11
JayFI would suggest generally19:11
JayFit looks like the knobs in your deployment have kinda been indiscriminately turned up higher19:12
JayFconsider restoring defaults for the worker pool size and power state workers and the like19:12
rm_work[m]hmmm k19:12
JayFmainly because 1) you're well below any known scaling thresholds and 2) it's much harder to reason about performance when your config is weirdly tuned19:12
TheJuliaYeah, tons of retires have bitten folks in the past.19:12
rm_work[m]ah default is 300 for the workers_pool_size so not THAT far off19:13
JayFI mean, it's 166% of the default value19:13
JayFthat's not trivial ):19:13
rm_work[m]but yes I am returning retries to 3, but turning up the interval to ... maybe just 60019:13
JayFalso force_power_state_during_retries may be impactful19:13
rm_work[m]I just don't know how much it matters for the syncs to be that fast19:14
JayFbecause if you can get status but can't power on... I think that'll maint the machine, nevermind19:14
JayFrm_work[m]: ironic really doesn't care much, ironic knows the power state unless people are powering on/off servers willy-nilly19:14
rm_work[m]yeah I don't know that anyone is really turning off their servers...19:14
JayFrm_work[m]: the biggest impact IMO is that if a user shutdown the instance inside the OS, and nova doesn't know b/c ironic hasn't sync'd that power state to it's DB (and then to nova)19:14
JayFIDK if the person would be able to `start` the instance to power it back up19:15
JayFbecause nova would think it's on already19:15
rm_work[m]hmm19:15
JayFthat is a mild impact at most19:15
JayFand if you disable nova power sync, you're already paying that price19:15
rm_work[m]they could just tell it to reboot? heh19:15
rm_work[m]I mean I assume if nova thinks it is on, but it is off, they could tell nova to ... turn it off19:15
JayFor they could use the nova reboot command :)19:15
rm_work[m]and that would no-op, yes19:16
JayFrm_work[m]: that's outside of the box thinking, nice job19:16
JayFlol19:16
rm_work[m]T_T19:16
TheJuliastevebaker[m]: I reviewed some of your n-g-s changes, I didn't leave a vote on the change adding security group support into the actual netmiko model because it has me thinking19:21
TheJuliacomments on the change, mostly more perception related19:21
cardoealright gonna wander away for a bit. I threw out a pile of +2s on backports stuff. Pile of stuff in some state of ready for maintainers in https://review.opendev.org/q/hashtag:%22ironic-week-prio%22+AND+status:open19:27
stevebaker[m]TheJulia: ok thanks I'll take a look19:28
cardoeI'll note that some of those changes in that queue have 4 different people +2ing them so they're probably pretty good.19:33
JayFIt's kinda a little crazy how different our cleaning flow is with disable_ramdisk=True20:09
JayFI had tested this entire automated cleaning by runbook change with it set true in the runbook, and basically there's an entire separate place to update for not-disabled ramdisk :| 20:10
JayFI think I have it figured out now, but it's not awesome20:10
opendevreviewJay Faulkner proposed openstack/ironic master: Automated cleaning by runbook  https://review.opendev.org/c/openstack/ironic/+/94525920:26
cardoeWell I hope it passes.20:28
JayFCI won't catch this20:31
JayFI have to test in my devstack again20:32
cardoeIs https://review.opendev.org/c/openstack/ironic/+/951631 the right way to squish two commits together which should be one for a backport?20:35
cardoeThe second being a fix up of the releasenote20:35
cardoeI referenced both commits in the cherry-pick comments20:36
JayF+2 lgtm20:40
cardoeyour automated cleaning failed pep820:40
JayFI'm not worried about that if it passes devstack :) 20:40
JayFI use gerrit as my remote code store, not just pushing when it's ready20:40
cardoeah just wanted to let ya know. I'm watching Zuul intently.20:41
JayFTheJulia: IDK if today was just an announcement, but congrats on OIF->LF closing20:43
opendevreviewDoug Goldstein proposed openstack/ironic stable/2024.2: Control port updates with update_pxe_enabled flag  https://review.opendev.org/c/openstack/ironic/+/95171620:46
opendevreviewDoug Goldstein proposed openstack/ironic stable/2024.1: Control port updates with update_pxe_enabled flag  https://review.opendev.org/c/openstack/ironic/+/95171720:52
opendevreviewDoug Goldstein proposed openstack/ironic bugfix/28.0: Control port updates with update_pxe_enabled flag  https://review.opendev.org/c/openstack/ironic/+/95171820:53
opendevreviewDoug Goldstein proposed openstack/ironic bugfix/27.0: Control port updates with update_pxe_enabled flag  https://review.opendev.org/c/openstack/ironic/+/95171920:53
opendevreviewMerged openstack/ironic master: CI: remove legacy devstack baremetal admin and observer role usage  https://review.opendev.org/c/openstack/ironic/+/95144521:06
rm_work[m]offhand, any major bugs/issues that have been fixed in the last year that I can reference as possible reasons why we'd need to make sure we get to latest? we're running like... 2024.121:16
opendevreviewVerification of a change to openstack/ironic master failed: api: Ensure parameter transform happens early  https://review.opendev.org/c/openstack/ironic/+/94879521:18
JayFrm_work[m]: how about "so we continue getting security updates past halloween" https://usercontent.irccloud-cdn.com/file/dUebTDYh/image.png21:24
rm_work[m]lol21:24
rm_work[m]I suppose that might be relevant21:25
opendevreviewcid proposed openstack/networking-generic-switch master: Cast numeric Netmiko kwargs to native types.  https://review.opendev.org/c/openstack/networking-generic-switch/+/95172422:21
opendevreviewMerged openstack/ironic master: api: Ensure parameter transform happens early  https://review.opendev.org/c/openstack/ironic/+/94879522:36
opendevreviewMichal Nasiadka proposed openstack/networking-generic-switch master: doc: Rework support matrix for trunk driver  https://review.opendev.org/c/openstack/networking-generic-switch/+/94233822:55
opendevreviewMichal Nasiadka proposed openstack/networking-generic-switch master: doc: Rework support matrix for trunk driver  https://review.opendev.org/c/openstack/networking-generic-switch/+/94233822:55
opendevreviewMerged openstack/python-ironicclient master: Cast string boolean from CLI  https://review.opendev.org/c/openstack/python-ironicclient/+/95160023:33

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!