Tuesday, 2025-06-03

TheJulia	rm_work: ... I guess it would sort of depend on what the interface to convey it across is	04:35
TheJulia	to the UEFI firmware	04:35
TheJulia	really, I'd expect a runtime loader as an intermediary like ipxe or grub, but yeah	04:36
TheJulia	JayF: pass in a config drive, yeah	04:36
opendevreview	Syed Haseeb Ahmed proposed openstack/ironic stable/2025.1: Control port updates with update_pxe_enabled flag https://review.opendev.org/c/openstack/ironic/+/951631	08:30
opendevreview	cid proposed openstack/python-ironicclient master: Cast string boolean from CLI https://review.opendev.org/c/openstack/python-ironicclient/+/951600	09:33
*** masghar is now known as Mahnoor		09:39
*** Mahnoor is now known as masghar		09:40
opendevreview	Queensly Kyerewaa Acheampongmaa proposed openstack/sushy-tools master: Add PATCH support for Redfish DateTime fields in Manager resource https://review.opendev.org/c/openstack/sushy-tools/+/950925	11:10
dtantsur	TheJulia: yeah, there is nothing so critical on the conductor itself, it's more about RPC and API	11:50
TheJulia	oh yeah, definitely	12:08
TheJulia	Also, good morning!	12:09
dtantsur	morning!	12:09
TheJulia	it feels like api and to a similar extent rcp are also sort of blocked though. :\	12:13
TheJulia	Well, https://review.opendev.org/c/openstack/ironic/+/951054 :)	12:19
TheJulia	stevebaker[m]: you mentioned recently that you noticed an issue shutting down the conductor. Did you have the jsonrpc service running?	12:22
jrosser	the nova driver for ironic doesnt implement `get_host_uptime` which causes quite some `NotImplementedError` exceptions in my ironic nova-compute instance - is that known about?	12:32
TheJulia	last call on reviews of https://review.opendev.org/c/openstack/networking-generic-switch/+/951026	13:10
TheJulia	jrosser: known, yeah. I think we tried to get nova to explicitly treat ironic's driver to just ignore the error, but I think they chose to keep the exception surfacing in the log.	13:11
TheJulia	There is not a parallel line to draw to real hardware which has been deployed.	13:12
jrosser	TheJulia: thanks for the info, thats a shame to leave the exceptions in the log	13:21
TheJulia	Yeah, what they want is like a process aliveness time for the compute service if memory serves	13:22
TheJulia	and... yeah, that is not how the virt driver is designed.	13:22
TheJulia	I guess your running at a debug level?	13:22
jrosser	nope, debug = False	13:23
TheJulia	what version?	13:24
jrosser	caracal	13:24
TheJulia	Interesting, because latest CI job log I had handy doesn't show nova-compute logging that https://1659f7110238be299cb2-d967962f17ef2a051378ed1d5b3db360.ssl.cf2.rackcdn.com/openstack/cf098c89d7f6456d8990d6f90360b127/controller/logs/screen-n-cpu.txt	13:25
opendevreview	Julia Kreger proposed openstack/ironic-tempest-plugin master: ci: dial back check intervals https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/942220	13:35
opendevreview	Julia Kreger proposed openstack/ironic master: DNM/Ignore: Science! trim out cirros partition... https://review.opendev.org/c/openstack/ironic/+/931222	13:39
opendevreview	Merged openstack/ironic-ui master: Removing regular from localization https://review.opendev.org/c/openstack/ironic-ui/+/951215	14:19
alegacy_	question regarding how the various networks (e.g., rescuing, cleaning, provisioning, etc) are managed in a standalone case. It looks like those _could_ be each on a different VLAN, and in a full openstack deployment it looks like Neutron would be capable of (re-)programming the switch to set those up properly before booting the node in the right mode. In that case I assume the IPA would need to be built	14:28
alegacy_	knowing which VLAN to use in each scenario based on the info stored in the Neutron network? ...but in a standalone case, where Neutron isn't in the mix, how would Ironic know which VLAN to use for each...I don't see that defined in any config file?	14:28
JayF	https://review.opendev.org/c/openstack/ironic/+/946741 could use a review if someone else has a sec (API-call action addition to inspector)	14:28
JayF	alegacy_: IPA either uses DHCP or (rare) static IP configdrive	14:29
JayF	alegacy_: so the IPA ramdisk is just configured to dhcp on all interfaces and/or read and apply that configdrive	14:29
alegacy_	but if those networkings (e.g., cleaning) is meant to be on a VLAN the IPA would have to create that VLAN prior to sending out a DHCP, no?	14:30
TheJulia	brrrraaains	14:30
TheJulia	JayF: I'm seeing more configdrive hinting via IPA as of recent, specifically in the context of metal3 where it is being hinted because folks just don't have/want/need dhcp in those disjointed standalone csaes	14:31
TheJulia	alegacy_: generally the advise is to isolate to a specific access port and not try to have vlan tagging native on the interface, because when you offer an entire trunk without restrictions to a host, it is a security nightmare	14:32
TheJulia	access port in access mode, that is	14:32
alegacy_	so then all of those networks (rescuing, cleaning, inspecting, etc...) are always on the same NIC and always on the same VLAN (because it is an access port)?	14:33
TheJulia	In an integrated context, they are service networks which can all be the same or be distinctly different networks	14:34
TheJulia	in a standalone context, it is expected to be static as long as the traffic can somehow reach the endpoints	14:34
opendevreview	Mithun Krishnan Umesan proposed openstack/networking-generic-switch master: Adds a Sphinx directive to parse each file in the netmiko devices folder and return documentation containing the switch, command modules capable of being executed by the switch, and the CLI commands sent to the switch when the command module is selected. https://review.opendev.org/c/openstack/networking-generic-switch/+/951659	14:34
TheJulia	In ironic, the overall behavior is governed by the network_interface selection on the node object	14:35
alegacy_	ya, that's why I'm asking because in the network interface base class I'm seeing a lot of specific handling for those networks and I'm a bit confused as to how that information would be available in a standalone case	14:36
cardoe	Well I'm back now. All I remember is that Ironic does stuff with computers. How am I doing so far?	14:42
TheJulia	alegacy_: I guess the challenge is what along with when, and maybe walking through in discussion might help? dunno. A good starting point from a non-integrated case is just keep in mind there are no networks, no neutron, you only have what you know and supply and some of what is known can be config as well	14:44
TheJulia	cardoe: clearly we need a bot which randomly delivers tasty beverages.	14:45
JayF	cardoe: Ironic /tries/ to do stuff with computers	14:46
TheJulia	And potentially make coffee	14:52
alegacy_	Thanks TheJulia... I'll chew on that for a bit. A short call/discussion might be good to clarify some basic assumptions before I head off into the weeds with some half-baked assumptions that are wrong.	14:57
TheJulia	last call on https://review.opendev.org/c/openstack/networking-baremetal/+/947985	15:03
opendevreview	Verification of a change to openstack/ironic-python-agent unmaintained/wallaby failed: Update .gitreview for unmaintained/wallaby, fix CI https://review.opendev.org/c/openstack/ironic-python-agent/+/912949	15:05
opendevreview	Mithun Krishnan Umesan proposed openstack/networking-generic-switch master: Improve Netmiko Device Commands Documentation https://review.opendev.org/c/openstack/networking-generic-switch/+/951659	15:09
cardoe	TheJulia: +1 from me	15:24
cardoe	I wanted to ask what I should do about https://review.opendev.org/c/openstack/ironic/+/951631 ? Since the original change was merged. Should I make a change to the note in master to be bugfix?	15:24
cardoe	Or should I fix up the backport?	15:26
cardoe	Cause it really IMHO is a bugfix. Since it's making the inspect code behave how it's documented.	15:26
opendevreview	Merged openstack/ironic-tempest-plugin master: Adding better error messages to microversion tests https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/945945	15:37
TheJulia	I'd make a fix to the release note on master, and go ahead and incorporate that into the backport	15:42
opendevreview	Jay Faulkner proposed openstack/ironic master: Fix minor devstack issues https://review.opendev.org/c/openstack/ironic/+/951672	15:54
opendevreview	Syed Haseeb Ahmed proposed openstack/ironic stable/2025.1: re-framing this as an explicit bugfix to backport https://review.opendev.org/c/openstack/ironic/+/951674	15:55
TheJulia	last call for https://review.opendev.org/c/openstack/networking-generic-switch/+/951026	15:57
JayF	land those so we can pump the "eventlet-free ironic projects" number up :D	15:57
TheJulia	I already workflowed the networking-baremetal change	15:58
TheJulia	;)	15:58
* TheJulia takes a break since she is at the 4 hour mark for the day		16:03
cardoe	JayF: wrt to pre-commit and stuff.. I did networking-baremetal right? I had forgotten networking-generic-switch only right?	16:12
JayF	I would have to check to be sure	16:12
opendevreview	Syed Haseeb Ahmed proposed openstack/ironic stable/2025.1: re-framing this as an explicit bugfix to backport https://review.opendev.org/c/openstack/ironic/+/951674	16:14
opendevreview	Syed Haseeb Ahmed proposed openstack/ironic stable/2025.1: re-framing this as an explicit bugfix to backport https://review.opendev.org/c/openstack/ironic/+/951674	16:14
opendevreview	Syed Haseeb Ahmed proposed openstack/ironic stable/2025.1: re-framing this as an explicit bugfix to backport https://review.opendev.org/c/openstack/ironic/+/951674	16:32
opendevreview	Syed Haseeb Ahmed proposed openstack/ironic master: re-framing this as an explicit bugfix to backport https://review.opendev.org/c/openstack/ironic/+/951680	16:38
opendevreview	Verification of a change to openstack/ironic master failed: api: Ensure parameter transform happens early https://review.opendev.org/c/openstack/ironic/+/948795	16:39
opendevreview	Mithun Krishnan Umesan proposed openstack/networking-generic-switch master: Improve Netmiko Device Commands Documentation https://review.opendev.org/c/openstack/networking-generic-switch/+/951659	16:40
cardoe	TheJulia: like https://review.opendev.org/c/openstack/ironic/+/951680 ?	16:42
cardoe	But once that's good I squish it into his backport.	16:42
TheJulia	yes	16:49
TheJulia	++	16:49
TheJulia	so, we could revise text like you've noted	16:50
TheJulia	it would make it even more clear	16:50
TheJulia	so I can workflow it, or not	16:51
opendevreview	Julia Kreger proposed openstack/ironic master: ci: combine networking multinode tests with shard tests https://review.opendev.org/c/openstack/ironic/+/951593	16:53
opendevreview	Julia Kreger proposed openstack/ironic-tempest-plugin master: trivial: fix execution error on timeout https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/951681	16:59
opendevreview	Julia Kreger proposed openstack/ironic master: Remove the partition image upload https://review.opendev.org/c/openstack/ironic/+/931222	17:03
opendevreview	Julia Kreger proposed openstack/ironic master: ci: remove the partition image upload https://review.opendev.org/c/openstack/ironic/+/931222	17:03
cardoe	TheJulia: Haseeb's going to reword it.	17:05
TheJulia	ok	17:05
opendevreview	Syed Haseeb Ahmed proposed openstack/ironic master: re-framing this as an explicit bugfix to backport https://review.opendev.org/c/openstack/ironic/+/951680	17:08
TheJulia	JayF: I've clarified my comment on https://etherpad.opendev.org/p/ironic-eventlet-removal. I think step 0 is sort of do we keep trying to use sslutils configuration as is and copy it, or do we break into individual section configs as we move parts. See: https://github.com/openstack/oslo.service/blob/master/oslo_service/sslutils.py#L29	17:10
JayF	TheJulia: okay, that's what I thought you meant. Was being explicit because the idea had been tossed around about migrating off oslo.service entirely	17:10
JayF	which at this point would seem like re-solving problems other people solved for us	17:10
cardoe	Should we backport https://review.opendev.org/c/openstack/ironic/+/948301 ?	17:11
cardoe	I'd love to see some of that land into oslo.service	17:11
TheJulia	cardoe: likely	17:11
TheJulia	cardoe: what specifically do you want to see land there?	17:12
cardoe	Trying to discuss "developer experience" type things at the TC. Like don't make it hard for people.	17:12
cardoe	TheJulia: not having people re-invent fixes themselves.	17:12
JayF	cardoe: you want a hot take?	17:12
cardoe	yes.	17:12
cardoe	I don't have a good answer.	17:12
JayF	cardoe: the worst DX in the whole project is the stuff happening in the DCO governance change	17:12
TheJulia	I think there is a trap we've fallen into, we have a single ssl configuration section which might actualy be wrong	17:12
cardoe	Cause oslo.* is something that people push stuff to but then nobody maintains	17:12
JayF	cardoe: trying to do something with reasonable scope, and then trying to boil the entire ocean with it	17:12
cardoe	Yeah I feel like this DCO stuff should be broken up.	17:13
JayF	oslo having low contribution, especially around bugs and responsiveness, are an artifact of the "we contribute to OPENSTACK" attitudes shifting around to be "we contribute to $subPROJECT" or "we contribute $FEATURE"	17:13
TheJulia	i think I spotted another option regarding oslo.service which is deprecated which we use or maybe it was vise versa, its getting a little blurry the more I look at the eventlet stuffs	17:14
opendevreview	Doug Goldstein proposed openstack/ironic stable/2025.1: Allow to unprovision instance from service wait states https://review.opendev.org/c/openstack/ironic/+/951682	17:14
TheJulia	I think there is an evolution aspect where oslo made sense, but in some areas might not make sense for everything and eveyone. There is a fine line to walk there and it is sort of situational	17:15
TheJulia	but that worsens it when orgs shift more to feature focused work	17:15
TheJulia	because there is inherently less motivation to and technical reason to have that centralization for smaller bits of common-ish looking code	17:16
opendevreview	Doug Goldstein proposed openstack/ironic master: allow running inspection hooks on redfish interface https://review.opendev.org/c/openstack/ironic/+/933066	17:16
TheJulia	(again, fine line, there are totally places where one should focus on that instead of the specific case, again, situational)	17:16
cardoe	okay so bad example then. :-D	17:17
JayF	It's interesting that eventlet is kinda demonstrating some of the value oslo still provides	17:17
JayF	although I agree in general that openstack would be well-served to be ... less of our own ecosystem and more a part of the larger python/cloud ecosystems	17:18
JayF	and oslo is an isolating factor there (as is eventlet, really)	17:18
TheJulia	yup	17:18
cardoe	okay https://review.opendev.org/c/openstack/ironic/+/951680 got touched after your +W TheJulia	17:19
cardoe	I think that wording is better	17:20
TheJulia	yeah, formatting fixups will likely occur later, but its all good	17:21
TheJulia	It gets the point across	17:21
cardoe	Haseeb is one of the devs on my team that I had previously mentioned would be working on release goals that we put forward.	17:24
TheJulia	cool cool	17:26
TheJulia	so going back to the sslutils discussion, https://review.opendev.org/c/openstack/ironic/+/951054 is top of mind	17:31
JayF	yeah, I owe cid some time looking into those failures with him, too	17:31
cid	++, that would be very helpful	17:39
* JayF just realized he owes cid a devstack VM he never provisioned yesterday 🤦‍♂️		17:44
cardoe	I think https://review.opendev.org/c/openstack/ironic/+/946741 is ready for a +W	17:46
JayF	I've been waiting to +A that one because I was hoping Julia or Dmitry might take a look	17:49
JayF	just not extremely confident in my ability to review new API endpoints in inspection	17:49
opendevreview	Merged openstack/networking-baremetal master: Remove explicit use of eventlet https://review.opendev.org/c/openstack/networking-baremetal/+/947985	17:50
* cardoe golf claps.		17:51
TheJulia	I sort of wish that had a test, but... it really is so simple...	17:53
TheJulia	cardoe: is the use pattern "hey, just call this url" but with no actual specific info regarding what ?	17:55
cardoe	yup	17:55
cardoe	Not our change but I can see the usefulness of it.	17:55
cardoe	The only guardrail I'd want added to to make sure that timeout is set and not 0	17:56
cardoe	But that would really require a stupid operator config to set it to 0 right now.	17:56
opendevreview	Syed Haseeb Ahmed proposed openstack/ironic master: re-framing this as an explicit bugfix to backport https://review.opendev.org/c/openstack/ironic/+/951680	17:57
TheJulia	shouldn't it have a check that timeout is not 0 then ?	18:00
TheJulia	or a semi-malicious operator who knows how to hang socket handling	18:00
JayF	that's a good call	18:05
JayF	part of why I waited was it felt too simple	18:05
JayF	but I couldn't put my finger on what it was missing	18:05
rm_work	Oh I gained some new insight into our random server shutdown / power sync problem. I guess the issue is not actually that our connectivity to BMC via ipmi is flaky anymore, it’s that the power checks in ironic queue up and eventually stuff queues so long that time out… lol	18:05
rm_work	Likely we need cells but we do not have cells yet <_<	18:05
TheJulia	BTW, my change to excise tinyipa passes CI https://review.opendev.org/c/openstack/ironic/+/950206 I'd <3 a review	18:05
JayF	rm_work: add conductors and/or conductor groups, and tweak the power status loop timing + number of threads	18:06
TheJulia	rm_work: how many baremetal nodes do you have per conductor?	18:06
JayF	rm_work: there are lots of ways to tune it	18:06
rm_work	Yeah I’ll look at our existing config, I think we already tried	18:07
JayF	what is your node:cond ratio like Julia asked	18:07
JayF	the other possibility is one or two bad BMCs used to be able to cause this behavior, because ipmitool would go out to lunch; but I think we timeout those now	18:07
rm_work	We have three conductors and … approximately 300 nodes I think	18:09
rm_work	I’m trying to look up our config quick	18:09
rm_work	Ah just over 400 actually, so like 136 nodes per conductor	18:10
TheJulia	and all ipmi ? and is it the execution of ipmitool which is timing out OR the conductor task thread in the queue?	18:11
JayF	that is more than enough conductors:node	18:11
JayF	I usually suggest a ratio of more like 500:1 without taking HA into account	18:11
rm_work	Hmm let me look at logs and try to figure that out	18:11
rm_work	I was really just intending to drive by and give an update, but if you’re gonna throw possible solutions at me I’ll take some time to follow up on this stuff quick 😁	18:12
TheJulia	so we have some operators who are running upwards of like 600+ nodes a conductor, but not all ipmi	18:12
TheJulia	it really depends on the state, settings and exact behavior	18:12
TheJulia	(and if the bmcs are well behaved as well, which is never a guarentee with ipmi)	18:13
rm_work	100% ipmi	18:13
rm_work	It is possible we have some misbehaving still but I was corrected as to it being a common occurrence anymore	18:14
JayF	I would only believe "these IPMI BMCs are not acting up commonly" when backed up with logs :D	18:14
rm_work	I believe we attempted to increase parallelism and timeouts, but do you know offhand which config vars are most relevant	18:14
rm_work	Or what I can search logs for to prove definitively one way or the other	18:15
TheJulia	I have seen ipmitool commands just timeout at times, but generally that is outside ipmitool itself running. That gets returned as an error which if you make the parallelism and queue too deep then I seem to think you end up sort of focusing on them because we want to focus on recovery/management of nodes which are being problematic as opposed to in a happy state.	18:20
TheJulia	at least, that is what my memory recall and thoughts are leaning towards	18:20
rm_work	Ok, well I have debug on here now so should see that if it happens	18:21
opendevreview	Merged openstack/networking-generic-switch master: Remove explicit use of eventlet https://review.opendev.org/c/openstack/networking-generic-switch/+/951026	18:21
rm_work[m]	excuse my 6-line paste	18:21
rm_work[m]	periodic_max_workers = 16... (full message at <https://matrix.org/oftc/media/v1/media/download/ASpmI2zdTVrOaO4Yy4dDGmFkPqRY1YMNh8XaY0VOBo9dsH4L-Y278upYx10zhsmypIA8G2CFr0z5BEs9f1gjjlNCeXfyBY4gAG1hdHJpeC5vcmcvV2pWRVdKS0hveVRsdHNjaFJOYmJHd0Zs>)	18:21
rm_work[m]	I think that is all the relevant config	18:22
JayF	rm_work[m]: sync_power_state_interval should be higher	18:23
JayF	rm_work[m]: and the retries being 30 is your problem	18:23
JayF	30 retries is absolutely bananas	18:24
rm_work[m]	ah I thought we already tuned that up... maybe they just tuned up the retries and not the interval? which would only exacerbate the issue I suppose lol	18:24
JayF	if there's a 60 second timeout (unsure) a single failing machine will overrun your interval	18:24
rm_work[m]	what is a safe power_state_interval?	18:24
JayF	that's 100% a situational question	18:24
rm_work[m]	I mean should we 10x it?	18:24
cardoe	TheJulia: ugh sorry... https://review.opendev.org/c/openstack/ironic/+/951680 once more :/	18:25
rm_work[m]	hmm	18:25
JayF	I've seen it completely disabled in some worlds	18:25
JayF	basically; how much lag time is OK if the node controls it's own power	18:25
JayF	e.g. shutdown from CLI or error or something	18:25
JayF	before ironic sees it's off	18:25
rm_work[m]	so like retries=5 interval=1200?	18:25
JayF	retries=3 is the default	18:25
JayF	I would tune that down down down down	18:25
JayF	you don't wanna retry, you want the bad ipmi instances to identify themselves, go to maintenance, and get outta your loop	18:26
TheJulia	oh yeah, 30 retries is super bananas since ipmitool will sit and run forever	18:26
rm_work[m]	lol that sounds... fair	18:26
JayF	in onmetal	18:26
JayF	we had something like, 3 conductors and 800 nodes in a region	18:26
JayF	3 failing nodes was enough to make the power status loop croak	18:26
rm_work[m]	ok so i'll see if i can tweak it down to default retries and put the interval up to 30min	18:26
JayF	I think we've improved the story a lot since then, but IPMI BMCs really are talented at going out to lunch	18:27
rm_work[m]	parallelism seem ok?	18:27
rm_work[m]	32 workers?	18:27
JayF	you won't see much pain from that being too high unless you increase your node:conductor ratio	18:27
TheJulia	honestly, I've seen folks lean toward retries=2 or even 1 since the loop pattern will focus on those problem nodes	18:29
JayF	+++	18:29
rm_work[m]	oh interesting, k	18:29
JayF	yeah basically the loop is your canary	18:29
JayF	if you can't get power status	18:29
rm_work[m]	I will attempt to fix this config and report back in two weeks after they actually let me make a config change >_<	18:29
JayF	than if a customer tries to provision that node, it'll also not work	18:30
TheJulia	yeah, this is one of those perfection is the enemy of good and unhappy nodes can make your power sync loop unhappy quickly	18:30
rm_work[m]	I ... miss yahoo, rofl	18:30
rm_work[m]	at least there they let us deploy literally constantly	18:30
JayF	perfection is only possible by letting the loop fail and pulling the bad machines outta the rotation	18:30
JayF	rm_work[m]: ...	18:30
TheJulia	rm_work[m]: feel free to link to eavesdrop and this conversation ;) "I talked with the maitnainers/experts" ;)	18:30
rm_work[m]	heh yeah	18:30
JayF	rm_work[m]: I am trying to think of a polite way to respond to that comment	18:30
rm_work[m]	lol	18:30
JayF	rm_work[m]: maybe they shoulda tried not needing to deploy all the time ;)	18:30
rm_work[m]	I mean the sentiment behind that was "unexpectedly"	18:31
rm_work[m]	I do not normally expect to miss it :D	18:31
TheJulia	I could totally see focusing on the old pattern to try and get perfection on the sync	18:31
TheJulia	but we did make some intentional changes to the loop... 6-7 years ago	18:32
TheJulia	to focus the pattern	18:32
rm_work[m]	we definitely have flaky nodes, it is just less than I thought, but if 1-2 could bring shit down then... definitely possible	18:32
JayF	with 30 replies	18:32
JayF	one could do it	18:32
JayF	**retries	18:32
TheJulia	yeah, that subprocess exec will just spin and spin	18:32
TheJulia	because you end up 30 retries with ?10? second timeouts, it quickly compounds out of control	18:33
cardoe	I gave that multinode tinyipa thing a +2... took me a couple to understand the switch_info change but it's good.	18:33
TheJulia	a single node, 300 seconds, etc	18:33
cardoe	I've also +2'd the backports for the agent get_XXX_steps stuff. the backport to 27.0 failed but it seems like python crashed so I retried that one.	18:34
rm_work[m]	thanks a ton, will attempt to get these changes in	18:34
TheJulia	oh, yeah, interface name length restrictions too ;)	18:34
TheJulia	cardoe: thanks!	18:34
* TheJulia spends some time reviewing n-g-s changes		18:35
cardoe	That release note change failed CI and Haseeb made another push and now it passed... sorry for the wasted +W efforts but it's good now	18:36
cardoe	https://review.opendev.org/c/openstack/ironic/+/951680	18:40
rm_work[m]	just one follow-up question -- how would a single node take up so many workers if it fails? like can multiple workers be retrying the same node at the same time?	18:46
opendevreview	Doug Goldstein proposed openstack/ironic master: allow running inspection hooks on redfish interface https://review.opendev.org/c/openstack/ironic/+/933066	18:47
rm_work[m]	logically with 30 retries and 300 interval, if the timeout is 10s, then 30 retries would put you right at that threshold (I am guessing someone did that exact math to pick those)	18:47
rm_work[m]	but why would it block the other ... 31/32 workers	18:47
rm_work[m]	interesting, the default sync interval is 60	18:50
cardoe	I created https://review.opendev.org/c/openstack/ironic/+/951682 as a backport (it'll pass CI since the only thing left is a non-voting job). I think that's good because it'll make the behavior consistent in 2025.1. The fix to make unprovision work landed in 2025.1 for all states except for the one that this corrects. I wouldn't backport past that since it would involve other backports to older versions. I gave it a +2	18:51
cardoe	since I didn't write the change.	18:51
TheJulia	rm_work[m]: yeah, so the way it works is the nodes are distributed in lists across the threads to invoke ipmitool	18:59
TheJulia	so one node spending time being timed out can cause the rest of the nodes in the queue to never be reached, which starts to create a cascading problem of sorts	19:00
opendevreview	Merged openstack/ironic master: re-framing this as an explicit bugfix to backport https://review.opendev.org/c/openstack/ironic/+/951680	19:00
opendevreview	Mithun Krishnan Umesan proposed openstack/networking-generic-switch master: Improve Netmiko Device Commands Documentation https://review.opendev.org/c/openstack/networking-generic-switch/+/951659	19:02
rm_work[m]	TheJulia: wait so a single node fails once... then 32 workers spin up simultaneously to try to sync it??	19:04
TheJulia	no	19:04
TheJulia	so...	19:04
TheJulia	think of it this way	19:04
rm_work[m]	I just don't understand how less than 32 nodes being bad could tie up 32 workers	19:04
TheJulia	you end up a with a list of like 25-30 nodes to check the power state	19:04
TheJulia	across say 8 threads	19:05
rm_work[m]	hmm	19:05
TheJulia	so, 20-30 seconds pass, it is half way through the list	19:05
TheJulia	it hits a node that just stalls out the entire check	19:05
TheJulia	so everything behind it, and that node itself, gets prioritized for being checked on the next time the entire power sync loop triggers to allocate sync work	19:05
rm_work[m]	well we have 32 workers, I feel like for it to be a problem we'd need a LOT of failing BMCs not just 1-2	19:05
opendevreview	Doug Goldstein proposed openstack/ironic stable/2025.1: Control port updates with update_pxe_enabled flag https://review.opendev.org/c/openstack/ironic/+/951631	19:06
JayF	There is some weirdness about how those variables interact, isn't there? I can't remember exactly, and quite frankly I'm already deep enough in my to-do list that I shouldn't look it up, but I think there's something about that number maybe only being the number of overall conductor threads that are dedicated to it	19:06
TheJulia	yeah, but then it is the failing node the next time and depending on the split, then you end up with 1-2 threads which are entirely stuck power sync wise because their very first node is the oldest unchecked node	19:06
JayF	So like if the overall conductor thread number is lower you won't have 25 threads to dedicate to power status loops	19:06
TheJulia	because we don't have an up to date power state	19:07
rm_work[m]	also, apparently people internally think we should NEVER shut down a node because of power checks, so ... maybe we should just disable the thing and set interval to 0 lol	19:07
cardoe	Is https://review.opendev.org/c/openstack/ironic/+/951631 correct for backporting those two together?	19:07
TheJulia	those nodes never really get checked, they get focused in on again, but now every sync you have some number of threads which will timeout their overall check operation	19:07
TheJulia	rm_work[m]: so, the issue is sort of a management issue	19:08
TheJulia	rm_work[m]: what is happening then is the node state goes to none	19:08
TheJulia	nova eventually picks up on that and flags it out and then ends up telling ironic to power it down	19:08
JayF	https://docs.openstack.org/ironic/latest/configuration/sample-config.html check [conductor]/workers_pool_size	19:09
JayF	the 25 threads for power status loop come from that pool	19:10
rm_work[m]	yeah I think we would also set nova sync_power_state_interval = -1	19:10
rm_work[m]	yeah our workers_pool_size is 500 lol	19:10
JayF	so if the number is too low you might see weird behavior if you're trying to spawn more threads than you have	19:10
JayF	okay	19:10
rm_work[m]	IDK if that is stupidly high tho	19:10
TheJulia	we're sort of talking about changing the default with eventlet	19:10
TheJulia	but.... that is not a now() change	19:11
JayF	I would suggest generally	19:11
JayF	it looks like the knobs in your deployment have kinda been indiscriminately turned up higher	19:12
JayF	consider restoring defaults for the worker pool size and power state workers and the like	19:12
rm_work[m]	hmmm k	19:12
JayF	mainly because 1) you're well below any known scaling thresholds and 2) it's much harder to reason about performance when your config is weirdly tuned	19:12
TheJulia	Yeah, tons of retires have bitten folks in the past.	19:12
rm_work[m]	ah default is 300 for the workers_pool_size so not THAT far off	19:13
JayF	I mean, it's 166% of the default value	19:13
JayF	that's not trivial ):	19:13
rm_work[m]	but yes I am returning retries to 3, but turning up the interval to ... maybe just 600	19:13
JayF	also force_power_state_during_retries may be impactful	19:13
rm_work[m]	I just don't know how much it matters for the syncs to be that fast	19:14
JayF	because if you can get status but can't power on... I think that'll maint the machine, nevermind	19:14
JayF	rm_work[m]: ironic really doesn't care much, ironic knows the power state unless people are powering on/off servers willy-nilly	19:14
rm_work[m]	yeah I don't know that anyone is really turning off their servers...	19:14
JayF	rm_work[m]: the biggest impact IMO is that if a user shutdown the instance inside the OS, and nova doesn't know b/c ironic hasn't sync'd that power state to it's DB (and then to nova)	19:14
JayF	IDK if the person would be able to `start` the instance to power it back up	19:15
JayF	because nova would think it's on already	19:15
rm_work[m]	hmm	19:15
JayF	that is a mild impact at most	19:15
JayF	and if you disable nova power sync, you're already paying that price	19:15
rm_work[m]	they could just tell it to reboot? heh	19:15
rm_work[m]	I mean I assume if nova thinks it is on, but it is off, they could tell nova to ... turn it off	19:15
JayF	or they could use the nova reboot command :)	19:15
rm_work[m]	and that would no-op, yes	19:16
JayF	rm_work[m]: that's outside of the box thinking, nice job	19:16
JayF	lol	19:16
rm_work[m]	T_T	19:16
TheJulia	stevebaker[m]: I reviewed some of your n-g-s changes, I didn't leave a vote on the change adding security group support into the actual netmiko model because it has me thinking	19:21
TheJulia	comments on the change, mostly more perception related	19:21
cardoe	alright gonna wander away for a bit. I threw out a pile of +2s on backports stuff. Pile of stuff in some state of ready for maintainers in https://review.opendev.org/q/hashtag:%22ironic-week-prio%22+AND+status:open	19:27
stevebaker[m]	TheJulia: ok thanks I'll take a look	19:28
cardoe	I'll note that some of those changes in that queue have 4 different people +2ing them so they're probably pretty good.	19:33
JayF	It's kinda a little crazy how different our cleaning flow is with disable_ramdisk=True	20:09
JayF	I had tested this entire automated cleaning by runbook change with it set true in the runbook, and basically there's an entire separate place to update for not-disabled ramdisk :\|	20:10
JayF	I think I have it figured out now, but it's not awesome	20:10
opendevreview	Jay Faulkner proposed openstack/ironic master: Automated cleaning by runbook https://review.opendev.org/c/openstack/ironic/+/945259	20:26
cardoe	Well I hope it passes.	20:28
JayF	CI won't catch this	20:31
JayF	I have to test in my devstack again	20:32
cardoe	Is https://review.opendev.org/c/openstack/ironic/+/951631 the right way to squish two commits together which should be one for a backport?	20:35
cardoe	The second being a fix up of the releasenote	20:35
cardoe	I referenced both commits in the cherry-pick comments	20:36
JayF	+2 lgtm	20:40
cardoe	your automated cleaning failed pep8	20:40
JayF	I'm not worried about that if it passes devstack :)	20:40
JayF	I use gerrit as my remote code store, not just pushing when it's ready	20:40
cardoe	ah just wanted to let ya know. I'm watching Zuul intently.	20:41
JayF	TheJulia: IDK if today was just an announcement, but congrats on OIF->LF closing	20:43
opendevreview	Doug Goldstein proposed openstack/ironic stable/2024.2: Control port updates with update_pxe_enabled flag https://review.opendev.org/c/openstack/ironic/+/951716	20:46
opendevreview	Doug Goldstein proposed openstack/ironic stable/2024.1: Control port updates with update_pxe_enabled flag https://review.opendev.org/c/openstack/ironic/+/951717	20:52
opendevreview	Doug Goldstein proposed openstack/ironic bugfix/28.0: Control port updates with update_pxe_enabled flag https://review.opendev.org/c/openstack/ironic/+/951718	20:53
opendevreview	Doug Goldstein proposed openstack/ironic bugfix/27.0: Control port updates with update_pxe_enabled flag https://review.opendev.org/c/openstack/ironic/+/951719	20:53
opendevreview	Merged openstack/ironic master: CI: remove legacy devstack baremetal admin and observer role usage https://review.opendev.org/c/openstack/ironic/+/951445	21:06
rm_work[m]	offhand, any major bugs/issues that have been fixed in the last year that I can reference as possible reasons why we'd need to make sure we get to latest? we're running like... 2024.1	21:16
opendevreview	Verification of a change to openstack/ironic master failed: api: Ensure parameter transform happens early https://review.opendev.org/c/openstack/ironic/+/948795	21:18
JayF	rm_work[m]: how about "so we continue getting security updates past halloween" https://usercontent.irccloud-cdn.com/file/dUebTDYh/image.png	21:24
rm_work[m]	lol	21:24
rm_work[m]	I suppose that might be relevant	21:25
opendevreview	cid proposed openstack/networking-generic-switch master: Cast numeric Netmiko kwargs to native types. https://review.opendev.org/c/openstack/networking-generic-switch/+/951724	22:21
opendevreview	Merged openstack/ironic master: api: Ensure parameter transform happens early https://review.opendev.org/c/openstack/ironic/+/948795	22:36
opendevreview	Michal Nasiadka proposed openstack/networking-generic-switch master: doc: Rework support matrix for trunk driver https://review.opendev.org/c/openstack/networking-generic-switch/+/942338	22:55
opendevreview	Michal Nasiadka proposed openstack/networking-generic-switch master: doc: Rework support matrix for trunk driver https://review.opendev.org/c/openstack/networking-generic-switch/+/942338	22:55
opendevreview	Merged openstack/python-ironicclient master: Cast string boolean from CLI https://review.opendev.org/c/openstack/python-ironicclient/+/951600	23:33

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!