Tuesday, 2024-05-21

opendevreview	Jacob Anders proposed openstack/sushy-tools master: [WIP] Replace hardcoded BiosVersion with an updatable field https://review.opendev.org/c/openstack/sushy-tools/+/909487	03:16
opendevreview	Jacob Anders proposed openstack/sushy-tools master: Replace hardcoded BiosVersion with an updatable field https://review.opendev.org/c/openstack/sushy-tools/+/909487	03:20
opendevreview	Ebbex proposed openstack/bifrost master: Refactor the use of include_vars https://review.opendev.org/c/openstack/bifrost/+/888448	08:31
iurygregory	good morning Ironic	11:22
opendevreview	Jacob Anders proposed openstack/sushy-tools master: Replace hardcoded BiosVersion with an updatable field https://review.opendev.org/c/openstack/sushy-tools/+/909487	11:25
iurygregory	JayF, hey I just saw I have 1-1 with my manager today and it will conflict with the presentation <facepalm>	11:32
iurygregory	yay 360sec to update firmware is not enough... =X	11:51
TheJulia	iurygregory: this is an hpe machine, does it have system bios firmware 2024.04 ?	13:17
iurygregory	TheJulia, let me get you the info	13:18
TheJulia	I ask because I have a report of a machine which just came in which is reporting UnableToModifyDuringSystemPOST	13:18
iurygregory	TheJulia, U30 v2.54 (09/03/2021)	13:19
iurygregory	at least from redfish this is the info I got for BiosVersion	13:19
TheJulia	oh well, different cases then	13:19
iurygregory	but I remember to see some messages related to POST in the interface The server might be in POST. Information on this page is current as of last power off. This information is updated after the server is powered on and POST is complete.	13:21
iurygregory	but I haven't saw UnableToModifyDuringSystemPOST	13:21
TheJulia	I've seen it sporaticaly, like when someone retries introspection or something silly on a machine which was just powered up or rebooting	13:43
iurygregory	I'm re-configuring my bifrost setup to add an HPE machine an test things	13:49
TheJulia	ok	15:06
JayF	iurygregory: no big deal, it happens. I'll move forward with just Julia today, if you wanna look over the slides for feedback lmk	15:22
opendevreview	Julia Kreger proposed openstack/ironic-tempest-plugin master: CI: dial back inspctor voting https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/920113	15:54
opendevreview	Julia Kreger proposed openstack/ironic-tempest-plugin master: CI: dial back inspector voting https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/920113	15:58
iurygregory	JayF, ack, happy to provide feedback in the slides	17:37
TheJulia	Thoughts on bringing back SPUC?	17:56
dtantsur	never against	18:04
JayF	I'm always in	18:35
JayF	iurygregory: https://docs.google.com/presentation/d/1U_I0XC_A5RaZxUotqKGf02hWgMYuPyfSXhirxh9clcI/edit?usp=sharing just a caveat that it's intentionally vague in places so the talk isn't two hours long :D	18:36
iurygregory	First slide is amazing :D	18:48
TheJulia	I really like the bear images	18:59
TheJulia	including the bear on a vacuum	18:59
JayF	it's just copilot-generated images	18:59
JayF	I had to focus on content to get the slides completely done, adding more fun bear photos might be a way I spend my evenings while travelling	18:59
JayF	lol	18:59
TheJulia	The cleaning bear demands servers to clean	19:01
JayF	honestly my favorite is the giant bear racking the servers	19:01
dtantsur	JayF: slide 10, I always make a mistake in this word, but I believe it's dependEnt?	19:03
JayF	you should have comment access if you wanna leave it inline	19:03
JayF	both work for the noun form	19:04
dtantsur	this is why I was not sure whether to leave a comment :)	19:04
JayF	looks like "e" is usually the adjective, I'll check my usage	19:04
JayF	e is probably a better fit	19:04
JayF	that's one of those words I probably spell differently each time I type it and never realized since both are ~valid :D	19:04
dtantsur	lol true	19:05
dtantsur	Bear pictures are absolutely great	19:05
JayF	So I launched a thing today: https://podcast.gr-oss.io or https://youtube.com/@oss-gr under the podcasts tab :D	19:05
dtantsur	There 20 slides in total, right? Bloody google is misbehaving, so I want to double-check	19:07
JayF	50+	19:07
dtantsur	Heck	19:08
JayF	55	19:08
dtantsur	Nope, it won't load past 20 :(	19:10
JayF	dtantsur: you're missing the best slide https://usercontent.irccloud-cdn.com/file/qPAHKGSH/image.png	19:10
JayF	(rpittau is presenting on metal3 directly beforehand so I am purposefully skipping this)	19:11
dtantsur	JayF: scary! :D	19:12
JayF	I started typing the slide and all the sudden I heard a whisper from the inky shadows telling me to "go"	19:13
dtantsur	Okay, I got a pdf of the whole thing. No commenting for me though.	19:13
TheJulia	JayF: by chance from our discussion yesterday, did you create a bug?	19:15
JayF	going to be honest, I've been in a whirlwind of meetings all morning and forgot about that discussion entirely until now	19:16
JayF	lol	19:16
JayF	no bug for it yet, I wasn't sure exactly how/what to bug	19:16
TheJulia	ahh, okay	19:16
JayF	it seems like some mix of expected behavior adding on top of expected behavior until the VM melts	19:16
TheJulia	I'll take care of it then, I have an idea, and maybe some extra logging might also help provide the needed clarity	19:16
dtantsur	Jira also doesn't open normally. I guess it's a sign for me to stop working past 9pm	19:16
JayF	Probably root_device_hint of md0 when we create the configuration = boom	19:16
TheJulia	dtantsur: I just had to reboot my router, but your like a third of a planet away so shrug	19:17
dtantsur	WiFi here is so dense that I seem to feel it while walking through the room :D especially after I added a repeater	19:19
dtantsur	but no, jira loaded in the end, it's just being jira	19:19
TheJulia	Well, jira not loading is a sign it is time to call it a day :)	19:20
dtantsur	And that I will :) See you tomorrow folks	19:20
cid	\o	19:21
cid	JayF: thorough (to my limited knowledge :D), beautiful images, scanty enough that it's no intimidating.	19:22
JayF	ty o/	19:25
opendevreview	cid proposed openstack/ironic master: Flexible IPMI credential persistence method configuration https://review.opendev.org/c/openstack/ironic/+/917229	19:44
opendevreview	cid proposed openstack/ironic master: Flexible IPMI credential persistence method configuration https://review.opendev.org/c/openstack/ironic/+/917229	19:49
opendevreview	Julia Kreger proposed openstack/ironic-python-agent master: WIP: Correct ramdisk from re-scanning disks on step step collection https://review.opendev.org/c/openstack/ironic-python-agent/+/920151	21:33
opendevreview	Jay Faulkner proposed openstack/ironic master: Add Node Servicing to api-ref https://review.opendev.org/c/openstack/ironic/+/920152	21:34
TheJulia	JayF: it wasn't the cache action, it was the evaluate_hardware_support	21:35
JayF	well, we also recache the node on get_deploy_steps	21:35
JayF	so we're still doing a lot of scanning	21:36
JayF	I think there's a potentially better fix than the one you offer, curious what you think	21:36
JayF	dispatch_to_all_managers need to filter out things with hardware support == 0 (disabled/unsupported)	21:36
TheJulia	oh, we only trigger it if and only if what ironic has on the node has changed since the original lookup	21:36
JayF	and cache that result	21:36
JayF	then we avoid the call to evaluate_hardware_support at a lower level, and more closely obey the API contract of hardware managers (imo)	21:37
TheJulia	I can see it in the logs, the node didn't change, it didn't trigger there, it was evaluate, but if it did, it would have been double scanned	21:37
TheJulia	I dunno, I suspect evaluate_hardware_support is kind of miss-used	21:38
JayF	I don't suspect it, I kinda know it	21:38
TheJulia	and so it got some of these one off actions	21:38
TheJulia	one kind of might make sense, even the existing disk scan... maybe	21:38
JayF	and my suggested change would have the identical result of yours (running evaluate_hardware_support less) in addition to fixing our result for get_*_steps to not include steps that'll never run on a node	21:38
* JayF has an IDE open and 5 minutes, will push a prototype		21:40
TheJulia	dunno, that feels like a lot of mechanics to change when this seems sort of clean and at less risk of unexpected things	21:40
TheJulia	cool cool	21:40
JayF	IMHO the get_*_steps stuff is a pretty gnarly bug, too	21:40
TheJulia	it might also be because I'm not seeing it	21:40
JayF	TheJulia: it's because the bug that I thought existed doesn't, and instead another nastier one does.	21:43
JayF	TheJulia: the managers are supposed to be cached -- line 3287 in get_managers()	21:43
JayF	both dispatch_to_managers & dispatch_to_all_managers call get_managers()	21:43
JayF	so theoretically, we should have those results cached already and evaluate_hardware_support should not run again	21:43
JayF	the fact it is indicates to me that caching in get_managers() is busted	21:43
JayF	the caching: https://opendev.org/openstack/ironic-python-agent/src/branch/master/ironic_python_agent/hardware.py#L3287	21:45
JayF	we call get_managers: https://opendev.org/openstack/ironic-python-agent/src/branch/master/ironic_python_agent/hardware.py#L3335	21:45
* JayF put these notes on your proposed change		21:48
JayF	I suspect this is a side-effect of get_deploy_steps being synchronouos	21:48
TheJulia	... by design, it is supposed to be	21:51
JayF	well I'm saying like, we execute that command in context of the wsgi server	21:52
JayF	which I wonder if it makes the execution path different [hand waves]	21:52
JayF	that last line is me guessing at a root cause without enough data (yet)	21:52
TheJulia	yes, likely, and which means my fix might only kick in on the second call	21:52
TheJulia	(might*	21:52
JayF	but I'm fairly certain about what I'm describing being the root cause bad behavior	21:52
JayF	and fixing it would likely help agent in more ways than just this	21:53
TheJulia	I don't see how given the call flow	21:53
TheJulia	unless the intent with dispatch is only ever call once	21:53
TheJulia	and if that is really truly the case, then it has been broken for years	21:53
JayF	> TheJulia> I don't see how given the call flow	21:53
JayF	I don't follow	21:53
TheJulia	or never worked as expected	21:53
JayF	or the behavior changed with newer eventlet or the removal of greendns	21:53
TheJulia	quite possible as well	21:53
JayF	I never trust eventlet to behave sanely in IPA because we only partially monkey patch	21:53
TheJulia	the flow, as I read it, because we need to get the steps, we still end up triggering explicit calls to load and evaluate, they are not coded today as one time operations, but operations to be triggered across the board	21:54
JayF	zoom out from that and zoom in just to the flow between dispatch_to{,_all}_managers and get_managers	21:54
TheJulia	okay	21:55
JayF	that is where the bad interaction is AIUI	21:55
JayF	because _global_managers /should be/ keeping that evaluate_hardware_support() call to exactly once	21:55
JayF	and there's some reason I don't grok yet why not	21:55
opendevreview	cid proposed openstack/ironic master: Flexible IPMI credential persistence method configuration https://review.opendev.org/c/openstack/ironic/+/917229	21:56
TheJulia	ahh	21:57
TheJulia	hmmm yes, it should	21:57
TheJulia	... I guess my worry is "fixing" it breaks things even more	21:58
TheJulia	oh	21:58
JayF	We're going to be changing a lot of code in IPA to excise eventlet, we can't be afraid of this kinda structural change IMO	21:58
TheJulia	no	21:58
TheJulia	once per invocation	21:58
TheJulia	it is getting invoked every single time it is called	21:58
JayF	_global_managers is initialized in module scope	21:58
JayF	hardware.py line 54	21:59
JayF	(yes?)	21:59
TheJulia	https://opendev.org/openstack/ironic-python-agent/src/branch/master/ironic_python_agent/hardware.py#L3504	21:59
TheJulia	but it only checks in response	21:59
JayF	oh hell	21:59
TheJulia	s/response/manager/	22:00
TheJulia	with getattr, which makes sense for what it is doing	22:00
JayF	so I suggest we add the value of the hw support number provided by the manager in _global_managers	22:01
JayF	add it to the dict response, maybe optionally if needed for backwards compat, from get_managers()	22:01
TheJulia	shoot	22:01
TheJulia	we need a dry run call or something	22:01
TheJulia	not execute	22:02
TheJulia	jeeeeeeze	22:02
JayF	then refactor that code to use get_managers() directly	22:02
JayF	so it gets a cached copy	22:02
JayF	basically this is just a cache miss that's killing us	22:02
* JayF is whipping a thing up		22:03
TheJulia	it is not that the method even really needs to be called,, but that it triggers the other actions further down the rabbit hole	22:03
JayF	it's it's explicitly calling it to get a side effect, we shouldn't have merged that code IMO	22:04
JayF	I sure as hell hope we aren't relying on that behavior, if so it's extra gross	22:04
JayF	our documentation says it gets called exactly once iirc	22:04
TheJulia	JayF: I think it getting merged is a side effect of a lack of lower level understaning of the dispatch logic, and upon it suddenly not working the expected way and as you frame it, a cache miss, then this problem gets created	22:08
JayF	well, I'm going to propose a fix which gets rid of the side effect, and then maybe root out the fallout from that and go from there	22:09
JayF	I am -1 to making IPA have more workaround logic because we can't follow our own API spec	22:09
TheJulia	sounds goodd	22:09
TheJulia	I agree, now that I better understand what you were talking about	22:10
JayF	(to be clear: I've always reviewed almost all changes into IPA; not meaning this as blamey-annoyance towards anyone; I have no idea who wrote this but there's a great chance I blindly +2'd it)	22:10
TheJulia	well, if it cached, it is safe	22:10
TheJulia	if not, well then....	22:10
JayF	like, I'm going to cache and return the number to it	22:11
JayF	but if the code relies on some down-the-call-stack hwm getting evaluate_hardware_support called "N" times	22:11
JayF	that's not going to happen anymore	22:11
JayF	heh	22:17
JayF	get_managers has a double-call of evaluate_hardware_support, too	22:17
opendevreview	Jay Faulkner proposed openstack/ironic-python-agent master: WIP: Don't call evaluate_hardware_support 9001 times https://review.opendev.org/c/openstack/ironic-python-agent/+/920153	22:31
JayF	TheJulia: ^ I haven't touched unit tests yet, but I'm curious to see what tempest thinks about this	22:32
TheJulia	only 9001 :)	22:32
opendevreview	Jay Faulkner proposed openstack/ironic-python-agent master: WIP: Don't call evaluate_hardware_support 9001 times https://review.opendev.org/c/openstack/ironic-python-agent/+/920153	22:33
JayF	I noticed I swapped the sort key, would've been sorted in reverse prio order, that would've been fun :\|	22:33
TheJulia	feels like 9002 now ;)	22:33
JayF	how's the structure of that? I think I like it	22:33
TheJulia	yeah, so very not fun :)	22:33
JayF	we resort on every call to _get_managers but I think that's probably approximately zero python time	22:34
JayF	sorting a dict of probably <10 entries! Oh no!	22:34
TheJulia	I am wrapping my day, but at a glance without really digging deep into it, i think it makes basic sense, I guess all that really remains at the moment is to see how much explodes, or not	22:35
JayF	I suspect this is going to have a measurable impact on CI runtimes	22:35
JayF	given how slow our VMs are	22:35
JayF	if you think it looks vaguely shaped correctly, I'll get unit tests passing if I can in my last 25 minutes	22:35
TheJulia	yeah, actual execution logs are sort of what I'm interested in seeing	22:36
TheJulia	which will likely be tomorrow()	22:36
opendevreview	Jay Faulkner proposed openstack/ironic-python-agent master: WIP: Don't call evaluate_hardware_support 9001 times https://review.opendev.org/c/openstack/ironic-python-agent/+/920153	22:37
JayF	yeah changes in dedupe steps are wrong, I'll try to have something working well enough for CI before I leave	22:40
TheJulia	okay, have a good evening when you finally call it a day	22:41
JayF	o/	22:43
opendevreview	Jay Faulkner proposed openstack/ironic-python-agent master: WIP: Don't call evaluate_hardware_support 9001 times https://review.opendev.org/c/openstack/ironic-python-agent/+/920153	22:48
opendevreview	Jay Faulkner proposed openstack/ironic-python-agent master: WIP: Don't call evaluate_hardware_support 9001 times https://review.opendev.org/c/openstack/ironic-python-agent/+/920153	23:08
JayF	unit tests pass locally on this now, although it wouldn't be unreasonable for a reviewer to ask for more	23:08
JayF	nn o/	23:08
NobodyCam	Good Afternoon, Openstack folks, As usual another off the wall question, Is there an ironic conductor decommission guide? I looking to remove several ironic conductors from one of our regions	23:52
TheJulia	NobodyCam: guide, no. But question before I answer, are there other conductors in the conductor group?	23:54
NobodyCam	I have both, a few that are not in a group and about 10 conductor groups with 2 conductors each	23:55
TheJulia	Okay, so if there is another conductor alive, you can just stop the process, newer versions of ironic ensure work is drained out, fwiw	23:55
NobodyCam	I am manually migrating all the node off them	23:56
TheJulia	When you have the last conductor in a conductor group, you must ensure all nodes have been moved out of the conductor group	23:56
NobodyCam	+++	23:56
TheJulia	yeah, hash ring updates because it will change the query as soon it is noticed by the other conductors	23:56
TheJulia	migration out should jsut be changing the conductor_group value	23:58
NobodyCam	once all nodes are removed, then its just power off, how about the nova compute	23:59

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!