TheJulia | rm_work: ... I guess it would sort of depend on what the interface to convey it across is | 04:35 |
---|---|---|
TheJulia | to the UEFI firmware | 04:35 |
TheJulia | really, I'd expect a runtime loader as an intermediary like ipxe or grub, but yeah | 04:36 |
TheJulia | JayF: pass in a config drive, yeah | 04:36 |
opendevreview | Syed Haseeb Ahmed proposed openstack/ironic stable/2025.1: Control port updates with update_pxe_enabled flag https://review.opendev.org/c/openstack/ironic/+/951631 | 08:30 |
opendevreview | cid proposed openstack/python-ironicclient master: Cast string boolean from CLI https://review.opendev.org/c/openstack/python-ironicclient/+/951600 | 09:33 |
*** masghar is now known as Mahnoor | 09:39 | |
*** Mahnoor is now known as masghar | 09:40 | |
opendevreview | Queensly Kyerewaa Acheampongmaa proposed openstack/sushy-tools master: Add PATCH support for Redfish DateTime fields in Manager resource https://review.opendev.org/c/openstack/sushy-tools/+/950925 | 11:10 |
dtantsur | TheJulia: yeah, there is nothing so critical on the conductor itself, it's more about RPC and API | 11:50 |
TheJulia | oh yeah, definitely | 12:08 |
TheJulia | Also, good morning! | 12:09 |
dtantsur | morning! | 12:09 |
TheJulia | it feels like api and to a similar extent rcp are also sort of blocked though. :\ | 12:13 |
TheJulia | Well, https://review.opendev.org/c/openstack/ironic/+/951054 :) | 12:19 |
TheJulia | stevebaker[m]: you mentioned recently that you noticed an issue shutting down the conductor. Did you have the jsonrpc service running? | 12:22 |
jrosser | the nova driver for ironic doesnt implement `get_host_uptime` which causes quite some `NotImplementedError` exceptions in my ironic nova-compute instance - is that known about? | 12:32 |
TheJulia | last call on reviews of https://review.opendev.org/c/openstack/networking-generic-switch/+/951026 | 13:10 |
TheJulia | jrosser: known, yeah. I think we tried to get nova to explicitly treat ironic's driver to just ignore the error, but I think they chose to keep the exception surfacing in the log. | 13:11 |
TheJulia | There is not a parallel line to draw to real hardware which has been deployed. | 13:12 |
jrosser | TheJulia: thanks for the info, thats a shame to leave the exceptions in the log | 13:21 |
TheJulia | Yeah, what they want is like a process aliveness time for the compute service if memory serves | 13:22 |
TheJulia | and... yeah, that is not how the virt driver is designed. | 13:22 |
TheJulia | I guess your running at a debug level? | 13:22 |
jrosser | nope, debug = False | 13:23 |
TheJulia | what version? | 13:24 |
jrosser | caracal | 13:24 |
TheJulia | Interesting, because latest CI job log I had handy doesn't show nova-compute logging that https://1659f7110238be299cb2-d967962f17ef2a051378ed1d5b3db360.ssl.cf2.rackcdn.com/openstack/cf098c89d7f6456d8990d6f90360b127/controller/logs/screen-n-cpu.txt | 13:25 |
opendevreview | Julia Kreger proposed openstack/ironic-tempest-plugin master: ci: dial back check intervals https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/942220 | 13:35 |
opendevreview | Julia Kreger proposed openstack/ironic master: DNM/Ignore: Science! trim out cirros partition... https://review.opendev.org/c/openstack/ironic/+/931222 | 13:39 |
opendevreview | Merged openstack/ironic-ui master: Removing regular from localization https://review.opendev.org/c/openstack/ironic-ui/+/951215 | 14:19 |
alegacy_ | question regarding how the various networks (e.g., rescuing, cleaning, provisioning, etc) are managed in a standalone case. It looks like those _could_ be each on a different VLAN, and in a full openstack deployment it looks like Neutron would be capable of (re-)programming the switch to set those up properly before booting the node in the right mode. In that case I assume the IPA would need to be built | 14:28 |
alegacy_ | knowing which VLAN to use in each scenario based on the info stored in the Neutron network? ...but in a standalone case, where Neutron isn't in the mix, how would Ironic know which VLAN to use for each...I don't see that defined in any config file? | 14:28 |
JayF | https://review.opendev.org/c/openstack/ironic/+/946741 could use a review if someone else has a sec (API-call action addition to inspector) | 14:28 |
JayF | alegacy_: IPA either uses DHCP or (rare) static IP configdrive | 14:29 |
JayF | alegacy_: so the IPA ramdisk is just configured to dhcp on all interfaces and/or read and apply that configdrive | 14:29 |
alegacy_ | but if those networkings (e.g., cleaning) is meant to be on a VLAN the IPA would have to create that VLAN prior to sending out a DHCP, no? | 14:30 |
TheJulia | brrrraaains | 14:30 |
TheJulia | JayF: I'm seeing more configdrive hinting via IPA as of recent, specifically in the context of metal3 where it is being hinted because folks just don't have/want/need dhcp in those disjointed standalone csaes | 14:31 |
TheJulia | alegacy_: generally the advise is to isolate to a specific access port and not try to have vlan tagging native on the interface, because when you offer an entire trunk without restrictions to a host, it is a security nightmare | 14:32 |
TheJulia | access port in access mode, that is | 14:32 |
alegacy_ | so then all of those networks (rescuing, cleaning, inspecting, etc...) are always on the same NIC and always on the same VLAN (because it is an access port)? | 14:33 |
TheJulia | In an integrated context, they are service networks which can all be the same or be distinctly different networks | 14:34 |
TheJulia | in a standalone context, it is expected to be static as long as the traffic can somehow reach the endpoints | 14:34 |
opendevreview | Mithun Krishnan Umesan proposed openstack/networking-generic-switch master: Adds a Sphinx directive to parse each file in the netmiko devices folder and return documentation containing the switch, command modules capable of being executed by the switch, and the CLI commands sent to the switch when the command module is selected. https://review.opendev.org/c/openstack/networking-generic-switch/+/951659 | 14:34 |
TheJulia | In ironic, the overall behavior is governed by the network_interface selection on the node object | 14:35 |
alegacy_ | ya, that's why I'm asking because in the network interface base class I'm seeing a lot of specific handling for those networks and I'm a bit confused as to how that information would be available in a standalone case | 14:36 |
cardoe | Well I'm back now. All I remember is that Ironic does stuff with computers. How am I doing so far? | 14:42 |
TheJulia | alegacy_: I guess the challenge is *what* along with when, and maybe walking through in discussion might help? dunno. A good starting point from a non-integrated case is just keep in mind there are no networks, no neutron, you only have what you know and supply and some of what is known can be config as well | 14:44 |
TheJulia | cardoe: clearly we need a bot which randomly delivers tasty beverages. | 14:45 |
JayF | cardoe: Ironic /tries/ to do stuff with computers | 14:46 |
TheJulia | And potentially make coffee | 14:52 |
alegacy_ | Thanks TheJulia... I'll chew on that for a bit. A short call/discussion might be good to clarify some basic assumptions before I head off into the weeds with some half-baked assumptions that are wrong. | 14:57 |
TheJulia | last call on https://review.opendev.org/c/openstack/networking-baremetal/+/947985 | 15:03 |
opendevreview | Verification of a change to openstack/ironic-python-agent unmaintained/wallaby failed: Update .gitreview for unmaintained/wallaby, fix CI https://review.opendev.org/c/openstack/ironic-python-agent/+/912949 | 15:05 |
opendevreview | Mithun Krishnan Umesan proposed openstack/networking-generic-switch master: Improve Netmiko Device Commands Documentation https://review.opendev.org/c/openstack/networking-generic-switch/+/951659 | 15:09 |
cardoe | TheJulia: +1 from me | 15:24 |
cardoe | I wanted to ask what I should do about https://review.opendev.org/c/openstack/ironic/+/951631 ? Since the original change was merged. Should I make a change to the note in master to be bugfix? | 15:24 |
cardoe | Or should I fix up the backport? | 15:26 |
cardoe | Cause it really IMHO is a bugfix. Since it's making the inspect code behave how it's documented. | 15:26 |
opendevreview | Merged openstack/ironic-tempest-plugin master: Adding better error messages to microversion tests https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/945945 | 15:37 |
TheJulia | I'd make a fix to the release note on master, and go ahead and incorporate that into the backport | 15:42 |
opendevreview | Jay Faulkner proposed openstack/ironic master: Fix minor devstack issues https://review.opendev.org/c/openstack/ironic/+/951672 | 15:54 |
opendevreview | Syed Haseeb Ahmed proposed openstack/ironic stable/2025.1: re-framing this as an explicit bugfix to backport https://review.opendev.org/c/openstack/ironic/+/951674 | 15:55 |
TheJulia | last call for https://review.opendev.org/c/openstack/networking-generic-switch/+/951026 | 15:57 |
JayF | land those so we can pump the "eventlet-free ironic projects" number up :D | 15:57 |
TheJulia | I already workflowed the networking-baremetal change | 15:58 |
TheJulia | ;) | 15:58 |
* TheJulia takes a break since she is at the 4 hour mark for the day | 16:03 | |
cardoe | JayF: wrt to pre-commit and stuff.. I did networking-baremetal right? I had forgotten networking-generic-switch only right? | 16:12 |
JayF | I would have to check to be sure | 16:12 |
opendevreview | Syed Haseeb Ahmed proposed openstack/ironic stable/2025.1: re-framing this as an explicit bugfix to backport https://review.opendev.org/c/openstack/ironic/+/951674 | 16:14 |
opendevreview | Syed Haseeb Ahmed proposed openstack/ironic stable/2025.1: re-framing this as an explicit bugfix to backport https://review.opendev.org/c/openstack/ironic/+/951674 | 16:14 |
opendevreview | Syed Haseeb Ahmed proposed openstack/ironic stable/2025.1: re-framing this as an explicit bugfix to backport https://review.opendev.org/c/openstack/ironic/+/951674 | 16:32 |
opendevreview | Syed Haseeb Ahmed proposed openstack/ironic master: re-framing this as an explicit bugfix to backport https://review.opendev.org/c/openstack/ironic/+/951680 | 16:38 |
opendevreview | Verification of a change to openstack/ironic master failed: api: Ensure parameter transform happens early https://review.opendev.org/c/openstack/ironic/+/948795 | 16:39 |
opendevreview | Mithun Krishnan Umesan proposed openstack/networking-generic-switch master: Improve Netmiko Device Commands Documentation https://review.opendev.org/c/openstack/networking-generic-switch/+/951659 | 16:40 |
cardoe | TheJulia: like https://review.opendev.org/c/openstack/ironic/+/951680 ? | 16:42 |
cardoe | But once that's good I squish it into his backport. | 16:42 |
TheJulia | yes | 16:49 |
TheJulia | ++ | 16:49 |
TheJulia | so, we could revise text like you've noted | 16:50 |
TheJulia | it would make it even more clear | 16:50 |
TheJulia | so I can workflow it, or not | 16:51 |
opendevreview | Julia Kreger proposed openstack/ironic master: ci: combine networking multinode tests with shard tests https://review.opendev.org/c/openstack/ironic/+/951593 | 16:53 |
opendevreview | Julia Kreger proposed openstack/ironic-tempest-plugin master: trivial: fix execution error on timeout https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/951681 | 16:59 |
opendevreview | Julia Kreger proposed openstack/ironic master: Remove the partition image upload https://review.opendev.org/c/openstack/ironic/+/931222 | 17:03 |
opendevreview | Julia Kreger proposed openstack/ironic master: ci: remove the partition image upload https://review.opendev.org/c/openstack/ironic/+/931222 | 17:03 |
cardoe | TheJulia: Haseeb's going to reword it. | 17:05 |
TheJulia | ok | 17:05 |
opendevreview | Syed Haseeb Ahmed proposed openstack/ironic master: re-framing this as an explicit bugfix to backport https://review.opendev.org/c/openstack/ironic/+/951680 | 17:08 |
TheJulia | JayF: I've clarified my comment on https://etherpad.opendev.org/p/ironic-eventlet-removal. I think step 0 is sort of do we keep trying to use sslutils configuration as is and copy it, or do we break into individual section configs as we move parts. See: https://github.com/openstack/oslo.service/blob/master/oslo_service/sslutils.py#L29 | 17:10 |
JayF | TheJulia: okay, that's what I thought you meant. Was being explicit because the idea had been tossed around about migrating off oslo.service entirely | 17:10 |
JayF | which at this point would seem like re-solving problems other people solved for us | 17:10 |
cardoe | Should we backport https://review.opendev.org/c/openstack/ironic/+/948301 ? | 17:11 |
cardoe | I'd love to see some of that land into oslo.service | 17:11 |
TheJulia | cardoe: likely | 17:11 |
TheJulia | cardoe: what specifically do you want to see land there? | 17:12 |
cardoe | Trying to discuss "developer experience" type things at the TC. Like don't make it hard for people. | 17:12 |
cardoe | TheJulia: not having people re-invent fixes themselves. | 17:12 |
JayF | cardoe: you want a hot take? | 17:12 |
cardoe | yes. | 17:12 |
cardoe | I don't have a good answer. | 17:12 |
JayF | cardoe: the worst DX in the whole project is the stuff happening in the DCO governance change | 17:12 |
TheJulia | I think there is a trap we've fallen into, we have a single ssl configuration section which might actualy be wrong | 17:12 |
cardoe | Cause oslo.* is something that people push stuff to but then nobody maintains | 17:12 |
JayF | cardoe: trying to do something with reasonable scope, and then trying to boil the entire ocean with it | 17:12 |
cardoe | Yeah I feel like this DCO stuff should be broken up. | 17:13 |
JayF | oslo having low contribution, especially around bugs and responsiveness, are an artifact of the "we contribute to OPENSTACK" attitudes shifting around to be "we contribute to $subPROJECT" or "we contribute $FEATURE" | 17:13 |
TheJulia | i think I spotted another option regarding oslo.service which is deprecated which we use or maybe it was vise versa, its getting a little blurry the more I look at the eventlet stuffs | 17:14 |
opendevreview | Doug Goldstein proposed openstack/ironic stable/2025.1: Allow to unprovision instance from service wait states https://review.opendev.org/c/openstack/ironic/+/951682 | 17:14 |
TheJulia | I think there is an evolution aspect where oslo made sense, but in some areas might not make sense for everything and eveyone. There is a fine line to walk there and it is sort of situational | 17:15 |
TheJulia | but that worsens it when orgs shift more to feature focused work | 17:15 |
TheJulia | because there is inherently less motivation to and technical reason to have that centralization for smaller bits of common-ish looking code | 17:16 |
opendevreview | Doug Goldstein proposed openstack/ironic master: allow running inspection hooks on redfish interface https://review.opendev.org/c/openstack/ironic/+/933066 | 17:16 |
TheJulia | (again, fine line, there are totally places where one *should* focus on that instead of the specific case, again, situational) | 17:16 |
cardoe | okay so bad example then. :-D | 17:17 |
JayF | It's interesting that eventlet is kinda demonstrating some of the value oslo still provides | 17:17 |
JayF | although I agree in general that openstack would be well-served to be ... less of our own ecosystem and more a part of the larger python/cloud ecosystems | 17:18 |
JayF | and oslo is an isolating factor there (as is eventlet, really) | 17:18 |
TheJulia | yup | 17:18 |
cardoe | okay https://review.opendev.org/c/openstack/ironic/+/951680 got touched after your +W TheJulia | 17:19 |
cardoe | I think that wording is better | 17:20 |
TheJulia | yeah, formatting fixups will likely occur later, but its all good | 17:21 |
TheJulia | It gets the point across | 17:21 |
cardoe | Haseeb is one of the devs on my team that I had previously mentioned would be working on release goals that we put forward. | 17:24 |
TheJulia | cool cool | 17:26 |
TheJulia | so going back to the sslutils discussion, https://review.opendev.org/c/openstack/ironic/+/951054 is top of mind | 17:31 |
JayF | yeah, I owe cid some time looking into those failures with him, too | 17:31 |
cid | ++, that would be very helpful | 17:39 |
* JayF just realized he owes cid a devstack VM he never provisioned yesterday 🤦♂️ | 17:44 | |
cardoe | I think https://review.opendev.org/c/openstack/ironic/+/946741 is ready for a +W | 17:46 |
JayF | I've been waiting to +A that one because I was hoping Julia or Dmitry might take a look | 17:49 |
JayF | just not extremely confident in my ability to review new API endpoints in inspection | 17:49 |
opendevreview | Merged openstack/networking-baremetal master: Remove explicit use of eventlet https://review.opendev.org/c/openstack/networking-baremetal/+/947985 | 17:50 |
* cardoe golf claps. | 17:51 | |
TheJulia | I sort of wish that had a test, but... it really is so simple... | 17:53 |
TheJulia | cardoe: is the use pattern "hey, just call this url" but with no actual specific info regarding what ? | 17:55 |
cardoe | yup | 17:55 |
cardoe | Not our change but I can see the usefulness of it. | 17:55 |
cardoe | The only guardrail I'd want added to to make sure that timeout is set and not 0 | 17:56 |
cardoe | But that would really require a stupid operator config to set it to 0 right now. | 17:56 |
opendevreview | Syed Haseeb Ahmed proposed openstack/ironic master: re-framing this as an explicit bugfix to backport https://review.opendev.org/c/openstack/ironic/+/951680 | 17:57 |
TheJulia | shouldn't it have a check that timeout is not 0 then ? | 18:00 |
TheJulia | or a semi-malicious operator who knows how to hang socket handling | 18:00 |
JayF | that's a good call | 18:05 |
JayF | part of why I waited was it felt too simple | 18:05 |
JayF | but I couldn't put my finger on what it was missing | 18:05 |
rm_work | Oh I gained some new insight into our random server shutdown / power sync problem. I guess the issue is not actually that our connectivity to BMC via ipmi is flaky anymore, it’s that the power checks in ironic queue up and eventually stuff queues so long that time out… lol | 18:05 |
rm_work | Likely we need cells but we do not have cells yet <_< | 18:05 |
TheJulia | BTW, my change to excise tinyipa passes CI https://review.opendev.org/c/openstack/ironic/+/950206 I'd <3 a review | 18:05 |
JayF | rm_work: add conductors and/or conductor groups, and tweak the power status loop timing + number of threads | 18:06 |
TheJulia | rm_work: how many baremetal nodes do you have per conductor? | 18:06 |
JayF | rm_work: there are lots of ways to tune it | 18:06 |
rm_work | Yeah I’ll look at our existing config, I think we already tried | 18:07 |
JayF | what is your node:cond ratio like Julia asked | 18:07 |
JayF | the other possibility is one or two bad BMCs used to be able to cause this behavior, because ipmitool would go out to lunch; but I think we timeout those now | 18:07 |
rm_work | We have three conductors and … approximately 300 nodes I think | 18:09 |
rm_work | I’m trying to look up our config quick | 18:09 |
rm_work | Ah just over 400 actually, so like 136 nodes per conductor | 18:10 |
TheJulia | and all ipmi ? and is it the execution of ipmitool which is timing out OR the conductor task thread in the queue? | 18:11 |
JayF | that is more than enough conductors:node | 18:11 |
JayF | I usually suggest a ratio of more like 500:1 without taking HA into account | 18:11 |
rm_work | Hmm let me look at logs and try to figure that out | 18:11 |
rm_work | I was really just intending to drive by and give an update, but if you’re gonna throw possible solutions at me I’ll take some time to follow up on this stuff quick 😁 | 18:12 |
TheJulia | so we have some operators who are running upwards of like 600+ nodes a conductor, but not all ipmi | 18:12 |
TheJulia | it really depends on the state, settings and exact behavior | 18:12 |
TheJulia | (and if the bmcs are well behaved as well, which is never a guarentee with ipmi) | 18:13 |
rm_work | 100% ipmi | 18:13 |
rm_work | It is possible we have some misbehaving still but I was corrected as to it being a common occurrence anymore | 18:14 |
JayF | I would only believe "these IPMI BMCs are not acting up commonly" when backed up with logs :D | 18:14 |
rm_work | I believe we attempted to increase parallelism and timeouts, but do you know offhand which config vars are most relevant | 18:14 |
rm_work | Or what I can search logs for to prove definitively one way or the other | 18:15 |
TheJulia | I have seen ipmitool commands just timeout at times, but generally that is outside ipmitool itself running. That gets returned as an error which if you make the parallelism and queue too deep then I seem to think you end up sort of focusing on them because we want to focus on recovery/management of nodes which are being problematic as opposed to in a happy state. | 18:20 |
TheJulia | at least, that is what my memory recall and thoughts are leaning towards | 18:20 |
rm_work | Ok, well I have debug on here now so should see that if it happens | 18:21 |
opendevreview | Merged openstack/networking-generic-switch master: Remove explicit use of eventlet https://review.opendev.org/c/openstack/networking-generic-switch/+/951026 | 18:21 |
rm_work[m] | excuse my 6-line paste | 18:21 |
rm_work[m] | periodic_max_workers = 16... (full message at <https://matrix.org/oftc/media/v1/media/download/ASpmI2zdTVrOaO4Yy4dDGmFkPqRY1YMNh8XaY0VOBo9dsH4L-Y278upYx10zhsmypIA8G2CFr0z5BEs9f1gjjlNCeXfyBY4gAG1hdHJpeC5vcmcvV2pWRVdKS0hveVRsdHNjaFJOYmJHd0Zs>) | 18:21 |
rm_work[m] | I think that is all the relevant config | 18:22 |
JayF | rm_work[m]: sync_power_state_interval should be higher | 18:23 |
JayF | rm_work[m]: and the retries being 30 is your problem | 18:23 |
JayF | 30 retries is absolutely bananas | 18:24 |
rm_work[m] | ah I thought we already tuned that up... maybe they just tuned up the retries and not the interval? which would only exacerbate the issue I suppose lol | 18:24 |
JayF | if there's a 60 second timeout (unsure) a single failing machine will overrun your interval | 18:24 |
rm_work[m] | what is a safe power_state_interval? | 18:24 |
JayF | that's 100% a situational question | 18:24 |
rm_work[m] | I mean should we 10x it? | 18:24 |
cardoe | TheJulia: ugh sorry... https://review.opendev.org/c/openstack/ironic/+/951680 once more :/ | 18:25 |
rm_work[m] | hmm | 18:25 |
JayF | I've seen it completely disabled in some worlds | 18:25 |
JayF | basically; how much lag time is OK if the node controls it's own power | 18:25 |
JayF | e.g. shutdown from CLI or error or something | 18:25 |
JayF | before ironic sees it's off | 18:25 |
rm_work[m] | so like retries=5 interval=1200? | 18:25 |
JayF | retries=3 is the default | 18:25 |
JayF | I would tune that down down down down | 18:25 |
JayF | you don't wanna retry, you want the bad ipmi instances to identify themselves, go to maintenance, and get outta your loop | 18:26 |
TheJulia | oh yeah, 30 retries is super bananas since ipmitool will sit and run forever | 18:26 |
rm_work[m] | lol that sounds... fair | 18:26 |
JayF | in onmetal | 18:26 |
JayF | we had something like, 3 conductors and 800 nodes in a region | 18:26 |
JayF | 3 failing nodes was enough to make the power status loop croak | 18:26 |
rm_work[m] | ok so i'll see if i can tweak it down to default retries and put the interval up to 30min | 18:26 |
JayF | I think we've improved the story a lot since then, but IPMI BMCs *really* are talented at going out to lunch | 18:27 |
rm_work[m] | parallelism seem ok? | 18:27 |
rm_work[m] | 32 workers? | 18:27 |
JayF | you won't see much pain from that being too high unless you increase your node:conductor ratio | 18:27 |
TheJulia | honestly, I've seen folks lean toward retries=2 or even 1 since the loop pattern will focus on those problem nodes | 18:29 |
JayF | +++ | 18:29 |
rm_work[m] | oh interesting, k | 18:29 |
JayF | yeah basically the loop is your canary | 18:29 |
JayF | if you can't get power status | 18:29 |
rm_work[m] | I will attempt to fix this config and report back in two weeks after they actually let me make a config change >_< | 18:29 |
JayF | than if a customer tries to provision that node, it'll also not work | 18:30 |
TheJulia | yeah, this is one of those perfection is the enemy of good and unhappy nodes can make your power sync loop unhappy quickly | 18:30 |
rm_work[m] | I ... miss yahoo, rofl | 18:30 |
rm_work[m] | at least there they let us deploy literally constantly | 18:30 |
JayF | perfection is only possible by letting the loop fail and pulling the bad machines outta the rotation | 18:30 |
JayF | rm_work[m]: ... | 18:30 |
TheJulia | rm_work[m]: feel free to link to eavesdrop and this conversation ;) "I talked with the maitnainers/experts" ;) | 18:30 |
rm_work[m] | heh yeah | 18:30 |
JayF | rm_work[m]: I am trying to think of a polite way to respond to that comment | 18:30 |
rm_work[m] | lol | 18:30 |
JayF | rm_work[m]: maybe they shoulda tried not needing to deploy all the time ;) | 18:30 |
rm_work[m] | I mean the sentiment behind that was "unexpectedly" | 18:31 |
rm_work[m] | I do not normally expect to miss it :D | 18:31 |
TheJulia | I could totally see focusing on the old pattern to try and get perfection on the sync | 18:31 |
TheJulia | but we did make some intentional changes to the loop... 6-7 years ago | 18:32 |
TheJulia | to focus the pattern | 18:32 |
rm_work[m] | we definitely have flaky nodes, it is just less than I thought, but if 1-2 could bring shit down then... definitely possible | 18:32 |
JayF | with 30 replies | 18:32 |
JayF | one could do it | 18:32 |
JayF | **retries | 18:32 |
TheJulia | yeah, that subprocess exec will just spin and spin | 18:32 |
TheJulia | because you end up 30 retries with ?10? second timeouts, it quickly compounds out of control | 18:33 |
cardoe | I gave that multinode tinyipa thing a +2... took me a couple to understand the switch_info change but it's good. | 18:33 |
TheJulia | a single node, 300 seconds, etc | 18:33 |
cardoe | I've also +2'd the backports for the agent get_XXX_steps stuff. the backport to 27.0 failed but it seems like python crashed so I retried that one. | 18:34 |
rm_work[m] | thanks a ton, will attempt to get these changes in | 18:34 |
TheJulia | oh, yeah, interface name length restrictions too ;) | 18:34 |
TheJulia | cardoe: thanks! | 18:34 |
* TheJulia spends some time reviewing n-g-s changes | 18:35 | |
cardoe | That release note change failed CI and Haseeb made another push and now it passed... sorry for the wasted +W efforts but it's good now | 18:36 |
cardoe | https://review.opendev.org/c/openstack/ironic/+/951680 | 18:40 |
rm_work[m] | just one follow-up question -- how would a single node take up so many workers if it fails? like can multiple workers be retrying the same node at the same time? | 18:46 |
opendevreview | Doug Goldstein proposed openstack/ironic master: allow running inspection hooks on redfish interface https://review.opendev.org/c/openstack/ironic/+/933066 | 18:47 |
rm_work[m] | logically with 30 retries and 300 interval, if the timeout is 10s, then 30 retries would put you right at that threshold (I am guessing someone did that exact math to pick those) | 18:47 |
rm_work[m] | but why would it block the other ... 31/32 workers | 18:47 |
rm_work[m] | interesting, the default sync interval is 60 | 18:50 |
cardoe | I created https://review.opendev.org/c/openstack/ironic/+/951682 as a backport (it'll pass CI since the only thing left is a non-voting job). I think that's good because it'll make the behavior consistent in 2025.1. The fix to make unprovision work landed in 2025.1 for all states except for the one that this corrects. I wouldn't backport past that since it would involve other backports to older versions. I gave it a +2 | 18:51 |
cardoe | since I didn't write the change. | 18:51 |
TheJulia | rm_work[m]: yeah, so the way it works is the nodes are distributed in lists across the threads to invoke ipmitool | 18:59 |
TheJulia | so *one* node spending time being timed out can cause the rest of the nodes in the queue to never be reached, which starts to create a cascading problem of sorts | 19:00 |
opendevreview | Merged openstack/ironic master: re-framing this as an explicit bugfix to backport https://review.opendev.org/c/openstack/ironic/+/951680 | 19:00 |
opendevreview | Mithun Krishnan Umesan proposed openstack/networking-generic-switch master: Improve Netmiko Device Commands Documentation https://review.opendev.org/c/openstack/networking-generic-switch/+/951659 | 19:02 |
rm_work[m] | TheJulia: wait so a single node fails once... then 32 workers spin up simultaneously to try to sync it?? | 19:04 |
TheJulia | no | 19:04 |
TheJulia | so... | 19:04 |
TheJulia | think of it this way | 19:04 |
rm_work[m] | I just don't understand how less than 32 nodes being bad could tie up 32 workers | 19:04 |
TheJulia | you end up a with a list of like 25-30 nodes to check the power state | 19:04 |
TheJulia | across say 8 threads | 19:05 |
rm_work[m] | hmm | 19:05 |
TheJulia | so, 20-30 seconds pass, it is half way through the list | 19:05 |
TheJulia | it hits a node that just stalls out the entire check | 19:05 |
TheJulia | so everything behind it, and that node itself, gets prioritized for being checked on the next time the entire power sync loop triggers to allocate sync work | 19:05 |
rm_work[m] | well we have 32 workers, I feel like for it to be a problem we'd need a LOT of failing BMCs not just 1-2 | 19:05 |
opendevreview | Doug Goldstein proposed openstack/ironic stable/2025.1: Control port updates with update_pxe_enabled flag https://review.opendev.org/c/openstack/ironic/+/951631 | 19:06 |
JayF | There is some weirdness about how those variables interact, isn't there? I can't remember exactly, and quite frankly I'm already deep enough in my to-do list that I shouldn't look it up, but I think there's something about that number maybe only being the number of overall conductor threads that are dedicated to it | 19:06 |
TheJulia | yeah, but then it is the failing node the next time and depending on the split, then you end up with 1-2 threads which are entirely stuck power sync wise because their very first node is the oldest unchecked node | 19:06 |
JayF | So like if the overall conductor thread number is lower you won't have 25 threads to dedicate to power status loops | 19:06 |
TheJulia | because we don't have an up to date power state | 19:07 |
rm_work[m] | also, apparently people internally think we should NEVER shut down a node because of power checks, so ... maybe we should just disable the thing and set interval to 0 lol | 19:07 |
cardoe | Is https://review.opendev.org/c/openstack/ironic/+/951631 correct for backporting those two together? | 19:07 |
TheJulia | those nodes never really get checked, they get focused in on again, but now every sync you have some number of threads which will timeout their overall check operation | 19:07 |
TheJulia | rm_work[m]: so, the issue is sort of a management issue | 19:08 |
TheJulia | rm_work[m]: what is happening then is the node state goes to none | 19:08 |
TheJulia | nova eventually picks up on that and flags it out and then ends up telling ironic to power it down | 19:08 |
JayF | https://docs.openstack.org/ironic/latest/configuration/sample-config.html check [conductor]/workers_pool_size | 19:09 |
JayF | the 25 threads for power status loop come from that pool | 19:10 |
rm_work[m] | yeah I think we would also set nova sync_power_state_interval = -1 | 19:10 |
rm_work[m] | yeah our workers_pool_size is 500 lol | 19:10 |
JayF | so if the number is too low you might see weird behavior if you're trying to spawn more threads than you have | 19:10 |
JayF | okay | 19:10 |
rm_work[m] | IDK if that is stupidly high tho | 19:10 |
TheJulia | we're sort of talking about changing the default with eventlet | 19:10 |
TheJulia | but.... that is not a now() change | 19:11 |
JayF | I would suggest generally | 19:11 |
JayF | it looks like the knobs in your deployment have kinda been indiscriminately turned up higher | 19:12 |
JayF | consider restoring defaults for the worker pool size and power state workers and the like | 19:12 |
rm_work[m] | hmmm k | 19:12 |
JayF | mainly because 1) you're well below any known scaling thresholds and 2) it's much harder to reason about performance when your config is weirdly tuned | 19:12 |
TheJulia | Yeah, tons of retires have bitten folks in the past. | 19:12 |
rm_work[m] | ah default is 300 for the workers_pool_size so not THAT far off | 19:13 |
JayF | I mean, it's 166% of the default value | 19:13 |
JayF | that's not trivial ): | 19:13 |
rm_work[m] | but yes I am returning retries to 3, but turning up the interval to ... maybe just 600 | 19:13 |
JayF | also force_power_state_during_retries may be impactful | 19:13 |
rm_work[m] | I just don't know how much it matters for the syncs to be that fast | 19:14 |
JayF | because if you can get status but can't power on... I think that'll maint the machine, nevermind | 19:14 |
JayF | rm_work[m]: ironic really doesn't care much, ironic knows the power state unless people are powering on/off servers willy-nilly | 19:14 |
rm_work[m] | yeah I don't know that anyone is really turning off their servers... | 19:14 |
JayF | rm_work[m]: the biggest impact IMO is that if a user shutdown the instance inside the OS, and nova doesn't know b/c ironic hasn't sync'd that power state to it's DB (and then to nova) | 19:14 |
JayF | IDK if the person would be able to `start` the instance to power it back up | 19:15 |
JayF | because nova would think it's on already | 19:15 |
rm_work[m] | hmm | 19:15 |
JayF | that is a mild impact at most | 19:15 |
JayF | and if you disable nova power sync, you're already paying that price | 19:15 |
rm_work[m] | they could just tell it to reboot? heh | 19:15 |
rm_work[m] | I mean I assume if nova thinks it is on, but it is off, they could tell nova to ... turn it off | 19:15 |
JayF | or they could use the nova reboot command :) | 19:15 |
rm_work[m] | and that would no-op, yes | 19:16 |
JayF | rm_work[m]: that's outside of the box thinking, nice job | 19:16 |
JayF | lol | 19:16 |
rm_work[m] | T_T | 19:16 |
TheJulia | stevebaker[m]: I reviewed some of your n-g-s changes, I didn't leave a vote on the change adding security group support into the actual netmiko model because it has me thinking | 19:21 |
TheJulia | comments on the change, mostly more perception related | 19:21 |
cardoe | alright gonna wander away for a bit. I threw out a pile of +2s on backports stuff. Pile of stuff in some state of ready for maintainers in https://review.opendev.org/q/hashtag:%22ironic-week-prio%22+AND+status:open | 19:27 |
stevebaker[m] | TheJulia: ok thanks I'll take a look | 19:28 |
cardoe | I'll note that some of those changes in that queue have 4 different people +2ing them so they're probably pretty good. | 19:33 |
JayF | It's kinda a little crazy how different our cleaning flow is with disable_ramdisk=True | 20:09 |
JayF | I had tested this entire automated cleaning by runbook change with it set true in the runbook, and basically there's an entire separate place to update for not-disabled ramdisk :| | 20:10 |
JayF | I think I have it figured out now, but it's not awesome | 20:10 |
opendevreview | Jay Faulkner proposed openstack/ironic master: Automated cleaning by runbook https://review.opendev.org/c/openstack/ironic/+/945259 | 20:26 |
cardoe | Well I hope it passes. | 20:28 |
JayF | CI won't catch this | 20:31 |
JayF | I have to test in my devstack again | 20:32 |
cardoe | Is https://review.opendev.org/c/openstack/ironic/+/951631 the right way to squish two commits together which should be one for a backport? | 20:35 |
cardoe | The second being a fix up of the releasenote | 20:35 |
cardoe | I referenced both commits in the cherry-pick comments | 20:36 |
JayF | +2 lgtm | 20:40 |
cardoe | your automated cleaning failed pep8 | 20:40 |
JayF | I'm not worried about that if it passes devstack :) | 20:40 |
JayF | I use gerrit as my remote code store, not just pushing when it's ready | 20:40 |
cardoe | ah just wanted to let ya know. I'm watching Zuul intently. | 20:41 |
JayF | TheJulia: IDK if today was just an announcement, but congrats on OIF->LF closing | 20:43 |
opendevreview | Doug Goldstein proposed openstack/ironic stable/2024.2: Control port updates with update_pxe_enabled flag https://review.opendev.org/c/openstack/ironic/+/951716 | 20:46 |
opendevreview | Doug Goldstein proposed openstack/ironic stable/2024.1: Control port updates with update_pxe_enabled flag https://review.opendev.org/c/openstack/ironic/+/951717 | 20:52 |
opendevreview | Doug Goldstein proposed openstack/ironic bugfix/28.0: Control port updates with update_pxe_enabled flag https://review.opendev.org/c/openstack/ironic/+/951718 | 20:53 |
opendevreview | Doug Goldstein proposed openstack/ironic bugfix/27.0: Control port updates with update_pxe_enabled flag https://review.opendev.org/c/openstack/ironic/+/951719 | 20:53 |
opendevreview | Merged openstack/ironic master: CI: remove legacy devstack baremetal admin and observer role usage https://review.opendev.org/c/openstack/ironic/+/951445 | 21:06 |
rm_work[m] | offhand, any major bugs/issues that have been fixed in the last year that I can reference as possible reasons why we'd need to make sure we get to latest? we're running like... 2024.1 | 21:16 |
opendevreview | Verification of a change to openstack/ironic master failed: api: Ensure parameter transform happens early https://review.opendev.org/c/openstack/ironic/+/948795 | 21:18 |
JayF | rm_work[m]: how about "so we continue getting security updates past halloween" https://usercontent.irccloud-cdn.com/file/dUebTDYh/image.png | 21:24 |
rm_work[m] | lol | 21:24 |
rm_work[m] | I suppose that might be relevant | 21:25 |
opendevreview | cid proposed openstack/networking-generic-switch master: Cast numeric Netmiko kwargs to native types. https://review.opendev.org/c/openstack/networking-generic-switch/+/951724 | 22:21 |
opendevreview | Merged openstack/ironic master: api: Ensure parameter transform happens early https://review.opendev.org/c/openstack/ironic/+/948795 | 22:36 |
opendevreview | Michal Nasiadka proposed openstack/networking-generic-switch master: doc: Rework support matrix for trunk driver https://review.opendev.org/c/openstack/networking-generic-switch/+/942338 | 22:55 |
opendevreview | Michal Nasiadka proposed openstack/networking-generic-switch master: doc: Rework support matrix for trunk driver https://review.opendev.org/c/openstack/networking-generic-switch/+/942338 | 22:55 |
opendevreview | Merged openstack/python-ironicclient master: Cast string boolean from CLI https://review.opendev.org/c/openstack/python-ironicclient/+/951600 | 23:33 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!