| opendevreview | Jacob Anders proposed openstack/ironic master: Fix intermittent Redfish firmware update failures with BMC validation https://review.opendev.org/c/openstack/ironic/+/960230 | 01:03 |
|---|---|---|
| opendevreview | Jacob Anders proposed openstack/ironic master: Fix intermittent Redfish firmware update failures with BMC validation https://review.opendev.org/c/openstack/ironic/+/960230 | 01:05 |
| opendevreview | Jacob Anders proposed openstack/ironic master: Fix intermittent Redfish firmware update failures with BMC validation https://review.opendev.org/c/openstack/ironic/+/960230 | 01:06 |
| opendevreview | Jacob Anders proposed openstack/ironic master: [WIP] Make cache_firmware_components more resilient during upgrades https://review.opendev.org/c/openstack/ironic/+/960711 | 03:24 |
| opendevreview | Steve Baker proposed openstack/ironic master: Replace Chrome/Selenium console with Firefox extension https://review.opendev.org/c/openstack/ironic/+/961434 | 05:05 |
| opendevreview | Steve Baker proposed openstack/ironic master: Replace Chrome/Selenium console with Firefox extension https://review.opendev.org/c/openstack/ironic/+/961434 | 05:17 |
| dtantsur | Error while preparing to deploy to node 050264d4-5439-44f0-a9d7-c34f855a8f0a: Image oci://quay.io/dtantsur/cirros:0.6.3 | 12:01 |
| dtantsur | could not be found.: ironic.common.exception.ImageNotFound: Image oci://quay.io/dtantsur/cirros:0.6.3 could not be found | 12:01 |
| dtantsur | I don't think this feature even works... | 12:01 |
| opendevreview | Dmitry Tantsur proposed openstack/ironic master: OCI: accept both content types when requesting a manifest https://review.opendev.org/c/openstack/ironic/+/960283 | 12:12 |
| dtantsur | more debugging ^^^ | 12:12 |
| opendevreview | Dmitry Tantsur proposed openstack/bifrost master: WIP add an OCI artifact registry https://review.opendev.org/c/openstack/bifrost/+/961388 | 12:33 |
| opendevreview | Dmitry Tantsur proposed openstack/bifrost master: Remove the ability to install and use ironic-inspector https://review.opendev.org/c/openstack/bifrost/+/887934 | 13:02 |
| dtantsur | JayF: revived the old patch of mine ^^ let me know if you think it's too early | 13:02 |
| JayF | I think for the sake of the release teams sanity, we should wait to do the paperwork until after the official release date. As far as prep work like this, I think it's okay. | 13:24 |
| JayF | If someone wanted to get really industrious, they could also propose all of the delete everything and replace it with a readme patches 😂 | 13:24 |
| TheJulia | good morning | 13:42 |
| opendevreview | Jakub Jelinek proposed openstack/ironic-python-agent master: Fix skip block devices for RAID arrays https://review.opendev.org/c/openstack/ironic-python-agent/+/937342 | 13:49 |
| opendevreview | Jakub Jelinek proposed openstack/ironic-python-agent master: Fix erasable devices check https://review.opendev.org/c/openstack/ironic-python-agent/+/961485 | 14:09 |
| TheJulia | well, dib seems broken on fedora | 14:10 |
| TheJulia | le-sigh | 14:10 |
| dtantsur | TheJulia: which registries have you tested the OCI feature with? | 14:13 |
| TheJulia | quay.io and the image registry included with OpenShift | 14:16 |
| dtantsur | hmm, quay.io definitely does not work in the current version | 14:18 |
| TheJulia | dtantsur: can you try something like oci://quay.io/podman/machine-os:5.3-amd64 ? | 14:18 |
| TheJulia | it will be a way bigger payload, but that was the working example I was using | 14:19 |
| TheJulia | ugh, it looks like dib+ipa-b is super broken | 14:20 |
| dtantsur | trying now | 14:20 |
| dtantsur | TheJulia: Image oci://quay.io/podman/machine-os:5.3-amd64 could not be found. | 14:20 |
| TheJulia | wut?! | 14:21 |
| dtantsur | I'm now adding logging everywhere, I'm not sure where this one instance is coming from | 14:21 |
| TheJulia | ok | 14:21 |
| TheJulia | ugh | 14:21 |
| TheJulia | actually, try | 14:22 |
| TheJulia | oci://quay.io/podman/machine-os:5.3 | 14:22 |
| opendevreview | Jakub Jelinek proposed openstack/ironic-python-agent master: Fix erasable devices check https://review.opendev.org/c/openstack/ironic-python-agent/+/961485 | 14:23 |
| TheJulia | Tag structure wise, 5.3 should take it down the most complex path as well with most possibilities of being happy | 14:23 |
| dtantsur | so far so good | 14:23 |
| dtantsur | ah, we're probably restarting IPA because of the previous error | 14:24 |
| dtantsur | no, wait, it works | 14:25 |
| TheJulia | AHH! | 14:25 |
| dtantsur | TheJulia: this one works.. but we need this feature work with more than one image :) | 14:25 |
| TheJulia | so its something in the matching logic then | 14:25 |
| dtantsur | I'm trying to find where this ImageNotFound is coming from. It's not oci_registry.py apparently? | 14:26 |
| TheJulia | give me a minute to pull that up | 14:27 |
| TheJulia | I'm trying to help iury with drac10 stuffs | 14:27 |
| dtantsur | Now I'm looking at OciImageService and.. it does not support tags at all?? Then what supports them? | 14:27 |
| TheJulia | so https://github.com/openstack/ironic/blob/master/ironic/common/oci_registry.py#L527 should be attempting to resolve the tags and what *should* be happening is it gets a list of possible tags back based upon what was supplied | 14:31 |
| TheJulia | https://github.com/openstack/ironic/blob/master/ironic/common/oci_registry.py#L535 should be getting a whole index back based upon the tag and resolution | 14:31 |
| dtantsur | It's not coming from oci_registry.py but I have another case | 14:31 |
| TheJulia | well, hold on | 14:31 |
| TheJulia | so the entry point into the call sequence should be https://github.com/openstack/ironic/blob/master/ironic/common/oci_registry.py#L596 | 14:32 |
| TheJulia | https://github.com/openstack/ironic/blob/master/ironic/common/image_service.py#L640-L804 should be taking the resulting data and then figuring out/matching out what is there | 14:33 |
| TheJulia | I'm wondering if your tag matching doesn't line up with what it is searching for?! Or are you not even making it there ? | 14:33 |
| opendevreview | Dmitry Tantsur proposed openstack/ironic master: OCI: accept both content types when requesting a manifest https://review.opendev.org/c/openstack/ironic/+/960283 | 14:35 |
| dtantsur | I'll run with ^^ and hopefully see | 14:35 |
| TheJulia | k | 14:35 |
| TheJulia | iurygregory: I'm going to build a centos machine to try and reproduce your build issue | 14:35 |
| TheJulia | Looks like dib's current code expects the same version of python in the build enviornment | 14:38 |
| TheJulia | which is wrong | 14:38 |
| *** dansmith_ is now known as dansmith | 14:39 | |
| *** JakubJelnek[m] is now known as kubajj | 14:41 | |
| dtantsur | TheJulia: the problem is: I simply did `oras push <tag> <file>`. I expect roughly 100% of users trying this feature to do the same (and fail the same way) :( | 14:44 |
| TheJulia | Fair, uhh... I'm trying to remember what it does under the hood and if that has changed at all | 14:44 |
| TheJulia | dtantsur: I'm trying to remember, is that what was written in the docs, or did I frame the docs much more towards explicitly using a shadigest style URL ? | 14:52 |
| TheJulia | iurygregory: now updating centos | 14:53 |
| dtantsur | TheJulia: you're not really prescriptive there, but we absolutely need both to work IMO | 14:53 |
| dtantsur | https://docs.openstack.org/ironic/latest/admin/oci-container-registry.html#available-url-formats | 14:53 |
| TheJulia | okay, yeah | 14:54 |
| TheJulia | So I never did tag creation in quay, since I started to wholly discount the idea thanks to openshift's image registry | 14:54 |
| TheJulia | so, distinctly possible oras is doing something we odn't expect and is not matching in the path I noted | 14:55 |
| TheJulia | That is where I would expect it to be breaking without being in any position to reproduce/try at the moment | 14:55 |
| dtantsur | Don't worry, I can iterate on it further. I just need to find the exactl place where ImageNotFound appears | 14:55 |
| dtantsur | Unfortunately, rebuilding dev-scripts with a new revision takes like half an hour :( | 14:56 |
| dtantsur | and I'm still unable to rebuild my bifrost environment with CS10 | 14:56 |
| TheJulia | CS9 or ubuntu+devstack ? I mean, its not *that* difficult to do and its not like your doing anything special aside from setting an image_source | 14:57 |
| TheJulia | WOW | 14:57 |
| TheJulia | 9-stream firmware-images package is like over 1GB | 14:58 |
| TheJulia | 75% done and at 880 MB | 14:58 |
| TheJulia | at some point, we're just going to flip the code around, copy the couple of NIC firmwares worth keeping, wax the entire folder, and rebuild it | 14:59 |
| dtantsur | CS9 is not supported by bifrost | 14:59 |
| TheJulia | oh well | 14:59 |
| dtantsur | and bloody virt-builder still does not support CS10, so I need to change my tooling | 14:59 |
| TheJulia | 1.1 GB at 95%, WOW | 15:00 |
| dtantsur | replace https://github.com/dtantsur/config/blob/85facb18464a96bc995632f6b2f9da713b40490e/virt-install.sh with something else | 15:00 |
| iurygregory | TheJulia, ack tks | 15:09 |
| TheJulia | so I had to pass DIB_RELEASE as well on it, but its gettings tarted | 15:10 |
| TheJulia | loops attached, extracting | 15:12 |
| TheJulia | doing the magic | 15:12 |
| JayF | https://superuser.openinfra.org/articles/2025-superuser-awards-nominee-g-research/ just went live. Not pasting it here so my Ironic colleagues can stuff ballots in. Nope not at all... ;) | 15:19 |
| opendevreview | Clif Houck proposed openstack/ironic master: WIP: Trait Based Networking Filter Expression Parsing and Base Models https://review.opendev.org/c/openstack/ironic/+/961498 | 15:31 |
| TheJulia | iurygregory: how did you install diskimage-builder && ironic-python-agent-builder ? | 15:31 |
| TheJulia | regardless, I think your build is out of date because currently it is horribly broken, specifically: https://tarballs.opendev.org/openstack/ironic-python-agent-builder/dib/ | 15:37 |
| TheJulia | We never put a project.toml file in ironic-python-agent-builder FWIW | 15:38 |
| iurygregory | TheJulia, basically it was via bifrost and after pulling the new code I've run pip install . for ipa-b | 15:45 |
| iurygregory | going to prepare lunch, will be back in ~2hrs | 15:45 |
| dtantsur | TheJulia: finally useful logging | 15:48 |
| dtantsur | Cannot use image oci://quay.io/dtantsur/cirros:0.6.3: the artifact index does not contain a list of manifests: {'sc | 15:48 |
| dtantsur | hemaVersion': 2, 'mediaType': 'application/vnd.oci.image.manifest.v1+json', 'artifactType': 'application/vnd.unknown.artifact.v1', 'config': {'mediaType': 'application/vnd.oci.empty.v1+json', 'digest': 'sha256:44136fa355b3678a1146ad16f7 | 15:48 |
| dtantsur | e8649e94fb4fc21fe77e8310c060f61caaff8a', 'size': 2, 'data': 'e30='}, 'layers': [{'mediaType': 'application/vnd.oci.image.layer.v1.tar', 'digest': 'sha256:7d6355852aeb6dbcd191bcda7cd74f1536cfe5cbf8a10495a7283a8396e4b75b', 'size': 2169241 | 15:48 |
| dtantsur | 6, 'annotations': {'org.opencontainers.image.title': 'cirros-0.6.3-x86_64-disk.img'}}], 'annotations': {'org.opencontainers.image.created': '2025-09-09T13:50:07Z'}} | 15:48 |
| TheJulia | woohoo | 15:49 |
| dtantsur | I think our code expects one more layer of indirection where here it's already THE manifest? | 15:49 |
| TheJulia | just draw a direct line in that code!? | 15:49 |
| dtantsur | which is probably why the Accept header does not work? | 15:49 |
| TheJulia | Yeah, we do, and yeah | 15:49 |
| TheJulia | I'd just try and raw the direct line, but using the length | 15:49 |
| TheJulia | if there is one or more | 15:49 |
| TheJulia | because then we know its a composite tag | 15:49 |
| TheJulia | iurygregory: finally got it to build, I think its environment variable differences passed in, ramdisk of 487MB on a local build. | 16:00 |
| dtantsur | sigh, there is also mandatory disktype that needs fixing.. | 16:12 |
| TheJulia | yeah, just make it not mandatory in such a case, if there try and use it, if not don't | 16:16 |
| TheJulia | the existing modeling assumed compound object, such as podman's machine-os which was the prior art we based upon | 16:17 |
| TheJulia | The super simple path of 1-1 mapping... yeah | 16:17 |
| dtantsur | `'OciImageService.identify_specific_image' is too complex` jesus how I hate everything... | 16:18 |
| dtantsur | so I'm also on the hook for a refactoring, nice | 16:18 |
| * dtantsur gets a snack first | 16:19 | |
| TheJulia | you could always bump complexity, create a bug, and wait for me to be able to cycle to it | 16:20 |
| TheJulia | (I'm awful, I know) | 16:20 |
| dtantsur | I *think* there is a simple subroutine I can extract into a new method | 16:49 |
| * TheJulia falls over dddead | 17:06 | |
| * TheJulia just had a long-ish discussion regarding bootc | 17:08 | |
| opendevreview | Dmitry Tantsur proposed openstack/ironic master: Fix OCI artifacts pointing to a single manifest https://review.opendev.org/c/openstack/ironic/+/960283 | 17:19 |
| dtantsur | The patch is getting out of hand ^^ | 17:19 |
| dtantsur | needs testing still | 17:19 |
| dtantsur | TheJulia: it's surprisingly quiet in my world about bootc. although somebody in metal3 upstream has already asked if we can somehow build IPA with it :D | 17:29 |
| JayF | I got some feedback that ironic-weekly-prio was tagged on some stuff, and was autocompleting in that field. I dropped that invalid tag (with the LY) from the merged/abandoned patches that had it in an attempt to correct | 17:30 |
| TheJulia | JayF: the auto pull of items with open/merged state? | 17:31 |
| JayF | I'm saying simply: many older patches had hashtag:ironic-weekly-prio | 17:31 |
| TheJulia | JayF: yeah, okay | 17:31 |
| JayF | when I asked clif1 to h/t his change, he noted it autocompleted with the ly version | 17:31 |
| JayF | so I deleted them | 17:31 |
| TheJulia | oh, ironic-week-prio vs ironic-weekly-prio (which I don't think I've ever tagged, but I could see how it gets there) | 17:32 |
| JayF | yeah exactly | 17:32 |
| TheJulia | dtantsur: interesting in that we added logic to support it in the agent, but to use it to run the agent is... Interesting. | 17:32 |
| *** clif1 is now known as clif | 17:32 | |
| dtantsur | I'm not sure it was a well-thought idea :) we had a brainstorming session about distributing IPA images | 17:33 |
| dtantsur | The immediate desire to to switch to oci:// by default and ship IPA images via our Quay | 17:35 |
| dtantsur | And the next question was of course "oh, and can we layer them for easy modifications"? :) | 17:35 |
| TheJulia | Yeah, that is a plus | 17:36 |
| TheJulia | I mean, its more like just running a container | 17:36 |
| TheJulia | Uhh, lets see, where did I put my brain | 17:38 |
| dtantsur | Cats were playing with it, so now it's under some furniture? | 17:38 |
| * iurygregory is back | 17:42 | |
| iurygregory | TheJulia, what command did you use? | 17:43 |
| TheJulia | dtantsur: oh noes... the orange cat has it.. Rutro! | 17:44 |
| dtantsur | forget about it then :D | 17:44 |
| dtantsur | (what would an orange cat do with a brain??) | 17:44 |
| TheJulia | iurygregory: DIB_RELEASE=9-stream ironic-python-agent-builder centos -o ironic-python-agent | 17:44 |
| iurygregory | ack, trying now | 17:45 |
| TheJulia | dtantsur: given they universally share a single braincell.... take a nap?! | 17:45 |
| dtantsur | true | 17:45 |
| TheJulia | Yes, orange cat's eyes are closed. He occupies the dog bed. | 17:47 |
| dtantsur | A good position to be in! (my condolences to the dog) | 17:48 |
| TheJulia | (He is used to it...) | 17:53 |
| dtantsur | Okay, my testing environment won't rebuild for some time more, I guess I'll test the new revision tomorrow. | 17:53 |
| dtantsur | If that works, I'll try ORAS+docker's registry, I guess | 17:54 |
| iurygregory | 595MB ironic-python-agent.initramfs | 17:57 |
| TheJulia | ... I guess your going to need to take it apart | 18:13 |
| TheJulia | and look inside | 18:13 |
| iurygregory | yeah | 18:14 |
| iurygregory | I'm going to also test building in my laptop just to see what size will be | 18:14 |
| TheJulia | iurygregory: by chance, did you put forward a change around https://github.com/iurygregory/openstack-sushy/commit/8039b9f8a42e0600a467849b3dc77f79397422e3 into gerrit? | 18:24 |
| iurygregory | TheJulia, nope, still testing downstream to see if really makes sense | 18:26 |
| TheJulia | based upon your notes, it seems to, fwiw. | 18:26 |
| iurygregory | ack | 18:27 |
| TheJulia | FWIW, I'm looking at starting to move some of these fixes downstream into openstack | 18:38 |
| iurygregory | ok | 18:56 |
| JayF | I didn't realize y'all had any sushy compat code in downstream places :-O | 18:59 |
| TheJulia | well, we have the older releases we've consumed downstream | 19:05 |
| JayF | honestly am more just curious how your end of the world works overall | 19:06 |
| iurygregory | on ocp is mostly to make it easier to consume from source so we can map the content for each ocp release we have | 19:35 |
| iurygregory | so we don't have to be strict with an upstream release (since the upstream/downstream cycle) is quite different 6 and 4 months kinda | 19:36 |
| JayF | sensible | 20:01 |
| TheJulia | osp/rhoso is release tied, so we have downstream git mirrors which have additional branches and additional processes we need to walk thorugh. From there we go through build/release processes | 20:21 |
| TheJulia | iurygregory: so looking at your change again, I'm thinking that might be wrong because its going to raise an exception regardless then because there won't be a location. What needs to happen, I think, is first check if we have a 200 error code, then hand over to that location handling code and the reference lookup using that | 20:28 |
| TheJulia | thoughts? | 20:29 |
| *** mnaser_ is now known as mnaser | 20:36 | |
| TheJulia | I guess I'm mentally trying to avoid raising extensionerror | 20:36 |
| mnaser | whos ready for some FUN | 20:37 |
| TheJulia | define fun? and type of fun? | 20:38 |
| TheJulia | mnaser: whats going on? | 20:38 |
| mnaser | i have a f(riggin)antastic environment that takes maybe up to 60s for a port to go up | 20:38 |
| JayF | hopefully the fun where he approves adamcarthur5's openstack-exporter changes ;) | 20:38 |
| mnaser | https://paste.openstack.org/show/bDbiC2ZIPAB1u6YBRnoE/ | 20:38 |
| * JayF hands mnaser a "stp off" /s | 20:38 | |
| mnaser | i have a feeling that ironic-conductor is in a busy loop | 20:38 |
| TheJulia | mnaser: line carrier up or for packets to actually be forwarded? | 20:39 |
| JayF | so you're saying the *BMC* port goes out to lunch? | 20:39 |
| JayF | because that's querying the BMC port | 20:39 |
| mnaser | right, but isnt oslo_service.loopingcall.LoopingCallTimeOut seeming to imply that the loop never ran for 82s? | 20:39 |
| * TheJulia twitches about eventlet | 20:40 | |
| mnaser | our bff eventlet indeed | 20:40 |
| JayF | This looks like Ironic is querying /your BMC/ for power state and it's not up | 20:40 |
| JayF | so again I ask: do the *BMC* ports flap on power change? | 20:40 |
| JayF | because we might have a corrolation!=causation issue but I'm not 100% sure | 20:40 |
| TheJulia | mnaser: ipmi or redfish? | 20:40 |
| mnaser | redfish | 20:41 |
| TheJulia | uhh.... | 20:41 |
| TheJulia | presently searching for a word | 20:41 |
| TheJulia | is it a shared port with the OS? or a dedicated BMC port? | 20:41 |
| mnaser | dedicated | 20:41 |
| TheJulia | And it just doesn't respond, times out, etc? Do we see packets? | 20:42 |
| mnaser | let me double check but for example | 20:43 |
| TheJulia | The looping call only calling back to it after so many seconds is... 8| | 20:43 |
| mnaser | https://www.irccloud.com/pastebin/H2ZNJSpW/ | 20:43 |
| * TheJulia blinks | 20:44 | |
| TheJulia | ooookay | 20:44 |
| TheJulia | okay | 20:44 |
| TheJulia | so so the overall loopingcall I guess is wrapping the neutron interaction and your getting the thread slayed basically | 20:45 |
| TheJulia | eventlet is standing there going "no, you cannot proceed" | 20:45 |
| JayF | is this how needing more workers presents?! | 20:45 |
| TheJulia | or the overall port operation is hanging for a super long time, finally returns | 20:46 |
| mnaser | so the port was plugged "2025-09-17 20:22:05.057" | 20:46 |
| TheJulia | oh. wow. | 20:46 |
| mnaser | which is fine, you see it 3 minutes less in the logs that the ports are there | 20:46 |
| TheJulia | how many nodes is this conductor managing? | 20:46 |
| mnaser | a whopping 80 something split by 3 conductors | 20:47 |
| TheJulia | ... Are there a bunch of override timeouts? | 20:47 |
| mnaser | only for neutron since these _FUN_ cumulus switches take ~30s to actually apply their configs * two switches for a bond | 20:47 |
| TheJulia | (We've seen some folks try to tune things like timeouts to insane values and break things in super weird ways, just trying to understand the scope) | 20:47 |
| TheJulia | okay okay | 20:48 |
| TheJulia | so, uhhhhhh | 20:48 |
| mnaser | and to avoid massive races, we have locking in place, so this is with me trying to put a 60s sleep between every manage | 20:48 |
| TheJulia | okay, I think I understand what is going on, give me a few minutes while I dig through the code | 20:49 |
| JayF | I'm very curious to see :D | 20:49 |
| mnaser | (i understand i am dealing with a turd here.. but i do think we can get away if the behaviour is indeed blocking and tenacity is not yielding for some reason) | 20:49 |
| JayF | mnaser: the amount I'm curious what the behavior would be on post-eventlet ironic is maximum right now :D | 20:50 |
| mnaser | are ya suggesting an upgrade on a thursday | 20:50 |
| mnaser | so right now, with a 60s sleep, i managed to get 21 clean failed, 7 managable | 20:51 |
| TheJulia | crazy question | 20:52 |
| TheJulia | what python version is this? | 20:52 |
| mnaser | 3.10.12 | 20:53 |
| TheJulia | so flow wise, your asking for the node to be provided, and they are all going down this path | 20:55 |
| mnaser | running a runbook more specifically, but i think it ends up the same issue if it was a normal provide too | 20:55 |
| mnaser | https://www.irccloud.com/pastebin/vtB9qfWc/ | 20:55 |
| mnaser | they all seem to be floating around 75-80s | 20:55 |
| TheJulia | yeah | 20:55 |
| TheJulia | Curious | 20:56 |
| mnaser | https://opendev.org/openstack/oslo.service/src/branch/master/oslo_service/backend/_eventlet/loopingcall.py#L54-L60 | 20:56 |
| mnaser | i mean it sounds/feels like the system didnt power on for 75s "in theory" | 20:57 |
| mnaser | but that just seems off | 20:57 |
| TheJulia | mnaser: any chance we can get a peak at a santized ironic.conf ? | 20:58 |
| TheJulia | this feels super bizar | 21:00 |
| mnaser | https://www.irccloud.com/pastebin/IZn2Imum/ | 21:01 |
| TheJulia | secondary question: what kind of gear is this? | 21:01 |
| mnaser | guilherme told me your favorite kind | 21:01 |
| mnaser | the novo | 21:02 |
| TheJulia | novo?! | 21:04 |
| JayF | fwiw cc: kubajj I'm working on that IPA Hardware manager refactor | 21:04 |
| TheJulia | so, the power sync interval stands out to me in the configuration | 21:04 |
| JayF | figured if anyone should be responsible to clean that up it should be the original spiller :) | 21:04 |
| mnaser | lenovo => le novo => the novo | 21:04 |
| TheJulia | set to every 30 seconds | 21:04 |
| TheJulia | lol | 21:04 |
| TheJulia | now I get it! | 21:04 |
| mnaser | i'm all bad jokes after digging at this for the past few hours | 21:04 |
| mnaser | mush brain | 21:04 |
| TheJulia | so, every 30 seconds for every bmc is super aggressive | 21:05 |
| TheJulia | I'd back it to 60 since the base standards expect 1 interaction every 60 seconds, specifically around ipmi, but the vendors get weird about redfish and sessions as well | 21:05 |
| TheJulia | uhhhhh. speaking of redfish, is it session auth ? | 21:05 |
| mnaser | session auth.. is that just username/pw? | 21:06 |
| TheJulia | this would be governed with driver_info parameters | 21:06 |
| TheJulia | session auth is username+password, but where the conductor saves a session token in ram | 21:06 |
| JayF | every 30 seconds is MASSIVE compared to what I've ever run at scale | 21:06 |
| TheJulia | so it doesn't re-auth | 21:06 |
| JayF | I think we run 5 minutes+ in current downstraem | 21:06 |
| TheJulia | you could be exhausting the session limits of the BMCs | 21:06 |
| TheJulia | some vendors only allow so many distinct logins, fwiw | 21:06 |
| TheJulia | the novo gear, I'm don't know about | 21:07 |
| TheJulia | BUT, I've seen/heard grumbling about some gear *also* taking *forever* to refelct/update power state changes too | 21:07 |
| mnaser | if i have username+pw that is session, right? or is there an extra knob to flip? | 21:07 |
| TheJulia | Like... 2.5 minutes which forces the sync interval to also be raised upwards | 21:07 |
| TheJulia | mnaser: we can use it to create a session, or we use it for basic re-auth every thime | 21:08 |
| TheJulia | uhhh | 21:08 |
| mnaser | https://www.irccloud.com/pastebin/jHChM3Vs/ | 21:08 |
| TheJulia | auth_type in ironic.conf | 21:08 |
| TheJulia | so sessions auth it *should* be | 21:09 |
| TheJulia | based upon the info you've provided | 21:09 |
| TheJulia | well, it would be "auto", where it tries/prefers session auth and falls back | 21:09 |
| TheJulia | Okay, what else... | 21:09 |
| mnaser | TheJulia: ok so i got a fun one for you, i looekd at docs at why its 30s | 21:10 |
| mnaser | looks like we had it at default.. | 21:10 |
| TheJulia | oh? | 21:10 |
| mnaser | 2025-07-18 14:10:01.932 1 ERROR oslo.service.loopingcall [-] Dynamic backoff interval looping call 'ironic.conductor.utils.node_wait_for_power_state.<locals>._wait' failed: oslo_service.loopingcall.LoopingCallTimeOut: Looping call timed out after 186.12 seconds | 21:10 |
| TheJulia | WUT | 21:11 |
| mnaser | okay hold | 21:11 |
| mnaser | this runbook includes bios.apply_configuration | 21:11 |
| mnaser | i wonder if that's mucking about | 21:11 |
| TheJulia | oh | 21:11 |
| TheJulia | OH | 21:11 |
| TheJulia | yeah | 21:11 |
| mnaser | so.. do we need a sleep clean step? | 21:12 |
| TheJulia | Technically, version wise, we might have one | 21:12 |
| TheJulia | I'd just do a sleep in a playbook first to let things settle down post-apply because things may be autonmously rebooting | 21:12 |
| mnaser | well after the apply bios config, we have some erase and raid stuff we do after | 21:13 |
| mnaser | so i wonder if we split that into two runbooks.. | 21:13 |
| mnaser | or if we can put a sleep in there, we have 2024.2 here i think | 21:14 |
| TheJulia | I'd give it a shot. I know we've long seen vendors do weird things when bios settings get changed | 21:14 |
| TheJulia | yeah, I think the code which give you a sleep option is in 2025.2, but first step to just delineate if one is causing another issue | 21:15 |
| TheJulia | The whole thing very much looks like we're sitting there aggressively trying to engage with the bmc and not getting a response | 21:15 |
| TheJulia | the *other* issue, is the connection timeouts are also 60 seconds, so your basically constantly trying to check power state and if the BMC is not responding, then bad things will start to happen | 21:15 |
| mnaser | i think post apply configuration maybe the bmc is not happy | 21:15 |
| TheJulia | I'd packet capture to verify somewhere in there | 21:16 |
| TheJulia | Yeah, I'd concur with that | 21:16 |
| TheJulia | and I bet that is starting a sort of cascade of unhappiness | 21:16 |
| TheJulia | sort of like a portable storm cloud | 21:16 |
| mnaser | so maybe two runbooks, one for bios, and one for the normal raid cleanup | 21:16 |
| mnaser | and go for a nap inbetween | 21:16 |
| TheJulia | sandwich for the novo ;) | 21:17 |
| opendevreview | Jay Faulkner proposed openstack/ironic master: Increase default sync_power_state_interval https://review.opendev.org/c/openstack/ironic/+/961554 | 21:17 |
| JayF | did somebody say bad default? | 21:17 |
| TheJulia | eek, 60 is okay-ish | 21:18 |
| TheJulia | just... the interval is exceeding the timeouts | 21:18 |
| TheJulia | default timeouts are also 60 seconds | 21:18 |
| TheJulia | (I... think.) | 21:18 |
| mnaser | oh also i managed to get an sqlalachemy trace from the api in messing with this too! | 21:19 |
| mnaser | runbook update with without "args" in the json blob | 21:20 |
| * TheJulia vomits | 21:20 | |
| TheJulia | wecanhasbugpls? | 21:20 |
| mnaser | yes i will write that down | 21:21 |
| opendevreview | Steve Baker proposed openstack/ironic master: Replace Chrome/Selenium console with Firefox extension https://review.opendev.org/c/openstack/ironic/+/961434 | 21:22 |
| TheJulia | mnaser: thanks | 21:24 |
| TheJulia | stevebaker[m]: impressive | 21:33 |
| stevebaker[m] | TheJulia: ty! | 21:33 |
| JayF | I am so close to getting all unit tests passing on this IPA HWM refactor | 21:35 |
| mnaser | so i guess.. technically... do we need to reboot or even boot up the system for the redfish bios interface? | 22:31 |
| mnaser | https://github.com/openstack/ironic/blob/247ae57a22d1b48173398e940d62e531c237d66e/ironic/drivers/modules/redfish/bios.py#L318 | 22:33 |
| mnaser | it seems we're rebooting right meow | 22:33 |
| JayF | iurygregory has some work ongoing in that space iirc, we have lots of unneccessary reboots in redfish-only step actions | 22:34 |
| opendevreview | Jay Faulkner proposed openstack/ironic-python-agent master: WIP: Refactor ironic_python_agent/hardware.py into multiple modules https://review.opendev.org/c/openstack/ironic-python-agent/+/961559 | 22:36 |
| * JayF wonders if mnaser is seeing his messages? | 22:38 | |
| mnaser | po9=]o | 22:39 |
| mnaser | sorry | 22:39 |
| mnaser | kid attacking laptop | 22:39 |
| mnaser | im still trying to figure out how to run this bios thing :( | 22:39 |
| JayF | Yeah, I think the reboots are unavoidable at this time | 22:41 |
| TheJulia | If memory serves the bmc can also say "i need to reboot for this to take effect" | 22:44 |
| mnaser | power_state_change_timeout .. gonna try playing with that, since that's really my issue, maaaybe | 23:08 |
| iurygregory | I think for bios settings we need to reboot, some firmware updates we can try to avoid rebooting | 23:19 |
| iurygregory | TheJulia, by any chance can you share the .initramfs and .kernel you generated so I can test in the idrac10? | 23:20 |
| cardoe | Yeah there's a bit we get back actually when we upload the firmware or set BIOS settings if memory serves me correctly we're just throwing out the window. | 23:25 |
| cardoe | Like the BMC will give you back a Job ID if it wants you to reboot it or do something further. | 23:26 |
| cardoe | I had a WIP patch a while back that I tried to save that ID and make some choices from there | 23:26 |
| cardoe | But I couldn't figure out how to make the operation conditionally do a reboot and ensure the IPA was setup to boot again or not. | 23:27 |
| *** ex_tnode7 is now known as ex_tnode | 23:40 | |
| iurygregory | TheJulia, oh I only saw this now: so looking at your change again, I'm thinking that might be wrong because its going to raise an exception regardless then because there won't be a location. What needs to happen, I think, is first check if we have a 200 error code, then hand over to that location handling code and the reference lookup using that | 23:46 |
| iurygregory | I think it makes sense | 23:46 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!