dxterslab | TheJulia: pyghmi SOL to supermicro and then proxied to a websocket and using xterm.js for interactive shell works | 00:49 |
---|---|---|
TheJulia | impressive | 00:50 |
dxterslab | the good thing is that sol is generic and not oem specific. I will be trying the same approach with dell soon | 01:19 |
TheJulia | cool | 01:36 |
JayF | Thanks for the new xterm-ipmi console interface ;) | 02:51 |
dxterslab | The current approach I believe I will be taking is having a proxy that does IPMI SOL in the backend and websocket on the front end. I heavily rely on metal3, so my assumption is that ironic would own this container/app, and metal3 should proxy the websocket for Kubernetes? | 02:57 |
JayF | Look at the recent redfish graphical console code, it follows a similar pattern | 03:05 |
JayF | I also suspect we need to write different ways to spawn the proxy container there, too. | 03:05 |
JayF | https://opendev.org/openstack/ironic/commit/25a3dd076a0a8d3f4bbb5886252f6d08d78e33f9 Is the last commit in that series. | 03:08 |
dxterslab | I saw the graphical-console. It is a selenium browser wrapper. I tried to interact with the websockets provided by redfish, first challenge is the custom RFB used by SMC and their BMC hardware, second challenge, Dell websocket didn’t easily work with vanilla novnc… because of time I have parked this option, but will revisit in the future. Also, to my understanding, novnc doesn’t allow for copy paste for text nor | 03:09 |
dxterslab | keeps shell history | 03:09 |
JayF | Well I'm saying it uses a sidebar container | 03:21 |
JayF | You could follow the pattern, and just put your stuff into the container instead of the VNC stuff | 03:21 |
JayF | **sidecar | 03:22 |
dxterslab | Ah! Gotcha | 03:28 |
rpittau | good morning ironic! o/ | 07:15 |
queensly[m] | Good morning | 08:23 |
AmarachiOrdor[m] | Good morning everyone! | 08:26 |
freemanboss[m] | Good afternoon everyone | 11:17 |
opendevreview | Jay Faulkner proposed openstack/ironic stable/2024.2: OSSA-2025-001: Disallow unsafe image file:// paths https://review.opendev.org/c/openstack/ironic/+/949174 | 14:05 |
opendevreview | Jay Faulkner proposed openstack/ironic stable/2024.1: OSSA-2025-001: Disallow unsafe image file:// paths https://review.opendev.org/c/openstack/ironic/+/949175 | 14:05 |
JayF | Anyone poked at vmedia+openbmc based BMCs yet? | 14:45 |
JayF | I found what looks like an old spec from their GitHub that looks like they may have chosen a cifs-based implementation at least as an option, I'm really hoping that's not the case | 14:58 |
TheJulia | I've had a few off-hand conversations about doing cifs on weird vendors int he past | 15:01 |
JayF | irmc driver supports it | 15:02 |
TheJulia | and the consensus was maybe having a flag in ironic which could know to "send it a cifs url" would be ideal, but yeah.... | 15:02 |
JayF | for their mc | 15:02 |
TheJulia | fun | 15:02 |
TheJulia | yeah | 15:02 |
TheJulia | I think that sort of predicated upon the idea of we change the output url for where to grab the artifact from, but don't change to where we write it, in other words make it the expectation that the operator sets up a cifs share with it | 15:03 |
JayF | yeah, what I'm looking at is just rejecting an http url outright (not even trying to GET/HEAD it) so I think I might have hit the reverse lottery :| | 15:03 |
JayF | makes sense | 15:03 |
TheJulia | or a cifs share which points to the vmedia folder, or something | 15:03 |
JayF | and gives me ideas on what to try with sushy directly | 15:03 |
TheJulia | I think it makes a ton of sense to detect teh failure, and then try submitting a changed url | 15:03 |
TheJulia | fwiw | 15:03 |
JayF | probably trivial to hack in for seeing if I can get it to work | 15:04 |
JayF | ack, thanks for the suggestion, I wouldn't have considered just doing a basic fallback | 15:04 |
okamitok[m] | So I've got a just general question on flow that maybe I'm not understanding.... (full message at <https://matrix.org/oftc/media/v1/media/download/AcoWRZe3OD7jr9wyHvi2mt8-77bes5aYjqOQzgAtD2vD-iC81itD_sAZP-6GYZt2am593eTX5PkQiLs-tq_ZjbRCeW_bLnzwAG1hdHJpeC5vcmcvR3ROT3NMZHBzbUdma01rTWVrQW9QQ1Rq>) | 15:11 |
JayF | to your last line; yes | 15:12 |
JayF | that's exactly what happens in flat or neutron network interface | 15:12 |
okamitok[m] | Got it so that's where my issue is then, everything is working except that last step. | 15:14 |
okamitok[m] | I'll need to dive into the logs, thanks. | 15:14 |
* TheJulia attempts to put brain into policy writing mode | 15:24 | |
*** MichaelSherman[m] is now known as shermanm[m] | 15:38 | |
shermanm[m] | now that service_steps are a thing, are there any plans around snapshot support? I've got a student working on our internal mechanism for it this summer, and thought it would be nice to try and align with an approach that could get upstreamed | 15:41 |
opendevreview | Jay Faulkner proposed openstack/ironic bugfix/26.0: OSSA-2025-001: Disallow unsafe image file:// paths https://review.opendev.org/c/openstack/ironic/+/949186 | 15:44 |
TheJulia | JayF: it occurs to me someone came in with the exact same issue recently | 16:18 |
JayF | My current plan is to hack in ironic/drivers/modules/redfish/boot.py:288 and last-minute swap to an smb url in this test env | 16:19 |
JayF | to see if it will take samba | 16:19 |
TheJulia | shermanm[m]: would love to do it and get it into place, that being said I don't have time to do it. I think we could loosely collaborate/mentor if that might help on a upstream friendly change | 16:19 |
TheJulia | JayF: I think that is exactly what that person did based upon the failure from the bmc | 16:19 |
JayF | if not, I will likely email a vendor to get more info | 16:19 |
opendevreview | Julia Kreger proposed openstack/ironic master: Patch configdrive metadata https://review.opendev.org/c/openstack/ironic/+/946677 | 16:20 |
TheJulia | ^ was painful. | 16:20 |
TheJulia | But more so from stop/start multiple times | 16:20 |
cardoe | So I think we should look at implementing the upload to BMC instead of instructing the BMC to download it. | 16:27 |
opendevreview | Julia Kreger proposed openstack/ironic master: Patch configdrive metadata https://review.opendev.org/c/openstack/ironic/+/946677 | 16:27 |
cardoe | I mean sure extend how ya need. | 16:27 |
cardoe | But we should add the upload to BMC and nudge folks to try that first. | 16:27 |
JayF | At this point I'm not thinking in theory, I wanna get a thing working on a specific piece of hardware | 16:29 |
cardoe | Yeah that's why I said "yes extend how ya need" | 16:30 |
cardoe | Just tossing out a suggestion for a longer term thing. | 16:30 |
opendevreview | Jay Faulkner proposed openstack/ironic bugfix/26.0: [bugfix-only] Further docs build fixes https://review.opendev.org/c/openstack/ironic/+/949373 | 16:39 |
JayF | The docs build failures that continue are incredibly perplexing | 16:40 |
* JayF going to maybe try one alternate approach | 16:40 | |
opendevreview | Jay Faulkner proposed openstack/ironic bugfix/26.0: [bugfix-only] Ensure u-c is applied https://review.opendev.org/c/openstack/ironic/+/949374 | 16:42 |
JayF | two different approaches, if the second works it'll be ebtter | 16:42 |
JayF | alternatively if cores are +1, I can ask infra to mash force merge on these security patches and stop the urgency-clock on fixing docs | 16:43 |
shermanm[m] | TheJulia: that's exactly what I was hoping for, just broad guidance on footguns to avoid, and a "shape" that might be acceptable. And maybe it just turns into a spec after we've discovered some tradeoffs | 16:49 |
Sandzwerg[m] | hello ironic. Has anyone heard about race conditions with nvme naming? I see a case were the IPA identified the root disk correctly as the smaller disk (nvm1n1) Accourding to the IPA log the device is <900GiB. But in the booted instance the OS installed on nvme1n1 (6TB) which was detected as nvme0n1 in the IPA log. I mean I get that different OS'es might have different naming schemes but it seems to me right now that IPA is | 16:50 |
Sandzwerg[m] | writing to the bigger disk but detects that disk as the other, smaller one. | 16:50 |
TheJulia | shermanm[m]: Yeah, a spec is a great starting point to try and get to the same place | 16:51 |
TheJulia | ... That is weird, since If I'm remembering the logic correctly the OS should be targetted to the smallest disk greater than 4 GB | 16:54 |
TheJulia | Sandzwerg[m]: ^ | 16:54 |
Sandzwerg[m] | We set root hints, but the hint is correct if I can believe the log. According to the log the OS is installed on the correct, so the smaller, disk. But when booting it appears the OS was installed on the wrong disk. But the device name in IPA and OS is swapped. So IPA: nvme0 big disk, nvme1 small disk, OS: nvme0 small disk, nvme1 big disk | 16:57 |
Sandzwerg[m] | I only have a couple of nodes, all from the same order, that show these behavior. | 16:58 |
TheJulia | oh wow | 16:59 |
TheJulia | could it be initalization order differences? | 16:59 |
TheJulia | Going back to.. ?2022? distributions started shipping kernels which did async device initalization | 17:00 |
Sandzwerg[m] | I'm not sure what it is but it feels like something like that yeah | 17:01 |
TheJulia | which makes actual device by name matching/ordering unreliable across reboots | 17:01 |
TheJulia | unless you only have a single such device | 17:01 |
Sandzwerg[m] | I would expect that even if both OS order/name the devices differently if they get the facts from these devices these should be stable. But right now it feels like IPA mixes the naming at first, then continues to use the mixed names but the device naming itself is changed while that is happening | 17:03 |
TheJulia | JayF: I'm good with force merging fixes in for broken doc builds, as long as we work the doc issues | 17:03 |
TheJulia | ordering for the kernel is purely based upon which device responds first | 17:04 |
TheJulia | what is your root device hint set to? | 17:04 |
JayF | TheJulia: the thing is, I only see the docs job failing on the patch changes, and IDK why | 17:06 |
Sandzwerg[m] | The root device hint is <900 and the smaller disk fits to that (894GiB) | 17:08 |
TheJulia | so no hint? | 17:11 |
JayF | GB vs GiB shenanigans? | 17:11 |
TheJulia | no, one boot the device will be /dev/nvme0, reboot it might be /dev/nvme1 or /dev/nvme0 | 17:12 |
TheJulia | JayF: docs run on a different nodeset | 17:13 |
Sandzwerg[m] | > so no hint? | 17:13 |
Sandzwerg[m] | No the hint is "root_device: size <=900". That is correct. | 17:13 |
Sandzwerg[m] | Should FS UUIDs be stable? Maybe this is an old install, because the FS UUIDs I see in the IPA log and on the OS are different. But then the install to the smaller disk is completly gone | 17:14 |
Sandzwerg[m] | ohhhhh, the metadata uses the date & time of depyment as UUID and that shows that the partitions on the bigger disk are two months old. It does not explain why all the partitions that IPA wrote to the smaller disk are no longer visible in lsblk but maybe wiping everything is enough. (Yes I know I reaaally need cleaning but we had issues with it last time we tried it) | 17:20 |
TheJulia | Sandzwerg[m]: UUIDs/WWID/Serial numbers are stable (except on some raid controlelrs which fake a serial number) | 17:25 |
Sandzwerg[m] | Yeah these were also not matching, that was the other hint. I'll wipe both disks and try again | 17:26 |
shermanm[m] | Sandzwerg: if wiping fixes it, I'd try to get cleaning enabled if at all possible. we were having similar issues here: https://bugs.launchpad.net/ironic/+bug/2084565, https://bugs.launchpad.net/ironic/+bug/2084852 . worst case, you should be able to use deploy templates to trigger erase_devices_metadata during deployment, instead of as a separate cleaning step. | 17:29 |
Sandzwerg[m] | The issue is that for some of our deployments we don't want to wipe all disks normally as in these cases the bigger disks hold data that ideally shouldn't get deleted during (redeployments). I think the issue here was that the root device hint at first was wrong so the first deployment went to a wrong disk and since then it didn't really recover. But I will need to look at cleaning again. I'll look into the links thanks. Maybe | 17:33 |
Sandzwerg[m] | deploy templates work for us, need to look them up. | 17:33 |
shermanm[m] | one thing that I did discover on the last go-around, the rebuild action doesn't trigger automated cleaning | 17:36 |
shermanm[m] | not sure off the top of my head if it's possible to blacklist some disks form cleaning | 17:36 |
JayF | TheJulia: in case the vmedia question comes up again: http:// urls (nopearoni, fast error) https:// urls, it actually tries to hit | 17:37 |
JayF | TheJulia: so I think it's reuqiring https | 17:37 |
TheJulia | oh, nice | 17:39 |
JayF | (it also liked smb:// urls, but I was unable to get one to work) | 17:39 |
TheJulia | cifs:// perhaps? | 17:40 |
JayF | cifs is rejected like http is | 17:40 |
JayF | http/cifs: fast, non-retryable error | 17:40 |
JayF | smb/https: appears to try something (we couldn't get smb:// to ever connect, but we got https:// to connect) | 17:41 |
Sandzwerg[m] | <shermanm[m]> "one thing that I did discover on..." <- hmm somewhere in the back of my head I've heard of that command but never used it. Might something I need to look into as well. Thanks for the idea. | 17:44 |
Sandzwerg[m] | <shermanm[m]> "not sure off the top of my..." <- I think back then the only way I found was to write our own hardware(device?)manager and I never got around to do that. But the issues we had were more with ports that were created but not at the correct place (project) and then cleaning would fail. | 17:46 |
Sandzwerg[m] | BTW I meet Aeva last weekend and she mentioned she worked on ironic back then. I'm sure I should greet you all :) | 17:58 |
JayF | Aeva is wonderful people | 18:04 |
TheJulia | cool cool | 18:27 |
TheJulia | shermanm[m]: yeah, rebuild was originally modeled on just redeploy the node, don't clean the state because in the partition image days we had this preserve_ephemeral context | 18:29 |
* TheJulia twitches | 18:29 | |
* JayF uses rebuild downstream for "get a new OS image but keep the same physical hardware" | 18:31 | |
JayF | no nonsensical preserving of ephemeral here :D | 18:32 |
JayF | like trying to capture steam in a butterfly net; the preserved ephemeral :D | 18:32 |
TheJulia | heh | 18:36 |
TheJulia | Why would anyone want to capture a butterfly?! | 18:36 |
* TheJulia tries to not draw lines and things there | 18:40 | |
Sandzwerg[m] | <JayF> "Aeva is wonderful people" <- I agree, so are you all. I always enjoy talking to you all <3 | 18:50 |
Sandzwerg[m] | <JayF> "uses rebuild downstream for "get..." <- We currently do that with a new deployment but rebuild might be better suited | 18:51 |
TheJulia | Sandzwerg[m]: you'll want a bit more specifics on the root device hint, just in case since you have multiple devices | 19:00 |
Sandzwerg[m] | For us it work as we only have two disks (or three) disks in a single node. A small(ish) disk for the OS (usually a raid from some onboard raid controller) and one or more bigger disks for data, which all have the same size. The OS disk is usally <1TiB and I think the data disks are usally >2TiB. So it's relatively clear. | 19:03 |
Sandzwerg[m] | Adding other properties might also be hard, because of the raid controller there is no manufacturer reported, and since all disks on new hardware are SSD/NVMe some of the other properties one could use to differentiate are not helpfull anymore | 19:06 |
TheJulia | yeah | 19:22 |
TheJulia | that is a common problem with hardware raid controllers | 19:22 |
opendevreview | Jay Faulkner proposed openstack/ironic master: Inspection throws exception on CPU-less systems https://review.opendev.org/c/openstack/ironic/+/949090 | 19:53 |
opendevreview | Jay Faulkner proposed openstack/ironic master: Inspection throws exception on CPU-less systems https://review.opendev.org/c/openstack/ironic/+/949090 | 19:54 |
opendevreview | Julia Kreger proposed openstack/ironic master: Patch configdrive metadata https://review.opendev.org/c/openstack/ironic/+/946677 | 20:21 |
opendevreview | Jay Faulkner proposed openstack/ironic bugfix/26.0: OSSA-2025-001: Disallow unsafe image file:// paths https://review.opendev.org/c/openstack/ironic/+/949186 | 20:52 |
opendevreview | Jay Faulkner proposed openstack/ironic bugfix/26.0: OSSA-2025-001: Disallow unsafe image file:// paths https://review.opendev.org/c/openstack/ironic/+/949186 | 21:03 |
opendevreview | Julia Kreger proposed openstack/ironic master: provide host_id to neutron early on https://review.opendev.org/c/openstack/ironic/+/946378 | 21:06 |
opendevreview | Julia Kreger proposed openstack/ironic master: Patch configdrive metadata https://review.opendev.org/c/openstack/ironic/+/946677 | 21:06 |
opendevreview | Julia Kreger proposed openstack/ironic master: Consider missing MTU invalid metadata https://review.opendev.org/c/openstack/ironic/+/949385 | 21:06 |
JayF | I am at my wits end: https://review.opendev.org/c/openstack/ironic/+/949186 fails the docs job but https://review.opendev.org/c/openstack/ironic/+/949373 passes it. I've not touched *any* of the config in question in the backported patch | 21:22 |
JayF | It almost feels like it's running different code in the docs job for my change | 21:22 |
JayF | Anyone with any ideas, please share them. I'm to the point where Monday I'll get a held node and look at what's going on | 21:31 |
okamitok[m] | Hey everyone, so I did some more digging and one thing I'm not sure about is when the inspection should happen? | 21:48 |
okamitok[m] | I was reading some documentation that says to do a baremetal node inspect before provide to make it available for Nova. | 21:48 |
okamitok[m] | That throws an error about not being supported by ipmi. | 21:48 |
okamitok[m] | But if I do a baremetal introspection start node_id it powers on and removes the ignore from the ports in dnsmasq. | 21:49 |
okamitok[m] | Once it finishes though the ignore gets added back and even doing a server create it stays ignored. | 21:49 |
JayF | I suggest payingless attention to the DHCP config behind the stage :D | 21:50 |
JayF | So inspection, like a lot of things in ironic, can be done different ways | 21:50 |
JayF | in your config file and in the nodes you'll have an inspect_interface referenced | 21:50 |
JayF | for an IPMI node, that'd likely be set to "agent" which means we boot a ramdisk and perform inspection when you call `baremetal node $name inspect` | 21:51 |
JayF | \ | 21:51 |
JayF | that's the dhcp actions you saw | 21:51 |
JayF | did the node go back into manageable once complete? | 21:51 |
JayF | most of the information you need will be in the node, fields of last_error, provision_state, target_provision_state among others | 21:51 |
JayF | also https://docs.openstack.org/ironic/latest/admin/node-history.html can perhaps give you insight into previous failures if you've not been checking last_error | 21:52 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!