opendevreview | Steve Baker proposed openstack/ironic-python-agent master: Switch to utf-8 for parsing efibootmgr -v https://review.opendev.org/c/openstack/ironic-python-agent/+/902214 | 01:59 |
---|---|---|
stevebaker[m] | TheJulia: alright, here is my take | 02:01 |
stevebaker[m] | oh no, this would need to be backported all the way | 02:03 |
TheJulia | Wheeee :( | 02:16 |
rpittau | good morning ironic! o/ | 07:53 |
opendevreview | Verification of a change to openstack/ironic master failed: Fix *_by_arch documentation and un-deprecate the options without it https://review.opendev.org/c/openstack/ironic/+/901958 | 08:17 |
adam-metal3 | Hi all, this change is now breaking the virtualmedia boot in the Metal3 CI https://review.opendev.org/c/openstack/ironic-python-agent/+/895519/1 I have pasted the actual IPA error here with a link to the job https://paste.openstack.org/show/bZj140OMuzw5ZYtQcEc2/, we are running vmedia in a DHCP enabled env without glean in these tests so proably that is the cause of the confusion | 08:53 |
adam-metal3 | TheJulia: would it be possible to put this "glean restart" logic behind an option? It is quite limiting for any alternative network condfig approach that is not glean + networkd, or glean + NetworkManager with the rhifcfg plugin | 08:55 |
opendevreview | OpenStack Release Bot proposed openstack/ironic-inspector bugfix/11.8: Update .gitreview for bugfix/11.8 https://review.opendev.org/c/openstack/ironic-inspector/+/902244 | 09:12 |
adam-metal3 | or actually it is enough if this wouldn't be active if gelan is not present | 09:12 |
adam-metal3 | I would be happy to push the fix | 09:13 |
opendevreview | OpenStack Release Bot proposed openstack/ironic-python-agent bugfix/9.8: Update .gitreview for bugfix/9.8 https://review.opendev.org/c/openstack/ironic-python-agent/+/902247 | 09:17 |
opendevreview | OpenStack Release Bot proposed openstack/ironic bugfix/23.1: Update .gitreview for bugfix/23.1 https://review.opendev.org/c/openstack/ironic/+/902251 | 09:17 |
rpittau | adam-metal3: let me check quickly, can you please open a bug in launchpad? | 09:37 |
rpittau | trigger_glean_network_refresh should be triggered only if we actually have the backup in place,we gave as assumed that that is always the case | 09:45 |
rpittau | adam-metal3: feel free to propose a fix for that | 09:45 |
dtantsur | adam-metal3, rpittau, TheJulia: that absolutely should not have merged... | 10:02 |
dtantsur | okay, it's morning, and I'm grumpy, and I'm going to call out people | 10:02 |
dtantsur | TheJulia, JayF, stevebaker[m], you've merged a change on the critical path in IPA with 0 unit tests. Please don't do that any more. | 10:03 |
dtantsur | ah, gerrit UI shows me the wrong thing. Apologies then. | 10:04 |
dtantsur | (gerrit loves to insert a changeset number in all links, sigh) | 10:05 |
dtantsur | Yet, I believe this change should be urgently reworked or reverted | 10:12 |
TheJulia | Sigh, revert it with links to logs and hopefully I’ll have something to be able to look at next year even if I have time and spoons | 10:13 |
dtantsur | That's supposed to be waaaay too early for you | 10:13 |
TheJulia | And the gerrit ui really likes showing cached data now… :( | 10:13 |
dtantsur | My biggest beef with that is that it has no checks if glean is even present | 10:14 |
dtantsur | If someone uses cloud-init or a self-written script (I'm aware of at least attempts to do both, the result is a failure) | 10:14 |
TheJulia | Reasonable, just not part of the original scope we as a group were expected | 10:15 |
dtantsur | True. But what we want to support is one thing, what people have is a whole different one.. | 10:16 |
TheJulia | Cloud init itself has to be modified or use non-stock config to do anything but dhcp too | 10:16 |
TheJulia | Indeed | 10:16 |
TheJulia | Only way to find that out sometimes is to break something else | 10:16 |
dtantsur | I'll leave comments on the patch to make re-introducing it easier. But I'm afraid I also don't have spoons to properly fix and test it. | 10:17 |
TheJulia | It took me like six weeks, I t is not a easy place in the path to fix/korify | 10:18 |
TheJulia | Modify | 10:18 |
* TheJulia tries to go back to sleep | 10:19 | |
dtantsur | Yep, get rest still | 10:19 |
adam-metal3 | Okay, so how to go about this now, are you reverting and reworking, or should I just start fixing and opening the bug in launchpad (I was having lunch during the time you have discussed)? | 10:21 |
dtantsur | adam-metal3: I'll revert the change for now, but please do file the bug. | 10:21 |
adam-metal3 | dtantsur: okay sounds great, I will open the bug | 10:22 |
dtantsur | once you have it, I'll propose the revert with Closes-Bug | 10:23 |
adam-metal3 | dtantsur: https://bugs.launchpad.net/ironic-python-agent/+bug/2045255 | 10:32 |
opendevreview | Dmitry Tantsur proposed openstack/ironic-python-agent master: Revert "Fix vmedia network config drive handling" https://review.opendev.org/c/openstack/ironic-python-agent/+/902231 | 10:33 |
dtantsur | adam-metal3, rpittau, TheJulia ^^ | 10:34 |
adam-metal3 | +1 | 10:48 |
iurygregory | good morning Ironic | 11:18 |
dtantsur | adam-metal3, iurygregory, rpittau, another CI blocker: https://github.com/metal3-io/ironic-image/pull/454 | 11:48 |
dtantsur | JayF: I think you mentioned this issue previously ^^ | 11:48 |
iurygregory | I remember something related to this | 11:51 |
dtantsur | I think it went unnoticed because the generated version of ironic-lib matched the version in upper-constraints | 11:52 |
dtantsur | Then we landed something - and boom! | 11:52 |
iurygregory | to many booms this week | 11:53 |
iurygregory | at least tomorrow is friday | 11:53 |
iurygregory | so I just need to survive two more days :D | 11:53 |
dtantsur | yeah, I'm completely overwhelmed (and becoming visibly snappy, which is not great) | 11:54 |
rpittau | dtantsu adam-metal3 +W the revert | 12:22 |
* rpittau goes back to lunch | 12:22 | |
adam-metal3 | rpittau: thanks! | 12:23 |
adam-metal3 | dtantsur, JayF: related to yesterday's discussion about Dmitry's ironic-operator demo https://youtu.be/0ScIghaBUhY?t=1351 | 12:26 |
opendevreview | Dmitry Tantsur proposed openstack/ironic-python-agent-builder master: [PoC] An element to hijack Glean for a configurable configdrive label https://review.opendev.org/c/openstack/ironic-python-agent-builder/+/902291 | 13:17 |
dtantsur | TheJulia: I had something like this in mind ^^^ (absolutely untested) | 13:17 |
opendevreview | Merged openstack/ironic master: Generic API for attaching/detaching virtual media https://review.opendev.org/c/openstack/ironic/+/894918 | 14:02 |
dtantsur | woohooo | 14:04 |
rpittau | \o/ | 14:33 |
dtantsur | iurygregory: fancy an lgtm on https://github.com/metal3-io/ironic-image/pull/453? | 14:50 |
iurygregory | dtantsur, yes sir! | 14:52 |
dtantsur | thx :) | 14:52 |
iurygregory | BMO will probably need some changes I think | 14:52 |
dtantsur | iurygregory: I've checked it briefly, and it seems like it does not use 'idrac' explicitly other than for the driver | 14:53 |
iurygregory | nice! | 14:53 |
dtantsur | If you could take another look, would be great | 14:53 |
dtantsur | (just to double-check me) | 14:53 |
iurygregory | sure, will do | 14:53 |
TheJulia | dtantsur: I need something backportable | 14:56 |
TheJulia | not-backportable, is a total no-go. | 14:56 |
dtantsur | TheJulia: do we follow the stable policy in IPA-builder even? | 14:56 |
TheJulia | rpittau: does metal3-integration not test vmedia? | 14:56 |
dtantsur | also, the change I suggest has a lower risk than what you tried to land | 14:57 |
dtantsur | so, if we backport either, I'd vote for IPA-builder | 14:57 |
TheJulia | Well, to be fair, It landed after working on making sure it works and working to fix other issues in the entire ecosystem of virtual media boot | 14:57 |
dtantsur | TheJulia: re metal3-integarion: due to resource constraints, it's very VERY simple | 14:57 |
TheJulia | so basically, there was no way to detect it before we merged anyway | 14:57 |
TheJulia | sigh | 14:58 |
rpittau | TheJulia: not really, just basic stuff :/ | 14:58 |
TheJulia | what is super weird, is we have glean-less vmedia it was run against. I wonder why that broke | 14:58 |
TheJulia | hell, it successfully ran with cloud-init dhcp'ed vmedia | 14:59 |
dtantsur | TheJulia: ironic will inject stuff into a CD regardless of whether IPA has glean working or at all | 14:59 |
TheJulia | eh, so cloud init might have broken the chain then | 14:59 |
TheJulia | what is the ramdisk which is used in metal3 specifically? | 15:01 |
TheJulia | (Trying to figure out why I didn't see this in testing in our CI) | 15:02 |
dtantsur | TheJulia: well.. let's ask Adam once he's available. Metal3 uses our upstream ramdisks from tarballs.o.o, but I know that at least some CI jobs use something that Ericsson build with IPA-builder. | 15:03 |
TheJulia | yeah, was hoping adam would have stayed around a little longer today | 15:04 |
dtantsur | Regardless, the patch makes some dangerous assumptions and I sorta -1.5 on landing it (and -2 on ever backporting something like that) | 15:04 |
TheJulia | c'est la vie | 15:04 |
dtantsur | You can email him. He should be at work still, probably just left IRC. | 15:05 |
dtantsur | (or use k8s slack) | 15:05 |
rpittau | in ipa-downloader we just use our tarball from upstream | 15:05 |
dtantsur | Yes, but there is this downstream nordix build, I'm not sure which role it plays | 15:05 |
rpittau | I believe the modified ramdisk is used only from Ericsson internally at the moment | 15:06 |
dtantsur | ah, okay | 15:07 |
dtantsur | TheJulia: in addition to assuming files and the presence of glean, you also assume that glean can handle the situation where networking could have been injected by a malicious 3rd party. This is risky IMO. | 15:12 |
dtantsur | Also that nothing critical runs before IPA (like CERN downloading a tarball with plugins). | 15:12 |
opendevreview | Merged openstack/ironic-python-agent master: Revert "Fix vmedia network config drive handling" https://review.opendev.org/c/openstack/ironic-python-agent/+/902231 | 15:14 |
TheJulia | I commented on the change set | 15:16 |
TheJulia | dtantsur: well, the entire idea was also disable glean from ever running upfront too | 15:16 |
TheJulia | so we no longer *need* simple-init at all | 15:16 |
dtantsur | It's not something you can assume for backports though... | 15:17 |
TheJulia | true, which is why it cleaned up the state because it flat out fails with glean today, identifies the attached vmedia, unmounts the folder, and copies the files back for glean to run | 15:17 |
TheJulia | which is something I don't think you groked from your comment | 15:18 |
TheJulia | That being said, glean just not working today in the double cases was also involved with Centos9 cloud images including cloud-init by default | 15:20 |
TheJulia | Glean would try, and then be completely squashed by cloud init trying DHCP | 15:20 |
TheJulia | .... because cloud-init will only look locally if the data source is explicitly ConfigDrive *and* dhcp is turned off | 15:21 |
dtantsur | I guess what worried me is the transition of "We provide network_data, Glean is recommended" to "We provide network_data, and you MUST run Glean in the exact way we prescribe". Especially on stable branches. | 15:22 |
TheJulia | I mean that is navigatable | 15:23 |
TheJulia | I broached just reading the data and dropping glean and the push back I got was that we only supported glean | 15:23 |
TheJulia | and just to use it | 15:23 |
TheJulia | Challenge is, out of the gate on a stock path, that just doesn't work either | 15:23 |
TheJulia | Adam likely has a magical line and a patch or soemthing in the Nordix line that fixes that though | 15:24 |
dtantsur | What is preventing it from working? It definitely did work at some point. | 15:24 |
TheJulia | (and I kind of started owrking on fixing it dib outputted images anwyay) | 15:24 |
dtantsur | https://github.com/Nordix/metal3-dev-tools/tree/main/ci/scripts/image_scripts/ipa_builder_elements | 15:24 |
TheJulia | dtantsur: cloud-init, in initial phases now always triggers DHCP, even *if* you have the data source configured as ConfigDrive | 15:25 |
dtantsur | yeah, I guess assuming that cloud-init is kicked out | 15:25 |
TheJulia | except, glean always also ran before cloud-init based upon networking defaults | 15:26 |
TheJulia | so glean would run far in advance, get things ready | 15:26 |
dtantsur | For the record: I don't suggest running Glean together with cloud-init | 15:26 |
TheJulia | then cloud init would fire up, attempt to do it's thing, stomp upon fallback and wipe out glean's files | 15:26 |
dtantsur | This is a stupid thing to do that we shouldn't spend effort on working around | 15:26 |
TheJulia | oh, yeah, no. Please no, no mixing | 15:26 |
dtantsur | I do recognize that some people may try to hack together something based on cloud-init or something completely downstream | 15:27 |
TheJulia | yeah, which is why I originally just wanted to go "forget these helpers1" | 15:27 |
TheJulia | s/helpers1/helpers!/ | 15:27 |
dtantsur | I still don't quite get the benefits over something like https://review.opendev.org/c/openstack/ironic-python-agent-builder/+/902291?usp=dashboard | 15:28 |
dtantsur | (which is closer to what Nordix is doing AFAIK) | 15:28 |
TheJulia | we have to make something that is opted in, instead of something that is able to manage/recover on older deployments without logic to add a bootloader argument as well | 15:30 |
dtantsur | There is a fall back to /dev/sr0 if that's what you're looking for | 15:30 |
dtantsur | (could be a smarter way to detect a CD, I'm just showing the idea) | 15:31 |
TheJulia | more likely a usb device in real word instead of sr0 | 15:31 |
TheJulia | CI only gives us SATA attached CDs | 15:31 |
TheJulia | uhh, I'm not sure what you mean by fallback to /dev/sr0 though | 15:32 |
dtantsur | https://review.opendev.org/c/openstack/ironic-python-agent-builder/+/902291/1/dib/ironic-ramdisk-network-config/static/usr/local/bin/ipa-glean-early.sh#16 | 15:32 |
dtantsur | which, as you say, should be something slightly smarter, still excluding local disks | 15:33 |
dtantsur | Can be: anything that is not SATA, SCSI, NVME or a partition | 15:33 |
TheJulia | yeah, which is why I tried to solve this in IPA since we had existing guarding logic in IPA to help us hone in and identify the actual device instead of guess | 15:34 |
dtantsur | Yeah, but two of my three concerns here revolve around IPA being simply too late | 15:35 |
TheJulia | in my testing, it just failed to configure the addressing or handle it at all | 15:36 |
TheJulia | and the hope was ultimately, explicitly remove glean as a default invocation moving forward | 15:36 |
TheJulia | so glean becomes a tool we use, we don't fire it up on start at all | 15:36 |
dtantsur | Sure, but IPA is too late anyway. | 15:36 |
dtantsur | I mean, the current IPA even depends on network-online, which won't work any more :) | 15:37 |
TheJulia | well, not if moving forward, glean just doesn't have any configuration to run on boot | 15:37 |
TheJulia | eh, still goes network-online if it fails to actually address anything aiui | 15:37 |
TheJulia | it tried, nothing was there | 15:37 |
dtantsur | okay, so we no longer allow any services that require networking to run before IPA? | 15:38 |
TheJulia | That was kind of my hope, get rid of them all so we can make the decisions upon the information which we have and trigger the appropriate configuration | 15:38 |
dtantsur | nor really any glean alternatives? | 15:38 |
TheJulia | which can be further enhanced as time goes on | 15:39 |
dtantsur | wdym "get rid of them", people may need them? | 15:39 |
dtantsur | Fortunately OpenShift won't be affected, because that's life-critical for us | 15:39 |
TheJulia | from a utility standpoint, my thought was we encourage getting rid of cloud-init and glean from running entirely | 15:39 |
dtantsur | I also used CERN as an example | 15:39 |
JayF | Quite some time ago we sorta made the decision that IPA isn't software, it's the full ramdisk | 15:40 |
TheJulia | they have DHCP in their environment, do they use cloud-init for that? | 15:40 |
JayF | So it's not unreasonable for us to dictate things about the environment IPA runs in, and what can/can't be in the ramdisk | 15:40 |
dtantsur | JayF, TheJulia, I was not included in either "we" that you just used | 15:40 |
JayF | but such things can't be backported | 15:40 |
dtantsur | Like, I'm actively very much NOT one of these we to the extent that my career depends on it. | 15:40 |
JayF | dtantsur: the "we" in that case is strongly implied by code we have changed | 15:40 |
JayF | dtantsur: and I don't mean like, last month, I mean over the course of years | 15:40 |
JayF | I'm going to be honest, after re-reading/reviewing the patch, and reading the discussion this morning | 15:41 |
JayF | this feature just scares the crap out of me | 15:41 |
JayF | I keep trying to convince myself that none of the failure modes are security vulnerabilities so it's probably okay | 15:42 |
dtantsur | People downloading software before IPA or even downloading the IPA itself from the internet is a very-very much a reality. | 15:42 |
JayF | yes, I agree, but I'm not sure I see an architecture where the combo of that + node network data + configdrive support on instances we are cleaning can combine | 15:43 |
dtantsur | At least two major Ironic consumers rely on that. While none will presumably be affected by this discussion, it's a sign that we cannot just assume to know everything about the environment IPA runs in. | 15:43 |
dtantsur | JayF: well, I'm happy to present you with such an architecture. I've even just proposed an IPA-builder side of it. | 15:43 |
dtantsur | Randomize the label name, teach Ironic to provide it, teach (or force) Glean to understand that. | 15:44 |
dtantsur | Boom, done. | 15:44 |
TheJulia | and backport the feature on both sides | 15:44 |
JayF | I thought there was just a whole thing about how this couldn't be glean-specific? | 15:44 |
* TheJulia starts to eyeball vacation early | 15:44 | |
dtantsur | JayF: yep, and this is not. You can still swap in your favorite thing that does ~ the same. | 15:45 |
* JayF notes he is sick and likely won't make it a full 8 hours today, just trying to keep himself productive+distracted until brain runs outta juice | 15:45 | |
TheJulia | JayF: I was just going to suggest soup, and coffee | 15:45 |
TheJulia | but then I realized I have had no coffee today | 15:45 |
dtantsur | TheJulia: feature backports are bad except when they aren't. As long as we don't break the default case, I'm happy to justify backporting a feature that fixes serious security limitations. | 15:46 |
dtantsur | Not that we have never done that... | 15:46 |
JayF | I'd have to see the IPA side to have a stronger opinino on backporting it | 15:46 |
dtantsur | No IPA side in what I propose; an Ironic side though | 15:47 |
JayF | as long as an operator who doesn't care about this feature, even if they're doing a custom ramdisk, has no action, it should be OK though | 15:47 |
TheJulia | I was just trying to do it in a clean non-breaking way requiring agent and ironic to both be upgraded with additional external logic, but I don't see a path forward | 15:47 |
JayF | ^ criterion stays the same tho | 15:47 |
dtantsur | It will need to stay opt-in on stable branches. Kinda what we did with the agent token. | 15:48 |
TheJulia | eh, I don't think we backported that actively | 15:49 |
JayF | not upstream | 15:50 |
TheJulia | also not on our downstream, we just let it flow in | 15:50 |
dtantsur | I'm more referring to the state where we cannot be sure that IPA and Ironic are upgraded together | 15:50 |
opendevreview | Will Szumski proposed openstack/bifrost stable/2023.1: Restore discovery for dnsmasq dhcp provider https://review.opendev.org/c/openstack/bifrost/+/902233 | 15:50 |
TheJulia | oh, we've been having to deal with that reality for far longer than agent token :) | 15:51 |
TheJulia | we had to put a line in the sand and go "no, you must upgrade" for that though | 15:51 |
dtantsur | I think the agent token was quite prominent :) | 15:51 |
TheJulia | lots of warning, and nasty mean errors | 15:51 |
* TheJulia takes pride in that one | 15:51 | |
JayF | we handled it well, but that line did demark the first time when basically master IPA wouldn't work on arbitrarily old Ironic | 15:52 |
TheJulia | we also didn't lock out old agents until after we were outside of the established support window | 15:52 |
TheJulia | (but we definitely did give people the knob! and we had some people us it! | 15:53 |
TheJulia | s/us/use/ | 15:53 |
dtantsur | The case in question is a bit easier though because before TheJulia's patch, DHCP-less is opt-in on the image building level | 15:53 |
TheJulia | it could have continued to be | 15:53 |
dtantsur | yeah, I'm pointing out that the number of potentially broken operators is much smaller than with the agent token | 15:54 |
TheJulia | Then again, I've been working on it for so long I had sort of forgotten where I was at with things | 15:54 |
TheJulia | On a way less stressful topic: How about this one https://review.opendev.org/c/openstack/ironic/+/896570 | 16:05 |
TheJulia | great points dtantsur! | 16:05 |
TheJulia | Did I get the reasoning behind target_power_state correctly? | 16:06 |
dtantsur | 11 October was an eternity ago, did I say something smart? | 16:06 |
TheJulia | yes, you did, you raised two excellent questions | 16:06 |
dtantsur | \o/ | 16:06 |
TheJulia | The second one, If I'm understanding why, I suspect we just need to reach a consensus and document it, or we need to add more complex logic (and maybe the thing here is just fail fast) | 16:07 |
dtantsur | Mmm, upgrade_lock will most likely fail indeed, so mostly irrelevant.. | 16:07 |
TheJulia | The first one, dunno, I guess it *might* be good to turn the cards off, but maybe that could also just be a "presently we don't do this, we've considered it, if you have an opinion, please let us know" | 16:08 |
dtantsur | On that note, I wonder if we need to get locks beforehand. So that we don't end up with something half-powreed-on | 16:08 |
TheJulia | I'm a little worried since it is a fragmented vendor ecosystem to begin with | 16:08 |
TheJulia | dunno, it could also be it could partially power on | 16:08 |
dtantsur | my worry there is less about what the vendors do, more about what ironic will try to do | 16:09 |
TheJulia | some of these devices have like a "we're only going to run a single core until we're in full power mode" logic path | 16:09 |
dtantsur | will we document that you must power off child nodes manually before powering off the parent one? | 16:09 |
dtantsur | will it work in any of our automated processes? | 16:09 |
TheJulia | I think we should document it, and just note, you may create pain if you do not | 16:09 |
TheJulia | and leave the door to automating it | 16:10 |
TheJulia | s/door/door open/ | 16:10 |
dtantsur | E.g. when we reboot the parent node during deployment, how will the child nodes react? What will Ironic do about it? | 16:10 |
TheJulia | with our power off, power on, they will loose power completely and it will be restored | 16:11 |
TheJulia | so they too, will fresh boot | 16:11 |
TheJulia | unless they have an external PSU | 16:11 |
TheJulia | if you trigger the CPU interrupt to reboot, the cards will ignore it | 16:12 |
dtantsur | Or, say, a node gets deprovisioned. It has DPUs running. Ironic powers it off. Children nodes suddenly go off too. Ironic says "hmm, this is wrong" and powers everything on again. | 16:12 |
TheJulia | yeah, that is the sort of case this creates | 16:12 |
TheJulia | and one where we are kind of in an impossible situation without documenting $something | 16:13 |
dtantsur | So, will we expect the users do power DPUs off themselves before deprovisioning? Will they be able to do it if they don't have direct access to Ironic? | 16:13 |
dtantsur | Maybe we're smelling something like node.is_power_state_inherited | 16:15 |
TheJulia | yeah, they won't power them off | 16:15 |
TheJulia | but we may need them online to unprovision the node at all | 16:16 |
TheJulia | since all networking may pass through | 16:16 |
dtantsur | *nod* | 16:16 |
dtantsur | it all sounds to me like maybe we should be more prescriptive in the beginning | 16:16 |
TheJulia | hmm | 16:16 |
TheJulia | yeah | 16:16 |
TheJulia | "Here is what is going to happen, if you don't have an external power supply" | 16:17 |
TheJulia | I guess, there are a list of states where it makes sense to power the host on *if* it is off | 16:17 |
dtantsur | It really sounds to me like has_external_power should be an option in driver_info (or on the node, which I like better) | 16:17 |
TheJulia | we should only make such a change *in* those states, outside of that, "nope" | 16:17 |
TheJulia | I think the external power was proposed as a driver_info field | 16:18 |
TheJulia | if we add it to node, the snmp operators are going to demand to use it :) | 16:18 |
dtantsur | we can consider their demand in the due time ;) | 16:18 |
opendevreview | Julia Kreger proposed openstack/ironic master: Redfish UefiHttp boot support https://review.opendev.org/c/openstack/ironic/+/900964 | 16:18 |
opendevreview | Julia Kreger proposed openstack/ironic master: Add HTTP versions of network boot interfaces https://review.opendev.org/c/openstack/ironic/+/900965 | 16:19 |
TheJulia | rebases due to merge conflict on ironic/common/boot_devices.py | 16:19 |
TheJulia | enjoy! | 16:19 |
TheJulia | well, demand, exepct it just does things | 16:20 |
* dtantsur just wants to boast that metal3 is on the verge of making Inspector optional https://github.com/metal3-io/ironic-image/pull/443 | 16:20 | |
TheJulia | \o/ | 16:20 |
dtantsur | unfortunately, the rest of the inspector work is stuck for the time being.. | 16:22 |
TheJulia | okay, so the thing really needed is if I'm powering off a host which has DPUs, we should likely just turn off child nodes first. Which is fine. Internally the vendors are each working out the mechanism to figure out when to let firmware *start* to proceed with booting once power is activated, and it looks like it will be vendor specific-ish until DMTF standardizes on the ?MCT? communication path | 16:23 |
TheJulia | because today, we don't have bi-directional comms between the BMC and the Add-in Management Controller (the bmc on the card) | 16:24 |
TheJulia | so, docs saying "if you do this, this other thing will happen, and the thing we need to head off is just re-powering the host up | 16:26 |
TheJulia | So we can just take the same power state requested, and roll across child devices | 16:27 |
TheJulia | that sort of makes us opinionated, resets their state | 16:27 |
TheJulia | and if someone comes along and tries to power one on, the host will, indeed, return to life too | 16:28 |
TheJulia | which is just a thing we need to document. | 16:28 |
dtantsur | yeah | 16:28 |
TheJulia | That seems, reasonable sans vendor standardization | 16:28 |
TheJulia | because the AMC's can't call home, yet. | 16:28 |
TheJulia | s/home/home to the chassis bmc and say, give me power!" | 16:28 |
* dtantsur has no idea who AMC is :) | 16:29 | |
TheJulia | Add-in Management Controller | 16:29 |
TheJulia | The next generation of hosts will have multiple BMCs | 16:29 |
dtantsur | Bright future | 16:29 |
TheJulia | So, about opening a distillery! | 16:29 |
dtantsur | now we're talking business | 16:30 |
TheJulia | sigh, I think this might be contrary to the inline docs in the plugin: 2023-11-29 17:35:01.962851 | controller | {0} setUpClass (ironic_tempest_plugin.tests.scenario.ironic_standalone.test_basic_ops.BaremetalIPXEBootTestClass) ... SKIPPED: The driver: None used in test is not in the list of enabled_drivers [] or enabled_hardware_types ['ipmi'] in the tempest config. | 16:37 |
dtantsur | Our tempest tests are probably easier to rewrite than to fix. Just saying. | 16:38 |
TheJulia | (actually, that is sort of what this attempt is to do) | 16:38 |
TheJulia | oh, it is the other fields | 16:39 |
TheJulia | easy enough to bypass *grin* | 16:39 |
TheJulia | adam-metal3: o/ what OS are you running the downstream metal3 IPA images with? | 16:43 |
adam-metal3 | o/ centos 9 stream, in the metal3 community, even more downstream we have SLES also | 16:44 |
TheJulia | weird | 16:45 |
TheJulia | I wonder if there is some special casing we ended up with how the ipa image gets built. I didn't look at all of the elements dtantsur linked me to | 16:45 |
dtantsur | the official metal3 one is the one we build | 16:46 |
adam-metal3 | yes | 16:46 |
adam-metal3 | that is the same in the XI | 16:46 |
adam-metal3 | CI | 16:46 |
adam-metal3 | I have a pipeline that builds our flavor of centos 9 stream IPA but that we don't use for our regular tests | 16:46 |
dtantsur | adam-metal3: the IOError failure, which image did it happen with? | 16:47 |
TheJulia | adam-metal3: is ipa coming from a package in that pipeline, or source? | 16:47 |
adam-metal3 | it comes from whatever is pulled from upstream by IPA downloader | 16:48 |
rpittau | which is our tarball :) | 16:48 |
adam-metal3 | yes | 16:48 |
adam-metal3 | https://tarballs.opendev.org/openstack/ironic-python-agent/dib/ | 16:48 |
opendevreview | Julia Kreger proposed openstack/ironic-tempest-plugin master: WIP: Test multiple boot interfaces as part of one CI job https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/902171 | 16:48 |
TheJulia | Interesting, oh well | 16:49 |
dtantsur | I think I know where you're coming from. The glean code only handles config-2 which is the label we use when we add network_data. | 16:49 |
dtantsur | The IPA side you modified also handles vmedia_iso or something like that, which is our normal label. | 16:49 |
dtantsur | This is how you ended up in the Glean code without having Glean or network_data. | 16:49 |
TheJulia | yup, and why I didn't hit it in our local CI | 16:50 |
TheJulia | there is likely a couple other things involved there, but all water under the bridge as it were | 16:51 |
dtantsur | What worried me is why our CI does not catch it.. | 16:51 |
TheJulia | we had the target folder in our CI jobs with resulting images | 16:52 |
dtantsur | We don't have any jobs with DIB any more, do we? | 16:52 |
TheJulia | but we build entirely from source | 16:52 |
TheJulia | dtantsur: I developed the change with one, so dunno | 16:53 |
rpittau | mmm in ipa? we do have dib as base ramdisk type | 16:53 |
dtantsur | I'm trying to understand why only metal3 caught the failure | 16:53 |
TheJulia | it is, if we run on rax, we may change that to tinyipa otherwise we end up spending 10 minutes uncompressing the initrd | 16:54 |
dtantsur | Yeah, AND we probably don't use virtual media in the IPA gate | 16:54 |
rpittau | oh right | 16:54 |
TheJulia | but the change I had explicitly force us to use dib for that change | 16:54 |
TheJulia | I think the change I had was on ironic since that was where the ipa jobs were | 16:54 |
rpittau | we havwe ipa-tempest-uefi-redfish-vmedia-src | 16:55 |
rpittau | that should be virtualmedia | 16:55 |
dtantsur | ** WARNING ** - DIB based IPA images have been defined, however we are running devstack on an environment which does not support nested VMs. Due to virtualization constraints, we are automatically falling back to TinyIPA to ensure CI job passage. | 16:55 |
rpittau | hehhh | 16:56 |
dtantsur | Maybe we shouldn't "ensure CI job passage" at the expense of regressions... | 16:56 |
dtantsur | But yeah, this is why we missed it. The redfish job used tinyIPA. | 16:56 |
rpittau | yep | 16:56 |
TheJulia | yeah | 16:57 |
TheJulia | ripping it out means re-tuning all job timing for running on rax all the time. | 16:58 |
JayF | Can we setup a job like that, which requires DIB in order to perform it's test, to never run on rax? | 16:59 |
dtantsur | ++ | 16:59 |
rpittau | that would be ideal | 17:00 |
TheJulia | that is entirely doable | 17:00 |
TheJulia | well, maybe not the very last part | 17:00 |
TheJulia | I'm not sure if we can exclude an entire provider | 17:00 |
dtantsur | A question for infra folks probably.. | 17:02 |
* dtantsur has a feeling that has been discussed in the past | 17:02 | |
*** tkajinam is now known as Guest8666 | 17:02 | |
opendevreview | Julia Kreger proposed openstack/ironic master: Change snmp job to not use a focal node https://review.opendev.org/c/openstack/ironic/+/893824 | 17:04 |
TheJulia | dtantsur: likely, and gets lost on a side commentary | 17:05 |
rpittau | good night! o/ | 17:15 |
JayF | dtantsur: our docs say, for Ironic, to not use the inspector-inside-ironic support yet, right? | 17:19 |
JayF | until it's done? | 17:19 |
* JayF thinks he's confused about something | 17:20 | |
JayF | on the surface I was wondering if https://github.com/metal3-io/ironic-image/pull/443 is exposing Ironic-says-dont-use-this-its-not-done-yet functionality in metal3 | 17:21 |
JayF | but now I realize that may not be the case | 17:21 |
dtantsur | Let's say, we like to live on the bleeding edge :D | 17:21 |
dtantsur | Then again, my intention is to expose the functionality in question in this cycle | 17:21 |
dtantsur | https://review.opendev.org/c/openstack/ironic/+/898237 starts adding docs, for instance, and is awaiting reviews ;) | 17:22 |
JayF | I was confused by the statement earlier about the inspector work not getting finished this cycle | 17:22 |
JayF | which I thought was the point we were going to flip that swithc | 17:22 |
JayF | I'm slightly worried if we get inspector-in-ironic "good enough" for metal3, it'll stay in limbo ~forever | 17:23 |
dtantsur | I think we'll have to stabilize it with only a subset of features migrated :( | 17:23 |
dtantsur | That does not depend on what metal3 does, it depends on having people to work on it.. | 17:23 |
JayF | Makes sense. I do prefer metal3 operates on the what-we-consider-public API of Ironic though in general | 17:27 |
JayF | but it sounds like we're just sorta landing them togetherish | 17:28 |
dtantsur | Yep. The default is not switched yet. | 17:28 |
dtantsur | I need it first and foremost for ironic-operator, which itself is a PoC | 17:29 |
opendevreview | Julia Kreger proposed openstack/ironic master: DNM: CI test for httpboot jobs https://review.opendev.org/c/openstack/ironic/+/901182 | 17:29 |
JayF | ++ | 17:29 |
dtantsur | But yeah. I want to finish the docs. We got all processing hooks. So it's a decent feature set already, we can tell people to use that. | 17:30 |
dtantsur | Maybe someone from my OSP counterparts can figure out the PXE filter ;) That will give us what 90% of people using inspection need. | 17:32 |
dtantsur | Inspection rules can be left as an exercise for the next internship or a new hire. | 17:32 |
JayF | Inspection rules is a specific place I've heard as negative feedback for Ironic recently from a couple of folks | 17:32 |
JayF | (I don't think it's related to this change, just a note) | 17:33 |
dtantsur | Interesting, any details? | 17:33 |
* dtantsur is reading the Itamar's eventlet email with a great interest in the meantime | 17:35 | |
JayF | "the documentation seems lacking" was the substance I had from that conversation; I'm trying to help get this person into IRC | 17:35 |
JayF | Itamar has been trickling these scary eventlet news bits into my slack nearly daily for weeks lol | 17:35 |
dtantsur | This is feedback we get for every part of OpenStack, always :) | 17:35 |
JayF | oh yeah, apparently wheels are turning on the docs pro from GR side, too | 17:36 |
dtantsur | \o/ | 17:36 |
JayF | they found a person and are just figuring out remaining details | 17:36 |
dtantsur | I'm asking because if the rules themselves suck, we can rewrite them now. | 17:37 |
dtantsur | If it's only about good docs.. yeah, inspector docs are not amazing (and them being separate from ironic is not helping either) | 17:37 |
* dtantsur needs to go, will catch up with you tomorrow | 17:38 | |
JayF | o/ | 17:39 |
TheJulia | zigo: o/ you around? | 17:50 |
opendevreview | Merged openstack/ironic-python-agent bugfix/9.8: Update .gitreview for bugfix/9.8 https://review.opendev.org/c/openstack/ironic-python-agent/+/902247 | 17:51 |
TheJulia | zigo: for when you appear, this question is likely better suited in the #openstack-tc room, but your not there right now, so I figured anywhere might work because I also didn't want to introduce possible noise to the eventlet thread. When you said the ship had sailed, what specifically were you referring to? Or maybe yet what were you thinking that caused you to think that? Just where debian is at on the 3.12 timeline? | 17:53 |
JayF | That's how I read it | 17:54 |
TheJulia | I was thinking so as well, but was wondering if there was more context or if it was coming from someplace else. | 17:54 |
JayF | honestly, given the findings of Itamar, I'm slightly worried about how well things would run at high contention in Py 3.11 | 18:04 |
TheJulia | sort of feels like someone needs to chase down the tests just to see | 18:04 |
TheJulia | or at least, get a broad idea | 18:05 |
TheJulia | Then again, the hypothosis is interesting | 18:05 |
opendevreview | Julia Kreger proposed openstack/ironic-tempest-plugin master: WIP: Test multiple boot interfaces as part of one CI job https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/902171 | 18:08 |
JayF | Well, I am getting off the computer and going to go find a doctor, and hope they have some kind of medication to help me out | 18:08 |
TheJulia | feel better JayF! | 18:08 |
JayF | ty o/ | 18:08 |
iurygregory | get well JayF o/ | 18:35 |
opendevreview | Merged openstack/ironic master: CI: Remove deprecated devstack method https://review.opendev.org/c/openstack/ironic/+/901211 | 19:25 |
opendevreview | Merged openstack/ironic-inspector bugfix/11.8: Update .gitreview for bugfix/11.8 https://review.opendev.org/c/openstack/ironic-inspector/+/902244 | 19:25 |
opendevreview | Julia Kreger proposed openstack/ironic-tempest-plugin master: WIP: Test multiple boot interfaces as part of one CI job https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/902171 | 19:44 |
opendevreview | Steve Baker proposed openstack/ironic-python-agent master: Switch to utf-8 for parsing efibootmgr -v https://review.opendev.org/c/openstack/ironic-python-agent/+/902214 | 20:15 |
opendevreview | Merged openstack/ironic master: Fix *_by_arch documentation and un-deprecate the options without it https://review.opendev.org/c/openstack/ironic/+/901958 | 20:23 |
jamesdenton_ | hello all - working with idrac-redfish and curious to know if redfish_verify_ca=False is expected to be functional...Conductor appears to ignore the suggestion | 21:35 |
TheJulia | I believe it should be, but have not looked at it specifically | 21:49 |
jamesdenton_ | thanks TheJulia - actually just did a Zed->2023.1 upgrade and something in that fixed it | 21:57 |
*** zbitter is now known as zaneb | 22:23 | |
iurygregory | GREEN \o/ https://review.opendev.org/c/openstack/ironic/+/893824 | 22:45 |
iurygregory | if we don't have other cores around I will +W since is a CI change only | 22:45 |
iurygregory | and we really need to move out of focal in the snmp... | 22:45 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!