15:00:28 <TheJulia> #startmeeting ironic
15:00:28 <opendevmeet> Meeting started Mon Mar  3 15:00:28 2025 UTC and is due to finish in 60 minutes.  The chair is TheJulia. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:28 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:28 <opendevmeet> The meeting name has been set to 'ironic'
15:00:29 <TheJulia> o/
15:00:31 <JayF> o/
15:00:54 <frickler> \o
15:01:13 <TheJulia> Good morning folks, lets see if we have a quorum of contributors this morning.
15:01:55 * TheJulia makes more cocfeeeeee
15:01:57 <TheJulia> coffeeeee
15:02:04 * TheJulia clearly needs more coffeeeeeeeee
15:03:08 <kubajj> o/
15:03:29 <TheJulia> I'm sensing we might not have a quorum for today
15:03:42 <JayF> That's really sad, I had a couple of RFEs I wanted to advance
15:04:07 <JayF> Can we discuss them anyway? One of them is for satoshi's MLH project and we'd love to get feedback if not full approval
15:04:11 <TheJulia> RFE's are not required to be triaged in a meeting :)
15:04:28 <JayF> I thought we approved/needs-specs them in a meeting with quorum, generally
15:04:38 <TheJulia> nope
15:04:40 <JayF> not that it matters that much, but the feedback is an important part anyway
15:04:46 <TheJulia> ++
15:05:11 <TheJulia> Lets do abriviated reminders, then jump to the RFEs
15:05:13 <TheJulia> sound good?
15:05:15 <JayF> ++
15:05:38 <TheJulia> #info Reminder, please review items on the weekly review dashboard.
15:05:41 <TheJulia> #link v
15:05:44 <TheJulia> #undo
15:05:44 <opendevmeet> Removing item from minutes: #link v
15:05:48 <TheJulia> #link https://tinyurl.com/ironic-weekly-prio-dash
15:06:04 <TheJulia> #info Epoxy release schedule has been posted.
15:06:06 <TheJulia> #link https://releases.openstack.org/epoxy/schedule.html
15:06:24 <TheJulia> #info Flamingo PTG will take place place April 7-11, 2025
15:06:30 <TheJulia> #link https://etherpad.opendev.org/p/ironic-ptg-april-2025
15:06:49 <TheJulia> #info We're officially a DPL project!
15:07:13 <TheJulia> JayF: do you know why the ironic-lib topic is a discussion topic?
15:07:26 <JayF> mainly for \o/ purposes
15:07:28 <JayF> it's gone
15:07:38 <TheJulia> Cool cool
15:07:41 <JayF> I think there are a couple of perfuntory patches still remaining
15:07:46 <TheJulia> #topic RFEs
15:07:47 <JayF> but nothing ironic side
15:07:57 <TheJulia> First one: https://bugs.launchpad.net/ironic/+bug/2100556
15:09:26 <JayF> So this proposes a feature for IPA of a ContainerHardwareManager, the idea is to run cleaning steps via container. There's a generic method useful with API-driven flows (needs args), and a proposed configuration mechanism to add steps for automated cleaning usage
15:09:35 <JayF> The next RFE on the list is sorta a cousin to this, if approved
15:09:45 <cid> o/
15:09:45 <opendevreview> Kaifeng Wang proposed openstack/python-ironicclient master: Add sort support for node history  https://review.opendev.org/c/openstack/python-ironicclient/+/943183
15:09:50 <TheJulia> And the list of available steps appears to be entirely dirven by conductor side configuration?
15:10:07 <JayF> well, we have a big generic step that you can provide args and run arbitrary stuff if you can give args
15:10:14 <JayF> but the available *automated* steps are config driven
15:10:48 <JayF> so you could interface: deploy, step: run_container (made up name/args), args: url: oci://registry/container:tag
15:10:52 <TheJulia> I guess that makes sense
15:11:07 <JayF> the next RFE is spicier and sorta came outta an ask from my downstream
15:11:16 <TheJulia> as long as the available parameters are restricted on the input.
15:11:36 <JayF> the end goal is to be able to change steps in automated cleaning without changing configuration and/or deploying new ramisk (rfe #2 gets us there)
15:11:47 <TheJulia> and really for the step, it seems like it is just a pass-through to a container
15:11:50 <JayF> TheJulia: I told satoshi that I would suggest we might want to lock the "run any container whatsoever" method behind config
15:12:02 <TheJulia> yeah, reasonable
15:12:09 <TheJulia> I think that is reasonable
15:12:09 <JayF> that, and us not using the infra you just made for images
15:12:18 <JayF> are the only two things I could anticipate being concerns here
15:12:21 <JayF> otherwise it's super straightforward
15:12:31 <TheJulia> go ahead
15:13:00 <JayF> does that means 2100556 is approved? Unsure what you mean by go ahead
15:14:40 <TheJulia> I don't think it needs a spec, but it is right on that line where it makes sense but seems like a ton to bite off.
15:14:49 <TheJulia> so I would feel fine taking an rfe-approved approach for it
15:15:02 <JayF> FWIW, we already have a PoC in agent of everything but the config :)
15:15:11 <JayF> well, "we" == satoshi
15:15:31 <JayF> so problems running containers in ramdisks (which do exist!) have been worked thru
15:15:39 <JayF> okay so the next RFE is under my name
15:15:40 <JayF> https://bugs.launchpad.net/ironic/+bug/2100545
15:15:55 <JayF> Declarative automated cleaning via runbooks
15:16:20 <JayF> basically I want to add config, in the normal places (conductor .conf + overridable by node), to allow you to specify a runbook to run in lieu of imperative automated cleaning
15:16:47 <TheJulia> ... I guess I would need a better understanding of how we're going to guard an owner from being able to override system defaults as asserted by the overall system-admin
15:17:17 <JayF> ah, so maybe a flag to completely disable this feature if the system admin doesn't trust users?
15:17:26 <TheJulia> I think: overall, a decent idea, however I'm a little concerned about the security implication of being able to override the overall system
15:17:32 <TheJulia> I think that is reasonable
15:18:26 <JayF> the other piece that came in as a requirement, and I marked there
15:18:29 <JayF> but I'm kinda :-| about
15:18:36 <JayF> is making them configurable by resource class
15:18:52 <TheJulia> That actually makes a ton of sense to me if you have specific classes
15:18:58 <JayF> I *think* that's the right place to split them up, and it's what my downstream wants, but it'd be the first config we have afaik that is "by resource class"
15:19:05 <TheJulia> I'm not a fan of dict config fields though
15:19:08 <JayF> (we do have "by cpu arch" settings by the bushel)
15:19:43 <TheJulia> would it make sense to be a yaml file which is consulted by the conductor?
15:19:45 <JayF> I mean, we have the library we have, and I'd rather not re-invent the wheel? In the IPA/ContainerHWM case we actually proposed a separate yaml as it gets complex
15:19:53 <JayF> but that's because we need a list(dict())
15:20:01 <JayF> in this case, in the second rfe, we only need dict()
15:20:12 <JayF> which is 100% supported in oslo config and used in a lot of places in ironic
15:20:23 <TheJulia> fair enough, I guess one of the things I'm wondering is how often any of that config would change
15:20:34 <JayF> lemme put it this way: I'd rather see an oslo.config feature OR full Ironic overhaul allowing *any* of our dict fields to be yaml
15:20:42 <TheJulia> anyhow, add a security knob and I'll be happy as an RFE
15:20:56 <JayF> added a note to comments there about wanting a security knob
15:21:14 <TheJulia> at some point, for complex config which may change, we should just avoid forcing the service to be HUPed upon changes
15:21:15 * JayF proposes node.admin_info /s
15:21:27 <JayF> TheJulia: this could be a mutable config?
15:21:39 <JayF> mutable configs don't need a hup, right?
15:21:53 <TheJulia> mutable configs only take effect once the service is hupped
15:22:08 * TheJulia knows this far too well from changing automated_clean to true locally
15:22:23 <JayF> Are you sure that it's not a lazy-activation thing?
15:22:30 <JayF> that it would've taken effect over time e
15:22:30 <TheJulia> 100% sure
15:23:13 <TheJulia> well, 99.95% sure, 0.05% someone might have slipped something in :)
15:23:47 <TheJulia> anyhow, one of the reason I did the container registry authenticaiton keys as an open file when needed approach was because that file can be regenerated
15:24:19 <TheJulia> while the service is running, and needing to have whatever manages ironic know to hup it upon changes is a burden. Its more a question of frequency of change and if that is not a concern then cool cool
15:24:32 <JayF> I think there might be some meat on this bone for making config better, but I'd prefer we take a "fix it all" approach (at least in ironic if not all of the stack) than introduce inconsistency
15:24:51 <JayF> but adding an optional yaml version of most of our dict configs might be really, really syntactically nice
15:25:13 <cardoe> sorry I'm late.
15:25:18 <TheJulia> I think your taking my concern for flux a bit further than I was worried about
15:25:43 <JayF> I think it's more that I think your idea is so cool I wanna take it further :D
15:25:50 <JayF> I *hate* our dict config syntax
15:25:59 <cardoe> So I threw something on the PTG that I think is related to this ContainerHardwareManager piece.
15:26:01 <TheJulia> anyhow, just not a fan of dict config values because decoding them is not always the easiest
15:26:19 <TheJulia> yes, I'm with you there entirely
15:26:34 <cardoe> Basically what if we did away with IPA or the deploy drivers having a list of steps in there and instead always created "deploy templates" and used those.
15:26:37 <JayF> cardoe: do you have a link to the ptg pad at hand?
15:26:47 <cardoe> https://etherpad.opendev.org/p/ironic-ptg-april-2025
15:27:18 <TheJulia> It might be fair to do, I'm not sure we've ever *really* seen steps change in practice based upon hardware managers
15:27:34 <TheJulia> but downstream operators might be doing that today and such a change is an operational risk
15:27:43 <JayF> cardoe is one of those downstream operators
15:27:45 <TheJulia> which means, definitely ptg topic
15:27:47 <JayF> whether he knows it or not lol
15:27:52 <TheJulia> JayF: indeed.
15:28:21 <JayF> I am struggling to grasp at the value of plugging in all deploy steps as templates
15:28:26 <JayF> but that's what ptg is for :)
15:29:14 <TheJulia> yup
15:29:26 <TheJulia> So, anything else to discuss this week?
15:29:34 <cardoe> I had a question about the anaconda docs patch.
15:29:50 <cardoe> I threw 2 TODOs they I wanna rip out... https://review.opendev.org/c/openstack/ironic/+/942839
15:30:17 <cardoe> If we should make those changes then I'll create bugs for enhancements. If not, I'll delete them.
15:31:03 <cardoe> dtantsur: Really hoping you can provide feedback on https://review.opendev.org/c/openstack/ironic/+/940333 as well.
15:31:14 <JayF> I'd suggest you check with the only other vocal user of that driver: kubajj and the friends at cern
15:31:32 <JayF> I don't have strong opinions around it other than "please don't break existing users or give them an annoying migration" :D
15:31:33 <TheJulia> for the second one, is it just lacking a default value today?
15:31:34 <cardoe> What do we need to do to unblock our CI? It seems like nothing is passing. It's all different jobs that fail.
15:31:51 <TheJulia> It seems networking is just toast
15:32:01 <TheJulia> and it seems entirely random :(
15:32:37 <cardoe> TheJulia: So if I don't want to ever use the generic ks_template provided by Ironic and require the user to supply a ks_template, that's not allowed.
15:32:39 <JayF> Is there anything we could nail down, like per provider or something?
15:32:48 <JayF> if it's an infra issue, we can maybe point them at it :/
15:33:10 <JayF> otherwise I've been thinking drastic things ... like -nv almost all integration jobs and communicate to cores to enforce that all jobs passed once
15:33:19 <frickler> iiuc it is mostly high load on the whole system
15:33:23 <TheJulia> cardoe: so ironic should have a reasonable default, I'm reading what your saying as the value must be supplied regardless
15:33:37 <JayF> I mean, if high load on the whole system renders our CI useless, the system is broken for purposes of our CI
15:33:51 <JayF> IDK if that means CI is broken, the system is broken, or "yes"
15:33:58 <TheJulia> Unfortunately, our jobs are io intensive and we've seen this cycle after cycle where when the system is getting crushed our failure rate goes through the roof
15:34:21 <kubajj> cardoe: TODO 1 - we did set it up to load the kickstarts from glance
15:34:23 <TheJulia> cardoe: replied to your first question on the docs review
15:34:38 <cardoe> TheJulia: I'm cool with Ironic having a reasonable default out of the box. But as an operator I cannot set "default_ks_template" to "".
15:34:59 <cardoe> kubajj: yeah you can load it from glance if you set ks_template=glance:// on each image you upload.
15:35:02 <TheJulia> cardoe: and are you saying you need to?
15:36:06 <cardoe> TheJulia: So I want to require all image to specify their own ks_template. The code will check for that because it uses default_ks_template if a specific ks_template isn't set. errr lemme fake code something
15:36:09 <TheJulia> I guess, the call in validation would always expect it be used, but if there is a *documented* path to avoid it's use which works, I could be okay with "if set to an empty value, treat it as None and skip the validation on it
15:36:34 <cardoe> ks_template = image_info.get("ks_template", CONF.anaconda.default_ks_template)
15:36:52 <TheJulia> okay, sounds good
15:36:57 <cardoe> if not this_exists(ks_template): print("user you did it wrong")
15:37:02 <cardoe> That's how the code works today.
15:37:14 <TheJulia> okay
15:37:23 <cardoe> BUT Ironic fails to start up if CONF.anaconda.default_ks_template isn't a real file.
15:37:32 <cardoe> Because of a check in another spot
15:38:13 <TheJulia> Ahh
15:38:15 <TheJulia> okay
15:38:18 <cardoe> kubajj: would love feedback on https://review.opendev.org/c/openstack/ironic/+/942839
15:38:37 <kubajj> cardoe: will do
15:38:44 <TheJulia> cardoe: so likely okay to separately change that logic since we should ideally not abort startup unless it is a horribly bad issue
15:39:07 <TheJulia> That itself might actually be a bug at this point
15:39:07 <cardoe> So basically if those use cases are valid, I'll make bugs to improve this.
15:39:26 <JayF> sgtm
15:39:38 <JayF> image handling more consistent is good, not crashing on startup is good
15:40:00 <cardoe> okay thanks. Just wanna start landing some of my docs patches rather than leaving them in this terrible WIP state.
15:40:13 <TheJulia> I recently... (like within the last year) did a similar check removal since it no longer made sense
15:40:40 <TheJulia> cool coo
15:40:42 <TheJulia> cool cool
15:41:00 <TheJulia> anything else to discuss other than CI performance sadness and IP networking failing across the FIP
15:41:40 <JayF> I probably won't have time to look at CI today/tomorrow, but if we find a quiet hour at OIF days might be interesting to IRL pair on it
15:42:47 <TheJulia> Yeah, I've looked enough times I stopped digging at failed connectity failures since they also seem to be highly intermittent
15:43:06 <JayF> adamcarthur5 keeps looking for interesting ways to intersect AI tooling and OpenStack
15:43:12 <JayF> I had an epiphany this weekend to maybe point him at CI logs
15:44:11 <darmach> frickler Thank you!
15:45:26 <TheJulia> JayF: that... might not be a bad idea
15:45:40 <JayF> yep. no need to mechanical turk it as humans
15:45:42 <TheJulia> Anyway, closing meeting in 1 minute if nobody else has anything to discuss
15:45:50 <JayF> combined with AI that doesn't exhaust and can maybe find patterns we can't
15:46:04 <JayF> at this point "bad" AI ideas are maybe better than no ideas at all
15:49:00 <TheJulia> #endmeeting