15:00:28 <TheJulia> #startmeeting ironic 15:00:28 <opendevmeet> Meeting started Mon Mar 3 15:00:28 2025 UTC and is due to finish in 60 minutes. The chair is TheJulia. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:28 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:28 <opendevmeet> The meeting name has been set to 'ironic' 15:00:29 <TheJulia> o/ 15:00:31 <JayF> o/ 15:00:54 <frickler> \o 15:01:13 <TheJulia> Good morning folks, lets see if we have a quorum of contributors this morning. 15:01:55 * TheJulia makes more cocfeeeeee 15:01:57 <TheJulia> coffeeeee 15:02:04 * TheJulia clearly needs more coffeeeeeeeee 15:03:08 <kubajj> o/ 15:03:29 <TheJulia> I'm sensing we might not have a quorum for today 15:03:42 <JayF> That's really sad, I had a couple of RFEs I wanted to advance 15:04:07 <JayF> Can we discuss them anyway? One of them is for satoshi's MLH project and we'd love to get feedback if not full approval 15:04:11 <TheJulia> RFE's are not required to be triaged in a meeting :) 15:04:28 <JayF> I thought we approved/needs-specs them in a meeting with quorum, generally 15:04:38 <TheJulia> nope 15:04:40 <JayF> not that it matters that much, but the feedback is an important part anyway 15:04:46 <TheJulia> ++ 15:05:11 <TheJulia> Lets do abriviated reminders, then jump to the RFEs 15:05:13 <TheJulia> sound good? 15:05:15 <JayF> ++ 15:05:38 <TheJulia> #info Reminder, please review items on the weekly review dashboard. 15:05:41 <TheJulia> #link v 15:05:44 <TheJulia> #undo 15:05:44 <opendevmeet> Removing item from minutes: #link v 15:05:48 <TheJulia> #link https://tinyurl.com/ironic-weekly-prio-dash 15:06:04 <TheJulia> #info Epoxy release schedule has been posted. 15:06:06 <TheJulia> #link https://releases.openstack.org/epoxy/schedule.html 15:06:24 <TheJulia> #info Flamingo PTG will take place place April 7-11, 2025 15:06:30 <TheJulia> #link https://etherpad.opendev.org/p/ironic-ptg-april-2025 15:06:49 <TheJulia> #info We're officially a DPL project! 15:07:13 <TheJulia> JayF: do you know why the ironic-lib topic is a discussion topic? 15:07:26 <JayF> mainly for \o/ purposes 15:07:28 <JayF> it's gone 15:07:38 <TheJulia> Cool cool 15:07:41 <JayF> I think there are a couple of perfuntory patches still remaining 15:07:46 <TheJulia> #topic RFEs 15:07:47 <JayF> but nothing ironic side 15:07:57 <TheJulia> First one: https://bugs.launchpad.net/ironic/+bug/2100556 15:09:26 <JayF> So this proposes a feature for IPA of a ContainerHardwareManager, the idea is to run cleaning steps via container. There's a generic method useful with API-driven flows (needs args), and a proposed configuration mechanism to add steps for automated cleaning usage 15:09:35 <JayF> The next RFE on the list is sorta a cousin to this, if approved 15:09:45 <cid> o/ 15:09:45 <opendevreview> Kaifeng Wang proposed openstack/python-ironicclient master: Add sort support for node history https://review.opendev.org/c/openstack/python-ironicclient/+/943183 15:09:50 <TheJulia> And the list of available steps appears to be entirely dirven by conductor side configuration? 15:10:07 <JayF> well, we have a big generic step that you can provide args and run arbitrary stuff if you can give args 15:10:14 <JayF> but the available *automated* steps are config driven 15:10:48 <JayF> so you could interface: deploy, step: run_container (made up name/args), args: url: oci://registry/container:tag 15:10:52 <TheJulia> I guess that makes sense 15:11:07 <JayF> the next RFE is spicier and sorta came outta an ask from my downstream 15:11:16 <TheJulia> as long as the available parameters are restricted on the input. 15:11:36 <JayF> the end goal is to be able to change steps in automated cleaning without changing configuration and/or deploying new ramisk (rfe #2 gets us there) 15:11:47 <TheJulia> and really for the step, it seems like it is just a pass-through to a container 15:11:50 <JayF> TheJulia: I told satoshi that I would suggest we might want to lock the "run any container whatsoever" method behind config 15:12:02 <TheJulia> yeah, reasonable 15:12:09 <TheJulia> I think that is reasonable 15:12:09 <JayF> that, and us not using the infra you just made for images 15:12:18 <JayF> are the only two things I could anticipate being concerns here 15:12:21 <JayF> otherwise it's super straightforward 15:12:31 <TheJulia> go ahead 15:13:00 <JayF> does that means 2100556 is approved? Unsure what you mean by go ahead 15:14:40 <TheJulia> I don't think it needs a spec, but it is right on that line where it makes sense but seems like a ton to bite off. 15:14:49 <TheJulia> so I would feel fine taking an rfe-approved approach for it 15:15:02 <JayF> FWIW, we already have a PoC in agent of everything but the config :) 15:15:11 <JayF> well, "we" == satoshi 15:15:31 <JayF> so problems running containers in ramdisks (which do exist!) have been worked thru 15:15:39 <JayF> okay so the next RFE is under my name 15:15:40 <JayF> https://bugs.launchpad.net/ironic/+bug/2100545 15:15:55 <JayF> Declarative automated cleaning via runbooks 15:16:20 <JayF> basically I want to add config, in the normal places (conductor .conf + overridable by node), to allow you to specify a runbook to run in lieu of imperative automated cleaning 15:16:47 <TheJulia> ... I guess I would need a better understanding of how we're going to guard an owner from being able to override system defaults as asserted by the overall system-admin 15:17:17 <JayF> ah, so maybe a flag to completely disable this feature if the system admin doesn't trust users? 15:17:26 <TheJulia> I think: overall, a decent idea, however I'm a little concerned about the security implication of being able to override the overall system 15:17:32 <TheJulia> I think that is reasonable 15:18:26 <JayF> the other piece that came in as a requirement, and I marked there 15:18:29 <JayF> but I'm kinda :-| about 15:18:36 <JayF> is making them configurable by resource class 15:18:52 <TheJulia> That actually makes a ton of sense to me if you have specific classes 15:18:58 <JayF> I *think* that's the right place to split them up, and it's what my downstream wants, but it'd be the first config we have afaik that is "by resource class" 15:19:05 <TheJulia> I'm not a fan of dict config fields though 15:19:08 <JayF> (we do have "by cpu arch" settings by the bushel) 15:19:43 <TheJulia> would it make sense to be a yaml file which is consulted by the conductor? 15:19:45 <JayF> I mean, we have the library we have, and I'd rather not re-invent the wheel? In the IPA/ContainerHWM case we actually proposed a separate yaml as it gets complex 15:19:53 <JayF> but that's because we need a list(dict()) 15:20:01 <JayF> in this case, in the second rfe, we only need dict() 15:20:12 <JayF> which is 100% supported in oslo config and used in a lot of places in ironic 15:20:23 <TheJulia> fair enough, I guess one of the things I'm wondering is how often any of that config would change 15:20:34 <JayF> lemme put it this way: I'd rather see an oslo.config feature OR full Ironic overhaul allowing *any* of our dict fields to be yaml 15:20:42 <TheJulia> anyhow, add a security knob and I'll be happy as an RFE 15:20:56 <JayF> added a note to comments there about wanting a security knob 15:21:14 <TheJulia> at some point, for complex config which may change, we should just avoid forcing the service to be HUPed upon changes 15:21:15 * JayF proposes node.admin_info /s 15:21:27 <JayF> TheJulia: this could be a mutable config? 15:21:39 <JayF> mutable configs don't need a hup, right? 15:21:53 <TheJulia> mutable configs only take effect once the service is hupped 15:22:08 * TheJulia knows this far too well from changing automated_clean to true locally 15:22:23 <JayF> Are you sure that it's not a lazy-activation thing? 15:22:30 <JayF> that it would've taken effect over time e 15:22:30 <TheJulia> 100% sure 15:23:13 <TheJulia> well, 99.95% sure, 0.05% someone might have slipped something in :) 15:23:47 <TheJulia> anyhow, one of the reason I did the container registry authenticaiton keys as an open file when needed approach was because that file can be regenerated 15:24:19 <TheJulia> while the service is running, and needing to have whatever manages ironic know to hup it upon changes is a burden. Its more a question of frequency of change and if that is not a concern then cool cool 15:24:32 <JayF> I think there might be some meat on this bone for making config better, but I'd prefer we take a "fix it all" approach (at least in ironic if not all of the stack) than introduce inconsistency 15:24:51 <JayF> but adding an optional yaml version of most of our dict configs might be really, really syntactically nice 15:25:13 <cardoe> sorry I'm late. 15:25:18 <TheJulia> I think your taking my concern for flux a bit further than I was worried about 15:25:43 <JayF> I think it's more that I think your idea is so cool I wanna take it further :D 15:25:50 <JayF> I *hate* our dict config syntax 15:25:59 <cardoe> So I threw something on the PTG that I think is related to this ContainerHardwareManager piece. 15:26:01 <TheJulia> anyhow, just not a fan of dict config values because decoding them is not always the easiest 15:26:19 <TheJulia> yes, I'm with you there entirely 15:26:34 <cardoe> Basically what if we did away with IPA or the deploy drivers having a list of steps in there and instead always created "deploy templates" and used those. 15:26:37 <JayF> cardoe: do you have a link to the ptg pad at hand? 15:26:47 <cardoe> https://etherpad.opendev.org/p/ironic-ptg-april-2025 15:27:18 <TheJulia> It might be fair to do, I'm not sure we've ever *really* seen steps change in practice based upon hardware managers 15:27:34 <TheJulia> but downstream operators might be doing that today and such a change is an operational risk 15:27:43 <JayF> cardoe is one of those downstream operators 15:27:45 <TheJulia> which means, definitely ptg topic 15:27:47 <JayF> whether he knows it or not lol 15:27:52 <TheJulia> JayF: indeed. 15:28:21 <JayF> I am struggling to grasp at the value of plugging in all deploy steps as templates 15:28:26 <JayF> but that's what ptg is for :) 15:29:14 <TheJulia> yup 15:29:26 <TheJulia> So, anything else to discuss this week? 15:29:34 <cardoe> I had a question about the anaconda docs patch. 15:29:50 <cardoe> I threw 2 TODOs they I wanna rip out... https://review.opendev.org/c/openstack/ironic/+/942839 15:30:17 <cardoe> If we should make those changes then I'll create bugs for enhancements. If not, I'll delete them. 15:31:03 <cardoe> dtantsur: Really hoping you can provide feedback on https://review.opendev.org/c/openstack/ironic/+/940333 as well. 15:31:14 <JayF> I'd suggest you check with the only other vocal user of that driver: kubajj and the friends at cern 15:31:32 <JayF> I don't have strong opinions around it other than "please don't break existing users or give them an annoying migration" :D 15:31:33 <TheJulia> for the second one, is it just lacking a default value today? 15:31:34 <cardoe> What do we need to do to unblock our CI? It seems like nothing is passing. It's all different jobs that fail. 15:31:51 <TheJulia> It seems networking is just toast 15:32:01 <TheJulia> and it seems entirely random :( 15:32:37 <cardoe> TheJulia: So if I don't want to ever use the generic ks_template provided by Ironic and require the user to supply a ks_template, that's not allowed. 15:32:39 <JayF> Is there anything we could nail down, like per provider or something? 15:32:48 <JayF> if it's an infra issue, we can maybe point them at it :/ 15:33:10 <JayF> otherwise I've been thinking drastic things ... like -nv almost all integration jobs and communicate to cores to enforce that all jobs passed once 15:33:19 <frickler> iiuc it is mostly high load on the whole system 15:33:23 <TheJulia> cardoe: so ironic should have a reasonable default, I'm reading what your saying as the value must be supplied regardless 15:33:37 <JayF> I mean, if high load on the whole system renders our CI useless, the system is broken for purposes of our CI 15:33:51 <JayF> IDK if that means CI is broken, the system is broken, or "yes" 15:33:58 <TheJulia> Unfortunately, our jobs are io intensive and we've seen this cycle after cycle where when the system is getting crushed our failure rate goes through the roof 15:34:21 <kubajj> cardoe: TODO 1 - we did set it up to load the kickstarts from glance 15:34:23 <TheJulia> cardoe: replied to your first question on the docs review 15:34:38 <cardoe> TheJulia: I'm cool with Ironic having a reasonable default out of the box. But as an operator I cannot set "default_ks_template" to "". 15:34:59 <cardoe> kubajj: yeah you can load it from glance if you set ks_template=glance:// on each image you upload. 15:35:02 <TheJulia> cardoe: and are you saying you need to? 15:36:06 <cardoe> TheJulia: So I want to require all image to specify their own ks_template. The code will check for that because it uses default_ks_template if a specific ks_template isn't set. errr lemme fake code something 15:36:09 <TheJulia> I guess, the call in validation would always expect it be used, but if there is a *documented* path to avoid it's use which works, I could be okay with "if set to an empty value, treat it as None and skip the validation on it 15:36:34 <cardoe> ks_template = image_info.get("ks_template", CONF.anaconda.default_ks_template) 15:36:52 <TheJulia> okay, sounds good 15:36:57 <cardoe> if not this_exists(ks_template): print("user you did it wrong") 15:37:02 <cardoe> That's how the code works today. 15:37:14 <TheJulia> okay 15:37:23 <cardoe> BUT Ironic fails to start up if CONF.anaconda.default_ks_template isn't a real file. 15:37:32 <cardoe> Because of a check in another spot 15:38:13 <TheJulia> Ahh 15:38:15 <TheJulia> okay 15:38:18 <cardoe> kubajj: would love feedback on https://review.opendev.org/c/openstack/ironic/+/942839 15:38:37 <kubajj> cardoe: will do 15:38:44 <TheJulia> cardoe: so likely okay to separately change that logic since we should ideally not abort startup unless it is a horribly bad issue 15:39:07 <TheJulia> That itself might actually be a bug at this point 15:39:07 <cardoe> So basically if those use cases are valid, I'll make bugs to improve this. 15:39:26 <JayF> sgtm 15:39:38 <JayF> image handling more consistent is good, not crashing on startup is good 15:40:00 <cardoe> okay thanks. Just wanna start landing some of my docs patches rather than leaving them in this terrible WIP state. 15:40:13 <TheJulia> I recently... (like within the last year) did a similar check removal since it no longer made sense 15:40:40 <TheJulia> cool coo 15:40:42 <TheJulia> cool cool 15:41:00 <TheJulia> anything else to discuss other than CI performance sadness and IP networking failing across the FIP 15:41:40 <JayF> I probably won't have time to look at CI today/tomorrow, but if we find a quiet hour at OIF days might be interesting to IRL pair on it 15:42:47 <TheJulia> Yeah, I've looked enough times I stopped digging at failed connectity failures since they also seem to be highly intermittent 15:43:06 <JayF> adamcarthur5 keeps looking for interesting ways to intersect AI tooling and OpenStack 15:43:12 <JayF> I had an epiphany this weekend to maybe point him at CI logs 15:44:11 <darmach> frickler Thank you! 15:45:26 <TheJulia> JayF: that... might not be a bad idea 15:45:40 <JayF> yep. no need to mechanical turk it as humans 15:45:42 <TheJulia> Anyway, closing meeting in 1 minute if nobody else has anything to discuss 15:45:50 <JayF> combined with AI that doesn't exhaust and can maybe find patterns we can't 15:46:04 <JayF> at this point "bad" AI ideas are maybe better than no ideas at all 15:49:00 <TheJulia> #endmeeting