15:00:28 #startmeeting ironic 15:00:28 Meeting started Mon Mar 3 15:00:28 2025 UTC and is due to finish in 60 minutes. The chair is TheJulia. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:28 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:28 The meeting name has been set to 'ironic' 15:00:29 o/ 15:00:31 o/ 15:00:54 \o 15:01:13 Good morning folks, lets see if we have a quorum of contributors this morning. 15:01:55 * TheJulia makes more cocfeeeeee 15:01:57 coffeeeee 15:02:04 * TheJulia clearly needs more coffeeeeeeeee 15:03:08 o/ 15:03:29 I'm sensing we might not have a quorum for today 15:03:42 That's really sad, I had a couple of RFEs I wanted to advance 15:04:07 Can we discuss them anyway? One of them is for satoshi's MLH project and we'd love to get feedback if not full approval 15:04:11 RFE's are not required to be triaged in a meeting :) 15:04:28 I thought we approved/needs-specs them in a meeting with quorum, generally 15:04:38 nope 15:04:40 not that it matters that much, but the feedback is an important part anyway 15:04:46 ++ 15:05:11 Lets do abriviated reminders, then jump to the RFEs 15:05:13 sound good? 15:05:15 ++ 15:05:38 #info Reminder, please review items on the weekly review dashboard. 15:05:41 #link v 15:05:44 #undo 15:05:44 Removing item from minutes: #link v 15:05:48 #link https://tinyurl.com/ironic-weekly-prio-dash 15:06:04 #info Epoxy release schedule has been posted. 15:06:06 #link https://releases.openstack.org/epoxy/schedule.html 15:06:24 #info Flamingo PTG will take place place April 7-11, 2025 15:06:30 #link https://etherpad.opendev.org/p/ironic-ptg-april-2025 15:06:49 #info We're officially a DPL project! 15:07:13 JayF: do you know why the ironic-lib topic is a discussion topic? 15:07:26 mainly for \o/ purposes 15:07:28 it's gone 15:07:38 Cool cool 15:07:41 I think there are a couple of perfuntory patches still remaining 15:07:46 #topic RFEs 15:07:47 but nothing ironic side 15:07:57 First one: https://bugs.launchpad.net/ironic/+bug/2100556 15:09:26 So this proposes a feature for IPA of a ContainerHardwareManager, the idea is to run cleaning steps via container. There's a generic method useful with API-driven flows (needs args), and a proposed configuration mechanism to add steps for automated cleaning usage 15:09:35 The next RFE on the list is sorta a cousin to this, if approved 15:09:45 o/ 15:09:45 Kaifeng Wang proposed openstack/python-ironicclient master: Add sort support for node history https://review.opendev.org/c/openstack/python-ironicclient/+/943183 15:09:50 And the list of available steps appears to be entirely dirven by conductor side configuration? 15:10:07 well, we have a big generic step that you can provide args and run arbitrary stuff if you can give args 15:10:14 but the available *automated* steps are config driven 15:10:48 so you could interface: deploy, step: run_container (made up name/args), args: url: oci://registry/container:tag 15:10:52 I guess that makes sense 15:11:07 the next RFE is spicier and sorta came outta an ask from my downstream 15:11:16 as long as the available parameters are restricted on the input. 15:11:36 the end goal is to be able to change steps in automated cleaning without changing configuration and/or deploying new ramisk (rfe #2 gets us there) 15:11:47 and really for the step, it seems like it is just a pass-through to a container 15:11:50 TheJulia: I told satoshi that I would suggest we might want to lock the "run any container whatsoever" method behind config 15:12:02 yeah, reasonable 15:12:09 I think that is reasonable 15:12:09 that, and us not using the infra you just made for images 15:12:18 are the only two things I could anticipate being concerns here 15:12:21 otherwise it's super straightforward 15:12:31 go ahead 15:13:00 does that means 2100556 is approved? Unsure what you mean by go ahead 15:14:40 I don't think it needs a spec, but it is right on that line where it makes sense but seems like a ton to bite off. 15:14:49 so I would feel fine taking an rfe-approved approach for it 15:15:02 FWIW, we already have a PoC in agent of everything but the config :) 15:15:11 well, "we" == satoshi 15:15:31 so problems running containers in ramdisks (which do exist!) have been worked thru 15:15:39 okay so the next RFE is under my name 15:15:40 https://bugs.launchpad.net/ironic/+bug/2100545 15:15:55 Declarative automated cleaning via runbooks 15:16:20 basically I want to add config, in the normal places (conductor .conf + overridable by node), to allow you to specify a runbook to run in lieu of imperative automated cleaning 15:16:47 ... I guess I would need a better understanding of how we're going to guard an owner from being able to override system defaults as asserted by the overall system-admin 15:17:17 ah, so maybe a flag to completely disable this feature if the system admin doesn't trust users? 15:17:26 I think: overall, a decent idea, however I'm a little concerned about the security implication of being able to override the overall system 15:17:32 I think that is reasonable 15:18:26 the other piece that came in as a requirement, and I marked there 15:18:29 but I'm kinda :-| about 15:18:36 is making them configurable by resource class 15:18:52 That actually makes a ton of sense to me if you have specific classes 15:18:58 I *think* that's the right place to split them up, and it's what my downstream wants, but it'd be the first config we have afaik that is "by resource class" 15:19:05 I'm not a fan of dict config fields though 15:19:08 (we do have "by cpu arch" settings by the bushel) 15:19:43 would it make sense to be a yaml file which is consulted by the conductor? 15:19:45 I mean, we have the library we have, and I'd rather not re-invent the wheel? In the IPA/ContainerHWM case we actually proposed a separate yaml as it gets complex 15:19:53 but that's because we need a list(dict()) 15:20:01 in this case, in the second rfe, we only need dict() 15:20:12 which is 100% supported in oslo config and used in a lot of places in ironic 15:20:23 fair enough, I guess one of the things I'm wondering is how often any of that config would change 15:20:34 lemme put it this way: I'd rather see an oslo.config feature OR full Ironic overhaul allowing *any* of our dict fields to be yaml 15:20:42 anyhow, add a security knob and I'll be happy as an RFE 15:20:56 added a note to comments there about wanting a security knob 15:21:14 at some point, for complex config which may change, we should just avoid forcing the service to be HUPed upon changes 15:21:15 * JayF proposes node.admin_info /s 15:21:27 TheJulia: this could be a mutable config? 15:21:39 mutable configs don't need a hup, right? 15:21:53 mutable configs only take effect once the service is hupped 15:22:08 * TheJulia knows this far too well from changing automated_clean to true locally 15:22:23 Are you sure that it's not a lazy-activation thing? 15:22:30 that it would've taken effect over time e 15:22:30 100% sure 15:23:13 well, 99.95% sure, 0.05% someone might have slipped something in :) 15:23:47 anyhow, one of the reason I did the container registry authenticaiton keys as an open file when needed approach was because that file can be regenerated 15:24:19 while the service is running, and needing to have whatever manages ironic know to hup it upon changes is a burden. Its more a question of frequency of change and if that is not a concern then cool cool 15:24:32 I think there might be some meat on this bone for making config better, but I'd prefer we take a "fix it all" approach (at least in ironic if not all of the stack) than introduce inconsistency 15:24:51 but adding an optional yaml version of most of our dict configs might be really, really syntactically nice 15:25:13 sorry I'm late. 15:25:18 I think your taking my concern for flux a bit further than I was worried about 15:25:43 I think it's more that I think your idea is so cool I wanna take it further :D 15:25:50 I *hate* our dict config syntax 15:25:59 So I threw something on the PTG that I think is related to this ContainerHardwareManager piece. 15:26:01 anyhow, just not a fan of dict config values because decoding them is not always the easiest 15:26:19 yes, I'm with you there entirely 15:26:34 Basically what if we did away with IPA or the deploy drivers having a list of steps in there and instead always created "deploy templates" and used those. 15:26:37 cardoe: do you have a link to the ptg pad at hand? 15:26:47 https://etherpad.opendev.org/p/ironic-ptg-april-2025 15:27:18 It might be fair to do, I'm not sure we've ever *really* seen steps change in practice based upon hardware managers 15:27:34 but downstream operators might be doing that today and such a change is an operational risk 15:27:43 cardoe is one of those downstream operators 15:27:45 which means, definitely ptg topic 15:27:47 whether he knows it or not lol 15:27:52 JayF: indeed. 15:28:21 I am struggling to grasp at the value of plugging in all deploy steps as templates 15:28:26 but that's what ptg is for :) 15:29:14 yup 15:29:26 So, anything else to discuss this week? 15:29:34 I had a question about the anaconda docs patch. 15:29:50 I threw 2 TODOs they I wanna rip out... https://review.opendev.org/c/openstack/ironic/+/942839 15:30:17 If we should make those changes then I'll create bugs for enhancements. If not, I'll delete them. 15:31:03 dtantsur: Really hoping you can provide feedback on https://review.opendev.org/c/openstack/ironic/+/940333 as well. 15:31:14 I'd suggest you check with the only other vocal user of that driver: kubajj and the friends at cern 15:31:32 I don't have strong opinions around it other than "please don't break existing users or give them an annoying migration" :D 15:31:33 for the second one, is it just lacking a default value today? 15:31:34 What do we need to do to unblock our CI? It seems like nothing is passing. It's all different jobs that fail. 15:31:51 It seems networking is just toast 15:32:01 and it seems entirely random :( 15:32:37 TheJulia: So if I don't want to ever use the generic ks_template provided by Ironic and require the user to supply a ks_template, that's not allowed. 15:32:39 Is there anything we could nail down, like per provider or something? 15:32:48 if it's an infra issue, we can maybe point them at it :/ 15:33:10 otherwise I've been thinking drastic things ... like -nv almost all integration jobs and communicate to cores to enforce that all jobs passed once 15:33:19 iiuc it is mostly high load on the whole system 15:33:23 cardoe: so ironic should have a reasonable default, I'm reading what your saying as the value must be supplied regardless 15:33:37 I mean, if high load on the whole system renders our CI useless, the system is broken for purposes of our CI 15:33:51 IDK if that means CI is broken, the system is broken, or "yes" 15:33:58 Unfortunately, our jobs are io intensive and we've seen this cycle after cycle where when the system is getting crushed our failure rate goes through the roof 15:34:21 cardoe: TODO 1 - we did set it up to load the kickstarts from glance 15:34:23 cardoe: replied to your first question on the docs review 15:34:38 TheJulia: I'm cool with Ironic having a reasonable default out of the box. But as an operator I cannot set "default_ks_template" to "". 15:34:59 kubajj: yeah you can load it from glance if you set ks_template=glance:// on each image you upload. 15:35:02 cardoe: and are you saying you need to? 15:36:06 TheJulia: So I want to require all image to specify their own ks_template. The code will check for that because it uses default_ks_template if a specific ks_template isn't set. errr lemme fake code something 15:36:09 I guess, the call in validation would always expect it be used, but if there is a *documented* path to avoid it's use which works, I could be okay with "if set to an empty value, treat it as None and skip the validation on it 15:36:34 ks_template = image_info.get("ks_template", CONF.anaconda.default_ks_template) 15:36:52 okay, sounds good 15:36:57 if not this_exists(ks_template): print("user you did it wrong") 15:37:02 That's how the code works today. 15:37:14 okay 15:37:23 BUT Ironic fails to start up if CONF.anaconda.default_ks_template isn't a real file. 15:37:32 Because of a check in another spot 15:38:13 Ahh 15:38:15 okay 15:38:18 kubajj: would love feedback on https://review.opendev.org/c/openstack/ironic/+/942839 15:38:37 cardoe: will do 15:38:44 cardoe: so likely okay to separately change that logic since we should ideally not abort startup unless it is a horribly bad issue 15:39:07 That itself might actually be a bug at this point 15:39:07 So basically if those use cases are valid, I'll make bugs to improve this. 15:39:26 sgtm 15:39:38 image handling more consistent is good, not crashing on startup is good 15:40:00 okay thanks. Just wanna start landing some of my docs patches rather than leaving them in this terrible WIP state. 15:40:13 I recently... (like within the last year) did a similar check removal since it no longer made sense 15:40:40 cool coo 15:40:42 cool cool 15:41:00 anything else to discuss other than CI performance sadness and IP networking failing across the FIP 15:41:40 I probably won't have time to look at CI today/tomorrow, but if we find a quiet hour at OIF days might be interesting to IRL pair on it 15:42:47 Yeah, I've looked enough times I stopped digging at failed connectity failures since they also seem to be highly intermittent 15:43:06 adamcarthur5 keeps looking for interesting ways to intersect AI tooling and OpenStack 15:43:12 I had an epiphany this weekend to maybe point him at CI logs 15:44:11 frickler Thank you! 15:45:26 JayF: that... might not be a bad idea 15:45:40 yep. no need to mechanical turk it as humans 15:45:42 Anyway, closing meeting in 1 minute if nobody else has anything to discuss 15:45:50 combined with AI that doesn't exhaust and can maybe find patterns we can't 15:46:04 at this point "bad" AI ideas are maybe better than no ideas at all 15:49:00 #endmeeting