#openstack-ironic log

15:00:05 <rpittau> #startmeeting ironic
15:00:05 <opendevmeet> Meeting started Mon Jul 17 15:00:05 2023 UTC and is due to finish in 60 minutes.  The chair is rpittau. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:05 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:05 <opendevmeet> The meeting name has been set to 'ironic'
15:00:08 <iurygregory> o/
15:00:08 <TheJulia> o/
15:00:31 <rpittau> welcome everyone to our weekly meeting!
15:00:31 <rpittau> I'll be your host for today :)
15:00:51 <rpittau> The meeting agenda can be found here:
15:00:51 <rpittau> #link https://wiki.openstack.org/wiki/Meetings/Ironic#Agenda_for_next_meeting
15:01:03 <rpittau> #topic Announcements/Reminder
15:01:37 <rpittau> we've announced last week that the next PTG will take place virtually October 23-27, 2023
15:02:20 <rpittau> going to remove the reminder after this meeting
15:02:20 <rpittau> we'll remind it everyone when we're closer to the date
15:02:48 <rpittau> #note usual friendly reminder to review patches tagged #ironic-week-prio, and tag your patches for priority review
15:03:38 <rpittau> I'm leaving the bobcat timeline for reference in the reminder section
15:04:03 <rpittau> any other announcement/reminder today ?
15:04:28 <TheJulia> I worked stupidly late on Friday night on a decorator for sqlite issues
15:04:35 <rpittau> lol
15:04:49 <TheJulia> I *think* it works, but I've not seen CI give me a resource contention issue to generate a "database is locked" error yet today
15:05:00 <TheJulia> unit test wise, it definitely works!
15:05:06 <rpittau> \o/
15:05:11 <rpittau> let
15:05:19 <iurygregory> :D
15:05:36 <rpittau> I guess we need to review it and recheck a couple of times
15:05:38 <TheJulia> it has been tagged as a prio review, any reviews/feedback would be appreciated
15:05:40 <TheJulia> ++
15:05:42 <rpittau> great
15:05:51 <rpittau> thanks TheJulia :)
15:05:53 <TheJulia> yeah, I think I'm on the 3rd stupidly minor change today, so hopefully that is helping
15:06:33 <rpittau> (I'm silently skipping the action items review as there are none from last time)
15:06:55 <rpittau> since we started with that patch let's go to
15:06:55 <rpittau> #topic Review Ironic CI Status
15:07:18 <rpittau> so we're back to jammy, except for the snmp pxe job
15:07:36 <rpittau> and still battling the sqlite shenanighans
15:07:45 <TheJulia> yup, I have no idea why it is failing, but I don't remember the exact reason we held it back to begin with
15:08:23 <rpittau> I don't know, I'll try to make time this week to try and replicate, downstream permitting
15:08:42 <rpittau> also found an issue with the new tinycore 14.x in the standalone job
15:08:51 <rpittau> the one I mentioned before
15:09:23 <rpittau> output of efibootmgr "EFI variables are not supported on this system"
15:09:39 <rpittau> oh well
15:09:48 <rpittau> we're probably not in a rush for that
15:10:06 <rpittau> anything else to mention for the CI status?
15:10:47 <rpittau> ok, moving on
15:10:47 <rpittau> #topic 2023.2 Workstream
15:10:47 <rpittau> #link https://etherpad.opendev.org/p/IronicWorkstreams2023.2
15:11:08 <rpittau> any update to share?
15:11:38 <iurygregory> feedback would be appreciated in the Firmware Update patches =)
15:11:45 <TheJulia> I've updated the patch to support vendor passthru methods as steps
15:11:49 <TheJulia> and it now passes CI \o/
15:12:10 <rpittau> iurygregory: yeah, was going to mention that, I have some time this week, I'll put here on top of my list
15:12:23 <rpittau> TheJulia: great!
15:12:31 <TheJulia> hopefully get back to service steps this week
15:12:35 * TheJulia hopes
15:12:35 <iurygregory> tks rpittau o/
15:12:43 <TheJulia> but CI is obviously the priority
15:12:51 <iurygregory> ++
15:13:01 <iurygregory> I will take a look at patches today for CI
15:14:02 <rpittau> ok, great
15:14:02 <rpittau> I think we're good
15:14:14 <rpittau> #topic RFE review
15:14:25 <rpittau> JayF: left a couple of notes
15:14:42 <TheJulia> So, jay also put the same link in on both
15:14:42 <rpittau> so the first one is https://bugs.launchpad.net/ironic/+bug/2027688
15:14:48 <rpittau> ah yeah :D
15:14:59 * TheJulia hands JayF coffee
15:15:22 <TheJulia> I left a comment on the first one
15:15:25 <rpittau> the second one is https://bugs.launchpad.net/ironic/+bug/2027690
15:15:29 <rpittau> I'll update teh agenda
15:15:50 <TheJulia> I think the idea is good overall, but we need to better understand where the delination is, and even then, some of the things that fail are deep in the weeds
15:15:59 <TheJulia> and you only see the issue when your at that step, deep in the weeds
15:16:16 <TheJulia> in other words, more discussion/clarification required
15:16:19 <rpittau> mmmm I agree
15:16:29 <TheJulia> since we can't  go check switches we know *nothing* about
15:16:51 <TheJulia> but if we can provide better failure clarity, or "identify networking failed outright" mid-deploy, then we don't have to time out completely
15:17:29 <iurygregory> ++, I like the rfe, but we will need further discussion about it
15:18:26 <TheJulia> so yeah, I agree with the second rfe, we'll need it as an explicit spec and we'll need to figure out how to handle permissions
15:18:44 <TheJulia> it might just be we pre-flight validate all permisison rights and then cache the step data from the templates
15:18:48 * iurygregory is reading the second
15:19:06 <TheJulia> but, that requires a human putting their brain into the deploy templates code
15:19:07 <rpittau> checking earlky for the permissions would be best IMHO
15:19:19 <TheJulia> *also* custom clean steps would be a thing to consider
15:19:27 <TheJulia> maybe we just... permit them to be referenced, dunno
15:19:33 <TheJulia> that is a later idea I guess
15:19:48 <rpittau> yeah
15:20:44 <rpittau> once we define the templates, it should be not too hard to expand with custom clean steps
15:20:54 <TheJulia> yeah
15:21:14 <JayF> I mainly wanted feedback on the interfaces I laid out there; I intend to specify the implementation just to keep my thoughts straight
15:21:14 <iurygregory> ++
15:21:46 <TheJulia> right now you need to have system privileges for the deploy template endpoint if memory serves, so if we take the same pattern as add fields, begin to wire everythin together, adn then change the rbac policy, it should be a good overall approach
15:22:20 <TheJulia> we have the patterns already spelled out in nodes fairly cleanly too
15:22:34 <TheJulia> and allocations
15:22:37 <rpittau> right!
15:22:54 <TheJulia> because you can have an allocation owner
15:23:01 <TheJulia> (but not lessee, since that model doesn't exist there)
15:24:29 <rpittau> we're probably going to discuss both RFEs further, but they both look good to me
15:25:17 <TheJulia> yeah, first one just needs some more "what are the specific failure cases we're trying to detect" to provide greater clarity
15:25:41 <rpittau> probably clarify some aspects and then finalize during the PTG ?
15:25:49 <TheJulia> because honestly, some areas we do a bad job with today, and we should make that better since they could still fail there even if neutron has a thing saying "yes, I can login to the switch"
15:27:20 <JayF> I don't think for the precheck idea, we're looking for anything perfect, just catch anything obviously broken
15:27:39 <JayF> it was suggested by johnthetubaguy that for ilo, for instance, there's an internal BMC status and you could fail if that wasn't 'green'
15:28:08 <TheJulia> I *suspect* the big one is "bmc is half broken" weird case, and I'm not sure we actually test a swift tempurl we create...
15:28:15 <JayF> and I know with the hardware I ran at Rackspace, even just an ipmi power status would've been enough to indicate if our most common failure modes were active
15:28:20 <iurygregory> yellow can be a thing depending on what you are looking at in iLO
15:28:21 <iurygregory> XD
15:30:27 <TheJulia> i guess one question might end up being, are there *really* operators still turning off power sync at scale? We went from 1 to 8 power sync workers. Of course, that is also ipmi
15:31:33 <TheJulia> Anyway, that is beyond the scope of the meeting, just a question to get a datapoint really
15:31:58 <rpittau> let's remember this during (and more) during future discussion :)
15:32:06 <JayF> The more I think of it, the more I wonder if it's just a flag to validation to say "no really, actually validate"
15:32:26 <JayF> and have config to change default behavior for when Ironic does it
15:33:09 <TheJulia> to at least try and fail faster, I guess
15:33:17 <JayF> I also have a use case
15:33:21 <JayF> of failing a rebuild before the disk gets nuked
15:33:29 <rpittau> yeha, I guess that's the point (fail faster)
15:33:39 <JayF> so like e.g. a Ceph node wouldn't drop outta the cluster for longer than needed
15:34:02 <JayF> if we can prevent that even in 25% of failure cases, it reduces operator pain because they don't have to go $bmc_fix_period without their node in a usable state
15:34:20 <TheJulia> JayF: I think additional details or context might be needed there since that might also just be a defect
15:34:56 <TheJulia> or, maybe something we've done work on and a lack of a failure awareness is causing the wrong recovery path to be taken by humans
15:35:00 <JayF> that's sorta the theme of both of these rfes; allowing people with that kind of "I have a sensitive cluster" kind of use cases do their own scheduling of maintenance, and reduce the amount of times we'd have them in a  failed-down state
15:36:20 <TheJulia> anyway, lets keep talking about this one after the meeting, I think we need to better understand your use case and what spawned the issue to begin with, if that makes sense
15:36:53 <TheJulia> specifically, if I'm trying to rebuild [to update] a node, and somehow I'm wiping all of ceph volumes out
15:37:18 <rpittau> alright, moving on!
15:37:25 <rpittau> #topic Open Discussion
15:38:00 <TheJulia> I had https://bugzilla.redhat.com/show_bug.cgi?id=2222981 pop up in my inbox this morning!
15:38:06 <rpittau> it's been great running the meetings again this time and the last week :)
15:38:06 <rpittau> next week JayF will be back!
15:38:57 <rpittau> #note Overcloud deploy fails when mounting config drive on 4k disks
15:39:03 <rpittau> #link https://bugzilla.redhat.com/show_bug.cgi?id=2222981
15:41:02 <TheJulia> I'm looking at cloud-init code to see if there is a happy path forward, but I suspect it might be we may need to locally repack the config drive as an intermediate solution
15:44:06 <TheJulia> there might be, just depends on how long it has *really* been an "or", and we'll need to likely fix glean
15:44:37 <opendevreview> Merged openstack/bifrost master: Refactor use of include_vars  https://review.opendev.org/c/openstack/bifrost/+/855807
15:44:42 <iurygregory> I'm wondering if repack would cause performance issues (but it would be only in the 4k scenario right?)
15:45:40 <TheJulia> well, we would have to open the file on a loopback... if we even can...
15:45:48 <TheJulia> and then write out as a brand new filesystem
15:45:51 <TheJulia> so... dunno
15:46:00 <iurygregory> gotcha
15:46:13 <TheJulia> ¯\_(ツ)_/¯
15:49:33 <rpittau> I don't see a "straight" solution honestly
15:50:41 <TheJulia> yeah, since we support binary payloads
15:51:05 <TheJulia> I'll dig more, and write an upstream bug up
15:51:12 <rpittau> thanks
15:51:53 <iurygregory> ack
15:52:59 <rpittau> alright, anything else for open discussion ?
15:53:34 <rpittau> thanks everyone!
15:53:34 <rpittau> #endmeeting