#openstack-ironic log

15:00:31 <rpittau> #startmeeting ironic
15:00:31 <opendevmeet> Meeting started Mon May  6 15:00:31 2024 UTC and is due to finish in 60 minutes.  The chair is rpittau. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:31 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:31 <opendevmeet> The meeting name has been set to 'ironic'
15:00:44 <iurygregory> o/
15:00:49 <masghar> o/
15:00:49 <TheJulia> good morning
15:00:52 <rpittau> Hello everyone!
15:00:52 <rpittau> Welcome to our weekly meeting!
15:00:52 <rpittau> The meeting agenda can be found here:
15:00:52 <rpittau> #link https://wiki.openstack.org/wiki/Meetings/Ironic#Agenda_for_May_6.2C_2024
15:00:53 <dtantsur> o/
15:01:33 <rpittau> #topic Announcements/Reminders
15:01:43 <rpittau> #info Standing reminder to review patches tagged ironic-week-prio and to hashtag any patches ready for review with ironic-week-prio: https://tinyurl.com/ironic-weekly-prio-dash
15:01:58 <rpittau> #info Work items for 2024.2 have been merged https://review.opendev.org/c/openstack/ironic-specs/+/916295
15:02:16 <rpittau> #info Ironic Meetup/BareMetal SIG June 5, OpenInfra Days June 6 @ CERN.
15:02:21 <rpittau> Signup is now closed
15:02:33 <rpittau> #info 2024.2 Dalmatian Release Schedule https://releases.openstack.org/dalmatian/schedule.html
15:02:50 <rpittau> anything else to announce/remind ?
15:02:53 <iurygregory> #info CFP for OIS Asia'24 closes on May 29
15:02:59 <rpittau> thanks iurygregory
15:03:28 <iurygregory> yw
15:04:16 <rpittau> I do have an announcement, I will be out for a couple of weeks, I'll be back on Monday May 26
15:04:16 <rpittau> I will skip two meetings, so need one/two volunteer(s) to run the next 2 meetings :)
15:04:40 <TheJulia> I've got nothing planned so I guess I can
15:04:42 <dtantsur> I'll be out for 2 Mondays as well
15:04:49 <rpittau> thanks TheJulia :)
15:05:22 <rpittau> #info TheJulia will run the meetings on 13 and 20 of May
15:05:25 <iurygregory> I can run
15:05:36 <iurygregory> oh nvm =D
15:05:50 <rpittau> thanks iurygregory, I guess you can alternate with TheJulia :)
15:06:00 <iurygregory> yeah =) for sure
15:06:30 <rpittau> ok, moving on!
15:06:37 <rpittau> #topic Review Ironic CI status
15:07:03 <rpittau> haven't seen anything particular in CI last week
15:07:41 <JayF> I believe Julia landed some fixes for stable branches in IPA
15:07:48 <JayF> Other than that I've not seen anything new
15:08:01 <TheJulia> Yeah, CI seemed mostly okay last week with the exception being some stable branches being a little cranky
15:08:04 <rpittau> true, thanks TheJulia!
15:08:45 <rpittau> now that I think about it, I've seen some failures on grenade jobs in stable releases, maybe due to zed going into unmaintained mode
15:09:38 <rpittau> we may need to remove the job in zed to be able to land the latest changes in the branch, just simple branch updates
15:10:51 <rpittau> I don't see any discussion topics for this week, do we want to jump directly to Bug Deputy updates ?
15:10:53 <TheJulia> Yeah, oldest grenade always needs to be disabled
15:11:17 <rpittau> #action disable grenade in zed
15:11:53 <TheJulia> Bug Deputy update sounds like it is next
15:12:04 <rpittau> #topic Bug Deputy Updates
15:12:13 <rpittau> TheJulia: anything to reprt? :)
15:12:16 <TheJulia> Two items to note
15:12:44 <TheJulia> First, I cleaned up ironic-inspector bugs, mostly marked stuff that was super old to indicate we were not planning on fixing the items
15:13:16 <TheJulia> I also was able to close out a few bugs in ironic as already fixed or no longer applicable
15:13:29 <rpittau> great, thanks!
15:14:13 <JayF> I'll take it this week
15:14:18 <iurygregory> I can be the bug dep after next week (ie, the week from May 20 - 24)
15:14:24 <rpittau> thanks JayF I was going to ask :D
15:14:33 <rpittau> and thanks iurygregory :)
15:15:08 <rpittau> I guess we can move to
15:15:08 <rpittau> #topic RFE Review
15:15:36 <rpittau> we have 2 RFEs proposed by rozzi but I don't see him here today
15:16:06 <rpittau> they look both reasonable to me
15:16:26 <rpittau> and I saw the development already started
15:16:40 <TheJulia> I concur, I think the block device list combo one makes sense
15:16:47 <rpittau> yeah, absolutely
15:16:48 <TheJulia> that seems more like a bug to me, tbh
15:17:04 <TheJulia> the other, I have some mixed feelings on but I can see the reasoning
15:17:07 <rpittau> sure on the edge
15:17:32 <rpittau> I guess the scenario proposed makes sense
15:17:35 * iurygregory looks at the RFEs
15:17:45 <JayF> Can we link the rfe's in here so that they are in the minutes? I also don't have the agenda up lol
15:17:53 <rpittau> #link https://bugs.launchpad.net/ironic-python-agent/+bug/2061437
15:17:58 <rpittau> #https://bugs.launchpad.net/ironic-python-agent/+bug/2061362
15:18:02 <rpittau> #link https://bugs.launchpad.net/ironic-python-agent/+bug/2061362
15:18:46 <opendevreview> Verification of a change to openstack/ironic stable/2023.1 failed: ci: stable-only: explicitly pin centos build  https://review.opendev.org/c/openstack/ironic/+/918118
15:18:57 <JayF> 2061362, I'm not sure I'm on board with approving that
15:19:16 <JayF> We have methods to disable the step entirely which seems to be the correct move if you have a disc that you do not want cleaned
15:19:31 <dtantsur> Mixed feelings here too. We can disable the step or we can skip a disk by using hints.
15:19:43 <iurygregory> we can also skip cleaning a given disk
15:19:47 <iurygregory> yeah
15:20:00 <TheJulia> Based upon the described case, it sounds like they have shared luns across hosts
15:20:11 <rpittau> mmm I can see your point too, I'm more on the verge of accepting honestly
15:20:30 <JayF> Is the actual feature a way to skip shared devices? Like a device hint that points to any shared devices that they can skip cleaning
15:20:35 <TheJulia> so, if we can better understand the exact case in which the device is locked, that might help us understand how this feature could be scoped and accepted as an "optional" thing
15:20:49 <JayF> I'm okay with a declarative way of saying don't clean this. I'm less okay with saying just ignore certain types of errors during cleaning
15:20:52 <TheJulia> can't necessarilly know what the device is or if it will be in use in a shared lab
15:21:12 <dtantsur> I'd even be okay with "skip any remote device" as an option
15:21:27 <JayF> I do think it's not unreasonable to ask someone to use a custom hardware manager for a lab case if it is a unique use case
15:21:35 <JayF> Same dtantsur
15:22:14 <masghar> I think their trouble is having many disks and a random one randomly failing
15:22:15 <TheJulia> skipping "remote" might make sense, not sure I'd ever let a customer use it though :)
15:22:20 <TheJulia> but that is always a documentation thing
15:22:30 <rpittau> JayF, dtantsur, when you have a minute please add a comment in the RFE, I guess we need more details on that?
15:22:47 <dtantsur> masghar: it's not a normal situation there? we're curious why they fail.
15:22:56 <JayF> masghar: another way of looking at that is should  ironic say a machine is available if it has a random disk failure
15:23:34 <masghar> dtantsur, JayF, I see...but then their whole fleet can't be cleaned?
15:23:39 <rpittau> I will approve the first one
15:24:12 <dtantsur> masghar: I'd be really curious what is going on in this case
15:24:27 <dtantsur> Like, in my whole career in OpenShift, I don't remember a single such case
15:24:51 <masghar> dtantsur: true, I guess disk failures are too serious to ignore
15:25:39 <dtantsur> A harsher way to put that would be "if you don't care if some of your disks don't get cleaned, why enable cleaning at all?"
15:25:49 <TheJulia> to me, it reads as if there are quorum disks in the environment
15:26:07 <TheJulia> and we hit one while another node is up which shares the quorum disk, and we fail at that point
15:26:27 <masghar> oh I see!
15:26:43 <dtantsur> Sure but... why have cleaning enabled?
15:27:07 <dtantsur> Metadata cleaning (the only type supported in metal3) is designed to prevent software conflicts, such as Ceph not liking old Ceph partitions
15:27:13 <JayF> Or why not as part of your process, skip cleaning on shared/remote discs. Even if the feature was oriented that way, we'd likely approve it as skipping cleaning on remote devices optionally
15:27:20 <TheJulia> dtantsur: I think because they are re-cycling the environment
15:27:36 <rpittau> the error is not going to be ignored though, just not fail the entire cleanup
15:27:36 <rpittau> but I guess at that point excluding the disk (removing it physically or logically) would be better than leaving it there in any case, and then redoing the cleaning
15:27:39 <dtantsur> Yeah, but if you leave random data around on a shared disk, you still hit software problems
15:27:40 <TheJulia> dtantsur: so if you have a host already cleaning, and it grabs the quorum disk first, it has the lock
15:27:45 <TheJulia> another one comes along and will fail
15:28:09 <TheJulia> dtantsur: if the other node is being cleaned, the data gets wiped out, but another node doesn't know about it
15:28:18 <dtantsur> Or maybe it uses the disk, and you try to clean it :) I think this environment should have no cleaning or some very custom cleaning
15:28:25 <dtantsur> exactly
15:28:31 <dtantsur> how is cleaning the way we do it safe there?
15:28:46 <TheJulia> we kind of need to take a step back from the problem
15:28:52 <TheJulia> and provide greater clarity
15:29:00 <TheJulia> is there a way to identify this as an "acceptable" failure
15:29:06 <TheJulia> for that device, or not
15:29:27 <TheJulia> for example, I ran 8GB quorum disks in clusters years ago shared across multiple nodes, the lock holder was always the leader
15:29:28 <dtantsur> Right, yeah. We're guessing already, we should get back to Adam
15:29:46 <TheJulia> ++
15:29:52 <rpittau> yep, I agree we need more clarification at this point
15:31:03 <dtantsur> I think the option to ignore errors should be our last resort
15:31:42 <JayF> Yeah basically we open ourselves up to a security vulnerability if someone can figure out how to break our agent using something crazy on disk
15:31:56 <dtantsur> Well, this is metadata cleaning
15:32:06 <dtantsur> For full cleaning, I'd be hard -2 on even considering that
15:32:25 <TheJulia> JayF: if a disk is locked, you might not be able to read it depending on the lock
15:32:39 <JayF> Yeah, I guess for metadata cleaning the worst cases that will get very low quality bugs around people disabling error checking and then being shocked that their unclean disc can't be deployed to
15:32:56 <TheJulia> yeah
15:33:27 <JayF> I still think we should round trip some questions about details. I'm not a hard and fast opposed no matter what with that nuance in mind though
15:33:41 <TheJulia> ++
15:33:53 <dtantsur> agreed
15:34:27 <rpittau> ++
15:35:44 <rpittau> please just add a comment in the RFE to get more info from Adam
15:35:50 <TheJulia> +=
15:35:51 <TheJulia> ++
15:36:33 <JayF> I'll toss that comment in here in the next 15 minutes
15:36:41 <rpittau> thanks JayF :)
15:36:46 <JayF> If y'all want to just let me do it so that we don't crush Adam with five things that say the same thing LOL
15:36:54 <rpittau> sounds good :D
15:37:00 * dtantsur nods
15:37:33 <rpittau> we can probably move on
15:37:55 <rpittau> is there anything for open discussion ?
15:38:37 <JayF> Ah that Nova bug I linked earlier in the ironic driver, it's worth a look if anyone has some time
15:39:05 <JayF> I'm mainly curious if we need to add tempest coverage on the nova driver side
15:39:09 <JayF> for things like reboot
15:39:09 <rpittau> I think I lost it
15:39:09 <rpittau> can you link it here?
15:39:21 <JayF> https://review.opendev.org/c/openstack/nova/+/918195
15:39:25 <rpittau> thanks
15:39:37 <JayF> #link https://launchpad.net/bugs/2064826
15:39:47 <JayF> tl;dr we broke nova reboot last cycle, CI didn't catch it
15:39:54 <rpittau> whoops
15:39:58 <JayF> it's being fixed but I wanted to see if it is an oversight or an explicit decision to not test this
15:40:05 <JayF> if an oversight, I'll see if I can swing a test for it
15:40:21 <JayF> but always like asking since I know we are shaving our test surfaces, don't wanna readd something that was intentionally removed
15:40:30 <TheJulia> afaik, our basic ops plugin doesn't exercise reoot
15:40:32 <TheJulia> reboot
15:41:04 <TheJulia> doing reboot testing as part of it would also add some time to plugin execution
15:41:19 <TheJulia> ... Mileage may vary, of course as with all things tempest plugin wise
15:41:41 <JayF> It sorta goes back to my statement a cycle ago ... I wish we could just have a fake ironic node for some of these interactions/tests
15:41:57 <JayF> but I think that's more than I'm willing  to take on right now, so I might just swing at a reboot test and see what the actual impact is in minutes
15:42:14 <TheJulia> well, we sort of can but the test can't ping something entirely fake
15:42:30 <TheJulia> your ether doing something entirely fake, or your doing stuff which is a full scenario
15:42:36 <TheJulia> integrative is scenario inherently
15:43:09 <JayF> well, Nova<>Ironic driver, some of those things are just "did we tell Ironic to do the thing"
15:43:12 <TheJulia> doing stuff in the middle leads fixes like https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/918001 having to come along
15:43:20 <JayF> heh, fair
15:43:41 <JayF> I'll keep wishing for utopia and living in the real world instead then :P
15:43:48 <TheJulia> heh
15:44:01 <TheJulia> adding to our basic ops likely isn't a huge issue, just add a knob for the time
15:44:07 <TheJulia> if a reasonable one doesn't already exist
15:44:27 <JayF> ack
15:44:58 <rpittau> thanks for reporting that JayF
15:45:26 <rpittau> anything else we want to discuss today?
15:45:37 <dtantsur> At some point, we should probably delineate stuff we test in Nova and stuff we test in Ironic
15:45:49 <dtantsur> otherwise, our testing matrix will keep being chaotic.
15:46:14 <JayF> dtantsur: that's a line of chat Sean Mooney and I had last cycle; a tempest scenario set specifically designed for nova driver, and to run it more frequently
15:46:16 <TheJulia> Yeah, I'm not a fan of doing an extra reboot when exercising rescue
15:46:32 <TheJulia> and then doing that across however many tests that ends up triggering on
15:46:34 <JayF> but realistically I'm already majorly overcommitted this cycle
15:47:37 <rpittau> I guess this needs a longer discussion and involving the nova team
15:47:37 <rpittau> we need to see if we can find the resources during this cycle
15:47:46 <TheJulia> well
15:47:59 <TheJulia> They are not going to really grasp the current scale
15:48:10 <JayF> Sean is sorta our ambassador to the nova team in some ways these days
15:48:20 <JayF> but I don't wanna engage a chat at that level if I don't have time to take action items outta it
15:48:29 <rpittau> yep, exactly
15:48:35 <TheJulia> but we do need to likely take a look and sort out if we've got appropriate coverage with the plugin
15:48:40 * JayF has had a problem with saying yes more times than he delivered a "yes" result
15:50:12 <dtantsur> we know this feeling
15:50:27 <rpittau> we can probably reevaluate the situation during the cycle
15:51:50 <rpittau> looks like we're closing
15:51:50 <rpittau> thanks everyone
15:51:55 <rpittau> #endmeeting