15:00:31 #startmeeting ironic 15:00:31 Meeting started Mon May 6 15:00:31 2024 UTC and is due to finish in 60 minutes. The chair is rpittau. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:31 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:31 The meeting name has been set to 'ironic' 15:00:44 o/ 15:00:49 o/ 15:00:49 good morning 15:00:52 Hello everyone! 15:00:52 Welcome to our weekly meeting! 15:00:52 The meeting agenda can be found here: 15:00:52 #link https://wiki.openstack.org/wiki/Meetings/Ironic#Agenda_for_May_6.2C_2024 15:00:53 o/ 15:01:33 #topic Announcements/Reminders 15:01:43 #info Standing reminder to review patches tagged ironic-week-prio and to hashtag any patches ready for review with ironic-week-prio: https://tinyurl.com/ironic-weekly-prio-dash 15:01:58 #info Work items for 2024.2 have been merged https://review.opendev.org/c/openstack/ironic-specs/+/916295 15:02:16 #info Ironic Meetup/BareMetal SIG June 5, OpenInfra Days June 6 @ CERN. 15:02:21 Signup is now closed 15:02:33 #info 2024.2 Dalmatian Release Schedule https://releases.openstack.org/dalmatian/schedule.html 15:02:50 anything else to announce/remind ? 15:02:53 #info CFP for OIS Asia'24 closes on May 29 15:02:59 thanks iurygregory 15:03:28 yw 15:04:16 I do have an announcement, I will be out for a couple of weeks, I'll be back on Monday May 26 15:04:16 I will skip two meetings, so need one/two volunteer(s) to run the next 2 meetings :) 15:04:40 I've got nothing planned so I guess I can 15:04:42 I'll be out for 2 Mondays as well 15:04:49 thanks TheJulia :) 15:05:22 #info TheJulia will run the meetings on 13 and 20 of May 15:05:25 I can run 15:05:36 oh nvm =D 15:05:50 thanks iurygregory, I guess you can alternate with TheJulia :) 15:06:00 yeah =) for sure 15:06:30 ok, moving on! 15:06:37 #topic Review Ironic CI status 15:07:03 haven't seen anything particular in CI last week 15:07:41 I believe Julia landed some fixes for stable branches in IPA 15:07:48 Other than that I've not seen anything new 15:08:01 Yeah, CI seemed mostly okay last week with the exception being some stable branches being a little cranky 15:08:04 true, thanks TheJulia! 15:08:45 now that I think about it, I've seen some failures on grenade jobs in stable releases, maybe due to zed going into unmaintained mode 15:09:38 we may need to remove the job in zed to be able to land the latest changes in the branch, just simple branch updates 15:10:51 I don't see any discussion topics for this week, do we want to jump directly to Bug Deputy updates ? 15:10:53 Yeah, oldest grenade always needs to be disabled 15:11:17 #action disable grenade in zed 15:11:53 Bug Deputy update sounds like it is next 15:12:04 #topic Bug Deputy Updates 15:12:13 TheJulia: anything to reprt? :) 15:12:16 Two items to note 15:12:44 First, I cleaned up ironic-inspector bugs, mostly marked stuff that was super old to indicate we were not planning on fixing the items 15:13:16 I also was able to close out a few bugs in ironic as already fixed or no longer applicable 15:13:29 great, thanks! 15:14:13 I'll take it this week 15:14:18 I can be the bug dep after next week (ie, the week from May 20 - 24) 15:14:24 thanks JayF I was going to ask :D 15:14:33 and thanks iurygregory :) 15:15:08 I guess we can move to 15:15:08 #topic RFE Review 15:15:36 we have 2 RFEs proposed by rozzi but I don't see him here today 15:16:06 they look both reasonable to me 15:16:26 and I saw the development already started 15:16:40 I concur, I think the block device list combo one makes sense 15:16:47 yeah, absolutely 15:16:48 that seems more like a bug to me, tbh 15:17:04 the other, I have some mixed feelings on but I can see the reasoning 15:17:07 sure on the edge 15:17:32 I guess the scenario proposed makes sense 15:17:35 * iurygregory looks at the RFEs 15:17:45 Can we link the rfe's in here so that they are in the minutes? I also don't have the agenda up lol 15:17:53 #link https://bugs.launchpad.net/ironic-python-agent/+bug/2061437 15:17:58 #https://bugs.launchpad.net/ironic-python-agent/+bug/2061362 15:18:02 #link https://bugs.launchpad.net/ironic-python-agent/+bug/2061362 15:18:46 Verification of a change to openstack/ironic stable/2023.1 failed: ci: stable-only: explicitly pin centos build https://review.opendev.org/c/openstack/ironic/+/918118 15:18:57 2061362, I'm not sure I'm on board with approving that 15:19:16 We have methods to disable the step entirely which seems to be the correct move if you have a disc that you do not want cleaned 15:19:31 Mixed feelings here too. We can disable the step or we can skip a disk by using hints. 15:19:43 we can also skip cleaning a given disk 15:19:47 yeah 15:20:00 Based upon the described case, it sounds like they have shared luns across hosts 15:20:11 mmm I can see your point too, I'm more on the verge of accepting honestly 15:20:30 Is the actual feature a way to skip shared devices? Like a device hint that points to any shared devices that they can skip cleaning 15:20:35 so, if we can better understand the exact case in which the device is locked, that might help us understand how this feature could be scoped and accepted as an "optional" thing 15:20:49 I'm okay with a declarative way of saying don't clean this. I'm less okay with saying just ignore certain types of errors during cleaning 15:20:52 can't necessarilly know what the device is or if it will be in use in a shared lab 15:21:12 I'd even be okay with "skip any remote device" as an option 15:21:27 I do think it's not unreasonable to ask someone to use a custom hardware manager for a lab case if it is a unique use case 15:21:35 Same dtantsur 15:22:14 I think their trouble is having many disks and a random one randomly failing 15:22:15 skipping "remote" might make sense, not sure I'd ever let a customer use it though :) 15:22:20 but that is always a documentation thing 15:22:30 JayF, dtantsur, when you have a minute please add a comment in the RFE, I guess we need more details on that? 15:22:47 masghar: it's not a normal situation there? we're curious why they fail. 15:22:56 masghar: another way of looking at that is should ironic say a machine is available if it has a random disk failure 15:23:34 dtantsur, JayF, I see...but then their whole fleet can't be cleaned? 15:23:39 I will approve the first one 15:24:12 masghar: I'd be really curious what is going on in this case 15:24:27 Like, in my whole career in OpenShift, I don't remember a single such case 15:24:51 dtantsur: true, I guess disk failures are too serious to ignore 15:25:39 A harsher way to put that would be "if you don't care if some of your disks don't get cleaned, why enable cleaning at all?" 15:25:49 to me, it reads as if there are quorum disks in the environment 15:26:07 and we hit one while another node is up which shares the quorum disk, and we fail at that point 15:26:27 oh I see! 15:26:43 Sure but... why have cleaning enabled? 15:27:07 Metadata cleaning (the only type supported in metal3) is designed to prevent software conflicts, such as Ceph not liking old Ceph partitions 15:27:13 Or why not as part of your process, skip cleaning on shared/remote discs. Even if the feature was oriented that way, we'd likely approve it as skipping cleaning on remote devices optionally 15:27:20 dtantsur: I think because they are re-cycling the environment 15:27:36 the error is not going to be ignored though, just not fail the entire cleanup 15:27:36 but I guess at that point excluding the disk (removing it physically or logically) would be better than leaving it there in any case, and then redoing the cleaning 15:27:39 Yeah, but if you leave random data around on a shared disk, you still hit software problems 15:27:40 dtantsur: so if you have a host already cleaning, and it grabs the quorum disk first, it has the lock 15:27:45 another one comes along and will fail 15:28:09 dtantsur: if the other node is being cleaned, the data gets wiped out, but another node doesn't know about it 15:28:18 Or maybe it uses the disk, and you try to clean it :) I think this environment should have no cleaning or some very custom cleaning 15:28:25 exactly 15:28:31 how is cleaning the way we do it safe there? 15:28:46 we kind of need to take a step back from the problem 15:28:52 and provide greater clarity 15:29:00 is there a way to identify this as an "acceptable" failure 15:29:06 for that device, or not 15:29:27 for example, I ran 8GB quorum disks in clusters years ago shared across multiple nodes, the lock holder was always the leader 15:29:28 Right, yeah. We're guessing already, we should get back to Adam 15:29:46 ++ 15:29:52 yep, I agree we need more clarification at this point 15:31:03 I think the option to ignore errors should be our last resort 15:31:42 Yeah basically we open ourselves up to a security vulnerability if someone can figure out how to break our agent using something crazy on disk 15:31:56 Well, this is metadata cleaning 15:32:06 For full cleaning, I'd be hard -2 on even considering that 15:32:25 JayF: if a disk is locked, you might not be able to read it depending on the lock 15:32:39 Yeah, I guess for metadata cleaning the worst cases that will get very low quality bugs around people disabling error checking and then being shocked that their unclean disc can't be deployed to 15:32:56 yeah 15:33:27 I still think we should round trip some questions about details. I'm not a hard and fast opposed no matter what with that nuance in mind though 15:33:41 ++ 15:33:53 agreed 15:34:27 ++ 15:35:44 please just add a comment in the RFE to get more info from Adam 15:35:50 += 15:35:51 ++ 15:36:33 I'll toss that comment in here in the next 15 minutes 15:36:41 thanks JayF :) 15:36:46 If y'all want to just let me do it so that we don't crush Adam with five things that say the same thing LOL 15:36:54 sounds good :D 15:37:00 * dtantsur nods 15:37:33 we can probably move on 15:37:55 is there anything for open discussion ? 15:38:37 Ah that Nova bug I linked earlier in the ironic driver, it's worth a look if anyone has some time 15:39:05 I'm mainly curious if we need to add tempest coverage on the nova driver side 15:39:09 for things like reboot 15:39:09 I think I lost it 15:39:09 can you link it here? 15:39:21 https://review.opendev.org/c/openstack/nova/+/918195 15:39:25 thanks 15:39:37 #link https://launchpad.net/bugs/2064826 15:39:47 tl;dr we broke nova reboot last cycle, CI didn't catch it 15:39:54 whoops 15:39:58 it's being fixed but I wanted to see if it is an oversight or an explicit decision to not test this 15:40:05 if an oversight, I'll see if I can swing a test for it 15:40:21 but always like asking since I know we are shaving our test surfaces, don't wanna readd something that was intentionally removed 15:40:30 afaik, our basic ops plugin doesn't exercise reoot 15:40:32 reboot 15:41:04 doing reboot testing as part of it would also add some time to plugin execution 15:41:19 ... Mileage may vary, of course as with all things tempest plugin wise 15:41:41 It sorta goes back to my statement a cycle ago ... I wish we could just have a fake ironic node for some of these interactions/tests 15:41:57 but I think that's more than I'm willing to take on right now, so I might just swing at a reboot test and see what the actual impact is in minutes 15:42:14 well, we sort of can but the test can't ping something entirely fake 15:42:30 your ether doing something entirely fake, or your doing stuff which is a full scenario 15:42:36 integrative is scenario inherently 15:43:09 well, Nova<>Ironic driver, some of those things are just "did we tell Ironic to do the thing" 15:43:12 doing stuff in the middle leads fixes like https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/918001 having to come along 15:43:20 heh, fair 15:43:41 I'll keep wishing for utopia and living in the real world instead then :P 15:43:48 heh 15:44:01 adding to our basic ops likely isn't a huge issue, just add a knob for the time 15:44:07 if a reasonable one doesn't already exist 15:44:27 ack 15:44:58 thanks for reporting that JayF 15:45:26 anything else we want to discuss today? 15:45:37 At some point, we should probably delineate stuff we test in Nova and stuff we test in Ironic 15:45:49 otherwise, our testing matrix will keep being chaotic. 15:46:14 dtantsur: that's a line of chat Sean Mooney and I had last cycle; a tempest scenario set specifically designed for nova driver, and to run it more frequently 15:46:16 Yeah, I'm not a fan of doing an extra reboot when exercising rescue 15:46:32 and then doing that across however many tests that ends up triggering on 15:46:34 but realistically I'm already majorly overcommitted this cycle 15:47:37 I guess this needs a longer discussion and involving the nova team 15:47:37 we need to see if we can find the resources during this cycle 15:47:46 well 15:47:59 They are not going to really grasp the current scale 15:48:10 Sean is sorta our ambassador to the nova team in some ways these days 15:48:20 but I don't wanna engage a chat at that level if I don't have time to take action items outta it 15:48:29 yep, exactly 15:48:35 but we do need to likely take a look and sort out if we've got appropriate coverage with the plugin 15:48:40 * JayF has had a problem with saying yes more times than he delivered a "yes" result 15:50:12 we know this feeling 15:50:27 we can probably reevaluate the situation during the cycle 15:51:50 looks like we're closing 15:51:50 thanks everyone 15:51:55 #endmeeting