15:00:05 #startmeeting ironic 15:00:05 Meeting started Mon Jul 17 15:00:05 2023 UTC and is due to finish in 60 minutes. The chair is rpittau. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:05 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:05 The meeting name has been set to 'ironic' 15:00:08 o/ 15:00:08 o/ 15:00:31 welcome everyone to our weekly meeting! 15:00:31 I'll be your host for today :) 15:00:51 The meeting agenda can be found here: 15:00:51 #link https://wiki.openstack.org/wiki/Meetings/Ironic#Agenda_for_next_meeting 15:01:03 #topic Announcements/Reminder 15:01:37 we've announced last week that the next PTG will take place virtually October 23-27, 2023 15:02:20 going to remove the reminder after this meeting 15:02:20 we'll remind it everyone when we're closer to the date 15:02:48 #note usual friendly reminder to review patches tagged #ironic-week-prio, and tag your patches for priority review 15:03:38 I'm leaving the bobcat timeline for reference in the reminder section 15:04:03 any other announcement/reminder today ? 15:04:28 I worked stupidly late on Friday night on a decorator for sqlite issues 15:04:35 lol 15:04:49 I *think* it works, but I've not seen CI give me a resource contention issue to generate a "database is locked" error yet today 15:05:00 unit test wise, it definitely works! 15:05:06 \o/ 15:05:11 let 15:05:19 :D 15:05:36 I guess we need to review it and recheck a couple of times 15:05:38 it has been tagged as a prio review, any reviews/feedback would be appreciated 15:05:40 ++ 15:05:42 great 15:05:51 thanks TheJulia :) 15:05:53 yeah, I think I'm on the 3rd stupidly minor change today, so hopefully that is helping 15:06:33 (I'm silently skipping the action items review as there are none from last time) 15:06:55 since we started with that patch let's go to 15:06:55 #topic Review Ironic CI Status 15:07:18 so we're back to jammy, except for the snmp pxe job 15:07:36 and still battling the sqlite shenanighans 15:07:45 yup, I have no idea why it is failing, but I don't remember the exact reason we held it back to begin with 15:08:23 I don't know, I'll try to make time this week to try and replicate, downstream permitting 15:08:42 also found an issue with the new tinycore 14.x in the standalone job 15:08:51 the one I mentioned before 15:09:23 output of efibootmgr "EFI variables are not supported on this system" 15:09:39 oh well 15:09:48 we're probably not in a rush for that 15:10:06 anything else to mention for the CI status? 15:10:47 ok, moving on 15:10:47 #topic 2023.2 Workstream 15:10:47 #link https://etherpad.opendev.org/p/IronicWorkstreams2023.2 15:11:08 any update to share? 15:11:38 feedback would be appreciated in the Firmware Update patches =) 15:11:45 I've updated the patch to support vendor passthru methods as steps 15:11:49 and it now passes CI \o/ 15:12:10 iurygregory: yeah, was going to mention that, I have some time this week, I'll put here on top of my list 15:12:23 TheJulia: great! 15:12:31 hopefully get back to service steps this week 15:12:35 * TheJulia hopes 15:12:35 tks rpittau o/ 15:12:43 but CI is obviously the priority 15:12:51 ++ 15:13:01 I will take a look at patches today for CI 15:14:02 ok, great 15:14:02 I think we're good 15:14:14 #topic RFE review 15:14:25 JayF: left a couple of notes 15:14:42 So, jay also put the same link in on both 15:14:42 so the first one is https://bugs.launchpad.net/ironic/+bug/2027688 15:14:48 ah yeah :D 15:14:59 * TheJulia hands JayF coffee 15:15:22 I left a comment on the first one 15:15:25 the second one is https://bugs.launchpad.net/ironic/+bug/2027690 15:15:29 I'll update teh agenda 15:15:50 I think the idea is good overall, but we need to better understand where the delination is, and even then, some of the things that fail are deep in the weeds 15:15:59 and you only see the issue when your at that step, deep in the weeds 15:16:16 in other words, more discussion/clarification required 15:16:19 mmmm I agree 15:16:29 since we can't go check switches we know *nothing* about 15:16:51 but if we can provide better failure clarity, or "identify networking failed outright" mid-deploy, then we don't have to time out completely 15:17:29 ++, I like the rfe, but we will need further discussion about it 15:18:26 so yeah, I agree with the second rfe, we'll need it as an explicit spec and we'll need to figure out how to handle permissions 15:18:44 it might just be we pre-flight validate all permisison rights and then cache the step data from the templates 15:18:48 * iurygregory is reading the second 15:19:06 but, that requires a human putting their brain into the deploy templates code 15:19:07 checking earlky for the permissions would be best IMHO 15:19:19 *also* custom clean steps would be a thing to consider 15:19:27 maybe we just... permit them to be referenced, dunno 15:19:33 that is a later idea I guess 15:19:48 yeah 15:20:44 once we define the templates, it should be not too hard to expand with custom clean steps 15:20:54 yeah 15:21:14 I mainly wanted feedback on the interfaces I laid out there; I intend to specify the implementation just to keep my thoughts straight 15:21:14 ++ 15:21:46 right now you need to have system privileges for the deploy template endpoint if memory serves, so if we take the same pattern as add fields, begin to wire everythin together, adn then change the rbac policy, it should be a good overall approach 15:22:20 we have the patterns already spelled out in nodes fairly cleanly too 15:22:34 and allocations 15:22:37 right! 15:22:54 because you can have an allocation owner 15:23:01 (but not lessee, since that model doesn't exist there) 15:24:29 we're probably going to discuss both RFEs further, but they both look good to me 15:25:17 yeah, first one just needs some more "what are the specific failure cases we're trying to detect" to provide greater clarity 15:25:41 probably clarify some aspects and then finalize during the PTG ? 15:25:49 because honestly, some areas we do a bad job with today, and we should make that better since they could still fail there even if neutron has a thing saying "yes, I can login to the switch" 15:27:20 I don't think for the precheck idea, we're looking for anything perfect, just catch anything obviously broken 15:27:39 it was suggested by johnthetubaguy that for ilo, for instance, there's an internal BMC status and you could fail if that wasn't 'green' 15:28:08 I *suspect* the big one is "bmc is half broken" weird case, and I'm not sure we actually test a swift tempurl we create... 15:28:15 and I know with the hardware I ran at Rackspace, even just an ipmi power status would've been enough to indicate if our most common failure modes were active 15:28:20 yellow can be a thing depending on what you are looking at in iLO 15:28:21 XD 15:30:27 i guess one question might end up being, are there *really* operators still turning off power sync at scale? We went from 1 to 8 power sync workers. Of course, that is also ipmi 15:31:33 Anyway, that is beyond the scope of the meeting, just a question to get a datapoint really 15:31:58 let's remember this during (and more) during future discussion :) 15:32:06 The more I think of it, the more I wonder if it's just a flag to validation to say "no really, actually validate" 15:32:26 and have config to change default behavior for when Ironic does it 15:33:09 to at least try and fail faster, I guess 15:33:17 I also have a use case 15:33:21 of failing a rebuild before the disk gets nuked 15:33:29 yeha, I guess that's the point (fail faster) 15:33:39 so like e.g. a Ceph node wouldn't drop outta the cluster for longer than needed 15:34:02 if we can prevent that even in 25% of failure cases, it reduces operator pain because they don't have to go $bmc_fix_period without their node in a usable state 15:34:20 JayF: I think additional details or context might be needed there since that might also just be a defect 15:34:56 or, maybe something we've done work on and a lack of a failure awareness is causing the wrong recovery path to be taken by humans 15:35:00 that's sorta the theme of both of these rfes; allowing people with that kind of "I have a sensitive cluster" kind of use cases do their own scheduling of maintenance, and reduce the amount of times we'd have them in a failed-down state 15:36:20 anyway, lets keep talking about this one after the meeting, I think we need to better understand your use case and what spawned the issue to begin with, if that makes sense 15:36:53 specifically, if I'm trying to rebuild [to update] a node, and somehow I'm wiping all of ceph volumes out 15:37:18 alright, moving on! 15:37:25 #topic Open Discussion 15:38:00 I had https://bugzilla.redhat.com/show_bug.cgi?id=2222981 pop up in my inbox this morning! 15:38:06 it's been great running the meetings again this time and the last week :) 15:38:06 next week JayF will be back! 15:38:57 #note Overcloud deploy fails when mounting config drive on 4k disks 15:39:03 #link https://bugzilla.redhat.com/show_bug.cgi?id=2222981 15:41:02 I'm looking at cloud-init code to see if there is a happy path forward, but I suspect it might be we may need to locally repack the config drive as an intermediate solution 15:44:06 there might be, just depends on how long it has *really* been an "or", and we'll need to likely fix glean 15:44:37 Merged openstack/bifrost master: Refactor use of include_vars https://review.opendev.org/c/openstack/bifrost/+/855807 15:44:42 I'm wondering if repack would cause performance issues (but it would be only in the 4k scenario right?) 15:45:40 well, we would have to open the file on a loopback... if we even can... 15:45:48 and then write out as a brand new filesystem 15:45:51 so... dunno 15:46:00 gotcha 15:46:13 ¯\_(ツ)_/¯ 15:49:33 I don't see a "straight" solution honestly 15:50:41 yeah, since we support binary payloads 15:51:05 I'll dig more, and write an upstream bug up 15:51:12 thanks 15:51:53 ack 15:52:59 alright, anything else for open discussion ? 15:53:34 thanks everyone! 15:53:34 #endmeeting