15:00:16 #startmeeting ironic 15:00:16 Meeting started Mon Jul 8 15:00:16 2024 UTC and is due to finish in 60 minutes. The chair is rpittau. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:16 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:16 The meeting name has been set to 'ironic' 15:00:21 o/ 15:00:22 o/ 15:00:26 Hello everyone! 15:00:26 Welcome to our weekly meeting! 15:00:26 The meeting agenda can be found here: 15:00:26 #link https://wiki.openstack.org/wiki/Meetings/Ironic#Agenda_for_July_08.2C_2024 15:00:29 o/ 15:00:37 o/ 15:00:47 o/ 15:00:58 o/ 15:00:58 #topic Announcements/Reminders 15:01:14 #info Standing reminder to review patches tagged ironic-week-prio and to hashtag any patches ready for review with ironic-week-prio 15:01:14 #link https://tinyurl.com/ironic-weekly-prio-dash 15:01:31 #info 2024.2 Dalmatian Release Schedule 15:01:31 #link https://releases.openstack.org/dalmatian/schedule.html 15:02:46 #info the next OpenInfra PTG which will take place October 21-25, 2024 virtually! Registration is now open! 15:02:46 #link https://ptg.openinfra.dev/ 15:02:58 So, we should make the 2nd set of bugfix releases soon(ish)? 15:03:15 dtantsur: beginning of august 15:03:29 oh? must have miscalculated 15:03:50 usually between -9 and -8 weeks to go 15:04:47 we're close :) 15:05:08 anything else to remind/announce ? 15:06:07 looks like we're good, moving on! 15:06:14 #topic Review Ironic CI status 15:06:40 any updates on CI? it looked stable last week, but I may have missed something 15:06:58 So I noticed one of the tempest tests, the standalone ones were not doing exactly what was thought 15:07:16 I seemingly lost a line in a rebace somewhere along the way, so I'm working to fix that since it has uncovered other issues 15:07:29 related changes are tagged ironic-week-prio 15:07:35 ok, thanks! 15:07:45 the only other thing to note, anaconda is taking a bit longer than we expect and is timing out 15:08:00 ah 15:08:00 is this a recent thing? 15:08:21 yeah, looks like it. Kooks like it might need a little more ram/cpu, but ultimately it is just a slow job to begin with 15:08:39 ok, something to keeps an eye on then 15:08:57 on a plus side, if I get pxe grub2 running under the ironic-redfish-standalone jobs, we can drop a main job setup which will free resources 15:09:06 good 15:09:28 end of last week, it was fairly frequent, look like it has been happier over the weekend. Worst comes to worst I'm okay with marking anaconda non-voting until we can sort it out 15:09:45 just wondering what could've changed recently 15:10:04 ok, thanks TheJulia 15:10:32 any more updates on CI ? 15:11:26 oh, it is already non-voting on ironic 15:11:32 right 15:11:33 its still voting on ironic-tempest-plugin 15:11:40 and it has been doing it for about two weeks 15:11:42 that's why we probably haven't noticed much 15:11:46 I can look at tuning it 15:11:56 That might speed it up a little bit 15:11:59 thanks :) 15:12:31 alright, moving on 15:12:37 #topic Discussion topics 15:12:45 I have on thing 15:12:55 #info pysnmp update status 15:13:08 I haven't received any answer from the maintainer of pysnmp-lextudio so far 15:13:29 I think the last comment was thursday/friday of last week? 15:13:30 and dont' really got time to move forward with the changes 15:13:39 yeah... :( 15:13:47 oh so he replied ? 15:14:00 I think I saw your last comment was last thursday/friday 15:14:54 I left 2 comments 15:15:00 one 3 weeks ago and another one last week 15:15:07 no answers so far 15:15:08 oh, hmm 15:15:10 :( 15:15:33 Can we catch up on context here? 15:15:40 Was this a migration we were doing out of neccessity? 15:15:46 my tests bring to a problem with how asyncio interact with python threading but I don't think I will have the time to look firther into it in the next weeks 15:16:11 JayF: kind of yeah, python version support mainly 15:16:31 this is the testing patch for virtualpdu 15:16:31 #link https://review.opendev.org/c/openstack/virtualpdu/+/922158 15:16:58 if anyone has time feel free to hijack it, prbably some methods need to be reworked to be able to work with asyncio 15:17:52 I found a similar issue that was fixed here https://github.com/home-assistant/core/pull/112815 15:18:03 so the issue your hitting is in virtualpdu specifically at this time? 15:18:11 yes 15:18:37 okay, that is good context to have 15:18:38 hmmm 15:19:36 I don't know how my availability really looks over the next two weeks, but I might be able to take a look. I'm dealing with folks vacations and then traveling the week after 15:20:03 thanks TheJulia, let me know if you need any more info 15:20:30 k, I'll try to give it some time over the next couple fo days 15:20:31 of days 15:20:38 ack, thanks 15:21:19 any more comments/questions ? 15:21:37 I have created a proposal - https://bugs.launchpad.net/ironic/+bug/2072307 , for this pr : https://review.opendev.org/c/openstack/ironic/+/921657 15:21:56 hroy: that is in the RFE discussion already I think 15:21:59 This is regarding virtual media GET api 15:22:10 ah, okay 15:22:15 or maybe not :D 15:22:17 I forgot to add it 15:22:22 I'll add it now and we can discuss it later 15:22:32 no issues 15:22:35 got it, thanks! 15:22:41 That RFE looks reasonable to me fwiw 15:22:50 thanks for filing the bug hroy 15:23:24 I'm mildly worried about a synchronous action in the API.. 15:23:28 but I also don't know how to avoid it 15:23:57 There are simply some cases we can't entirely avoid it without massive engineering efforts. I guess the question I have is "why is it needed" 15:24:16 and what piece of the puzzle owns the lower layer 15:24:27 TheJulia: that's the only reliable way to learn if the attachment via API actually succeeded 15:24:35 fair enough 15:24:37 otherwise, we're relying on the potentially racy last_error 15:24:41 yup 15:24:48 makes sense to expose status 15:25:35 This isn't the first sync request that goes back to a node is it? 15:25:44 no, boot mode does it today 15:26:03 and we've proposed a couple other light touch aspects in the past which sort of have stalled out on the same basic concern 15:26:13 I don't have any particular concern 15:26:33 as long as everything fails/times out properly in the unhappy cases, it's fine I think 15:26:48 Given it is inherently redfish, i have less concerns, since it is definitely not ipmi 15:27:20 if there's no objection I will just mark it as approved 15:27:59 none, just ask for a little more verbose rfe for that whole why question should it come up again in the future 15:28:14 yep, makes sense 15:28:28 alright, approved 15:28:44 do we ant to have a look at the other RFE since we're here? 15:28:58 #link https://bugs.launchpad.net/ironic/+bug/2071741 15:29:24 no objections 15:30:06 we have already a patch proposed for this one and it's ready AFAICS 15:30:20 none, it is still possible for operators to run their own, I think the conditional got "slightly" mis-used in the depths of the agent driver as well, but removal of the option will clean that up 15:30:40 alright, approved 15:30:47 It's not clear to me that using [agent]manage_agent_boot=false is a valid setup at all at this point 15:31:04 and I think the original use case could be served by a custom networkmanager 15:31:46 yeah, nobody knows if it works at all 15:32:03 yeah 15:32:15 one mroe reason to deprecate it quickly :) 15:32:26 * dtantsur git push -f 15:32:30 :D 15:32:32 I already have the patch up that makes it deprecated 15:32:34 heh 15:32:39 we'll have to wait two cycles to remove 15:32:56 will we? 15:33:08 JayF: yep, thanks, mentioned that before 15:33:08 #link https://review.opendev.org/c/openstack/ironic/+/923328 15:33:08 we can remove the W-1 15:33:09 we have to have it billboarded+working in a SLURP release 15:33:26 Okay, let's not have this SLURP thing as a religion 15:33:34 dtantsur: ++ 15:33:45 It's not a religion, it's openstack stable policy, and promises our users and operators rely upon. 15:33:49 This feature is not used and is very probably broken, and we won't fix any bugs with it 15:34:06 Marking deprecated at least communicates that 15:34:09 Please don't make me regret giving up the fight against this SLURP thing... 15:34:13 That's a reasonable counterpoint, but we also let it sit there rotting for years, why does one more matter so much? lol 15:34:26 Complexity is not something we lack in the agent code 15:34:45 lets just make sure there are plenty of inline comments 15:34:47 What is a SLURP release 15:35:00 so we know what is going away in case we need to touch that code in the next year 15:35:05 ... who am I kidding, when 15:35:05 masghar: a very stupid name for longer term support releases that allow upgrading jumping over 1 release 15:35:08 masghar: https://docs.openstack.org/project-team-guide/release-cadence-adjustment.html 15:35:20 thanks! 15:35:22 dtantsur: masghar: the dumb name is because the good name was cratered by lawyers lol 15:35:31 that lawyers! 15:35:54 Hey, I like those lawyers 15:36:00 They make my job easier! :) 15:36:06 gosh :) 15:36:12 Realistically, we always have a certain freedom in interpreting the policies 15:36:29 And "we keep a broken feature because the paper says so" is not a strong argument in my book 15:36:45 I think the question of both: "does it really work?" and "should a feature that turns off a large part of ironic be subject to deprecation policy?" are both reasonable arguments 15:36:49 let's maybe keep that for open discussion :) 15:36:53 I guess we can move on now 15:36:57 but lets just approve the rfe, get in the deprecation warning 15:37:04 and argue about the removal patch next cycle when it exists 15:37:14 Well, true 15:37:15 I assume we want to at least wait *one* cycle 15:37:45 * dtantsur nods 15:38:03 if we find out it is horribly broken to begin with, just rip it out. 15:38:16 but next cycle 15:38:47 alright, let's move forward with the meeting 15:38:47 #topic Bug Deputy Updates 15:38:57 I was bug deputy last week 15:39:07 we got 2 new RFEs (that we've discussed already) 15:39:41 4 new BUGs, with one in particular that I'd like help to triage, or better to approach to 15:39:41 #link https://bugs.launchpad.net/ironic-python-agent/+bug/2071935 15:39:47 I'll also note: I filed a handful (14? 15?) of bugs that came out of the docs-audit. They are tagged "docs-audit-2024" and Reverbverbverb has a very small amount of time left on contract for reviews and clarifying comments if anyone has any questions about those bugs or wants to work on one. 15:40:07 It's my expectation they'll be the foundation for a future outreachy/mlh style project 15:40:28 ++ 15:40:34 JayF: thanks 15:40:37 rpittau: that seems like a valid bug to me, probably medium? 15:40:41 and thanks again to Reverbverbverb 15:41:04 unicode in serial numbers, absolutely and perfectly cursed 15:41:10 JayF: yeah, just wondering how to approach it in a sane way 15:41:13 dtantsur: exactly 15:42:02 well, there are two cases right 15:42:19 well wait 15:42:19 wow 15:42:24 this is a serial number, baked into the disk 15:42:24 heh.... :) 15:42:27 yep 15:42:27 not like a label or something 15:42:28 the actual problem is in ironic-lib btw 15:42:52 dtantsur: yeah 15:43:03 it's the execute method 15:43:04 .... hmmmmmmm 15:43:11 that's why I'm puzzled 15:44:04 so, I guess there are two things 15:44:17 I wonder if it's also a broken unicode... 15:44:38 1) unicode chars should be valid, and we're running the command as utf8, the data structure has utf-16... we likely need to fix the command invocation to handle utf16 in the data 15:44:44 well I would not expect anyone to put unicode or broken utf8 in a serial number honestly 15:44:46 2) That character itself is likely not valid 15:44:58 that log is ... podman running IPA?! 15:45:00 https://www.compart.com/en/unicode/U+DCFF is repeated 15:45:12 JayF: welcome to OpenShift, we have weird stuffs :D 15:45:15 :D 15:45:20 JayF I was wondering the same 15:45:33 https://github.com/openshift/ironic-agent-image 15:45:35 dtantsur: full circle, I used to be the IPA-container-weirdo 15:45:41 hehe 15:46:04 I think this would be one of those cases we likely need to verify "what is the actual serial number reported via other means" as well. 15:46:10 specifically, we get "2HC015KJ0000" before the unicode 15:46:14 4 zeros is... suspect 15:46:27 The odds are against that actually being the case 15:46:33 I was wondering if maybe the serial is not correct for some reason 15:46:43 Yeah, the unicode part is garbage most likely 15:46:48 My curiousity is: a newer version of the tool in questino 15:46:55 or at least a cursory look at the changelog 15:47:24 wouldn't be the first time an lsblk behavior change benefited or harmed our use cases 15:47:31 definitely garbage characters, the question is how did it get there and why, but realistically we likely need to expect any command execution can return utf-16 at this point 15:47:47 and to try and force encoding to 8 is going to have some bad side effects 15:47:49 ok, I think I have enough to move forward on 2 fronts 15:47:51 we saw similar with efibootmgr 15:47:57 since it's data is technically utf-16 15:48:08 I guess the question is what can we do if an important tool returns junk 15:48:35 I would truncate the invalid characters if we recognize them as formatting/control chars 15:48:40 I haven't seen any other reports beside this one 15:49:01 but, running in utf-16 is sort of the first step along with identifying what is actually the correct serial 15:49:07 It'd also be nice to get written down the hardware that is doing this 15:49:09 ok 15:49:16 Validate reality versus our perception of reality 15:49:20 yeah, I guess we need details 15:49:25 e.g. if we get it reported again, maybe knowing it's WD Blacks released in May 2023 or something 15:49:29 JayF: LITEON IT ECE-12 15:49:39 wut 15:49:46 e.g. if we get it reported again, maybe knowing it's WD Blacks released in May 2023 or something/ 15:49:51 ugh 15:49:52 embedded sata drive 15:49:55 is that a cd-rom from 2005? 15:49:58 :D 15:50:29 I found an image of that model, the serial number is only ASCII characters confirmed :P 15:50:42 time to implement label-OCR-inspection technology in IPA /s 15:50:50 I'll get more details anyway 15:51:07 thanks all 15:51:16 ... rpittau I can't help but wonder if there is a raid controller in the middle which is doing "stuff" 15:51:23 mmmm 15:51:34 more info to get 15:51:36 I had a case recently where basically the serial/wwn and friends were all faked out by the raid controller 15:51:43 if you rebuild the reset, it all changed 15:51:49 s/reset/raid set/ 15:51:54 so nice of it 15:51:58 very much so 15:52:01 oh gosh 15:52:21 The customer kept insisting on doing 2x raid0s and they got the same WWN reported by the raid controller 15:52:31 and of course they filed a bug saying it is the software's fault :) 15:52:43 ah, I think I've seen this discussion 15:52:45 yeah, the software on that raid controller :D 15:52:50 ++ 15:52:53 yup 15:52:58 yeah 15:53:08 it's all quite a problem for root device hints.. 15:53:13 And really, it is actually a valid thing for the raid controller to do 15:53:39 just yeah, it and human expectations of uniqueness is problematic for root device hints 15:53:47 well, if they put some thoughts into it, they could do ~ hash of both WWNs and the RAID level or something like that 15:53:54 Since a SAN can say everything is the same WWN and have unique LUNs or path numbers 15:53:56 to make it predictable for the same combination of drives 15:54:07 * dtantsur is dreaming of a world where things make sense 15:54:11 heh 15:54:34 I think my guidance was the hctl based because it was at least consistent 15:54:49 everything else would change between reboots 15:54:57 or at least, could change 15:55:14 * TheJulia notes it is too early for whiskey 15:55:20 not here :D 15:55:45 btw cid you're the bug deputy for this week, good luck! :) 15:56:03 Oh question! 15:56:07 sure! 15:56:07 tks 15:56:30 cid: feel free to ping me here or in slack if you have any bugs that have you perplexed 15:56:34 ... should we consider deprecating the ilo stuffs? specifically thinking if we deprecate removal is in a year 15:56:43 mmmm 15:56:50 ilo4 is still used in the wild, unfortunately 15:56:51 ilo4 is kind of dead 15:56:53 We said pretty clearly we'd keep support for those as long as hardware that used it was in place. 15:57:01 heh 15:57:03 JayF: I will 15:57:07 I know multiple different Ironic shops that use ilo driver with ilo-driver-exclusive hardware. 15:57:08 Interesting perception differences 15:57:16 I mean technically 15:57:24 but i npractice we still see usage 15:57:30 Metal3's quick start literally uses ilo4 as an example :D 15:57:42 ilo4 is dead more than 1 year ago 15:58:00 "dead" means what? Unsupported by HP? No longer sold? 15:58:15 Would someone on a 5-7 year hardware lifecycle still have them? 15:58:15 means unsupported, latest firmware version is from March 2023 15:58:21 Okay, I guess nevermind then, just trying to think out into the future far enough and it sort of made me wonder since we dropped the wsman stuffs, but a forward path and all is the unknown 15:58:41 rpittau: and that last firwmare drop was like a year after EoS right? 15:58:42 TheJulia: if we deprecate in a slurp release (e.g. if we deprecated a driver next release), we can remove in the following 15:58:47 TheJulia: correct 15:58:52 TheJulia: it's only "wait a whole year" when you try to deprecate in a between-slurps release 15:59:18 Okay, well, just something we should be mindful of as time moves forward 15:59:19 This is the very case where I think "a whole year" may be justified 15:59:34 because people may not be able to buy new hardware in 6 months 15:59:40 Yeah, we have almost always treated driver removals as long deprecations 15:59:43 I could be +1 to *marking it deprecated* 15:59:48 with no timeline set on removal 15:59:53 as an informational thing 15:59:58 I think that is fair 16:00:07 Right. Just as a heads-up for operators: you're dealing with obsolete technology. 16:00:11 I think the whole "we still see some active use" 16:00:21 I'm absolutely and totally for deprecate it :) 16:00:29 it makes it a bit harder for us since the migration is not as clear cut perception wise 16:00:31 I think the weird bonus piece of this which we all include in the math indirectly 16:00:33 Thought keep in mind: the ilo5 driver uses ilo4 bits 16:00:49 so we don't just get rid of proliantutils 16:00:50 yup, and proliantutils hides the details 16:00:52 is that the question isn't "who uses ilo4 hardware today?" it's "Who will be using ilo4 hardware when they upgrade into Dalmation?" 16:00:55 which are different questions 16:02:09 The question is also what we'll realistically deprecate given that ilo5 is a superset of ilo4 16:02:18 ref https://opendev.org/openstack/ironic/src/branch/master/ironic/drivers/ilo.py 16:02:43 When TheJulia said "deprecate ilo driver", I assumed she meant *all ilo drivers* and migrate ilo5 users to redfish 16:02:54 there is minor if any benefit in doing it piecemeal afaict 16:02:59 That's a loss of functionality for existing hardware 16:03:05 ilo5 is still actively supported though 16:03:07 existing as in supported and actively used 16:03:15 e.g. you lose RAID 16:03:47 I guess we're sort of stuck until post ilo5 then 16:04:00 Is it very difficult for ilo5 users to just switch to redfish? 16:04:05 we should "deprecate" ilo4 in terms of set expectations 16:04:14 we're not going to magically fix bugs there in proliantutils 16:04:20 masghar: ilo5 does not support standard Redfish RAID (I think you were the one who discovered it ;) 16:04:20 ++ absolutely, and I could even see that being a good move for ilo5 16:04:30 ilo5... that might be in sushy for all we know and we should at least "look" 16:04:40 And the explain what/why/etc 16:04:41 dtantsur: Yes I think so 16:04:41 we *could* extract what we need from proliantutils and build a new driver on top of redfish 16:04:47 I also think we should consider pulling some of proliantutils into Ironic/sushy, like we have considered for sushy-oem-drac 16:04:54 bingo! 16:05:11 possibly. proliantutils is not a simple plugin on top of sushy 16:05:21 definitely not a simple plugin 16:05:28 it has lots of logic and makes assumptions 16:05:42 and it stil luses python six :) 16:05:42 i.e. if you point it at an edgeline machine it acts differently even though it has an "ilo" 16:05:44 Yeah, and some of those assumptions are invalid in newer ironics 16:05:50 yup 16:05:51 I fixed a bug in one recently 16:05:54 there is a possibility that the RAID stuff specifically is possible to extract 16:06:00 https://opendev.org/x/proliantutils/src/branch/master/proliantutils/redfish/resources/system/storage maybe 16:06:10 basically if there's no separate entity maintaining proliantutils 16:06:13 we're 5 past, I'm closing the meeting but we can keep the discussion open 16:06:14 but we still need the code inside 16:06:16 #endmeeting