15:00:06 #startmeeting ironic 15:00:06 Meeting started Mon Dec 2 15:00:06 2024 UTC and is due to finish in 60 minutes. The chair is rpittau. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:06 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:06 The meeting name has been set to 'ironic' 15:00:22 mmm I wonder if we'll have quorum today 15:00:26 anyway 15:00:31 Hello everyone! 15:00:31 Welcome to our weekly meeting! 15:00:31 The meeting agenda can be found here: 15:00:31 https://wiki.openstack.org/wiki/Meetings/Ironic#Agenda_for_December_02.2C_2024 15:00:42 let's give it a couple of minutes for people to join 15:01:17 o/ 15:01:28 o/ 15:01:49 We likely need to figure out our holiday meeting schedule 15:01:59 yeah, I was thinking the same 15:02:05 o/ 15:02:12 o/ 15:02:22 ok let's start 15:02:26 #topic Announcements/Reminders 15:02:40 #topic Standing reminder to review patches tagged ironic-week-prio and to hashtag any patches ready for review with ironic-week-prio: 15:02:40 #link https://tinyurl.com/ironic-weekly-prio-dash 15:03:08 there are some patches needing +W when any approver has a moment 15:03:56 o/ 15:04:04 #topic 2025.1 Epoxy Release Schedule 15:04:04 #link https://releases.openstack.org/epoxy/schedule.html 15:04:39 we're at R-17, nothing to mention except I'm wondering if we need to do some releases 15:04:39 we had ironic and ipa last week, I will go through the other repos and see where we are 15:05:36 I expect, given holidays and all, the next few weeks will largely be mainly focus time for myself 15:05:53 and I have one more thing, I won't be available for the meeting next week as I'm traveling, any volunteer to run the meeting? 15:06:35 I can't since I'm also traveling 15:06:42 I might be able to 15:06:43 My calendar is clear if you want me to 15:06:53 I guess a question might be, how many folks will be available next monday? 15:07:03 TheJulia, JayF, thanks, either of you is great :) 15:07:07 oh yeah 15:07:17 I'll make a reminder to run it next Monday. Why wouldn't we expect many people around? 15:07:20 I guess it will be at least 3 less people 15:07:36 Oh that's a good point. But I wonder if it's our last chance to have a meeting before the holiday, and I think we're technically supposed to have at least one a month 15:07:39 JayF: dtantsur, iurygregory and myself are traveling 15:07:52 we can have a last meeting the week after 15:08:00 then I guess we skip 2 meetings 15:08:07 the 23rd and the 30th 15:08:17 and we get back to the 6th 15:08:19 I'll note, while next week will be the ?9th?, the following week will be the 16th, and I might not be around 15:08:21 I will personally be out ... For exactly those two meetings 15:08:28 I'll be here on 23rd and 30th if anyone needs me 15:08:32 but not the next 2 weeks 15:09:00 Safe travels! 15:09:04 thanks :) 15:09:42 so tentative last meeting the 16th ? 15:09:58 or the 23rd? I may be able to make it 15:10:50 Lets do the 16th 15:10:58 I may partial week it, I dunno 15:11:17 perfect 15:11:17 I'll send an email out also as reminder/announcement 15:11:21 Yeah I like the idea of just saying the 16th is our only remaining meeting of the month. +1 15:11:36 cool :) 15:11:54 I really like that idea, skip next week, meet on the 16h, take over world 15:11:55 etc 15:12:12 Also, enables time for folks to focus on feature/work items they need to move forward 15:13:29 alright moving on 15:13:29 #topic Discussion topics 15:13:29 I have only one for today 15:13:37 which is more an announcement 15:13:52 #info CI migration to ubuntu noble has been completed 15:14:04 so far so good :D 15:14:26 anything else to discuss today? 15:14:41 I've got one item, if there is time/interest 15:14:44 servicing related 15:14:52 (we also have a good crowd for this topic) 15:14:57 janders: please go ahead :) 15:15:05 great :) 15:15:15 (in EMEA this week so easier to join this meeting) 15:15:18 o/ janders 15:15:32 so - iurygregory and I ran into some issues with firmware updates during servicing 15:15:47 the kind of issues I wanted to talk about is related to BMC responsiveness issues during/immediately after 15:16:16 Okay, what sort of issues? 15:16:21 and what sort of firmware update? 15:16:43 * iurygregory thanks HPE for saying servicing failed because the bmc wasn't accessible 15:16:54 HTTP error codes in responses (400s, 500s, generally things making no sense) 15:17:16 I think BMC firmware update was the more problematic case (which makes sense) 15:17:23 i know idracs can start spewing 500s if the FQDN is not set properly 15:17:47 but then what happens is update succeeds but Ironic thinks if failed cause it got a 400/500 response when BMC was booting up and talking garbage in the process 15:17:58 (if it remained silent and not responding it would have been OK) 15:18:04 https://paste.opendev.org/show/bdrsgYzFECwvq5O3hQPb/ 15:18:19 this was the error in case someone is interested =) 15:18:34 but TL;DR I wonder if we should have some logic saying "during/after BMC firmware upgrade, disregard any 'bad' BMC responses for X seconds" 15:18:35 There is sort of a weird similar issue NobodyCam has encountered with his downstream where after we power cycle, the BMCs sometimes also just seem to packup and go on vacation for a minute or two 15:19:08 in this case it was about 3min for me 15:19:21 Step wise, we likely need to... either implicitly or have an explicit step which is "hey, we're going to get garbage responses, lets hold off on the current action until the $thing is ready 15:19:22 but yeah, seems similar 15:19:24 it's not an entirely new problem but the impact of such BMC (mis)behaviour is way worse in day2 ops than day1 15:19:43 it is annoying when it happens on a new node being provisioned 15:19:54 or in service 15:20:03 I have seen similar related to checking power states 15:20:07 because these are known workflows and... odd things happening are the beginning of the derailment 15:20:08 it is disruptive if someone has prod nodes in scheduled downtime (and overshoots the scheduled downtime due to this) 15:20:48 we almost need a "okay, I've got a thing going on", give the node some grace or something flag 15:20:54 or "don't take new actions, or... dunno" 15:21:00 TheJulia++ 15:21:15 I guess I'm semi-struggling to figure out how we would fit it into the model and avoid consuming a locking task, but maybe the answer *is* to lock it 15:21:21 and hold a task 15:21:25 let me re-read the error Iury posted to see what Conductor was exactly trying to do when it crapped out 15:22:01 we almost need a "it is all okay" once "xyz state is achived" 15:22:24 OK so in this case it seems like the call to BMC came from within the step it seems 15:22:37 nobodycam's power issue makes me want to hold a lock, and have a countdown timer of sorts 15:22:41 but I wonder if it is possible that we hit issues with a periodic task or something 15:22:58 Well, if the task holds a lock the enitre time, the task can't run. 15:23:02 TheJulia I probably need to give it some more thought but this makes sense to me 15:23:03 until the lock releases it 15:23:19 iurygregory dtantsur WDYT? 15:23:28 it can't really be a background periodic short of adding a bunch more interlocking complexity 15:23:34 because then step flows need to resume 15:23:48 we kind of need to actually block in these "we're doing a thing" cases 15:24:10 and in nobodycam's case we could just figure out some middle ground which could be turned on for power actions 15:24:10 yeah it doesn't sound unreasonable 15:24:19 to do this 15:24:26 I *think* his issue is post-cleaning or just after deployment, like the very very very last step 15:24:27 I think it makes sense 15:24:41 I've got a bug in launchpad which lays out that issue 15:24:55 but I think someone triaged it as incomplete and it expired 15:25:02 oh =( 15:25:41 I think this time we'll need to get to the bottom of it cause when people start using servicing in anger (and also through metal3) this is going to cause real damage 15:26:00 (overshooting maintenance windows for already-deployed nodes is first scenario that comes to mind but there will likely be others) 15:26:09 https://bugs.launchpad.net/ironic/+bug/2069074 15:26:41 Overshooting maintenance windows is inevitable 15:26:52 the key is to keep the train of process from derailing 15:27:04 That way it is not the train which is the root cause 15:27:17 "if a ironic is unable to connect to a nodes power source" - power source == BMC in this case? 15:27:31 yes 15:27:37 I *think* 15:27:47 this rings a bell, I think this is what crapped out inside the service_step when iurygregory and I were looking at it 15:27:48 they also have SNMP PDUs in that environment, aiui 15:28:05 oh, so basically same type of issue 15:28:06 (this level of detail is hidden under that last_error) 15:28:08 yeah 15:28:09 not during service, but cleaning in an HPE 15:28:23 but yeah same type of issue indeed 15:28:26 thank you for clarifying iurygregory 15:28:35 np =) 15:28:50 yeah, I think I semi-pinned the issue that I thought 15:29:04 yeah it feels like we're missing the "don't depend on responses from BMC while mucking around with its firmware" bit 15:29:13 in a few different scenarios 15:29:29 well, or in cases where the bmc might also be taking a chance to reset/reboot itself 15:29:40 at which point, it is no longer a stable entity until it returns to stability 15:30:15 ok so from our discussion today it feels 1) the issue is real and 2) holding a lock could be a possible solution - am I right here? 15:30:38 Well, holding a lock prevents things from moving forward 15:30:46 and prevents others from making state assumptions 15:30:57 or other conflicting instructions coming in 15:31:02 yeah 15:31:38 iurygregory: was https://paste.opendev.org/show/bdrsgYzFECwvq5O3hQPb/'s 400 a result of power state checking? 15:31:41 post-action 15:31:42 ? 15:32:43 Need to double check 15:32:54 I can re-run things later and gather some logs 15:33:34 so the lock would be requested by code inside the step performing firmware operation in this case (regardless of whether day1 or day2) and if BMC doesn't resume returning valid responses after X seconds we fail the step and release the lock? 15:33:37 Yeah, I think if this comes down to a "we're in some action like power state change in a workflow, we should be abld to hold, or let the caller know we need to wait unti lwe're getting stable response 15:33:57 janders: the task triggering the step would already hold a lock (node.reservation) field through the task. 15:34:03 I think we do a very similar thing with boot mode / secure boot 15:34:31 Yeah, if the BMC never returns from "lunch" we eventually fail 15:34:45 dtantsur are you thinking about the code we fixed together in sushy-oem-idrac? 15:34:54 yep 15:34:55 or is this in the more generic part? 15:35:04 OK, I understand, thank you 15:36:54 great :) 15:37:04 anything more on the topic? or other topics to discuss? 15:37:25 iurygregory and I could get some ideas 15:37:35 I think this discussion really helped my understanding of this challenge and gives me some ideas going forward, thank you! 15:37:43 dtantsur yeah 15:37:45 what on Earth could make IPA take 90 seconds to return any response for any API (including the static root) 15:38:18 yeah, I'm breaking my mind trying to figure out this one 15:38:48 even on localhost! 15:39:43 hmm it's always the DNS right? :) 15:39:55 it could be DNS.. 15:39:59 dns for logging? 15:40:10 yeah, I recall this problem 15:40:31 saying that tongue-in-cheek since you said localhost but hey 15:40:38 maybe were onto something 15:40:49 address of the caller :) 15:41:09 what would be default timeout on the DNS client in question? 15:41:22 This was also a thing which was "fixed" at one point ages ago by monkeypatching eventlet 15:41:31 err, using eventlet monkeypatching 15:41:55 I asked them to check if the response time was the same using the name and ip, and the problem always repeats, I also remember someone said some requests taking 120sec =) 15:42:34 That problem was more or less completely excised, the one that was fixed with more monkey patching 15:43:11 I really like the hypothesis of inconsistent or non-working DNS. There might even be some differences in behavior between what distribution you're using for the ramdisk in those cases. 15:43:32 It's a RHEL container inside CoreOS 15:43:35 could tcpdump help confirm this hypothesis? 15:43:40 janders: likely 15:43:45 (see if there is DNS queries on the wire) 15:43:48 janders: at a minimum, worth a try 15:46:02 anything else to discuss? :) 15:46:17 I have a question if I may 15:46:24 adam-metal3: sure thing 15:46:31 we still have some time 15:47:45 I have noticed an interesting behaviour with ProLiant DL360 Gen10 Plus servers, as you know IPA registers an UEFI boot record under the name of ironic- by default 15:48:29 Unless there is a hint file, yeah 15:48:35 whats going on? 15:48:42 On the machine type I have mentioned this record gets saved during all the deployments so if you deploy and clean 50 times you have 50 of these boot devices visible 15:48:55 oh 15:48:55 heh 15:48:57 uhhhhh 15:49:12 Steve wrote a thing for this 15:49:32 as far as tests done by dowsntream folks inidcate, there is no serious issue 15:49:52 but it was confusing a lot of my downstream folks 15:50:14 * iurygregory needs to drop, lunch just arrived 15:50:40 https://review.opendev.org/c/openstack/ironic-python-agent/+/914563 15:50:49 Okay so I assume then it is a known issue, that is good! 15:51:03 Yeah, so... Ideally the image your deploying has a loader hint 15:51:26 in that case, the iamge can say what to use, because some shim loaders will try to do record injection as well 15:51:39 and at one point, that was a super bad bug on some intel hardware 15:51:52 or, triggered a bug... is the best way to describe it 15:52:39 Ironic, I *think* should be trying to clean those entries up in general, but I guess it would help to better understand what your seeing, and compare to a deployment log since the code is *supposed* to dedupe those entries if memory serves 15:52:57 adam-metal3: we can continue to discuss more as time permits 15:53:06 sure 15:53:11 we don't need to hold the meeting for this, I can also send you a pointer to the hint file 15:53:25 thanks! 15:53:55 we have a couple of minutes left, anything else to discuss today? :( 15:53:57 errr 15:53:58 :) 15:55:21 alright I guess we can close here 15:55:32 thanks everyone! 15:55:32 #endmeeting