#openstack-ironic log

15:00:06 <rpittau> #startmeeting ironic
15:00:06 <opendevmeet> Meeting started Mon Dec  2 15:00:06 2024 UTC and is due to finish in 60 minutes.  The chair is rpittau. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:06 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:06 <opendevmeet> The meeting name has been set to 'ironic'
15:00:22 <rpittau> mmm I wonder if we'll have quorum today
15:00:26 <rpittau> anyway
15:00:31 <rpittau> Hello everyone!
15:00:31 <rpittau> Welcome to our weekly meeting!
15:00:31 <rpittau> The meeting agenda can be found here:
15:00:31 <rpittau> https://wiki.openstack.org/wiki/Meetings/Ironic#Agenda_for_December_02.2C_2024
15:00:42 <rpittau> let's give it a couple of minutes for people to join
15:01:17 <iurygregory> o/
15:01:28 <TheJulia> o/
15:01:49 <TheJulia> We likely need to figure out our holiday meeting schedule
15:01:59 <rpittau> yeah, I was thinking the same
15:02:05 <kubajj> o/
15:02:12 <cid> o/
15:02:22 <rpittau> ok let's start
15:02:26 <rpittau> #topic Announcements/Reminders
15:02:40 <rpittau> #topic Standing reminder to review patches tagged ironic-week-prio and to hashtag any patches ready for review with ironic-week-prio:
15:02:40 <rpittau> #link https://tinyurl.com/ironic-weekly-prio-dash
15:03:08 <rpittau> there are some patches needing +W when any approver has a moment
15:03:56 <adam-metal3> o/
15:04:04 <rpittau> #topic 2025.1 Epoxy Release Schedule
15:04:04 <rpittau> #link https://releases.openstack.org/epoxy/schedule.html
15:04:39 <rpittau> we're at R-17, nothing to mention except I'm wondering if we need to do some releases
15:04:39 <rpittau> we had ironic and ipa last week, I will go through the other repos and see where we are
15:05:36 <TheJulia> I expect, given holidays and all, the next few weeks will largely be mainly focus time for myself
15:05:53 <rpittau> and I have one more thing, I won't be available for the meeting next week as I'm traveling, any volunteer to run the meeting?
15:06:35 <iurygregory> I can't since I'm also traveling
15:06:42 <TheJulia> I might be able to
15:06:43 <JayF> My calendar is clear if you want me to
15:06:53 <TheJulia> I guess a question might be, how many folks will be available next monday?
15:07:03 <rpittau> TheJulia, JayF, thanks, either of you is great :)
15:07:07 <rpittau> oh yeah
15:07:17 <JayF> I'll make a reminder to run it next Monday. Why wouldn't we expect many people around?
15:07:20 <rpittau> I guess it will be at least 3 less people
15:07:36 <JayF> Oh that's a good point. But I wonder if it's our last chance to have a meeting before the holiday, and I think we're technically supposed to have at least one a month
15:07:39 <rpittau> JayF: dtantsur, iurygregory and myself are traveling
15:07:52 <rpittau> we can have a last meeting the week after
15:08:00 <rpittau> then I guess we skip 2 meetings
15:08:07 <rpittau> the 23rd and the 30th
15:08:17 <rpittau> and we get back to the 6th
15:08:19 <TheJulia> I'll note, while next week will be the ?9th?, the following week will be the 16th, and I might not be around
15:08:21 <JayF> I will personally be out ... For exactly those two meetings
15:08:28 <dtantsur> I'll be here on 23rd and 30th if anyone needs me
15:08:32 <dtantsur> but not the next 2 weeks
15:09:00 <TheJulia> Safe travels!
15:09:04 <rpittau> thanks :)
15:09:42 <rpittau> so tentative last meeting the 16th ?
15:09:58 <rpittau> or the 23rd? I may be able to make it
15:10:50 <TheJulia> Lets do the 16th
15:10:58 <TheJulia> I may partial week it, I dunno
15:11:17 <rpittau> perfect
15:11:17 <rpittau> I'll send an email out also as reminder/announcement
15:11:21 <JayF> Yeah I like the idea of just saying the 16th is our only remaining meeting of the month. +1
15:11:36 <rpittau> cool :)
15:11:54 <TheJulia> I really like that idea, skip next week, meet on the 16h, take over world
15:11:55 <TheJulia> etc
15:12:12 <TheJulia> Also, enables time for folks to focus on feature/work items they need to move forward
15:13:29 <rpittau> alright moving on
15:13:29 <rpittau> #topic Discussion topics
15:13:29 <rpittau> I have only one for today
15:13:37 <rpittau> which is more an announcement
15:13:52 <rpittau> #info CI migration to ubuntu noble has been completed
15:14:04 <rpittau> so far so good :D
15:14:26 <rpittau> anything else to discuss today?
15:14:41 <janders> I've got one item, if there is time/interest
15:14:44 <janders> servicing related
15:14:52 <janders> (we also have a good crowd for this topic)
15:14:57 <rpittau> janders: please go ahead :)
15:15:05 <janders> great :)
15:15:15 <janders> (in EMEA this week so easier to join this meeting)
15:15:18 <TheJulia> o/ janders
15:15:32 <janders> so - iurygregory and I ran into some issues with firmware updates during servicing
15:15:47 <janders> the kind of issues I wanted to talk about is related to BMC responsiveness issues during/immediately after
15:16:16 <TheJulia> Okay, what sort of issues?
15:16:21 <TheJulia> and what sort of firmware update?
15:16:43 * iurygregory thanks HPE for saying servicing failed because the bmc wasn't accessible
15:16:54 <janders> HTTP error codes in responses (400s, 500s, generally things making no sense)
15:17:16 <janders> I think BMC firmware update was the more problematic case (which makes sense)
15:17:23 <TheJulia> i know idracs can start spewing 500s if the FQDN is not set properly
15:17:47 <janders> but then what happens is update succeeds but Ironic thinks if failed cause it got a 400/500 response when BMC was booting up and talking garbage in the process
15:17:58 <janders> (if it remained silent and not responding it would have been OK)
15:18:04 <iurygregory> https://paste.opendev.org/show/bdrsgYzFECwvq5O3hQPb/
15:18:19 <iurygregory> this was the error in case someone is interested =)
15:18:34 <janders> but TL;DR I wonder if we should have some logic saying "during/after BMC firmware upgrade, disregard any 'bad' BMC responses for X seconds"
15:18:35 <TheJulia> There is sort of a weird similar issue NobodyCam has encountered with his downstream where after we power cycle, the BMCs sometimes also just seem to packup and go on vacation for a minute or two
15:19:08 <iurygregory> in this case it was about 3min for me
15:19:21 <TheJulia> Step wise, we likely need to... either implicitly or have an explicit step which is "hey, we're going to get garbage responses, lets hold off on the current action until the $thing is ready
15:19:22 <iurygregory> but yeah, seems similar
15:19:24 <janders> it's not an entirely new problem but the impact of such BMC (mis)behaviour is way worse in day2 ops than day1
15:19:43 <janders> it is annoying when it happens on a new node being provisioned
15:19:54 <TheJulia> or in service
15:20:03 <adam-metal3> I have seen similar related to checking power states
15:20:07 <TheJulia> because these are known workflows and... odd things happening are the beginning of the derailment
15:20:08 <janders> it is disruptive if someone has prod nodes in scheduled downtime (and overshoots the scheduled downtime due to this)
15:20:48 <TheJulia> we almost need a "okay, I've got a thing going on", give the node some grace or something flag
15:20:54 <TheJulia> or "don't take new actions, or... dunno"
15:21:00 <janders> TheJulia++
15:21:15 <TheJulia> I guess I'm semi-struggling to figure out how we would fit it into the model and avoid consuming a locking task, but maybe the answer *is* to lock it
15:21:21 <TheJulia> and hold a task
15:21:25 <janders> let me re-read the error Iury posted to see what Conductor was exactly trying to do when it crapped out
15:22:01 <TheJulia> we almost need a "it is all okay" once "xyz state is achived"
15:22:24 <janders> OK so in this case it seems like the call to BMC came from within the step it seems
15:22:37 <TheJulia> nobodycam's power issue makes me want to hold  a lock, and have a countdown timer of sorts
15:22:41 <janders> but I wonder if it is possible that we hit issues with a periodic task or something
15:22:58 <TheJulia> Well, if the task holds a lock the enitre time, the task can't run.
15:23:02 <janders> TheJulia I probably need to give it some more thought but this makes sense to me
15:23:03 <TheJulia> until the lock releases it
15:23:19 <janders> iurygregory dtantsur WDYT?
15:23:28 <TheJulia> it can't really be a background periodic short of adding a bunch more interlocking complexity
15:23:34 <TheJulia> because then step flows need to resume
15:23:48 <TheJulia> we kind of need to actually block in these "we're doing a thing" cases
15:24:10 <TheJulia> and in nobodycam's case we could just figure out some middle ground which could be turned on for power actions
15:24:10 <janders> yeah it doesn't sound unreasonable
15:24:19 <janders> to do this
15:24:26 <TheJulia> I *think* his issue is post-cleaning or just after deployment, like the very very very last step
15:24:27 <iurygregory> I think it makes sense
15:24:41 <TheJulia> I've got a bug in launchpad which lays out that issue
15:24:55 <TheJulia> but I think someone triaged it as incomplete and it expired
15:25:02 <iurygregory> oh =(
15:25:41 <janders> I think this time we'll need to get to the bottom of it cause when people start using servicing in anger (and also through metal3) this is going to cause real damage
15:26:00 <janders> (overshooting maintenance windows for already-deployed nodes is first scenario that comes to mind but there will likely be others)
15:26:09 <TheJulia> https://bugs.launchpad.net/ironic/+bug/2069074
15:26:41 <TheJulia> Overshooting maintenance windows is inevitable
15:26:52 <TheJulia> the key is to keep the train of process from derailing
15:27:04 <TheJulia> That way it is not the train which is the root cause
15:27:17 <janders> "if a ironic is unable to connect to a nodes power source" - power source == BMC in this case?
15:27:31 <TheJulia> yes
15:27:37 <TheJulia> I *think*
15:27:47 <janders> this rings a bell, I think this is what crapped out inside the service_step when iurygregory and I were looking at it
15:27:48 <TheJulia> they also have SNMP PDUs in that environment, aiui
15:28:05 <TheJulia> oh, so basically same type of issue
15:28:06 <janders> (this level of detail is hidden under that last_error)
15:28:08 <janders> yeah
15:28:09 <iurygregory> not during service, but cleaning in an HPE
15:28:23 <iurygregory> but yeah same type of issue indeed
15:28:26 <janders> thank you for clarifying iurygregory
15:28:35 <iurygregory> np =)
15:28:50 <TheJulia> yeah, I think I semi-pinned the issue that I thought
15:29:04 <janders> yeah it feels like we're missing the "don't depend on responses from BMC while mucking around with its firmware" bit
15:29:13 <janders> in a few different scenarios
15:29:29 <TheJulia> well, or in cases where the bmc might also be taking a chance to reset/reboot itself
15:29:40 <TheJulia> at which point, it is no longer a stable entity until it returns to stability
15:30:15 <janders> ok so from our discussion today it feels 1) the issue is real and 2) holding a lock could be a possible solution - am I right here?
15:30:38 <TheJulia> Well, holding a lock prevents things from moving forward
15:30:46 <TheJulia> and prevents others from making state assumptions
15:30:57 <TheJulia> or other conflicting instructions coming in
15:31:02 <janders> yeah
15:31:38 <TheJulia> iurygregory: was https://paste.opendev.org/show/bdrsgYzFECwvq5O3hQPb/'s 400 a result of power state checking?
15:31:41 <TheJulia> post-action
15:31:42 <TheJulia> ?
15:32:43 <iurygregory> Need to double check
15:32:54 <iurygregory> I can re-run things later and gather some logs
15:33:34 <janders> so the lock would be requested by code inside the step performing firmware operation in this case (regardless of whether day1 or day2) and if BMC doesn't resume returning valid responses after X seconds we fail the step and release the lock?
15:33:37 <TheJulia> Yeah, I think if this comes down to a "we're in some action like power state change in a workflow, we should be abld to hold, or let the caller know we need to wait unti lwe're getting stable response
15:33:57 <TheJulia> janders: the task triggering the step would already hold a lock (node.reservation) field through the task.
15:34:03 <dtantsur> I think we do a very similar thing with boot mode / secure boot
15:34:31 <TheJulia> Yeah, if the BMC never returns from "lunch" we eventually fail
15:34:45 <janders> dtantsur are you thinking about the code we fixed together in sushy-oem-idrac?
15:34:54 <dtantsur> yep
15:34:55 <janders> or is this in the more generic part?
15:35:04 <janders> OK, I understand, thank you
15:36:54 <rpittau> great :)
15:37:04 <rpittau> anything more on the topic? or other topics to discuss?
15:37:25 <dtantsur> iurygregory and I could get some ideas
15:37:35 <janders> I think this discussion really helped my understanding of this challenge and gives me some ideas going forward, thank you!
15:37:43 <janders> dtantsur yeah
15:37:45 <dtantsur> what on Earth could make IPA take 90 seconds to return any response for any API (including the static root)
15:38:18 <iurygregory> yeah, I'm breaking my mind trying to figure out this one
15:38:48 <dtantsur> even on localhost!
15:39:43 <janders> hmm it's always the DNS right? :)
15:39:55 <dtantsur> it could be DNS..
15:39:59 <TheJulia> dns for logging?
15:40:10 <dtantsur> yeah, I recall this problem
15:40:31 <janders> saying that tongue-in-cheek since you said localhost but hey
15:40:38 <janders> maybe were onto something
15:40:49 <TheJulia> address of the caller :)
15:41:09 <janders> what would be default timeout on the DNS client in question?
15:41:22 <TheJulia> This was also a thing which was "fixed" at one point ages ago by monkeypatching eventlet
15:41:31 <TheJulia> err, using eventlet monkeypatching
15:41:55 <iurygregory> I asked them to check if the response time was the same using the name and ip, and the problem always repeats, I also remember someone said some requests taking 120sec =)
15:42:34 <JayF> That problem was more or less completely excised, the one that was fixed with more monkey patching
15:43:11 <JayF> I really like the hypothesis of inconsistent or non-working DNS. There might even be some differences in behavior between what distribution you're using for the ramdisk in those cases.
15:43:32 <dtantsur> It's a RHEL container inside CoreOS
15:43:35 <janders> could tcpdump help confirm this hypothesis?
15:43:40 <TheJulia> janders: likely
15:43:45 <janders> (see if there is DNS queries on the wire)
15:43:48 <TheJulia> janders: at a minimum, worth a try
15:46:02 <rpittau> anything else to discuss? :)
15:46:17 <adam-metal3> I have a question if I may
15:46:24 <rpittau> adam-metal3: sure thing
15:46:31 <rpittau> we still have some time
15:47:45 <adam-metal3> I have noticed an interesting behaviour with ProLiant DL360 Gen10 Plus servers, as you know IPA registers an UEFI boot record under the name of ironic-<somenumber> by default
15:48:29 <TheJulia> Unless there is a hint file, yeah
15:48:35 <TheJulia> whats going on?
15:48:42 <adam-metal3> On the machine type I have mentioned this record gets saved during all the deployments so if you deploy and clean 50 times you have 50 of these boot devices visible
15:48:55 <TheJulia> oh
15:48:55 <TheJulia> heh
15:48:57 <TheJulia> uhhhhh
15:49:12 <TheJulia> Steve wrote a thing for this
15:49:32 <adam-metal3> as far as tests done by dowsntream folks inidcate, there is no serious issue
15:49:52 <adam-metal3> but it was confusing a lot of my downstream folks
15:50:14 * iurygregory needs to drop, lunch just arrived
15:50:40 <TheJulia> https://review.opendev.org/c/openstack/ironic-python-agent/+/914563
15:50:49 <adam-metal3> Okay so I assume then it is a known issue, that is good!
15:51:03 <TheJulia> Yeah, so... Ideally the image your deploying has a loader hint
15:51:26 <TheJulia> in that case, the iamge can say what to use, because some shim loaders will try to do record injection as well
15:51:39 <TheJulia> and at one point, that was a super bad bug on some intel hardware
15:51:52 <TheJulia> or, triggered a bug... is the best way to describe it
15:52:39 <TheJulia> Ironic, I *think* should be trying to clean those entries up in general, but I guess it would help to better understand what your seeing, and compare to a deployment log since the code is *supposed* to dedupe those entries if memory serves
15:52:57 <TheJulia> adam-metal3: we can continue to discuss more as time permits
15:53:06 <adam-metal3> sure
15:53:11 <TheJulia> we don't need to hold the meeting for this, I can also send you a pointer to the hint file
15:53:25 <adam-metal3> thanks!
15:53:55 <rpittau> we have a couple of minutes left, anything else to discuss today? :(
15:53:57 <rpittau> errr
15:53:58 <rpittau> :)
15:55:21 <rpittau> alright I guess we can close here
15:55:32 <rpittau> thanks everyone!
15:55:32 <rpittau> #endmeeting