iurygregory | TheJulia, tks for the review in the spec, I just answered some of your questions there o/ | 01:58 |
---|---|---|
iurygregory | tomorrow I will be looking at the open specs we have for bobcat | 01:58 |
TheJulia | Cool cool | 02:05 |
TheJulia | As long as we don’t do that whole strict tie to a cycle process of control nightmare. | 02:08 |
iurygregory | yeah | 02:09 |
rpittau | good morning ironic! o/ | 06:37 |
kaloyank | morning ironic o/ | 06:57 |
kubajj | good morning everyone | 09:23 |
dtantsur | morning kubajj! how are your studies going? | 09:42 |
kubajj | dtantsur: it's not that bad. Exams are slowly approaching though, so I'm basically spending 9-19 at the library. Slightly more than a month to go and then I'm free. 🥹 | 09:45 |
dtantsur | great, good luck! | 09:47 |
iurygregory | morning Ironic | 11:35 |
opendevreview | Maksim Malchuk proposed openstack/bifrost master: [DNM] test linters https://review.opendev.org/c/openstack/bifrost/+/880163 | 12:32 |
TheJulia | good morning | 12:33 |
opendevreview | Maksim Malchuk proposed openstack/bifrost master: [DNM] test linters https://review.opendev.org/c/openstack/bifrost/+/880163 | 12:33 |
opendevreview | Maksim Malchuk proposed openstack/bifrost master: Fix ansible-lint https://review.opendev.org/c/openstack/bifrost/+/880163 | 12:34 |
Sandzwerg[m] | TheJulia: Turns out having nodes stuck in deleting is not that hard if ironic has issues reaching the node via IPMI :) | 12:57 |
TheJulia | Sandzwerg[m]: intermittently? | 12:59 |
opendevreview | Maksim Malchuk proposed openstack/bifrost master: Fix ansible-lint https://review.opendev.org/c/openstack/bifrost/+/880163 | 13:00 |
Sandzwerg[m] | <TheJulia> "Sandzwerg: intermittently?" <- More or less. The node was put to maintenance because ironic fails the power sync even after some retries. Probably something on the node side is broken | 13:10 |
opendevreview | Maksim Malchuk proposed openstack/bifrost master: Fix ansible-lint https://review.opendev.org/c/openstack/bifrost/+/880163 | 13:44 |
TheJulia | Sandzwerg[m]: do you have another system talking to the BMC via ipmi? | 13:47 |
TheJulia | Sandzwerg[m]: are you using power sync? the default is yes :) | 13:48 |
TheJulia | Sandzwerg[m]: Also, have any timers been changed from the defaults? | 13:48 |
Sandzwerg[m] | We have a vendor console (lxca/openmanage/etc) not sure if that uses IPMI or something else. Monitoring is done via SNMP If I'm not mistaken. I don't think the vendor console had anz tasks which would block the node | 13:49 |
Sandzwerg[m] | We sync the status to ironic, but no longer let ironic change the status if it does not match what it expects. Would need to check if we changed the default times, I assume not | 13:50 |
TheJulia | so depending on the console, it *might* and some of the vendor's BMCs are designed around no more than 1 request every so often | 13:50 |
TheJulia | I don't think we've changed the default behavior there, but it sounds like your using the knobs | 13:51 |
TheJulia | s/knobs/correct knobs/ | 13:51 |
Sandzwerg[m] | Yeah, it's also not a big issue. I found two nodes in two dazs which is a bit more, but we don't see it all the time. We know that sometimes the remoteboard get's unresponsive and will not respond to anything for a while, still haven' t found the issue. Could be such a case. Interestingly only (older ~4ß5 years) Lenovos seem to be affected this time | 13:53 |
TheJulia | Older supermicro gear in particular, I think makes a login entry and logging into the web console would count as a login to the bmc and it would eventually time out the fact i touched it via ipmi | 13:53 |
TheJulia | but if i hit it too many times, I'd start having errors | 13:53 |
TheJulia | it might be that using snmp on those bmcs, whatever they are, could be doing the same basic thing in that it could be creating a session and eventually time it out which is counting possibly | 13:53 |
Sandzwerg[m] | It's not an big issue currently, mostly annoying. But yesterday you mentioned you can't image a case in which a node might get stuck in the "deleted" state so I thought I described how it could happen | 13:54 |
TheJulia | yeah, I guess I can see that happening then :) | 13:55 |
Sandzwerg[m] | Yeah, wouldn't be suprised but I think it was to rare for that. We never saw it everywhere. But sometimes a whole block (~14 nodes in one or two racks) are affected at the same time | 13:55 |
TheJulia | oh joy | 13:55 |
TheJulia | :( | 13:55 |
Sandzwerg[m] | Most of our ironic infrastructure is pretty static and the support has a script to fix it. I think it tries in a loop to reboot the bmc or something, but could take a day or so till it succeeds. I'm not sad if these Lenovos get decommissoned, hopefully this year | 13:57 |
samuelkunkel[m] | Has anyone ever seen that in a HPE Node using ILO5 (redfish)? Node is being booted to clean, Redfish calls to boot the device and directly afterwards throws: Extended information: [{'MessageArgs': ['BootSourceOverrideTarget'], 'MessageId': 'iLO.2.15.UnableToModifyDuringSystemPOST. | 13:57 |
samuelkunkel[m] | If I retry it once or twice (starting back from "manage", "provide") it works | 13:58 |
TheJulia | Sandzwerg[m]: oh joy! Sounds a lot like the HP Gen ?6? IPMI BMCs we had at HP Cloud | 13:58 |
samuelkunkel[m] | Uff gen6? :D | 13:59 |
TheJulia | samuelkunkel[m]: oh my, yes! We've seen a report of that before | 13:59 |
TheJulia | samuelkunkel[m]: it was a very very long time ago | 13:59 |
samuelkunkel[m] | I dont get why it happens now. And only with these ARM HPE Nodes (I really start to hate them) | 13:59 |
Sandzwerg[m] | We also have some 8 Socket lenovo, where they basically just strap two 4 sockets together. These are also always "fun" for all their quirks | 13:59 |
opendevreview | Maksim Malchuk proposed openstack/bifrost master: Remove extra symbols accidentally added https://review.opendev.org/c/openstack/bifrost/+/879547 | 13:59 |
opendevreview | Maksim Malchuk proposed openstack/bifrost master: Remove extra symbols accidentally added https://review.opendev.org/c/openstack/bifrost/+/879547 | 14:00 |
TheJulia | samuelkunkel[m]: so, the report I believe was before any of those machines shipped, but it sounds like the window might be longer :\ | 14:00 |
samuelkunkel[m] | Hmm, for now I just retry it | 14:01 |
samuelkunkel[m] | But this sounds pretty inconvenient | 14:02 |
TheJulia | samuelkunkel[m]: with a full error and details, like how much time it takes, I suspect a patch could be created for sushy or ironic | 14:04 |
samuelkunkel[m] | The question is rather, should we handle this in sushy? | 14:04 |
TheJulia | depends on the details needed | 14:04 |
samuelkunkel[m] | If I recall this correctly we already have exponential backoff and retry? | 14:04 |
samuelkunkel[m] | I can provide atleast the details and maybe work on a Patch. But currently it only happens on ILO6 with the RL300 Nodes | 14:05 |
samuelkunkel[m] | So I will ask HPE first what they think of this :D | 14:06 |
TheJulia | looking at the sushy code | 14:06 |
opendevreview | Merged openstack/ironic stable/zed: Always fall back from hard linking to copying files https://review.opendev.org/c/openstack/ironic/+/879868 | 14:07 |
opendevreview | Merged openstack/ironic bugfix/21.3: Always fall back from hard linking to copying files https://review.opendev.org/c/openstack/ironic/+/879869 | 14:08 |
opendevreview | Merged openstack/ironic bugfix/21.2: Always fall back from hard linking to copying files https://review.opendev.org/c/openstack/ironic/+/880090 | 14:08 |
opendevreview | Merged openstack/ironic stable/2023.1: Always fall back from hard linking to copying files https://review.opendev.org/c/openstack/ironic/+/879867 | 14:08 |
samuelkunkel[m] | Shall I create a bug for that to provide the details? | 14:09 |
TheJulia | it might back off..... | 14:09 |
TheJulia | samuelkunkel[m]: please | 14:09 |
TheJulia | I just got off a call and there is a wait depending on the precise error code we get back | 14:09 |
TheJulia | http error code at that | 14:09 |
samuelkunkel[m] | I would create it for sushy repo? | 14:10 |
samuelkunkel[m] | yes I have the logs from the conductor atlast | 14:10 |
TheJulia | Yes, that should hopefully have the http error code | 14:10 |
samuelkunkel[m] | sushy bug also via bugs.launchpad? | 14:12 |
samuelkunkel[m] | never opened one for sushy | 14:12 |
TheJulia | I believe so, had to step away for moment | 14:12 |
TheJulia | https://bugs.launchpad.net/sushy | 14:14 |
dtantsur | TheJulia: morning! could you take a look at https://review.opendev.org/c/openstack/ironic-specs/+/878001 when you have a minute or let me know if you fine with us just merging it (has 2x +2)? | 15:22 |
TheJulia | i can try and glance in a little bit, we're starting to reach the "time to be able to focus" window | 15:44 |
TheJulia | dtantsur: so my only concern in it is plugin migration, since the method names are being chagned, it is not highlighted, it might not really need to be highlighted, but folks with custom plugins will need to modify code which seems reasonable given the level of work | 16:01 |
dtantsur | TheJulia: yep, they'll also need to change the plugin entry point. So it's not going to be automatic either way. | 16:01 |
TheJulia | yup | 16:01 |
dtantsur | do we need anything other than good docs? | 16:01 |
TheJulia | I don't think we can do anything besides that | 16:02 |
TheJulia | I do like the callout of what should and should not be done in a plugin | 16:02 |
TheJulia | also, I made two notes on the new api additions, I would expect we would have the ability to just entirely disable the endpoints like lookup/heartbeat have for operators with API surfaces pointed towards untrusted or semi-trusted users | 16:03 |
rpittau | good night! o/ | 16:07 |
opendevreview | Merged openstack/ironic-specs master: Merge Inspector into Ironic https://review.opendev.org/c/openstack/ironic-specs/+/878001 | 16:20 |
dtantsur | TheJulia: thanks! I'll work on a follow-up, also taking into account our discussion with hjensas | 16:50 |
* TheJulia is unsure which or where | 16:50 | |
* TheJulia goes back to paperwork | 16:50 | |
opendevreview | Chris Krelle proposed openstack/ironic master: Add ablity to power off nodes in clean failed https://review.opendev.org/c/openstack/ironic/+/880165 | 18:02 |
NobodyCam | Good Morning OpenStack Folks! | 18:02 |
TheJulia | it is nearly afternoon | 18:08 |
TheJulia | :) | 18:09 |
opendevreview | Chris Krelle proposed openstack/ironic master: Add ablity to power off nodes in clean failed https://review.opendev.org/c/openstack/ironic/+/880165 | 19:46 |
opendevreview | Chris Krelle proposed openstack/ironic master: Add ablity to power off nodes in clean failed https://review.opendev.org/c/openstack/ironic/+/880165 | 20:03 |
TheJulia | NobodyCam: current version of gerrit dislikes `` | 20:03 |
TheJulia | https://review.opendev.org/c/openstack/ironic/+/880165 on the reno ``[config_group]option_name`` | 20:03 |
TheJulia | https://review.opendev.org/c/openstack/ironic/+/880165 on the reno ```[config_group]option_name``` | 20:03 |
NobodyCam | le sigh | 20:04 |
TheJulia | oh jeeze, also irccloud is mucking with it | 20:04 |
NobodyCam | will fix right after lunch | 20:04 |
NobodyCam | heheheeh | 20:04 |
* TheJulia looks for a cane to shake and to talk about the times before where we didn't have this new fangled stuff | 20:04 | |
NobodyCam | 😱 | 20:04 |
samuelkunkel[m] | TheJulia: you remember vaguely the case (I think it was yesterday or 2 days ago) where efibootmgr was not able to access the uefi? | 20:12 |
samuelkunkel[m] | yeah seems like its related to the image. | 20:12 |
samuelkunkel[m] | Stream-9 IPA works properly | 20:12 |
samuelkunkel[m] | Debian 12 IPA not | 20:12 |
TheJulia | samuelkunkel[m]: oh my... | 20:12 |
samuelkunkel[m] | so seems like I switch back to Stream-9. | 20:14 |
samuelkunkel[m] | But it seems like your fix about the image size works. Atleast with the latest version of the diskimage-builder / ironic-python-agent-builder the initramfs of a stream-9 is only around 350M (no longer 800~) | 20:15 |
samuelkunkel[m] | thanks for that :) | 20:15 |
clarkb | note that you can adjust partition sizes via dib too if necessary | 20:17 |
TheJulia | well, in a ramdisk case there is no partitions | 20:29 |
TheJulia | :) | 20:29 |
clarkb | ah I mixed this up with the raid thing | 20:31 |
NobodyCam | TheJulia: no space between ] and the option name? `[conductor]poweroff_in_cleanfail` vs `[conductor] poweroff_in_cleanfail` | 20:32 |
NobodyCam | hey hey clarkb long Time no see | 20:32 |
clarkb | hello! | 20:33 |
TheJulia | NobodyCam: correct | 20:33 |
opendevreview | Chris Krelle proposed openstack/ironic master: Add ablity to power off nodes in clean failed https://review.opendev.org/c/openstack/ironic/+/880165 | 20:34 |
NobodyCam | okay that should tackle the dreaded pep8 error and the Reno note issue | 20:34 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!