opendevreview | Merged openstack/ironic-tempest-plugin master: Test multiple boot interfaces as part of one CI job https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/902171 | 03:27 |
---|---|---|
rpittau | good morning ironic! o/ | 07:55 |
masghar | Good morning! | 09:14 |
opendevreview | Merged openstack/ironic stable/wallaby: stable only/ci: pin CI to dnsmasq 2.85/pin proliantutils https://review.opendev.org/c/openstack/ironic/+/911160 | 09:25 |
iurygregory | good morning Ironic | 11:07 |
Sandzwerg[m] | moin ironic | 12:10 |
Sandzwerg[m] | Has anyone ever heard of dell nodes getting frequently powered off shortly after they got powered on. It does not happen all the time and we're unsure if ironic/metalĀ³ or something else is involved | 12:15 |
dtantsur | Sandzwerg[m]: what do you use to power them on? Ironic tends to force whatever it thinks is the right state. | 12:20 |
Sandzwerg[m] | metalĀ³ ironic | 12:26 |
Sandzwerg[m] | and yeah it should force them to come on, but something forces them off when ironic forces them to PXE/on just before. | 12:29 |
Sandzwerg[m] | Feels a bit like ironic is confused but not entirely sure. Haven't found which user triggers the shutdown in the remoteboard log | 12:30 |
dtantsur | If Ironic does it, it has to long something about it | 13:31 |
drannou | Hello | 13:38 |
drannou | If you have a moment to review https://review.opendev.org/c/openstack/ironic-python-agent/+/902769 | 13:39 |
dking | Does anybody know of the top of their head which part of the cleaning process, perhaps in hardware.py, LVM VG groups are cleared? | 13:42 |
dtantsur | dking: erase_devices/erase_device_metadata clean steps. For the latter, it propagates all the way into https://opendev.org/openstack/ironic-lib/src/branch/master/ironic_lib/disk_utils.py#L515 | 13:43 |
dking | dtantsur: Thank you very much! | 13:45 |
*** tosky_ is now known as tosky | 13:57 | |
TheJulia | Sandzwerg[m]: I'd check the console to see if reconfiguration jobs are running, that will cause a reboot and a dell node to appear on, go off, and then come back on. | 14:54 |
TheJulia | JayF: https://review.opendev.org/c/openstack/ironic/+/911158 won't pass until tempest tests are fixed | 14:57 |
JayF | I'm assuming since this pointed at me that must be a failure caused by the sharding test or something? | 14:58 |
JayF | I can look at it, I'm finally called up today | 14:58 |
JayF | **caught up | 14:58 |
TheJulia | yes :) | 14:58 |
TheJulia | That would be awesome if you could, I'm burried under piles of things right now and I need to work on a side deck | 14:59 |
opendevreview | Julia Kreger proposed openstack/ironic-tempest-plugin master: Invoke tests with fake interfaces https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/909939 | 14:59 |
TheJulia | ummm hmmm | 15:00 |
*** dansmith_ is now known as dansmith | 15:02 | |
opendevreview | Julia Kreger proposed openstack/ironic-tempest-plugin master: Invoke tests with fake interfaces https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/909939 | 15:08 |
dtantsur | TheJulia, hjensas, hi folks, would be great if you could put https://review.opendev.org/c/openstack/ironic/+/907991 (and thus https://review.opendev.org/c/openstack/ironic/+/910251) in your review queue. | 15:36 |
dtantsur | I'm actually the last person who needs that working :) | 15:36 |
JayF | TheJulia: I'm confused, I thought https://opendev.org/openstack/ironic-tempest-plugin/src/branch/master/ironic_tempest_plugin/tests/api/admin/test_shards.py#L24 was all that was needed to keep those tests from running | 15:38 |
JayF | https://opendev.org/openstack/ironic-tempest-plugin/src/branch/master/ironic_tempest_plugin/tests/api/base.py#L77 | 15:39 |
JayF | I'm assuming those branches are not configured properly with their max microversion? | 15:40 |
TheJulia | there are no branches for tempest | 15:41 |
TheJulia | I honestly don't know why they are trying to run | 15:41 |
TheJulia | it is clear it knows the diffrence from the errors I linked out | 15:41 |
JayF | 3142: iniset $TEMPEST_CONFIG baremetal max_microversion $TEMPEST_BAREMETAL_MAX_MICROVERSION | 15:41 |
JayF | we have to set that | 15:42 |
JayF | on older branches | 15:42 |
TheJulia | ahh! | 15:42 |
JayF | I'll add it to the patch in question | 15:42 |
TheJulia | so config on stable/zed then | 15:42 |
JayF | aye | 15:42 |
TheJulia | I guess that code path doesn't have real detetion based skipping unlike scenario jobs | 15:42 |
JayF | does that imply I only need to configure it on functional test jobs? | 15:43 |
JayF | I suspect it's also possible for scenario jobs ... we just haven't added any that use new APIs | 15:43 |
JayF | so it's a moot point and not breaky | 15:43 |
TheJulia | likely just the default in devstack/lib/ironic | 15:43 |
opendevreview | Jay Faulkner proposed openstack/ironic stable/zed: stable only/ci: pin CI to dnsmasq 2.85/pin proliantutils/scciclient https://review.opendev.org/c/openstack/ironic/+/911158 | 15:48 |
*** dking is now known as Guest2031 | 15:55 | |
*** Guest2031 is now known as dking | 16:07 | |
dking | In the Ironic-Python-Agent, was ProtectedDeviceFound replaced by errors.ProtectedDeviceError? I see that ProtectedDeviceFound is only referenced in some docstrings, and I cannot find it implemented, but errors.ProtectedDeviceError exists and seems to be used similiarly. | 16:20 |
JayF | you wanna link to those invalid docstrings, I'll clean em up | 16:28 |
JayF | eh, ripgrep can do that part I guess | 16:28 |
JayF | dking: yep, you are right | 16:32 |
opendevreview | Jay Faulkner proposed openstack/ironic-python-agent master: Correct invalid docstrings; s/Found/Error/ https://review.opendev.org/c/openstack/ironic-python-agent/+/911598 | 16:33 |
JayF | dking: ^ fyi it'll be fixed when that lands | 16:33 |
dking | JayF: Thanks! So, that can help clean up the code. So, I'm just wondering about it a little. The new error is thrown when looking at devices, but in those laces where it's mentioned in the docstrings is in methods handling the node itself. The error requires a device. Is that still the intention? | 16:36 |
JayF | Lets back up a step | 16:36 |
JayF | what's your overall thing you're trying to accomplish? | 16:37 |
dking | For instance, if I wanted to override erase_devices() for an entire node for some reason, I would make my own erase_devices() method in my hardware manager and send an error up saying that this node is protected. Would ProtectedDeviceFound still be the best way to do that? | 16:37 |
dking | In my specific use case, I can adjust the code to grab the particular device, so I'll probably do that anyway, but it just made me wonder if there should be something more generic. | 16:39 |
dking | (In my use case, I'm looking for some specific volume group names which happen to belong to a ceph cluster and I'm planning to make those be removed only upon a manual clean step. So, I'm also happy to try something else if there's a better way.) | 16:40 |
JayF | ProtectedDeviceError would be the way do that /on master branch/ which is all I've looked at | 16:42 |
dking | JayF: Okay, that's good, then. So, if somebody for some reason did decided that they wanted their clean step to fail after checking the whole node, not a check for a specific device, they would throw that error and perhaps fill in device with just any text? | 16:45 |
JayF | Do you *care* that it's a ProtectedDeviceError | 16:45 |
JayF | or do you just want to stop cleaning? | 16:46 |
dking | More the later. So, perhaps just a generic CleaningError? | 16:46 |
JayF | https://opendev.org/openstack/ironic-python-agent/src/branch/master/examples/business-logic/example_business_logic.py#L94 | 16:47 |
JayF | yep | 16:47 |
JayF | you're basically implementing the business logic pattern | 16:47 |
JayF | https://opendev.org/openstack/ironic-python-agent/src/branch/master/ironic_python_agent/errors.py#L362 ProtectedDeviceError *is* a CleaningError, which a preset exception message :) | 16:48 |
JayF | **with a preset exception message | 16:48 |
JayF | So either one of those works, just a question of if you want the premade message or you wanna write one yourself | 16:49 |
dking | Great. Then, that answers my question. I was wondering for a moment if the docstrings should be updated instead to the more generic CleaningError, but as those particular methods do specifically only raise ProtectedDeviceError, that seems appropriate. | 16:49 |
dking | Yeah, I see that. I might still do use ProtectedDeviceError in my case also. | 16:50 |
JayF | I suspect that error name was changed and just got missed being updated in the docstring | 16:50 |
rpittau | good night! o/ | 16:50 |
JayF | o/ | 16:50 |
dking | BTW, I see that after about 4 years, you just updated the hardware manager examples to fix some spelling issues. I'm sure there's tons of cobwebs growing on them. | 16:51 |
JayF | The interface itself hasn't changed in .... ever? | 16:51 |
dking | rpittau: Good night | 16:51 |
JayF | We did one api change on hardware managers, back in like, 2014 or 2015 | 16:51 |
JayF | and it's been a stable API since | 16:51 |
dking | It probably might be helpful to add in a deploy step? I don't know if people are doing them much, though. | 16:52 |
JayF | I'd +1 such a change, but it's not something that's urgently on my list | 16:52 |
dking | I'm adding in my first locally because we're not ready to update BMO yet and so we're stuck with the old version that doesn't have an option for a "by_path" root device hint. | 16:52 |
TheJulia | dnsmasq gives me a migraine | 16:53 |
JayF | More breakage? Or you trying to fix the issue in C? | 16:53 |
TheJulia | oh, I know of a way, it just masks the actual problem | 16:54 |
*** clarkb1 is now known as clarkb | 16:55 | |
opendevreview | Verification of a change to openstack/ironic master failed: Split conductor-specific RPCService https://review.opendev.org/c/openstack/ironic/+/910251 | 16:56 |
* TheJulia wonders if I just found the root causse | 17:12 | |
dtantsur | dking: we have downstream deploy steps if you're curious: https://github.com/openshift/ironic-agent-image/blob/main/hardware_manager/ironic_coreos_install.py | 17:26 |
JayF | I was reviewing a change for cid and got a little confused: https://review.opendev.org/c/openstack/ironic/+/910973 | 17:26 |
JayF | re: if we can use mysql types in alembic migrations | 17:26 |
dtantsur | dking: but if you have issues with hints, won't it make more sense to override get_os_install_device in a downstream hardware manager? | 17:27 |
TheJulia | JayF: mysql has a nuance I'm trying to remember | 17:27 |
TheJulia | I think postgres doesn't actually enforce stringfield lengths | 17:28 |
TheJulia | it does it by types if memory serves | 17:28 |
TheJulia | easy enough to test in migration test code | 17:28 |
dking | dtantsur: Perhaps. I hadn't looked at that, but at the moment, I'm just trying to mimic what BMO is doing so that it will work the same way once we update our version. | 17:31 |
dtantsur | dking: I haven't put a ton of thoughts into this suggestion either, but it does sound like you need to override get_os_install_device | 17:32 |
TheJulia | gah, I have dnsmasq in a cpu consuming loop | 17:32 |
dtantsur | nom-nom | 17:32 |
TheJulia | 88.6% om nom! | 17:32 |
dtantsur | \o/ | 17:32 |
TheJulia | ... how in the world did I do that | 17:32 |
clarkb | if you give a dhcp server a cookie... | 17:34 |
opendevreview | Verification of a change to openstack/ironic master failed: Split conductor-specific RPCService https://review.opendev.org/c/openstack/ironic/+/910251 | 17:35 |
clarkb | I'm curious, have you checked if neutron runs into similar problems? And if not maybe you can isolate what triggers these things (you probably have it jsut seems odd that its a consistent problem for ironic but apparently not for neutron) | 17:35 |
TheJulia | I remember I could see evidence in some of the non-ironic jobs that dnsmasq was getting restarted | 17:36 |
TheJulia | but that was a while back | 17:36 |
TheJulia | I haven't looked recently because it doesn't seem anything bare metal specific | 17:37 |
clarkb | ok I did wonder if maybe pxe boot flags could be the problem for example | 17:38 |
clarkb | since neutron in normal VM operation wouldn't be setting those | 17:38 |
TheJulia | it is any option handling it seems | 17:38 |
TheJulia | and all ports get options inherently to do the base matching | 17:39 |
clarkb | ya neutron will also set dns servers and other flags too iirc | 17:39 |
TheJulia | yup | 17:39 |
TheJulia | I'm sort of at the end of my ability to figure out what is going on, I don't really understand the inner workings well enough to do more than get a feeling for it being in option response building most likely | 17:59 |
JayF | TheJulia: can you make sure your research is reflected in that dnsmasq ubuntu bug? | 18:10 |
JayF | TheJulia: and I will see if I can pull a C expert outta the hat | 18:10 |
JayF | looks like you already did that | 18:11 |
JayF | score | 18:11 |
TheJulia | JayF: I did like an hour ago | 18:12 |
TheJulia | yeah | 18:13 |
TheJulia | waiting on hopefully another message from petr, but even he is unsure of a next step in debugging | 18:13 |
jrosser | would you have a link to the bug out of interest? | 18:14 |
JayF | https://bugs.launchpad.net/dnsmasq/+bug/2026757 | 18:14 |
JayF | I put out a bat-signal in GR-OSS downstream slack, this is basically how I started the eventlet stuff and got Itamar's help there. No promises it'll yield anything, but I am trying :D | 18:15 |
jrosser | a trivial reproducer might be valuable, then the barrier to debugging is lowered | 18:23 |
TheJulia | I've not been able to figure one out really | 18:24 |
TheJulia | it is not just about HUP operations, it is about that *and* dhcp options processing being triggered which appears to work just as expected, and upon the next hup it crashes | 18:25 |
*** awb_ is now known as awb | 19:05 | |
opendevreview | Merged openstack/ironic master: Split conductor-specific RPCService https://review.opendev.org/c/openstack/ironic/+/910251 | 20:53 |
adam__metal3 | Hello Ironic, just wondering if you have experienced kernel issues on Dell servers with BOSS-N1 raid controllers when running centos IPA ? I am having "fun" (not really) with new hw combinations and I am wondering whether that device is known to cause issues for IPA? Not sure it causes the problem for me .... | 22:02 |
JayF | What do you mean by kernel issues, and what centos version? | 22:03 |
adam__metal3 | latest upstream centos-9-stream, kernel panic | 22:03 |
JayF | Most of these are just ... $distro issues, but some stuff around how we build the ramdisk can cause headaches | 22:03 |
JayF | what is the panic? | 22:03 |
JayF | you have a screenshot/log? | 22:03 |
adam__metal3 | yeah I can make screenshot where should I share it? it most likely not the boss one, I have built a new networkd driver for some intel E810-XXV cards and then the issue has change to just getting stuck at the pinguins on the boot screen but I was wondering if it might be BOSS that is doing some extra wierd because the same cards on a different machine work | 22:06 |
JayF | BOSS? | 22:07 |
JayF | ah, BOSS-N1 raid | 22:07 |
JayF | I see | 22:07 |
JayF | I don't have a good spot, any random image sharing site | 22:08 |
TheJulia | so BOSS cards have always been a little weird, but generally they've just worked. The only sort of known issue is they can very much cause weirdness with the device ordering in the OS making /dev/sda /dev/sdb, etc being unreliable across reboots/kernel upgrades, but that is not a unique problem only to those devices and why hinting is preferred | 22:12 |
adam__metal3 | https://imgur.com/a/6E4ljs1 | 22:12 |
adam__metal3 | TheJulia, thanks good to know! | 22:12 |
TheJulia | you could try changing the iommu options on the command line | 22:12 |
TheJulia | to debug that sort of issue, your going to need to attach to a serial port and debug out to that to get as much of the kernel boot/initialziation as possible | 22:13 |
TheJulia | but yeah, your deeeeeeeep inside of the kernel | 22:13 |
adam__metal3 | yup and ofc in a global organization there is no living soul who can go there with a serial to usb adapter :D I think this specifc hw is on an other continent | 22:14 |
JayF | Like, have you reproduced this on another raid card/system? | 22:14 |
TheJulia | heh | 22:14 |
TheJulia | adam__metal3: ipmitool sol?! | 22:14 |
JayF | I mainly am asking just because this sorta issue is triggering my "are you SURE all ram is good?" spidey sense | 22:15 |
TheJulia | yeah, in device attach for iommu is... interesting | 22:15 |
opendevreview | Steve Baker proposed openstack/sushy-tools master: Add virtual-media-boot to openstack driver https://review.opendev.org/c/openstack/sushy-tools/+/906768 | 22:15 |
adam__metal3 | so here comes the even more weird part ther are 3 identical blade servers dell R660s the issue happens with all of them but if you run inspection for a night 2 out of the 3 eventually reboots enough times that it gets inspected and one of them never :D also as I mentioned the same network cards work in a different dell server jsut fine with the same IPA | 22:16 |
TheJulia | https://www.kernel.org/doc/Documentation/x86/x86_64/boot-options.txt is a good starting place for options, fwiw | 22:17 |
adam__metal3 | great I will check this kernel doc thanks this iommu is the best tip so far :D | 22:17 |
JayF | adam__metal3: I will say, if I was in your shoes: this is call the vendor territory | 22:18 |
TheJulia | ... that is *weird* | 22:18 |
TheJulia | yeah | 22:18 |
TheJulia | ++ | 22:18 |
JayF | vendor being RH or hardware-vendor | 22:18 |
JayF | you can launder the failure from centos into redhat if you have a contract | 22:18 |
TheJulia | I wonder if the blades have some sort of weird/random initization thing going on | 22:19 |
adam__metal3 | these dell machines are all around weird imo... iDRAc has this weird queue/job system when we change something takes 15 minutes to modify any meaninfull option.... slow as hell I would expect better for servers that cost 12K... | 22:20 |
TheJulia | wow | 22:20 |
TheJulia | so.. a long long time ago I was actually at dell's offices in one of the labs, and I had a couple identical systems, one of which acted sort of like that | 22:21 |
TheJulia | I ended up resetting the idrac and bios firmware back to defaults and it became a bit more consistent | 22:22 |
TheJulia | ... I also remember flashing firmware at one point | 22:22 |
TheJulia | It was a blur couple of days | 22:22 |
adam__metal3 | we did the same got a bit more stable bot still slow but bit more consistent indeed | 22:22 |
adam__metal3 | we also went through a few network card firmwares and as I mentioned I even switched the centos mainline ice drever to the official intel one | 22:23 |
TheJulia | Hmmm | 22:23 |
TheJulia | I've never been a fan of dell's blades, tbh | 22:23 |
TheJulia | Has the overall chassis manager firmware been updated? | 22:24 |
adam__metal3 | that I don't know because we actually not touching that we have a <redacted> between agents like Ironic and the BMCs :D soo I am not even sure our platform folks have access to the chassis I will ask tomorrow | 22:25 |
adam__metal3 | but in any case I can ask around this is also a good tip | 22:26 |
TheJulia | I did have a some dell blades when I worked for a place in Atlanta, which had a mismatch and exhibited some really weird behavior until they were all sorted out on aligning versions | 22:26 |
TheJulia | that was a very long time ago, but still sort of similar | 22:27 |
adam__metal3 | JayF,TheJulia, I will go to sleep now thanks for the tips and discussion I will bring these point up to downstream | 22:29 |
TheJulia | goodnight! | 22:30 |
JayF | o/ | 22:30 |
TheJulia | https://bugs.launchpad.net/ironic/+bug/2056248 is a good one :) | 22:32 |
TheJulia | I'll note https://review.opendev.org/c/openstack/ironic-tempest-plugin/+/909939 could use a review or two. It fixes one of the pains with running our tempest suite against an ironic deployment | 22:33 |
JayF | TheJulia: that bug sounds like someone running a rabbitmq that is underscaled and/or has a failover timeout set high enough to screw up their enviornment | 22:34 |
JayF | (I don't know the actual rabbitmq word for it, but I've seen clusters that took longer to failover in failure cases than the default conductor timeouts) | 22:35 |
TheJulia | rabbit is not the actual arbitor | 22:38 |
TheJulia | mysql is | 22:38 |
TheJulia | so rabbit is entirely independent in that case | 22:38 |
JayF | I was about to ask you "was that the case back in $release_before_you_started_stacking" | 22:39 |
JayF | lol | 22:39 |
JayF | Heads up: I'm going on some PTO the first week of April, going to completely disconnect that week. | 22:44 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!