vanou | good morning ironic | 02:28 |
---|---|---|
janders | ajya I have a question w/r/t https://review.opendev.org/c/openstack/sushy/+/865388 (SettingsURI boot attrs fix). From which iDRAC version does this becomes applicable to Dell machines? I think when we were collaborating on this change we were working against the behaviour of a future version. If my understanding is correct, I'd be keen to do some | 06:50 |
janders | testing against this version as soon as it's out so that we can validate the fix again and iron out any issues before users hit them. Good idea? | 06:50 |
vanou | TheJulia: Regarding your concern on https://review.opendev.org/c/openstack/ironic/+/865074. (periodic task):I think another approach is to implement vendor passthru method through which user explicitly trigger fetching&storing of firmware version. (packaging module dependency):Can I implement version comparison function in irmc driver code? | 07:05 |
vanou | Hi janders | 07:05 |
ajya | Hi janders , it should be next version, usually released in Decembers. So it could be soon. | 07:18 |
janders | vanou hello! :) | 09:05 |
janders | ajya ACK and thank you! | 09:05 |
kubajj | Morning everyone | 09:26 |
rpittau | good morning ironic! o/ | 09:33 |
rpittau | TheJulia: re: 866780 thanks, I'll have a look ASAP | 09:34 |
opendevreview | Riccardo Pittau proposed openstack/ironic bugfix/20.2: All jobs should still run on focal https://review.opendev.org/c/openstack/ironic/+/866972 | 10:01 |
opendevreview | Riccardo Pittau proposed openstack/ironic bugfix/20.2: Align iRMC driver with Ironic's default boot_mode https://review.opendev.org/c/openstack/ironic/+/866780 | 10:02 |
opendevreview | Riccardo Pittau proposed openstack/ironic bugfix/20.2: All jobs should still run on focal https://review.opendev.org/c/openstack/ironic/+/866972 | 10:08 |
opendevreview | Riccardo Pittau proposed openstack/ironic bugfix/20.2: Align iRMC driver with Ironic's default boot_mode https://review.opendev.org/c/openstack/ironic/+/866780 | 10:09 |
opendevreview | Kirill proposed openstack/ironic-specs master: new spec: support of vnc console. https://review.opendev.org/c/openstack/ironic-specs/+/866537 | 10:32 |
opendevreview | Kirill proposed openstack/ironic-specs master: new spec: support of vnc console. https://review.opendev.org/c/openstack/ironic-specs/+/866537 | 11:09 |
opendevreview | Dmitry Tantsur proposed openstack/sushy stable/zed: Handle a different error code for missing TransferProtocolType https://review.opendev.org/c/openstack/sushy/+/866899 | 11:16 |
opendevreview | Dmitry Tantsur proposed openstack/sushy stable/yoga: Handle a different error code for missing TransferProtocolType https://review.opendev.org/c/openstack/sushy/+/866901 | 11:17 |
opendevreview | Dmitry Tantsur proposed openstack/sushy stable/xena: Handle a different error code for missing TransferProtocolType https://review.opendev.org/c/openstack/sushy/+/866903 | 11:17 |
opendevreview | Dmitry Tantsur proposed openstack/sushy stable/wallaby: Handle a different error code for missing TransferProtocolType https://review.opendev.org/c/openstack/sushy/+/866904 | 11:17 |
vanou | Hi rpittau | 12:27 |
opendevreview | Slawek Kaplonski proposed openstack/ironic master: [grenade] Explicitly enable Neutron ML2/OVS services in the CI job https://review.opendev.org/c/openstack/ironic/+/866993 | 13:53 |
rpittau | mmm lots of failures everywhere, anything broken at large scale? | 14:04 |
rpittau | oh looks like a tox issue | 14:04 |
opendevreview | Riccardo Pittau proposed openstack/ironic bugfix/20.2: All jobs should still run on focal https://review.opendev.org/c/openstack/ironic/+/866972 | 14:37 |
opendevreview | Riccardo Pittau proposed openstack/ironic bugfix/20.2: All jobs should still run on focal https://review.opendev.org/c/openstack/ironic/+/866972 | 14:38 |
TheJulia | yeah... :( | 15:51 |
opendevreview | Verification of a change to openstack/ironic master failed: Fix debug log message argument formatting https://review.opendev.org/c/openstack/ironic/+/866856 | 16:00 |
TheJulia | stevebaker[m]: so I took a look at the proliantutils change you -1'ed on master branch, issue is upper-constraints has not incorporated the latest sushy release, and nobody rechecked them even though they are approved. I've done so, so hopefully they should merge in the next 24 hours | 16:14 |
opendevreview | Merged openstack/sushy stable/xena: Handle a different error code for missing TransferProtocolType https://review.opendev.org/c/openstack/sushy/+/866903 | 16:42 |
opendevreview | Merged openstack/sushy stable/wallaby: Handle a different error code for missing TransferProtocolType https://review.opendev.org/c/openstack/sushy/+/866904 | 16:43 |
rpittau | good night! o/ | 17:03 |
jrosser | i am closer to getting a system deploying - any tips for this would be welcome https://pastebin.com/raw/2W4Xbtfh | 17:04 |
TheJulia | jrosser: o/ | 18:05 |
jrosser | hello o/ | 18:05 |
TheJulia | reading the log | 18:06 |
jrosser | it seems to lsblk a partition which fails | 18:07 |
TheJulia | so the mellenox failure is weird too | 18:09 |
TheJulia | fwiw | 18:09 |
jrosser | is there compatibility between my yoga controller and a newer ipa? | 18:09 |
jrosser | i am wondering if i have a too new set of hardware for centos8 kernel | 18:10 |
TheJulia | ewww, | 18:10 |
TheJulia | nvme0n1p2 not a block device.... | 18:10 |
TheJulia | wow | 18:11 |
jrosser | yes at that point i was wtf | 18:11 |
TheJulia | ... yeah | 18:11 |
jrosser | but i did manage to put it into maintainance, log in and reproduce that | 18:11 |
TheJulia | oh, really?! | 18:11 |
TheJulia | so.... does the nvme device show a partition table, and what does the nvme tools say? | 18:12 |
jrosser | you'll see there is a loooong time at the end of the log while i was poking around it | 18:12 |
TheJulia | yeah | 18:12 |
jrosser | i'll boot it again | 18:13 |
jrosser | hmm right the node has been cleaned so the partitions are not there any more | 18:22 |
jrosser | i do have the logs though from that deployment https://paste.opendev.org/show/bkRrxy1XiREu1f0KopWh/ | 18:25 |
jrosser | and now i've got the node back to the state of the log, and can reproduce it https://paste.opendev.org/show/b87JVhA7eVbdXGpsSE8L/ | 18:31 |
TheJulia | https://lists.zuul-ci.org/archives/list/zuul-announce@lists.zuul-ci.org/message/3NNATSUTSIGP5FE2MDY5X2KJ5X4NB4PT/ <-- JayF | 18:35 |
TheJulia | jrosser: what if you reboot? | 18:35 |
JayF | Yep; I'm aware. I helped edit that post a little bit :) | 18:35 |
TheJulia | okay, cool | 18:35 |
JayF | this is why our gate is busted | 18:35 |
JayF | there's a follow on thing that broke tempest related to the tox 3/4 fix aiui | 18:36 |
TheJulia | well it sounds like "we have just under 2 weeks otherwise we're SOL" | 18:36 |
TheJulia | janders: so you've done a lot with NVMes, is anything jrosser is experiencing above ringing bells for you? | 18:37 |
JayF | TheJulia: yep, pretty much, but the breakages are minor and are unlikely to impact us | 18:38 |
JayF | TheJulia: I already did some of the prework (removing install_comm'and) | 18:39 |
TheJulia | and we then need to cherry-pick down ? | 18:39 |
TheJulia | jrosser: it is super weird, to me, that you an lsblk the first partition, but the second | 18:40 |
TheJulia | I *have* seen something like this on a nvme device, but it had a super weird over-usb bridge and I kind of just dismissed it | 18:40 |
JayF | TheJulia: maybe. We'll see. The install_command stuff wasn't impactful everywhere. i'm also OK with our answer being "pin tox to 3.x" on stable branches | 18:40 |
TheJulia | seems reasonable to me | 18:41 |
JayF | TheJulia: jrosser: This smells like a bad NVMe controller or bad partition table. | 18:41 |
TheJulia | yeah | 18:41 |
jrosser | this should be somehow regularised with cleaning? | 18:41 |
JayF | TheJulia: jrosser: If the data on that disk is no good, I'd try a dd if=/dev/zero of=/dev/nvme0n1 count=1 bs=10M or something to kill the partition table | 18:41 |
TheJulia | well, part of the reason why I'm wondering about what happens with a reboot | 18:42 |
JayF | jrosser: you're 100% right that cleaning should handle it; we just have to sus out what exactly is the problem before we can make cleaning be OK with it | 18:42 |
JayF | jrosser: TheJulia:yeah, I'd do a reboot first, then try zeroing that disk if it doesn't work | 18:42 |
TheJulia | since the partition table caching code can be a little weird, maybe there is an edge case with nvmes' | 18:42 |
TheJulia | and by caching code, I mean in the kernel | 18:42 |
JayF | yep that's a good thought | 18:43 |
jrosser | ok so after a reboot | 18:43 |
jrosser | [root@host-10-88-104-67 ~]# lsblk /dev/nvme0n1p2 --pairs --bytes --ascii --nodeps --output UUID,PARTUUID | 18:43 |
jrosser | UUID="" PARTUUID="ea590a46-9f74-4cf0-8ecd-dd17e3a14bab" | 18:43 |
TheJulia | okay | 18:44 |
TheJulia | so... hmmmmm | 18:44 |
TheJulia | this is a sign the table contents and the kernel are out of sync | 18:44 |
jrosser | so previously it has `Image streamed onto device /dev/nvme0n1p2 in 11.578590631484985 seconds` | 18:45 |
TheJulia | okay, that is freaky | 18:45 |
TheJulia | is there a kpartx command after that? | 18:45 |
jrosser | nothing about that in the log | 18:46 |
TheJulia | I wonder if you got it back into the bad state, and what `kpartx -l /dev/nvme0n1` would yield | 18:48 |
jrosser | i can do that | 18:49 |
TheJulia | it is super super weird that it is able to stream the image, but then when you turn around things go kaboom | 18:51 |
TheJulia | oh! | 18:51 |
TheJulia | i wonder if it has a nested partition table | 18:51 |
JayF | that's not a bad idea | 18:52 |
jrosser | would that be if i've built the image wrong? | 18:52 |
JayF | Potentially. You wanna find out? | 18:53 |
JayF | What format is the image in? | 18:53 |
JayF | I used to use `qemu-nbd` to mount up those images loopback, then you can inspect them | 18:53 |
TheJulia | oh! 'is_whole_disk_image': False | 18:53 |
TheJulia | so it thinks it it is a partition image... but *is* it?!? | 18:53 |
JayF | TheJulia: yeah, I assumed this was all partition image | 18:53 |
JayF | we are coming to the same place from opposite directions lol | 18:53 |
TheJulia | jrosser: do you have a diskimage-builder command you can share for the image then? | 18:54 |
TheJulia | it might be that the image has a partition table, and it confuses the nvme device | 18:54 |
JayF | it doesn't help that linux nvme names these days look like the output of `pwgen` | 18:54 |
TheJulia | The config ironic gets has it thinking it is a partition image | 18:54 |
jrosser | `disk-image-create ubuntu vm block-device-efi dhcp-all-interfaces -o my-image` | 18:55 |
TheJulia | so that *should* be a whole disk image aiui | 18:55 |
TheJulia | so in glance, remove the kernel_id and ramdisk_id fields from the glance image uuid, and redeploy | 18:55 |
JayF | TheJulia: I wonder if we can detect that, in Ironic. | 18:56 |
TheJulia | so, it is actually valid on non-nvmes | 18:56 |
TheJulia | as... awful... as that seems | 18:56 |
JayF | it does seem awful | 18:56 |
TheJulia | nvme's have smarts in them that groks partition tables | 18:56 |
JayF | I mean, you can do partitions on raid, partitions on lvm, partitions on partitions why the hell not :| | 18:56 |
TheJulia | ... yeah | 18:57 |
TheJulia | This is definitely FAQ worthy if it works | 18:57 |
jrosser | i'm a bit confused about what you'd like me to try in glance | 18:58 |
TheJulia | well, lets take a look at `openstack image show` output for the iamge id your using | 18:59 |
jrosser | https://paste.opendev.org/show/bD0pHFt1XMVTtQvKhEy8/ | 19:00 |
JayF | properties: img_type: partition | 19:00 |
JayF | but there is no kernel/ramdisk set? | 19:01 |
TheJulia | yeah, there are 2 ways | 19:01 |
TheJulia | aiui | 19:01 |
TheJulia | I think that needs to be whole-disk | 19:01 |
TheJulia | checking | 19:01 |
jrosser | oh i see | 19:02 |
TheJulia | yeah, set that to "whole-disk" and ironic will treat it as a whole disk image | 19:02 |
JayF | So in this case, the glance image is set to partition but has no kernel/ramdisk set | 19:02 |
TheJulia | since you have no properties/ramdisk_id and no properties/kernel_id | 19:03 |
TheJulia | yup | 19:03 |
JayF | how would Ironic have resolved that in the partition-on-partition (with no nvme) case? | 19:03 |
TheJulia | it would have just written it | 19:03 |
JayF | Is that a valid use case? | 19:03 |
TheJulia | it might not have worked | 19:03 |
JayF | I'm trying to figure out if this is a case we should check for | 19:03 |
TheJulia | valid in raid cases | 19:03 |
JayF | ah | 19:03 |
TheJulia | yeah :( | 19:03 |
jrosser | so how this has come about is we have an x86 non uefi image built the same way, and that has worked on systems with ssd | 19:04 |
jrosser | but that could be accidental | 19:04 |
opendevreview | Merged openstack/ironic stable/ussuri: CI: Pin ussuri to use only basic standalone tests https://review.opendev.org/c/openstack/ironic/+/861305 | 19:08 |
TheJulia | yeah, nvme's actually abstract the partitioning table | 19:11 |
TheJulia | so it can store things efficiently | 19:11 |
TheJulia | at least, as I understand it | 19:11 |
TheJulia | SSDs don't think to hard about their contents | 19:12 |
jrosser | i think you have to unset img_type | 19:21 |
jrosser | oh maybe not, the options parsing is complicated :) | 19:22 |
jrosser | TheJulia: JayF: it has worked! thank you so much! | 19:41 |
JayF | \o/ | 19:42 |
TheJulia | woot! | 19:51 |
TheJulia | jrosser: your welcome to start an edit to our docs :) | 19:51 |
JayF | did the cleaning-states-issue just go kaput? | 19:51 |
JayF | was the environment rebuilt? did a conductor restart fix it? | 19:51 |
TheJulia | ++ | 19:52 |
* JayF really wants to get those fixes landed and backported because the failure most is so nasty | 19:52 | |
TheJulia | yes | 19:52 |
opendevreview | Julia Kreger proposed openstack/ironic-specs master: Add a shard key https://review.opendev.org/c/openstack/ironic-specs/+/861803 | 20:42 |
jrosser | JayF: my troubles yesterday were fixed by removing my ironic venvs and redeploying them completely | 21:22 |
JayF | Julia and I were bandying around an idea that a conductor restart might have fixed it | 21:23 |
JayF | did, at any point in the troubleshooting before doing the rebuild, reproduce the err after a conductor restart? | 21:23 |
JayF | if not, it's fine, just making sure we squeeze every last drop of info out :P | 21:23 |
jrosser | i was restarting one of the conductors, the one responsible for the node i am testing | 21:23 |
jrosser | but the rest were left alone (it's 3-way HA) | 21:24 |
JayF | Useful information. | 21:24 |
JayF | Can you, for fun, just do a pip freeze in the working venv? | 21:24 |
JayF | so we can diff with the other one you posted? See if there is a code difference | 21:24 |
JayF | or if you just scared a gremlin away | 21:24 |
jrosser | i can do that tomorrow, as yes that would be very interesting | 21:25 |
jrosser | i applied the two patches for exception handling as well https://github.com/bbc/ironic/commits/bbc-yoga-25.0.0 | 21:26 |
JayF | ack; good stuff | 21:27 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!