Thursday, 2022-12-08

vanougood morning ironic02:28
jandersajya I have a question w/r/t https://review.opendev.org/c/openstack/sushy/+/865388 (SettingsURI boot attrs fix). From which iDRAC version does this becomes applicable to Dell machines? I think when we were collaborating on this change we were working against the behaviour of a future version. If my understanding is correct, I'd be keen to do some06:50
janderstesting against this version as soon as it's out so that we can validate the fix again and iron out any issues before users hit them. Good idea?06:50
vanouTheJulia: Regarding your concern on https://review.opendev.org/c/openstack/ironic/+/865074. (periodic task):I think another approach is to implement vendor passthru method through which user explicitly trigger fetching&storing of firmware version. (packaging module dependency):Can I implement version comparison function in irmc driver code?07:05
vanouHi janders07:05
ajyaHi janders , it should be next version, usually released in Decembers. So it could be soon.07:18
jandersvanou hello! :)09:05
jandersajya ACK and thank you!09:05
kubajjMorning everyone 09:26
rpittaugood morning ironic! o/09:33
rpittauTheJulia: re: 866780 thanks, I'll have a look ASAP09:34
opendevreviewRiccardo Pittau proposed openstack/ironic bugfix/20.2: All jobs should still run on focal  https://review.opendev.org/c/openstack/ironic/+/86697210:01
opendevreviewRiccardo Pittau proposed openstack/ironic bugfix/20.2: Align iRMC driver with Ironic's default boot_mode  https://review.opendev.org/c/openstack/ironic/+/86678010:02
opendevreviewRiccardo Pittau proposed openstack/ironic bugfix/20.2: All jobs should still run on focal  https://review.opendev.org/c/openstack/ironic/+/86697210:08
opendevreviewRiccardo Pittau proposed openstack/ironic bugfix/20.2: Align iRMC driver with Ironic's default boot_mode  https://review.opendev.org/c/openstack/ironic/+/86678010:09
opendevreviewKirill proposed openstack/ironic-specs master: new spec: support of vnc console.  https://review.opendev.org/c/openstack/ironic-specs/+/86653710:32
opendevreviewKirill proposed openstack/ironic-specs master: new spec: support of vnc console.  https://review.opendev.org/c/openstack/ironic-specs/+/86653711:09
opendevreviewDmitry Tantsur proposed openstack/sushy stable/zed: Handle a different error code for missing TransferProtocolType  https://review.opendev.org/c/openstack/sushy/+/86689911:16
opendevreviewDmitry Tantsur proposed openstack/sushy stable/yoga: Handle a different error code for missing TransferProtocolType  https://review.opendev.org/c/openstack/sushy/+/86690111:17
opendevreviewDmitry Tantsur proposed openstack/sushy stable/xena: Handle a different error code for missing TransferProtocolType  https://review.opendev.org/c/openstack/sushy/+/86690311:17
opendevreviewDmitry Tantsur proposed openstack/sushy stable/wallaby: Handle a different error code for missing TransferProtocolType  https://review.opendev.org/c/openstack/sushy/+/86690411:17
vanouHi rpittau12:27
opendevreviewSlawek Kaplonski proposed openstack/ironic master: [grenade] Explicitly enable Neutron ML2/OVS services in the CI job  https://review.opendev.org/c/openstack/ironic/+/86699313:53
rpittaummm lots of failures everywhere, anything broken at large scale?14:04
rpittauoh looks like a tox issue14:04
opendevreviewRiccardo Pittau proposed openstack/ironic bugfix/20.2: All jobs should still run on focal  https://review.opendev.org/c/openstack/ironic/+/86697214:37
opendevreviewRiccardo Pittau proposed openstack/ironic bugfix/20.2: All jobs should still run on focal  https://review.opendev.org/c/openstack/ironic/+/86697214:38
TheJuliayeah... :(15:51
opendevreviewVerification of a change to openstack/ironic master failed: Fix debug log message argument formatting  https://review.opendev.org/c/openstack/ironic/+/86685616:00
TheJuliastevebaker[m]: so I took a look at the proliantutils change you -1'ed on master branch, issue is upper-constraints has not incorporated the latest sushy release, and nobody rechecked them even though they are approved. I've done so, so hopefully they should merge in the next 24 hours16:14
opendevreviewMerged openstack/sushy stable/xena: Handle a different error code for missing TransferProtocolType  https://review.opendev.org/c/openstack/sushy/+/86690316:42
opendevreviewMerged openstack/sushy stable/wallaby: Handle a different error code for missing TransferProtocolType  https://review.opendev.org/c/openstack/sushy/+/86690416:43
rpittaugood night! o/17:03
jrosseri am closer to getting a system deploying - any tips for this would be welcome https://pastebin.com/raw/2W4Xbtfh17:04
TheJuliajrosser: o/18:05
jrosserhello o/18:05
TheJuliareading the log18:06
jrosserit seems to lsblk a partition which fails18:07
TheJuliaso the mellenox failure is weird too18:09
TheJuliafwiw18:09
jrosseris there compatibility between my yoga controller and a newer ipa?18:09
jrosseri am wondering if i have a too new set of hardware for centos8 kernel18:10
TheJuliaewww, 18:10
TheJulianvme0n1p2 not a block device....18:10
TheJuliawow18:11
jrosseryes at that point i was wtf18:11
TheJulia... yeah18:11
jrosserbut i did manage to put it into maintainance, log in and reproduce that18:11
TheJuliaoh, really?!18:11
TheJuliaso.... does the nvme device show a partition table, and what does the nvme tools say?18:12
jrosseryou'll see there is a loooong time at the end of the log while i was poking around it18:12
TheJuliayeah18:12
jrosseri'll boot it again18:13
jrosserhmm right the node has been cleaned so the partitions are not there any more18:22
jrosseri do have the logs though from that deployment https://paste.opendev.org/show/bkRrxy1XiREu1f0KopWh/18:25
jrosserand now i've got the node back to the state of the log, and can reproduce it https://paste.opendev.org/show/b87JVhA7eVbdXGpsSE8L/18:31
TheJuliahttps://lists.zuul-ci.org/archives/list/zuul-announce@lists.zuul-ci.org/message/3NNATSUTSIGP5FE2MDY5X2KJ5X4NB4PT/ <-- JayF18:35
TheJuliajrosser: what if you reboot?18:35
JayFYep; I'm aware. I helped edit that post a little bit :) 18:35
TheJuliaokay, cool18:35
JayFthis is why our gate is busted18:35
JayFthere's a follow on thing that broke tempest related to the tox 3/4 fix aiui18:36
TheJuliawell it sounds like "we have just under 2 weeks otherwise we're SOL"18:36
TheJuliajanders: so you've done a lot with NVMes, is anything jrosser is experiencing above ringing bells for you?18:37
JayFTheJulia: yep, pretty much, but the breakages are minor and are unlikely to impact us18:38
JayFTheJulia: I already did some of the prework (removing install_comm'and)18:39
TheJuliaand we then need to cherry-pick down ?18:39
TheJuliajrosser: it is super weird, to me, that you an lsblk the first partition, but the second18:40
TheJuliaI *have* seen something like this on a nvme device, but it had a super weird  over-usb bridge and I kind of just dismissed it18:40
JayFTheJulia: maybe. We'll see. The install_command stuff wasn't impactful everywhere. i'm also OK with our answer being "pin tox to 3.x" on stable branches18:40
TheJuliaseems reasonable to me18:41
JayFTheJulia: jrosser: This smells like a bad NVMe controller or bad partition table.18:41
TheJuliayeah18:41
jrosserthis should be somehow regularised with cleaning?18:41
JayFTheJulia: jrosser: If the data on that disk is no good, I'd try a dd if=/dev/zero of=/dev/nvme0n1 count=1 bs=10M or something to kill the partition table18:41
TheJuliawell, part of the reason why I'm wondering about what happens with a reboot18:42
JayFjrosser: you're 100% right that cleaning should handle it; we just have to sus out what exactly is the problem before we can make cleaning be OK with it18:42
JayFjrosser: TheJulia:yeah, I'd do a reboot first, then try zeroing that disk if it doesn't work18:42
TheJuliasince the partition table caching code can be a little weird, maybe there is an edge case with nvmes'18:42
TheJuliaand by caching code, I mean in the kernel18:42
JayFyep that's a good thought18:43
jrosserok so after a reboot18:43
jrosser[root@host-10-88-104-67 ~]# lsblk /dev/nvme0n1p2 --pairs --bytes --ascii --nodeps --output UUID,PARTUUID18:43
jrosserUUID="" PARTUUID="ea590a46-9f74-4cf0-8ecd-dd17e3a14bab"18:43
TheJuliaokay18:44
TheJuliaso... hmmmmm18:44
TheJuliathis is a sign the table contents and the kernel are out of sync18:44
jrosserso previously it has `Image streamed onto device /dev/nvme0n1p2 in 11.578590631484985 seconds`18:45
TheJuliaokay, that is freaky18:45
TheJuliais there a kpartx command after that?18:45
jrossernothing about that in the log18:46
TheJuliaI wonder if you got it back into the bad state, and what `kpartx -l /dev/nvme0n1` would yield18:48
jrosseri can do that18:49
TheJuliait is super super weird that it is able to stream the image, but then when you turn around things go kaboom18:51
TheJuliaoh!18:51
TheJuliai wonder if it has a nested partition table18:51
JayFthat's not a bad idea18:52
jrosserwould that be if i've built the image wrong?18:52
JayFPotentially. You wanna find out?18:53
JayFWhat format is the image in? 18:53
JayFI used to use `qemu-nbd` to mount up those images loopback, then you can inspect them18:53
TheJuliaoh! 'is_whole_disk_image': False18:53
TheJuliaso it thinks it it is a partition image... but *is* it?!?18:53
JayFTheJulia: yeah, I assumed this was all partition image18:53
JayFwe are coming to the same place from opposite directions lol18:53
TheJuliajrosser: do you have a diskimage-builder command you can share for the image then?18:54
TheJuliait might be that the image has a partition table, and it confuses the nvme device18:54
JayFit doesn't help that linux nvme names these days look like the output of `pwgen`18:54
TheJuliaThe config ironic gets has it thinking it is a partition image18:54
jrosser`disk-image-create ubuntu vm block-device-efi dhcp-all-interfaces -o my-image`18:55
TheJuliaso that *should* be a whole disk image aiui18:55
TheJuliaso in glance, remove the kernel_id and ramdisk_id fields from the glance image uuid, and redeploy18:55
JayFTheJulia: I wonder if we can detect that, in Ironic.18:56
TheJuliaso, it is actually valid on non-nvmes18:56
TheJuliaas... awful... as that seems18:56
JayFit does seem awful18:56
TheJulianvme's have smarts in them that groks partition tables18:56
JayFI mean, you can do partitions on raid, partitions on lvm, partitions on partitions why the hell not :|18:56
TheJulia... yeah18:57
TheJuliaThis is definitely FAQ worthy if it works18:57
jrosseri'm a bit confused about what you'd like me to try in glance18:58
TheJuliawell, lets take a look at `openstack image show` output for the iamge id your using18:59
jrosserhttps://paste.opendev.org/show/bD0pHFt1XMVTtQvKhEy8/19:00
JayFproperties: img_type: partition19:00
JayFbut there is no kernel/ramdisk set? 19:01
TheJuliayeah, there are 2 ways19:01
TheJuliaaiui19:01
TheJuliaI think that needs to be whole-disk19:01
TheJuliachecking19:01
jrosseroh i see19:02
TheJuliayeah, set that to "whole-disk" and ironic will treat it as a whole disk image19:02
JayFSo in this case, the glance image is set to partition but has no kernel/ramdisk set19:02
TheJuliasince you have no properties/ramdisk_id and no properties/kernel_id19:03
TheJuliayup19:03
JayFhow would Ironic have resolved that in the partition-on-partition (with no nvme) case?19:03
TheJuliait would have just written it19:03
JayFIs that a valid use case?19:03
TheJuliait might not have worked19:03
JayFI'm trying to figure out if this is a case we should check for19:03
TheJuliavalid in raid cases19:03
JayFah19:03
TheJuliayeah :(19:03
jrosserso how this has come about is we have an x86 non uefi image built the same way, and that has worked on systems with ssd19:04
jrosserbut that could be accidental19:04
opendevreviewMerged openstack/ironic stable/ussuri: CI: Pin ussuri to use only basic standalone tests  https://review.opendev.org/c/openstack/ironic/+/86130519:08
TheJuliayeah, nvme's actually abstract the partitioning table19:11
TheJuliaso it can store things efficiently19:11
TheJuliaat least, as I understand it19:11
TheJuliaSSDs don't think to hard about their contents19:12
jrosseri think you have to unset img_type19:21
jrosseroh maybe not, the options parsing is complicated :)19:22
jrosserTheJulia: JayF: it has worked! thank you so much!19:41
JayF\o/19:42
TheJuliawoot!19:51
TheJuliajrosser: your welcome to start an edit to our docs :)19:51
JayFdid the cleaning-states-issue just go kaput?19:51
JayFwas the environment rebuilt? did a conductor restart fix it?19:51
TheJulia++19:52
* JayF really wants to get those fixes landed and backported because the failure most is so nasty19:52
TheJuliayes19:52
opendevreviewJulia Kreger proposed openstack/ironic-specs master: Add a shard key  https://review.opendev.org/c/openstack/ironic-specs/+/86180320:42
jrosserJayF: my troubles yesterday were fixed by removing my ironic venvs and redeploying them completely21:22
JayFJulia and I were bandying around an idea that a conductor restart might have fixed it21:23
JayFdid, at any point in the troubleshooting before doing the rebuild, reproduce the err after a conductor restart?21:23
JayFif not, it's fine, just making sure we squeeze every last drop of info out :P 21:23
jrosseri was restarting one of the conductors, the one responsible for the node i am testing21:23
jrosserbut the rest were left alone (it's 3-way HA)21:24
JayFUseful information.21:24
JayFCan you, for fun, just do a pip freeze in the working venv?21:24
JayFso we can diff with the other one you posted? See if there is a code difference21:24
JayFor if you just scared a gremlin away21:24
jrosseri can do that tomorrow, as yes that would be very interesting21:25
jrosseri applied the two patches for exception handling as well https://github.com/bbc/ironic/commits/bbc-yoga-25.0.021:26
JayFack; good stuff21:27

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!