Thursday, 2022-12-08

vanou	good morning ironic	02:28
janders	ajya I have a question w/r/t https://review.opendev.org/c/openstack/sushy/+/865388 (SettingsURI boot attrs fix). From which iDRAC version does this becomes applicable to Dell machines? I think when we were collaborating on this change we were working against the behaviour of a future version. If my understanding is correct, I'd be keen to do some	06:50
janders	testing against this version as soon as it's out so that we can validate the fix again and iron out any issues before users hit them. Good idea?	06:50
vanou	TheJulia: Regarding your concern on https://review.opendev.org/c/openstack/ironic/+/865074. (periodic task):I think another approach is to implement vendor passthru method through which user explicitly trigger fetching&storing of firmware version. (packaging module dependency):Can I implement version comparison function in irmc driver code?	07:05
vanou	Hi janders	07:05
ajya	Hi janders , it should be next version, usually released in Decembers. So it could be soon.	07:18
janders	vanou hello! :)	09:05
janders	ajya ACK and thank you!	09:05
kubajj	Morning everyone	09:26
rpittau	good morning ironic! o/	09:33
rpittau	TheJulia: re: 866780 thanks, I'll have a look ASAP	09:34
opendevreview	Riccardo Pittau proposed openstack/ironic bugfix/20.2: All jobs should still run on focal https://review.opendev.org/c/openstack/ironic/+/866972	10:01
opendevreview	Riccardo Pittau proposed openstack/ironic bugfix/20.2: Align iRMC driver with Ironic's default boot_mode https://review.opendev.org/c/openstack/ironic/+/866780	10:02
opendevreview	Riccardo Pittau proposed openstack/ironic bugfix/20.2: All jobs should still run on focal https://review.opendev.org/c/openstack/ironic/+/866972	10:08
opendevreview	Riccardo Pittau proposed openstack/ironic bugfix/20.2: Align iRMC driver with Ironic's default boot_mode https://review.opendev.org/c/openstack/ironic/+/866780	10:09
opendevreview	Kirill proposed openstack/ironic-specs master: new spec: support of vnc console. https://review.opendev.org/c/openstack/ironic-specs/+/866537	10:32
opendevreview	Kirill proposed openstack/ironic-specs master: new spec: support of vnc console. https://review.opendev.org/c/openstack/ironic-specs/+/866537	11:09
opendevreview	Dmitry Tantsur proposed openstack/sushy stable/zed: Handle a different error code for missing TransferProtocolType https://review.opendev.org/c/openstack/sushy/+/866899	11:16
opendevreview	Dmitry Tantsur proposed openstack/sushy stable/yoga: Handle a different error code for missing TransferProtocolType https://review.opendev.org/c/openstack/sushy/+/866901	11:17
opendevreview	Dmitry Tantsur proposed openstack/sushy stable/xena: Handle a different error code for missing TransferProtocolType https://review.opendev.org/c/openstack/sushy/+/866903	11:17
opendevreview	Dmitry Tantsur proposed openstack/sushy stable/wallaby: Handle a different error code for missing TransferProtocolType https://review.opendev.org/c/openstack/sushy/+/866904	11:17
vanou	Hi rpittau	12:27
opendevreview	Slawek Kaplonski proposed openstack/ironic master: [grenade] Explicitly enable Neutron ML2/OVS services in the CI job https://review.opendev.org/c/openstack/ironic/+/866993	13:53
rpittau	mmm lots of failures everywhere, anything broken at large scale?	14:04
rpittau	oh looks like a tox issue	14:04
opendevreview	Riccardo Pittau proposed openstack/ironic bugfix/20.2: All jobs should still run on focal https://review.opendev.org/c/openstack/ironic/+/866972	14:37
opendevreview	Riccardo Pittau proposed openstack/ironic bugfix/20.2: All jobs should still run on focal https://review.opendev.org/c/openstack/ironic/+/866972	14:38
TheJulia	yeah... :(	15:51
opendevreview	Verification of a change to openstack/ironic master failed: Fix debug log message argument formatting https://review.opendev.org/c/openstack/ironic/+/866856	16:00
TheJulia	stevebaker[m]: so I took a look at the proliantutils change you -1'ed on master branch, issue is upper-constraints has not incorporated the latest sushy release, and nobody rechecked them even though they are approved. I've done so, so hopefully they should merge in the next 24 hours	16:14
opendevreview	Merged openstack/sushy stable/xena: Handle a different error code for missing TransferProtocolType https://review.opendev.org/c/openstack/sushy/+/866903	16:42
opendevreview	Merged openstack/sushy stable/wallaby: Handle a different error code for missing TransferProtocolType https://review.opendev.org/c/openstack/sushy/+/866904	16:43
rpittau	good night! o/	17:03
jrosser	i am closer to getting a system deploying - any tips for this would be welcome https://pastebin.com/raw/2W4Xbtfh	17:04
TheJulia	jrosser: o/	18:05
jrosser	hello o/	18:05
TheJulia	reading the log	18:06
jrosser	it seems to lsblk a partition which fails	18:07
TheJulia	so the mellenox failure is weird too	18:09
TheJulia	fwiw	18:09
jrosser	is there compatibility between my yoga controller and a newer ipa?	18:09
jrosser	i am wondering if i have a too new set of hardware for centos8 kernel	18:10
TheJulia	ewww,	18:10
TheJulia	nvme0n1p2 not a block device....	18:10
TheJulia	wow	18:11
jrosser	yes at that point i was wtf	18:11
TheJulia	... yeah	18:11
jrosser	but i did manage to put it into maintainance, log in and reproduce that	18:11
TheJulia	oh, really?!	18:11
TheJulia	so.... does the nvme device show a partition table, and what does the nvme tools say?	18:12
jrosser	you'll see there is a loooong time at the end of the log while i was poking around it	18:12
TheJulia	yeah	18:12
jrosser	i'll boot it again	18:13
jrosser	hmm right the node has been cleaned so the partitions are not there any more	18:22
jrosser	i do have the logs though from that deployment https://paste.opendev.org/show/bkRrxy1XiREu1f0KopWh/	18:25
jrosser	and now i've got the node back to the state of the log, and can reproduce it https://paste.opendev.org/show/b87JVhA7eVbdXGpsSE8L/	18:31
TheJulia	https://lists.zuul-ci.org/archives/list/zuul-announce@lists.zuul-ci.org/message/3NNATSUTSIGP5FE2MDY5X2KJ5X4NB4PT/ <-- JayF	18:35
TheJulia	jrosser: what if you reboot?	18:35
JayF	Yep; I'm aware. I helped edit that post a little bit :)	18:35
TheJulia	okay, cool	18:35
JayF	this is why our gate is busted	18:35
JayF	there's a follow on thing that broke tempest related to the tox 3/4 fix aiui	18:36
TheJulia	well it sounds like "we have just under 2 weeks otherwise we're SOL"	18:36
TheJulia	janders: so you've done a lot with NVMes, is anything jrosser is experiencing above ringing bells for you?	18:37
JayF	TheJulia: yep, pretty much, but the breakages are minor and are unlikely to impact us	18:38
JayF	TheJulia: I already did some of the prework (removing install_comm'and)	18:39
TheJulia	and we then need to cherry-pick down ?	18:39
TheJulia	jrosser: it is super weird, to me, that you an lsblk the first partition, but the second	18:40
TheJulia	I have seen something like this on a nvme device, but it had a super weird over-usb bridge and I kind of just dismissed it	18:40
JayF	TheJulia: maybe. We'll see. The install_command stuff wasn't impactful everywhere. i'm also OK with our answer being "pin tox to 3.x" on stable branches	18:40
TheJulia	seems reasonable to me	18:41
JayF	TheJulia: jrosser: This smells like a bad NVMe controller or bad partition table.	18:41
TheJulia	yeah	18:41
jrosser	this should be somehow regularised with cleaning?	18:41
JayF	TheJulia: jrosser: If the data on that disk is no good, I'd try a dd if=/dev/zero of=/dev/nvme0n1 count=1 bs=10M or something to kill the partition table	18:41
TheJulia	well, part of the reason why I'm wondering about what happens with a reboot	18:42
JayF	jrosser: you're 100% right that cleaning should handle it; we just have to sus out what exactly is the problem before we can make cleaning be OK with it	18:42
JayF	jrosser: TheJulia:yeah, I'd do a reboot first, then try zeroing that disk if it doesn't work	18:42
TheJulia	since the partition table caching code can be a little weird, maybe there is an edge case with nvmes'	18:42
TheJulia	and by caching code, I mean in the kernel	18:42
JayF	yep that's a good thought	18:43
jrosser	ok so after a reboot	18:43
jrosser	[root@host-10-88-104-67 ~]# lsblk /dev/nvme0n1p2 --pairs --bytes --ascii --nodeps --output UUID,PARTUUID	18:43
jrosser	UUID="" PARTUUID="ea590a46-9f74-4cf0-8ecd-dd17e3a14bab"	18:43
TheJulia	okay	18:44
TheJulia	so... hmmmmm	18:44
TheJulia	this is a sign the table contents and the kernel are out of sync	18:44
jrosser	so previously it has `Image streamed onto device /dev/nvme0n1p2 in 11.578590631484985 seconds`	18:45
TheJulia	okay, that is freaky	18:45
TheJulia	is there a kpartx command after that?	18:45
jrosser	nothing about that in the log	18:46
TheJulia	I wonder if you got it back into the bad state, and what `kpartx -l /dev/nvme0n1` would yield	18:48
jrosser	i can do that	18:49
TheJulia	it is super super weird that it is able to stream the image, but then when you turn around things go kaboom	18:51
TheJulia	oh!	18:51
TheJulia	i wonder if it has a nested partition table	18:51
JayF	that's not a bad idea	18:52
jrosser	would that be if i've built the image wrong?	18:52
JayF	Potentially. You wanna find out?	18:53
JayF	What format is the image in?	18:53
JayF	I used to use `qemu-nbd` to mount up those images loopback, then you can inspect them	18:53
TheJulia	oh! 'is_whole_disk_image': False	18:53
TheJulia	so it thinks it it is a partition image... but is it?!?	18:53
JayF	TheJulia: yeah, I assumed this was all partition image	18:53
JayF	we are coming to the same place from opposite directions lol	18:53
TheJulia	jrosser: do you have a diskimage-builder command you can share for the image then?	18:54
TheJulia	it might be that the image has a partition table, and it confuses the nvme device	18:54
JayF	it doesn't help that linux nvme names these days look like the output of `pwgen`	18:54
TheJulia	The config ironic gets has it thinking it is a partition image	18:54
jrosser	`disk-image-create ubuntu vm block-device-efi dhcp-all-interfaces -o my-image`	18:55
TheJulia	so that should be a whole disk image aiui	18:55
TheJulia	so in glance, remove the kernel_id and ramdisk_id fields from the glance image uuid, and redeploy	18:55
JayF	TheJulia: I wonder if we can detect that, in Ironic.	18:56
TheJulia	so, it is actually valid on non-nvmes	18:56
TheJulia	as... awful... as that seems	18:56
JayF	it does seem awful	18:56
TheJulia	nvme's have smarts in them that groks partition tables	18:56
JayF	I mean, you can do partitions on raid, partitions on lvm, partitions on partitions why the hell not :\|	18:56
TheJulia	... yeah	18:57
TheJulia	This is definitely FAQ worthy if it works	18:57
jrosser	i'm a bit confused about what you'd like me to try in glance	18:58
TheJulia	well, lets take a look at `openstack image show` output for the iamge id your using	18:59
jrosser	https://paste.opendev.org/show/bD0pHFt1XMVTtQvKhEy8/	19:00
JayF	properties: img_type: partition	19:00
JayF	but there is no kernel/ramdisk set?	19:01
TheJulia	yeah, there are 2 ways	19:01
TheJulia	aiui	19:01
TheJulia	I think that needs to be whole-disk	19:01
TheJulia	checking	19:01
jrosser	oh i see	19:02
TheJulia	yeah, set that to "whole-disk" and ironic will treat it as a whole disk image	19:02
JayF	So in this case, the glance image is set to partition but has no kernel/ramdisk set	19:02
TheJulia	since you have no properties/ramdisk_id and no properties/kernel_id	19:03
TheJulia	yup	19:03
JayF	how would Ironic have resolved that in the partition-on-partition (with no nvme) case?	19:03
TheJulia	it would have just written it	19:03
JayF	Is that a valid use case?	19:03
TheJulia	it might not have worked	19:03
JayF	I'm trying to figure out if this is a case we should check for	19:03
TheJulia	valid in raid cases	19:03
JayF	ah	19:03
TheJulia	yeah :(	19:03
jrosser	so how this has come about is we have an x86 non uefi image built the same way, and that has worked on systems with ssd	19:04
jrosser	but that could be accidental	19:04
opendevreview	Merged openstack/ironic stable/ussuri: CI: Pin ussuri to use only basic standalone tests https://review.opendev.org/c/openstack/ironic/+/861305	19:08
TheJulia	yeah, nvme's actually abstract the partitioning table	19:11
TheJulia	so it can store things efficiently	19:11
TheJulia	at least, as I understand it	19:11
TheJulia	SSDs don't think to hard about their contents	19:12
jrosser	i think you have to unset img_type	19:21
jrosser	oh maybe not, the options parsing is complicated :)	19:22
jrosser	TheJulia: JayF: it has worked! thank you so much!	19:41
JayF	\o/	19:42
TheJulia	woot!	19:51
TheJulia	jrosser: your welcome to start an edit to our docs :)	19:51
JayF	did the cleaning-states-issue just go kaput?	19:51
JayF	was the environment rebuilt? did a conductor restart fix it?	19:51
TheJulia	++	19:52
* JayF really wants to get those fixes landed and backported because the failure most is so nasty		19:52
TheJulia	yes	19:52
opendevreview	Julia Kreger proposed openstack/ironic-specs master: Add a shard key https://review.opendev.org/c/openstack/ironic-specs/+/861803	20:42
jrosser	JayF: my troubles yesterday were fixed by removing my ironic venvs and redeploying them completely	21:22
JayF	Julia and I were bandying around an idea that a conductor restart might have fixed it	21:23
JayF	did, at any point in the troubleshooting before doing the rebuild, reproduce the err after a conductor restart?	21:23
JayF	if not, it's fine, just making sure we squeeze every last drop of info out :P	21:23
jrosser	i was restarting one of the conductors, the one responsible for the node i am testing	21:23
jrosser	but the rest were left alone (it's 3-way HA)	21:24
JayF	Useful information.	21:24
JayF	Can you, for fun, just do a pip freeze in the working venv?	21:24
JayF	so we can diff with the other one you posted? See if there is a code difference	21:24
JayF	or if you just scared a gremlin away	21:24
jrosser	i can do that tomorrow, as yes that would be very interesting	21:25
jrosser	i applied the two patches for exception handling as well https://github.com/bbc/ironic/commits/bbc-yoga-25.0.0	21:26
JayF	ack; good stuff	21:27

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!