Monday, 2023-03-06

opendevreview	Merged openstack/sushy master: Update master for stable/2023.1 https://review.opendev.org/c/openstack/sushy/+/876156	03:51
rpittau	good morning ironic! o/	09:15
dtantsur	JayF, jlvillal, the IPA example in bifrost does look questionable. There does not seem to be support for per-node kernel/ramdisk, it's a global vairable.	09:18
espenfl	Hi there. Upon using cloud-init (from the configdrive) for setting up networks, SSH keys etc. I wonder if it in Bifrost now is possible to set `user-data` in JSON style instead of a string? Seems Ironic added support for this (https://opendev.org/openstack/ironic/commit/3e1e0c9d5e6e5753d0fe2529901986ea36934971 etc.).	10:37
dtantsur	espenfl: no, but there is https://opendev.org/openstack/bifrost/commit/bf1bb49c38c26c4b88e330d3f6f9ccdc8419a1f5	10:56
espenfl	dtantsur: That is awesome. Thanks. Also noticed from your writeup: https://owlet.today/posts/ephemeral-workloads-with-ironic/#id9 that `network_data` should work already. With that and the SSH keys to the `default` user we are ready to go. Testing it now. Thanks for the nice writeup.	11:03
opendevreview	Dmitry Tantsur proposed openstack/ironic bugfix/21.2: Fixes for tox 4.0 https://review.opendev.org/c/openstack/ironic/+/876409	11:04
opendevreview	Dmitry Tantsur proposed openstack/ironic bugfix/21.2: Configure CI for bugfix/21.2 https://review.opendev.org/c/openstack/ironic/+/876410	11:04
opendevreview	Riccardo Pittau proposed openstack/bifrost master: Fix enabling epel repo for rpm distributions https://review.opendev.org/c/openstack/bifrost/+/875929	11:06
opendevreview	Riccardo Pittau proposed openstack/ironic master: Add a non-voting metal3 CI job https://review.opendev.org/c/openstack/ironic/+/863873	11:15
opendevreview	Riccardo Pittau proposed openstack/ironic master: Add a non-voting metal3 CI job https://review.opendev.org/c/openstack/ironic/+/863873	11:16
kaloyank	Hello everyone, consider the following scenario. There are a couple of machines that I want to deploy using their local disks. I'm using Ironic in conjunction with Nova, Neutron and Cinder. In general, everything goes smoothly and Ironic deploys the machines as expected.	11:55
kubajj	Hello everyone	11:56
kaloyank	I then turn on fast-track deployment for these machines, destroy their respective Nova instances. Ironic cleans up these machines and leaves them powered on. I then create the same Nova instances. However, looking at their IPMI remote console, I don't see the IPA doing anything.	11:57
kaloyank	After a while, the machine is rebooted. However, it doesn't pxe boot iPXE and ends up in EFI shell.	11:58
dtantsur	kaloyank: fast-track with nova/neutron is risky. if networking parameters change between 2 instances, what happens?	11:58
dtantsur	otherwise, it's hard to debug from textual message. you need to check the logs (DHCP, ironic-conductor, maybe even tcpdump)	11:58
kaloyank	I haven't tried changing, I'd like to share something more :)	11:58
dtantsur	do you use the same ports for the instance? otherwise, the IP address may change.	11:59
kaloyank	I do, let me share some more information so you can have better context	11:59
kaloyank	I checked the ironic-conductors logs and found out the following sequence of events:	12:00
opendevreview	Vanou Ishii proposed openstack/ironic stable/zed: [iRMC] Handle IPMI incompatibility in iRMC S6 2.x https://review.opendev.org/c/openstack/ironic/+/870881	12:00
kaloyank	1. After cleaning a machine, the IPA sends heartbeats and the ironic-conductors processes them as expected.	12:01
kaloyank	2. When an instance is spawned, the ironic-conductor starts downloading the image from Glance, so it can serve it from it's HTTP server to the IPA	12:01
kaloyank	3. In the mean time, the IPA sends more heartbeats (two to be precise) but the ironic-conductor refuses to register them, as the code requires an exclusive instance lock. However, such a lock is already created for the node when the deployment process started	12:02
kaloyank	4. After the fast_track_timeout expires, the ironic-conductor considers the IPA dead and reboots the node	12:03
dtantsur	the "requires an exclusive instance lock" is probably a red herring hiding the actual failure	12:04
kaloyank	5. However, the created neutron port lacks the extra DHCP options to perform the PXE chainloading of iPXE	12:04
kaloyank	well, I can share the logs and the exact line of code	12:04
dtantsur	aha, #5 is probably a bug, I guess we ignore DHCP options on fast-track and then forget to add them if fast-track fails	12:05
kaloyank	I presumed the same as it seemed quite out of order, anyway logs incoming, shall I pastebin them?	12:06
dtantsur	yep, although I cannot promise to dive into them too deeply	12:07
kaloyank	https://github.com/openstack/ironic/blob/master/ironic/drivers/modules/agent_base.py#L583 where the shared lock is elevated to exclusive but fails (an exclusive lock already exists)	12:08
dtantsur	yeah, but the exclusive one should disappear soon	12:10
dtantsur	if it does not, you need to track what is holding the lock for so long (stuck image download, etc)	12:10
kaloyank	that was my other question, the image is quite large, ~10GB, with a smaller image it's not a problem since the fast_track timeout hasn't expired	12:11
kaloyank	so is this behavior expected and do I need to work around it somehow?	12:11
dtantsur	kaloyank: if you don't see any abnormalities in the logs, you may try to play with the value of fast_track_timeout	12:16
dtantsur	under #5 is quite likely a bug that needs fixing	12:16
kaloyank	it's already at its maximum (5mins), the image just downloads and verifies for a lot more	12:16
dtantsur	I'm frankly not sure why fast_track_timeout has an upper limit.. it has to be lower than ramdisk_heartbeat_timeout, but even that can be changed	12:18
dtantsur	but that's a curious issue you describe. indeed, we cannot register a heartbeat while the lock is actively held..	12:18
kaloyank	I presume because the code updates important fields, still I tried bypassing the code in the exception handler but nothing changed in behavior	12:19
dtantsur	it could be interesting to release the lock while we're downloading/converting the image. The question is: what to do if we cannot re-acquire the lock afterwards.	12:20
dtantsur	OR allow heartbeats to be registered when the lock is held. but that requires changing database.	12:21
kaloyank	can the image be downloaded without an exclusive lock?	12:22
dtantsur	kaloyank: a workaround may be to disable image conversion on the conductor side	12:22
dtantsur	kaloyank: that's my first idea. the problem is: what to do if you can no longer acquire the lock after the image is downloaded.	12:22
kaloyank	I use raw images only, that's why they're so big	12:23
dtantsur	ah, so there is no conversion, it's just downloaded for 5 minutes?	12:23
kaloyank	yes, it's really big, I have a smaller one and it deploys without a problem	12:23
dtantsur	if you use swift as a glance backend, you can make ironic serve a URL to swift without local caching	12:24
kaloyank	I don't have swift :/	12:24
kaloyank	as I store the images in Cinder (Glance Cinder store) I was thinking of attaching the volume to the ironic conductor and streaming the block device. I know that no such code exists but if there are no other simpler options, I'd opt it for writing the code	12:25
dtantsur	cc TheJulia JayF (once you wake up) for more ideas ^^	12:31
opendevreview	Dmitry Tantsur proposed openstack/ironic bugfix/21.2: Update .gitreview for bugfix/21.2 https://review.opendev.org/c/openstack/ironic/+/867826	12:36
kaloyank	logs were rotated, I think the issue was made clear, do you still need them?	12:37
kaloyank	jftr, I'm running the latest Yoga release from RDO	12:45
mnasiadka	Hello	12:58
mnasiadka	During upgrade Wallaby->Xena while running online_data_migrations I get "Error while running update_to_latest_versions: 'BIOSSetting' object has no attribute 'id'." - help appreciated ;-)	12:59
jssfr	how do I reset a node which is in state `deploy failed` or `wait call-back` state?	13:14
jssfr	by reset, I mean bring it into a clean/available state via cleaning	13:14
jssfr	abort, provide, and manage all give state errors	13:14
jssfr	ah, undeploy!	13:16
jssfr	thanks documentation!	13:16
kaloyank	or.. baremetal node manage <node-name>; baremetal node provide <node-name>	13:23
rpittau	looking for reviews and approval for 2 fixes in bifrost CI https://review.opendev.org/c/openstack/bifrost/+/872634 and https://review.opendev.org/c/openstack/bifrost/+/875929 thanks!	13:31
jssfr	okay, different issue now	13:55
jssfr	We are trying to deploy an image which contains an mdraid	13:55
jssfr	using the IPA, the image gets written to the disk successfully, but then the IPA re-reads mdraid and then attempts to installg rub	13:55
jssfr	which fails, because:	13:55
jssfr	Mar 06 13:03:16 ubuntu ironic-python-agent[2384]: 2023-03-06 13:03:16.645 2384 ERROR root [-] Command failed: install_bootloader, error: Error finding the disk or partition device to deploy the image onto: No EFI partition could be detected on device /dev/md127 and EFI partition UUID has not been recorded during deployment (which is often the case for whole disk images). Are you using a UEFI-compat	13:55
jssfr	ible image?: ironic_python_agent.errors.DeviceNotFound: Error finding the disk or partition device to deploy the image onto: No EFI partition could be detected on device /dev/md127 and EFI partition UUID has not been recorded during deployment (which is often the case for whole disk images). Are you using a UEFI-compatible image?	13:55
jssfr	is there a sane way to get disk images which include an mdraid deployed?	13:56
kaloyank	Well, can you verify that the image has an EFI system partition? What does it's partition table look like?	13:56
jssfr	(the EFI partition is outside the mdraid)	13:56
jssfr	kaloyank, https://paste.ubuntu.com/p/KkfYgqDmVj/	13:57
jssfr	OARrr	13:57
jssfr	why the heck does that need a login	13:57
jssfr	let me find a different paste, I was assuming pastebinit to be sane	13:57
jssfr	https://paste.debian.net/hidden/0f5930b1/	13:58
jssfr	kaloyank, ^ that's the partition table of the image	13:58
jssfr	p3 is the RAID, p1 is the EFI system partition.	13:58
jssfr	(this is from the box where we created the image, hence /dev/loopX)	13:59
kaloyank	jssfr: I'm looking at this: https://github.com/openstack/ironic-python-agent/blob/a1670753a23a79b6536f67eae9cca154e0ed2e65/ironic_python_agent/extensions/image.py#L695 which calls this: https://github.com/openstack/ironic-python-agent/blob/fcb65cae18f4a6b4b05fb70677e2fa114e0558a9/ironic_python_agent/hardware.py#L1182	14:05
kaloyank	did you specify a root_device_hint ?	14:06
jssfr	nope, looking into that now, thanks!	14:28
jssfr	kaloyank, do you happen to have a documentation link for root_device_hint? I can only find dcumentation from pike :/	14:30
kaloyank	jssfr: https://docs.openstack.org/ironic/pike/install/advanced.html#specifying-the-disk-for-deployment-root-device-hints	14:30
jssfr	that's the pike stuff I found	14:31
kaloyank	That's what you need, it's called root_device in Pike	14:31
jssfr	so something like {"name": "/dev/sda"} as root_device_hint?	14:32
jssfr	and that's still a property, not a driver_info or something?	14:32
kaloyank	Yes, it's a property	14:33
jssfr	ack, thanks	14:33
jssfr	(the code reads as if it's still `root_device` though, not root_device_hint?	14:33
kaloyank	"That's what you need, it's called root_device in Pike"	14:33
kaloyank	jssfr: JFTR, Pike is not supported, there might be bugs related to your case that are fixed in more recent releases, it'd be really nice if you can run a supported release	14:35
jssfr	we are on victoria	14:35
jssfr	that's why I was asking for newer documentation	14:36
kaloyank	then why were you asking about Pike?	14:36
jssfr	ah, misunderstanding :)	14:36
jssfr	what I meant to say was "all documentation I found was for Pike"	14:36
jssfr	"… which is not what I wanted, I wanted something more recen"	14:36
jssfr	I see how that wasn't clear, sorry	14:36
kaloyank	https://docs.openstack.org/ironic/victoria/install/advanced.html#specifying-the-disk-for-deployment-root-device-hints	14:36
jssfr	thanks a lot	14:36
kaloyank	this is the appropriate link then	14:36
iurygregory	good morning Ironic	14:56
TheJulia	good morning	14:57
TheJulia	kaloyank: so... I guess that is doable with appropriate code, the question is going to be "how to get application credentials for volume access down to the agent". We've generally discussed it before and have a general idea, but I don't think we've implemented any such credential path.	15:01
dtantsur	the initial problem is also curious. fast-track can break if the image is too large. which is definitely not the desired behavior.	15:01
TheJulia	I'n not quite sure if it is still heartbeating	15:02
JayF	good morning	15:02
dtantsur	why not?	15:02
JayF	#startmeeting ironic	15:02
opendevmeet	Meeting started Mon Mar 6 15:02:33 2023 UTC and is due to finish in 60 minutes. The chair is JayF. Information about MeetBot at http://wiki.debian.org/MeetBot.	15:02
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	15:02
opendevmeet	The meeting name has been set to 'ironic'	15:02
TheJulia	but, yes, it is a can of worms	15:02
iurygregory	o/	15:02
dtantsur	true	15:02
dtantsur	hi folks o/	15:02
TheJulia	dtantsur: window is against the heartbeat operation still occuring	15:02
matfechner	o/	15:02
JayF	Should be a pretty quick one this morning; I'll try to fly through	15:02
TheJulia	o/	15:03
JayF	#topic Announcements/Reminder	15:03
JayF	As always, tag your patches with ironic-week-prio to get them reviewed	15:03
JayF	And we are counting down, focusing on any outstanding reviews would be good; I haven't looked but will make a point to check today	15:03
kaloyank	TheJulia: what do you mean by application credentials exactly? How Ironic authenticates to the storage backend or something else?	15:03
rpittau	o/	15:03
JayF	#topic Review Actions from last meeting	15:03
TheJulia	kaloyank: lets revisit once the irc meeting is over :)	15:04
kaloyank	sure	15:04
JayF	kaloyank: I'll be fast :D	15:04
JayF	okay, action items, we had 3	15:04
JayF	First one is tied to a general announcement; I will be out Weds-Mon at a conference with limited availability. If you need me urgently, email is best (anything-but-IRC).	15:05
JayF	And iurygregory will be running the meeting next week in my stead	15:05
JayF	I have an untaken action I'll renew	15:05
JayF	#action JayF To look into VMT for Ironic	15:05
JayF	That's the actions from last week	15:05
JayF	moving on	15:05
JayF	#topic CI Status	15:05
JayF	Is there anything notable about CI from the last 7 days that people should know?	15:05
rpittau	JayF: just 2 minor fix for bifrost CI	15:06
rpittau	they're in the prio list	15:06
JayF	Thanks for that \o/	15:06
JayF	I'll look this morning	15:06
TheJulia	mnasiadka: uhh... wow, that shouldn't break that way. Have you checked oslo.db is appropriate based upon requirements.txt and the xena upper-constraints.txt file from the openstack/requirements repo branch stable/xena ?	15:06
rpittau	thanks	15:06
JayF	#topic VirtualPDU update	15:06
iurygregory	I saw some of the jobs in networking-generic-switch failing (but didn't really look to find the root cause)	15:07
JayF	So, thanks to rpittau, fungi and their work, VirtualPDU has us as cores now	15:07
rpittau	\o/	15:07
mnasiadka	TheJulia: that's a Kolla container, I can't assume it's different but let me check :)	15:07
JayF	additionally, I proposed + landed governance change to put it officially under Ironic's program	15:07
iurygregory	tks everyone!	15:07
JayF	So VirtualPDU is adopted. There is a further step; moving the repo from x/virtualpdu -> openstack/virtualpdu; but that requires gerrit downtime and will not happen until after the release sometime	15:08
jssfr	kaloyank, specifying root_device helped, thanks!	15:08
rpittau	JayF: what do we want to do for a possible release ?	15:08
rpittau	for virtualpdu	15:08
TheJulia	mnasiadka: the issue is it has been for 6 years, something is very wrong.	15:08
JayF	rpittau: cut a manual one in the meantime would be my suggestion?	15:08
rpittau	mmm yeah	15:08
rpittau	ok	15:08
JayF	I would assume release-management of virtualpdu won't be possible until after antelope is done	15:08
JayF	so if we need a virtualpdu release before that; manual is the way	15:08
rpittau	I'll look into that	15:09
JayF	Thanks for that \o/	15:09
JayF	#topic Release Countdown	15:09
JayF	cycle highlights have landed	15:09
JayF	probably too late to land any changes of bulk	15:09
JayF	but if you have something you want in antelope, now is the time to make noise and get it reviewed	15:10
JayF	#link https://review.opendev.org/c/openstack/ironic/+/876195 release notes prelude	15:10
JayF	I will revise this today and push to get it merged; if you have opinions on it post them soon :)	15:10
mnasiadka	TheJulia: oslo.db==11.0.0, and yoga upper-constraints.txt says 11.2.0, yes, something is wrong :)	15:10
JayF	anything else for release before I move on?	15:10
JayF	#topic Open Discussion	15:11
JayF	As said earlier, I will be unavailable from Weds thru Mon traveling to and participating in SCALE 20x.	15:11
TheJulia	mnasiadka: so that should be xena	15:11
JayF	If you need me urgently for anything, I think most of you have my email. That'll be checked much more often than IRC	15:11
JayF	anything else for open discussion	15:11
mnasiadka	TheJulia: sorry, mixed it up - yes, it's according to xena's u-c	15:12
TheJulia	I'll also be at SCALE 20x later this week, just fyi.	15:12
mnasiadka	TheJulia: what else can be wrong?	15:12
TheJulia	mnasiadka: i guess step 0 is going to be open the database, ensure the column is there	15:12
JayF	I'm calling it	15:13
JayF	#endmeeting	15:13
opendevmeet	Meeting ended Mon Mar 6 15:13:16 2023 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	15:13
opendevmeet	Minutes: https://meetings.opendev.org/meetings/ironic/2023/ironic.2023-03-06-15.02.html	15:13
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/ironic/2023/ironic.2023-03-06-15.02.txt	15:13
opendevmeet	Log: https://meetings.opendev.org/meetings/ironic/2023/ironic.2023-03-06-15.02.log.html	15:13
TheJulia	heh	15:13
TheJulia	kaloyank: o/ So I meant credentials for how ironic-python-agent connects to Cinder to access the image, that is unless you want Ironic's conductor to handle downloading and packing the image locally.	15:14
TheJulia	kaloyank: I thought cinder had a capability to do that though, now that I think abou tit.	15:14
TheJulia	about it	15:14
mnasiadka	TheJulia: I was upgrading from 17.1.1.dev7, there is no 'id' column in 'bios_settings' (only 'node_id')	15:14
mnasiadka	TheJulia: https://paste.openstack.org/show/b0LjGZNV0fwszRgiOdJL/	15:14
kaloyank	TheJulia: I was thinking of making the ironic-conductor process attach a Cinder volume to itself and serve its contents	15:15
kaloyank	implementing this in the IPA will be harder, tough more efficient in terms of the end result	15:15
TheJulia	mnasiadka: wow, that is a bug then	15:16
TheJulia	mnasiadka: can you get a full stacktrace for us? I'm really surprised we didn't find this in CI over the years	15:16
TheJulia	I was thinking id would be coming from the underlying base model, but not in that table, node_id is primary key	15:16
mnasiadka	TheJulia: sure, is there a guide somewhere so I don't miss anything?	15:17
TheJulia	mnasiadka: an upgrade guide? or on getting a stacktrace from the logs? I'm confused by the question, sorry	15:17
mnasiadka	TheJulia: stacktrace :)	15:17
TheJulia	kaloyank: I still could have sworn cinder was supposed to have an extract capability :)	15:18
TheJulia	mnasiadka: no, as much context around the error as possible, typically the ~20 lines before is most of the stack trace which would occur	15:18
kaloyank	TheJulia: by extract you mean point it a path and let it dump a volume to a file?	15:18
TheJulia	kaloyank: yeah	15:19
TheJulia	kaloyank: I might be totally off base, and maybe this is not universally supported, but I thought you could save a cinder volume to a glance image in some way	15:19
mnasiadka	TheJulia: https://storyboard.openstack.org/#!/story/2010632 - hope that's enough (in the comment)	15:20
dtantsur	some code reshuffling if anyone has time: https://review.opendev.org/c/openstack/ironic/+/874677 https://review.opendev.org/c/openstack/ironic/+/875915	15:20
kaloyank	TheJulia: You absolutely can make an image out of a volume but how does this help the fact that the ironic-conductor still has to download the image from Glance	15:20
TheJulia	oh, i bet the batching doesn't grok there being no id field	15:21
TheJulia	kaloyank: so the underlying issue is that a cinder volume may be any storage technology, and the ironic-conductor may not be able to access it	15:22
TheJulia	an mvp might be, get it packaged out to glance, download it, and then work on improving the flow for greater efficiency	15:22
kaloyank	TheJulia: I still don't see how that resolves the issue :/	15:24
kaloyank	right now, Glance attaches the volume, copies the data to a temporary file, serves the image from the file and after that deletes it	15:25
kaloyank	The load of the process grows exponentially because Ironic doesn't allow baremetal nodes to share images	15:25
kaloyank	I'm looking towards eliminating the extra copying from the volume to a temporary file at Glance to a temporary file at the ironic-conductor host	15:27
TheJulia	eww, so it doesn't store it, so it would rinse/repeat with each deployment	15:27
kaloyank	exactly	15:28
TheJulia	kaloyank: would you be able to trigger online-data-migrations without the batching functionality, or is your environment too big?	15:30
kaloyank	TheJulia: I don't know what you mean by online-data-migrations	15:32
mnasiadka	TheJulia: anything I can do in the meantime? I was supposed to upgrade to Yoga by tomorrow - should we leave it at Xena level and wait for a patch?	15:45
dtantsur	kaloyank: a correction: each glance image should be downloaded only once	15:47
kaloyank	dtantsur: I believe you, still I can't explain to myself why did it took 252 secs (per logs) to download and verify an image when spawning 4 instances simultaneously while it took 82 secs for the same image when spawning a single instance :(	15:55
dtantsur	hmm	15:55
dtantsur	kaloyank: the image cache should log its decisions in DEBUG level	15:55
kaloyank	I know, that's where I got those numbers from, but the logs got rotated and I haven't setup a debug file for journald and now all the HW is occupied by another team	15:57
dtantsur	that's pity. even when all 4 images are downloaded from scratch, the downloads should happen in parallel	15:59
opendevreview	Dmitry Tantsur proposed openstack/ironic master: [WIP] Migrate the inspector's /continue API https://review.opendev.org/c/openstack/ironic/+/875944	16:01
kaloyank	dtantsur: OK, managed to snatch one, brb	16:02
espenfl	is there a way to inspect the configdrive post deployment?	16:11
samuelkunkel[m]	Normally its a RO Blockdevice. If you check via lsblk you should see a block device	16:11
samuelkunkel[m]	if you mount this, you can read into the config drive	16:11
samuelkunkel[m]	But you need access to the OS for this	16:11
samuelkunkel[m]	s/RO/read-only/, s/Blockdevice/blockdevice/	16:12
samuelkunkel[m]	s/RO/read-only/, s/Blockdevice/blockdevice (assuming regular iso 9660/	16:13
espenfl	ah, okey, this is the problem, trying to diagnose setting of network, default password and SSH keys. Is there a way to see in Ironic side? or regenerate it in a way so that I can inspect based on what I have in my inventory (Bifrost and putting it into instance_info)	16:13
kaloyank	dtantsur: I confirm the image is downloaded only once	16:20
samuelkunkel[m]	s/RO/read-only/, s/Blockdevice/blockdevice (assuming regular iso 9660)/	16:20
opendevreview	Jay Faulkner proposed openstack/ironic master: Add prelude for OpenStack 2023.1 Ironic release https://review.opendev.org/c/openstack/ironic/+/876195	16:27
JayF	We need to land ^^ that very soon, if folks wanna review the changes I made for rpittau	16:27
kaloyank	dtantsur: So I think I found the culprit and this is the image SHA verification: `Computed sha512 checksum for image /var/lib/ironic/images/2c2f6c92-1595-47e5-8a75-816dce4af0ea/disk in 657.11 seconds`	16:31
TheJulia	+2'ed	16:32
TheJulia	kaloyank: yeouch	16:32
TheJulia	how big is that image?	16:32
kaloyank	TheJulia: ~10GB	16:32
TheJulia	.... it shouldn't that slow... :\	16:32
kaloyank	I suppose so, I see that Ironic uses the python hashlib, shouldn't it be using, say sha256sum ?	16:33
* TheJulia wonders if we still need conductor side checksum operation on images		16:33
TheJulia	kaloyank: I think it might be attempting to match the the glance record to verify what was generated/stored	16:33
TheJulia	os_image_algo in the glance image properties ... I think	16:34
kaloyank	TheJulia: yeah, I was thinking the same	16:34
TheJulia	since we don't have iscsi deploy, doing the checksum locally might be redundant	16:34
TheJulia	well	16:34
TheJulia	depends on if it is a remote deployment or a local boot I Guess	16:35
JayF	It's also the only place we can act on it	16:35
JayF	IPA finding a bad checksum has no option but to fail	16:35
TheJulia	if is_ramdisk_deploy, then run_checksum_check()	16:35
JayF	conductor finding one could potentially retry or give a better error	16:35
TheJulia	JayF: it is fatal to begin with even if it is found by the conductor	16:36
TheJulia	true	16:36
JayF	yeah I figured we didn't do anything with it today	16:36
TheJulia	Although I think we carry the error back from the agent	16:36
kaloyank	I have to admit, this function doesn't look good: https://github.com/openstack/oslo.utils/blob/master/oslo_utils/fileutils.py#L112	16:38
JayF	kaloyank: what approach would you take instead?	16:38
JayF	that looks more or less like a standard python hash-checking function	16:39
JayF	the time.sleep(0) is because we use eventlet	16:39
kaloyank	JayF: indeed it does, but the standard sha512sum tool on the same file completed for ~ 40s	16:40
kaloyank	I'd replace it for that, or atleast increase the chunksize (it's 64K currently)	16:41
TheJulia	the issue is without time.sleep(0), you might end up blocking the rest of the app for ~40s	16:41
JayF	... is there a chance something else is blocking significantly so that hash is giving up too much time to other stuff	16:41
TheJulia	^^^^ that	16:41
JayF	debugging that is going to be ridic	16:41
kaloyank	I gather some stats for this host, any clues?	16:43
TheJulia	I think the question might be what else was the conductor doing	16:45
TheJulia	... but my underlying thought is "do we still need it"	16:45
dtantsur	TheJulia: we need to recalculate checksums for raw images after conversion	16:48
dtantsur	although it's possible that in this case we don't check if the images are initially raw	16:48
TheJulia	oh yeah, this could be unpacked to something absurd	16:49
kaloyank	will spawning more workers help?	16:49
TheJulia	kaloyank: no, what type of image is this?	16:49
dtantsur	kaloyank: how much does sha512sum (the tool) take?	16:49
JayF	> kaloyank \| JayF: indeed it does, but the standard sha512sum tool on the same file completed for ~ 40s	16:50
dtantsur	ah, missed that. damn, yeah.	16:50
kaloyank	TheJulia: it's a raw image	16:50
dtantsur	TheJulia: re blocking the app: I'd expect eventlet to yield to other threads when waiting for a command?	16:51
dtantsur	otherwise we wouldn't be able to use ipmitool :)	16:51
JayF	there's no chance to yield, generally no blocking on that hashing activity	16:51
JayF	I was thinking the opposite: there's a competing thread that needs to yield but isn't	16:51
dtantsur	https://github.com/eventlet/eventlet/blob/master/eventlet/green/subprocess.py#L91	16:52
dtantsur	subprocess works as a loop with sleep in eventlet	16:53
JayF	could eventlet.sleep vs time.sleep be an impact there?	16:53
dtantsur	unlikely; time.sleep should be monkey-patched	16:53
dtantsur	also, the native time.sleep(0) is basically no-op	16:53
JayF	https://github.com/openstack/ironic/blob/master/ironic/cmd/__init__.py#L26 yeah just checking, only thing we exclude is os	16:55
TheJulia	hmm	16:56
kaloyank	jftr, the time it takes for a checksum to be calculated seems to scale with the number of machines currently being deployed. For a single machine with the same image the checksum was calculated for ~ 80s which correlates with the results of sha512sum	16:57
JayF	kaloyank: provisioning from the same conductor group, yeah?	16:58
dtantsur	well, a CPU-intensive task and green threads.....	16:58
kaloyank	JayF: yes, I have only one ironic-conductor	16:59
JayF	So that's what's happening. We're running N greenthreads for N deployments	16:59
JayF	all trying to checksum the same image	16:59
JayF	yeah?	16:59
kaloyank	yes	16:59
TheJulia	has anyone looked to see if hashlib is doing its work in python, or is it compiled module from c?	16:59
dtantsur	basically	16:59
dtantsur	I think hashlib is in C	16:59
dtantsur	which does not mean it releases gil	16:59
dtantsur	I wonder if it helps to crank up read_chunksize significantly	17:00
dtantsur	.. insignificantly in my testing	17:01
kaloyank	if there's image cache, why does the checksum has to be verified for each instance and not once per image?	17:02
dtantsur	kaloyank: the cache does not contain the checksum	17:03
dtantsur	what IS a good question is why we recalculate it at all, given that your images are already raw	17:03
dtantsur	(in my testing, using eventlet.green.subprocess is comparable with the oslo's function)	17:04
rpittau	good night! o/	17:04
kaloyank	dtantsur: well, besides someone tampering the images I can't think of any other reason	17:06
dtantsur	let me prototype something here	17:07
TheJulia	dtantsur: why I think is the most excellent question atm	17:08
kaloyank	I'm off for today, thanks for the help and support, bye! 0/	17:08
opendevreview	Dmitry Tantsur proposed openstack/ironic master: [WIP] Do not recalculate checksum if disk_format is not changed https://review.opendev.org/c/openstack/ironic/+/876595	17:15
dtantsur	TheJulia, kaloyank maybe ^^^	17:15
dtantsur	this won't solve the "checksumming is slow" problem	17:15
dtantsur	but at least kaloyank will be happy (if it works)	17:16
dtantsur	I suspect that outside of the raw->raw context, the slowness is not such a huge deal	17:16
jlvillal	I think this might be wrong. As x86 and aarch64 get put all into the same bucket. At least for me I was sending ipxe-x86_64.efi to my ARM64 client. https://github.com/openstack/bifrost/blob/0e6be25ee17ea75d60eb4f32fea37db0f79af52d/playbooks/roles/bifrost-ironic-install/templates/dnsmasq.conf.j2#L91-L93	17:31
jlvillal	11 seems to be the client-arch code for aarch64.	17:32
TheJulia	mnasiadka: so I think i figured out what the issue is	18:30
TheJulia	jlvillal: likely bifrost was never expected to have a mixed fleet	18:30
opendevreview	Verification of a change to openstack/ironic master failed: Add prelude for OpenStack 2023.1 Ironic release https://review.opendev.org/c/openstack/ironic/+/876195	18:30
mnasiadka	TheJulia: fantastic, how dark is the alley I'm in? :)	18:31
TheJulia	mnasiadka: not a bad one	18:33
TheJulia	mnasiadka: I'm working on a patch now	18:33
TheJulia	unfortunately I think this was not found before becuase we don't have data in the table in CI	18:33
TheJulia	but it is a glaringly bad bug	18:34
mnasiadka	TheJulia: that's why I deducted by myself (the lack of data in CI)	18:34
mnasiadka	s/why/what	18:34
mnasiadka	TheJulia: in the meantime I was thinking of making a mysqldump of bios_settings table, going forward with the upgrade, and importing it back in after the upgrade - does it make any sense?	18:35
TheJulia	eh... I'm not so sure about that	18:35
JayF	if the bug is as bad as it sounds it is, I bet we'll have a fixed stable/xena out for you in a few days	18:36
JayF	maybe not released but in git fo so	18:36
JayF	*for sure	18:36
mnasiadka	Kolla builds containers based on stable/xena and so on, we don't really follow releases - so no worries for that	18:37
JayF	We are extremely on top of bug backports in Ironic, I think it's one of the things we do really well	18:37
jrosser	jlvillal: do you have ipxe_bootfile_name_by_arch set in the ironic [ipxe] config section?	18:38
jlvillal	jrosser, I do.	18:38
mnasiadka	TheJulia: Ok then, I'll probably push off the upgrade to Yoga to next week - unless leaving Ironic Xena and upgrading Nova to Yoga is something that is supposed to work.	18:40
jlvillal	TheJulia, makes sense not expecting a mixed fleet. Thanks.	18:41
* TheJulia gives cookies to tox		18:47
TheJulia	I've got a unit test now which reproduces the error	18:57
TheJulia	the bios setting model incremented from 1.0 to 1.1 too, so since you had data at version 1.0	18:58
opendevreview	Merged openstack/ironic master: Add prelude for OpenStack 2023.1 Ironic release https://review.opendev.org/c/openstack/ironic/+/876195	19:11
mnasiadka	TheJulia: nice, it's 8pm here, so if there's something I could try out in my morning - please leave me a message - thanks for the help :)	19:14
TheJulia	mnasiadka: absolutely	19:23
TheJulia	mnasiadka: goodnight :)	19:23
*** shadower6 is now known as shadower		19:39
opendevreview	Julia Kreger proposed openstack/ironic master: Fix online upgrades for Bios/Traits https://review.opendev.org/c/openstack/ironic/+/876626	19:40
TheJulia	JayF: I think ^^^ will take care of mnasiadka's issue	19:41
JayF	looking	19:41
dtantsur	jlvillal: if you make bifrost multi-arch-aware, you'll be my hero :)	19:49
dtantsur	I think I already promised you a beer the other day?	19:49
* dtantsur slowly blends into the background		19:49
TheJulia	I feel like I'm going to need to have lots of beer on hand....	19:54
jlvillal	dtantsur, hehe :)	19:57
opendevreview	Julia Kreger proposed openstack/ironic master: Fix online upgrades for Bios/Traits https://review.opendev.org/c/openstack/ironic/+/876626	20:35
TheJulia	stevebaker[m]: https://review.opendev.org/q/I99a2d8ecab657c8e4c852c73e816a5a8f2856471 <-- since you asked me	21:06
jlvillal	So when Ironic/Bifrost creates a configdrive-*.iso.gz file. Is it not really gzipped and not really an ISO file? I notice the ones in /var/lib/ironic/httpboot/ are ASCII text. Looks like it might be bas64 encoded.	21:07
JayF	TheJulia: stevebaker[m]: I just landed all the remaining oens	21:09
stevebaker[m]	\o/	21:29
TheJulia	jlvillal: decode them and they should be gzipped iso files	21:39
opendevreview	Julia Kreger proposed openstack/ironic master: Fix online upgrades for Bios/Traits https://review.opendev.org/c/openstack/ironic/+/876626	21:56
TheJulia	JayF: typo fixed ^	21:57
JayF	+2	22:18
opendevreview	Merged openstack/metalsmith stable/yoga: Use a network cache in Instance https://review.opendev.org/c/openstack/metalsmith/+/873770	23:27
opendevreview	Merged openstack/metalsmith stable/xena: Use a network cache in Instance https://review.opendev.org/c/openstack/metalsmith/+/873771	23:27
opendevreview	Merged openstack/metalsmith stable/wallaby: Use a network cache in Instance https://review.opendev.org/c/openstack/metalsmith/+/873772	23:27

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!