opendevreview | Merged openstack/sushy master: Update master for stable/2023.1 https://review.opendev.org/c/openstack/sushy/+/876156 | 03:51 |
---|---|---|
rpittau | good morning ironic! o/ | 09:15 |
dtantsur | JayF, jlvillal, the IPA example in bifrost does look questionable. There does not seem to be support for per-node kernel/ramdisk, it's a global vairable. | 09:18 |
espenfl | Hi there. Upon using cloud-init (from the configdrive) for setting up networks, SSH keys etc. I wonder if it in Bifrost now is possible to set `user-data` in JSON style instead of a string? Seems Ironic added support for this (https://opendev.org/openstack/ironic/commit/3e1e0c9d5e6e5753d0fe2529901986ea36934971 etc.). | 10:37 |
dtantsur | espenfl: no, but there is https://opendev.org/openstack/bifrost/commit/bf1bb49c38c26c4b88e330d3f6f9ccdc8419a1f5 | 10:56 |
espenfl | dtantsur: That is awesome. Thanks. Also noticed from your writeup: https://owlet.today/posts/ephemeral-workloads-with-ironic/#id9 that `network_data` should work already. With that and the SSH keys to the `default` user we are ready to go. Testing it now. Thanks for the nice writeup. | 11:03 |
opendevreview | Dmitry Tantsur proposed openstack/ironic bugfix/21.2: Fixes for tox 4.0 https://review.opendev.org/c/openstack/ironic/+/876409 | 11:04 |
opendevreview | Dmitry Tantsur proposed openstack/ironic bugfix/21.2: Configure CI for bugfix/21.2 https://review.opendev.org/c/openstack/ironic/+/876410 | 11:04 |
opendevreview | Riccardo Pittau proposed openstack/bifrost master: Fix enabling epel repo for rpm distributions https://review.opendev.org/c/openstack/bifrost/+/875929 | 11:06 |
opendevreview | Riccardo Pittau proposed openstack/ironic master: Add a non-voting metal3 CI job https://review.opendev.org/c/openstack/ironic/+/863873 | 11:15 |
opendevreview | Riccardo Pittau proposed openstack/ironic master: Add a non-voting metal3 CI job https://review.opendev.org/c/openstack/ironic/+/863873 | 11:16 |
kaloyank | Hello everyone, consider the following scenario. There are a couple of machines that I want to deploy using their local disks. I'm using Ironic in conjunction with Nova, Neutron and Cinder. In general, everything goes smoothly and Ironic deploys the machines as expected. | 11:55 |
kubajj | Hello everyone | 11:56 |
kaloyank | I then turn on fast-track deployment for these machines, destroy their respective Nova instances. Ironic cleans up these machines and leaves them powered on. I then create the same Nova instances. However, looking at their IPMI remote console, I don't see the IPA doing anything. | 11:57 |
kaloyank | After a while, the machine is rebooted. However, it doesn't pxe boot iPXE and ends up in EFI shell. | 11:58 |
dtantsur | kaloyank: fast-track with nova/neutron is risky. if networking parameters change between 2 instances, what happens? | 11:58 |
dtantsur | otherwise, it's hard to debug from textual message. you need to check the logs (DHCP, ironic-conductor, maybe even tcpdump) | 11:58 |
kaloyank | I haven't tried changing, I'd like to share something more :) | 11:58 |
dtantsur | do you use the same ports for the instance? otherwise, the IP address may change. | 11:59 |
kaloyank | I do, let me share some more information so you can have better context | 11:59 |
kaloyank | I checked the ironic-conductors logs and found out the following sequence of events: | 12:00 |
opendevreview | Vanou Ishii proposed openstack/ironic stable/zed: [iRMC] Handle IPMI incompatibility in iRMC S6 2.x https://review.opendev.org/c/openstack/ironic/+/870881 | 12:00 |
kaloyank | 1. After cleaning a machine, the IPA sends heartbeats and the ironic-conductors processes them as expected. | 12:01 |
kaloyank | 2. When an instance is spawned, the ironic-conductor starts downloading the image from Glance, so it can serve it from it's HTTP server to the IPA | 12:01 |
kaloyank | 3. In the mean time, the IPA sends more heartbeats (two to be precise) but the ironic-conductor refuses to register them, as the code requires an exclusive instance lock. However, such a lock is already created for the node when the deployment process started | 12:02 |
kaloyank | 4. After the fast_track_timeout expires, the ironic-conductor considers the IPA dead and reboots the node | 12:03 |
dtantsur | the "requires an exclusive instance lock" is probably a red herring hiding the actual failure | 12:04 |
kaloyank | 5. However, the created neutron port lacks the extra DHCP options to perform the PXE chainloading of iPXE | 12:04 |
kaloyank | well, I can share the logs and the exact line of code | 12:04 |
dtantsur | aha, #5 is probably a bug, I guess we ignore DHCP options on fast-track and then forget to add them if fast-track fails | 12:05 |
kaloyank | I presumed the same as it seemed quite out of order, anyway logs incoming, shall I pastebin them? | 12:06 |
dtantsur | yep, although I cannot promise to dive into them too deeply | 12:07 |
kaloyank | https://github.com/openstack/ironic/blob/master/ironic/drivers/modules/agent_base.py#L583 where the shared lock is elevated to exclusive but fails (an exclusive lock already exists) | 12:08 |
dtantsur | yeah, but the exclusive one should disappear soon | 12:10 |
dtantsur | if it does not, you need to track what is holding the lock for so long (stuck image download, etc) | 12:10 |
kaloyank | that was my other question, the image is quite large, ~10GB, with a smaller image it's not a problem since the fast_track timeout hasn't expired | 12:11 |
kaloyank | so is this behavior expected and do I need to work around it somehow? | 12:11 |
dtantsur | kaloyank: if you don't see any abnormalities in the logs, you may try to play with the value of fast_track_timeout | 12:16 |
dtantsur | under #5 is quite likely a bug that needs fixing | 12:16 |
kaloyank | it's already at its maximum (5mins), the image just downloads and verifies for a lot more | 12:16 |
dtantsur | I'm frankly not sure why fast_track_timeout has an upper limit.. it has to be lower than ramdisk_heartbeat_timeout, but even that can be changed | 12:18 |
dtantsur | but that's a curious issue you describe. indeed, we cannot register a heartbeat while the lock is actively held.. | 12:18 |
kaloyank | I presume because the code updates important fields, still I tried bypassing the code in the exception handler but nothing changed in behavior | 12:19 |
dtantsur | it could be interesting to release the lock while we're downloading/converting the image. The question is: what to do if we cannot re-acquire the lock afterwards. | 12:20 |
dtantsur | OR allow heartbeats to be registered when the lock is held. but that requires changing database. | 12:21 |
kaloyank | can the image be downloaded without an exclusive lock? | 12:22 |
dtantsur | kaloyank: a workaround may be to disable image conversion on the conductor side | 12:22 |
dtantsur | kaloyank: that's my first idea. the problem is: what to do if you can no longer acquire the lock after the image is downloaded. | 12:22 |
kaloyank | I use raw images only, that's why they're so big | 12:23 |
dtantsur | ah, so there is no conversion, it's just downloaded for 5 minutes? | 12:23 |
kaloyank | yes, it's really big, I have a smaller one and it deploys without a problem | 12:23 |
dtantsur | if you use swift as a glance backend, you can make ironic serve a URL to swift without local caching | 12:24 |
kaloyank | I don't have swift :/ | 12:24 |
kaloyank | as I store the images in Cinder (Glance Cinder store) I was thinking of attaching the volume to the ironic conductor and streaming the block device. I know that no such code exists but if there are no other simpler options, I'd opt it for writing the code | 12:25 |
dtantsur | cc TheJulia JayF (once you wake up) for more ideas ^^ | 12:31 |
opendevreview | Dmitry Tantsur proposed openstack/ironic bugfix/21.2: Update .gitreview for bugfix/21.2 https://review.opendev.org/c/openstack/ironic/+/867826 | 12:36 |
kaloyank | logs were rotated, I think the issue was made clear, do you still need them? | 12:37 |
kaloyank | jftr, I'm running the latest Yoga release from RDO | 12:45 |
mnasiadka | Hello | 12:58 |
mnasiadka | During upgrade Wallaby->Xena while running online_data_migrations I get "Error while running update_to_latest_versions: 'BIOSSetting' object has no attribute 'id'." - help appreciated ;-) | 12:59 |
jssfr | how do I reset a node which is in state `deploy failed` or `wait call-back` state? | 13:14 |
jssfr | by reset, I mean bring it into a clean/available state via cleaning | 13:14 |
jssfr | abort, provide, and manage all give state errors | 13:14 |
jssfr | ah, undeploy! | 13:16 |
jssfr | thanks documentation! | 13:16 |
kaloyank | or.. baremetal node manage <node-name>; baremetal node provide <node-name> | 13:23 |
rpittau | looking for reviews and approval for 2 fixes in bifrost CI https://review.opendev.org/c/openstack/bifrost/+/872634 and https://review.opendev.org/c/openstack/bifrost/+/875929 thanks! | 13:31 |
jssfr | okay, different issue now | 13:55 |
jssfr | We are trying to deploy an image which contains an mdraid | 13:55 |
jssfr | using the IPA, the image gets written to the disk successfully, but then the IPA re-reads mdraid and then attempts to installg rub | 13:55 |
jssfr | which fails, because: | 13:55 |
jssfr | Mar 06 13:03:16 ubuntu ironic-python-agent[2384]: 2023-03-06 13:03:16.645 2384 ERROR root [-] Command failed: install_bootloader, error: Error finding the disk or partition device to deploy the image onto: No EFI partition could be detected on device /dev/md127 and EFI partition UUID has not been recorded during deployment (which is often the case for whole disk images). Are you using a UEFI-compat | 13:55 |
jssfr | ible image?: ironic_python_agent.errors.DeviceNotFound: Error finding the disk or partition device to deploy the image onto: No EFI partition could be detected on device /dev/md127 and EFI partition UUID has not been recorded during deployment (which is often the case for whole disk images). Are you using a UEFI-compatible image? | 13:55 |
jssfr | is there a sane way to get disk images which include an mdraid deployed? | 13:56 |
kaloyank | Well, can you verify that the image has an EFI system partition? What does it's partition table look like? | 13:56 |
jssfr | (the EFI partition is outside the mdraid) | 13:56 |
jssfr | kaloyank, https://paste.ubuntu.com/p/KkfYgqDmVj/ | 13:57 |
jssfr | OARrr | 13:57 |
jssfr | why the heck does that need a login | 13:57 |
jssfr | let me find a different paste, I was assuming pastebinit to be sane | 13:57 |
jssfr | https://paste.debian.net/hidden/0f5930b1/ | 13:58 |
jssfr | kaloyank, ^ that's the partition table of the image | 13:58 |
jssfr | p3 is the RAID, p1 is the EFI system partition. | 13:58 |
jssfr | (this is from the box where we created the image, hence /dev/loopX) | 13:59 |
kaloyank | jssfr: I'm looking at this: https://github.com/openstack/ironic-python-agent/blob/a1670753a23a79b6536f67eae9cca154e0ed2e65/ironic_python_agent/extensions/image.py#L695 which calls this: https://github.com/openstack/ironic-python-agent/blob/fcb65cae18f4a6b4b05fb70677e2fa114e0558a9/ironic_python_agent/hardware.py#L1182 | 14:05 |
kaloyank | did you specify a root_device_hint ? | 14:06 |
jssfr | nope, looking into that now, thanks! | 14:28 |
jssfr | kaloyank, do you happen to have a documentation link for root_device_hint? I can only find dcumentation from pike :/ | 14:30 |
kaloyank | jssfr: https://docs.openstack.org/ironic/pike/install/advanced.html#specifying-the-disk-for-deployment-root-device-hints | 14:30 |
jssfr | that's the pike stuff I found | 14:31 |
kaloyank | That's what you need, it's called root_device in Pike | 14:31 |
jssfr | so something like {"name": "/dev/sda"} as root_device_hint? | 14:32 |
jssfr | and that's still a property, not a driver_info or something? | 14:32 |
kaloyank | Yes, it's a property | 14:33 |
jssfr | ack, thanks | 14:33 |
jssfr | (the code reads as if it's still `root_device` though, not root_device_hint? | 14:33 |
kaloyank | "That's what you need, it's called root_device in Pike" | 14:33 |
kaloyank | jssfr: JFTR, Pike is not supported, there might be bugs related to your case that are fixed in more recent releases, it'd be really nice if you can run a supported release | 14:35 |
jssfr | we are on victoria | 14:35 |
jssfr | that's why I was asking for newer documentation | 14:36 |
kaloyank | then why were you asking about Pike? | 14:36 |
jssfr | ah, misunderstanding :) | 14:36 |
jssfr | what I meant to say was "all documentation I found was for Pike" | 14:36 |
jssfr | "… which is not what I wanted, I wanted something more recen" | 14:36 |
jssfr | I see how that wasn't clear, sorry | 14:36 |
kaloyank | https://docs.openstack.org/ironic/victoria/install/advanced.html#specifying-the-disk-for-deployment-root-device-hints | 14:36 |
jssfr | thanks a lot | 14:36 |
kaloyank | this is the appropriate link then | 14:36 |
iurygregory | good morning Ironic | 14:56 |
TheJulia | good morning | 14:57 |
TheJulia | kaloyank: so... I guess that is doable with appropriate code, the question is going to be "how to get application credentials for volume access down to the agent". We've generally discussed it before and have a general idea, but I don't think we've implemented any such credential path. | 15:01 |
dtantsur | the initial problem is also curious. fast-track can break if the image is too large. which is definitely not the desired behavior. | 15:01 |
TheJulia | I'n not quite sure if it is still heartbeating | 15:02 |
JayF | good morning | 15:02 |
dtantsur | why not? | 15:02 |
JayF | #startmeeting ironic | 15:02 |
opendevmeet | Meeting started Mon Mar 6 15:02:33 2023 UTC and is due to finish in 60 minutes. The chair is JayF. Information about MeetBot at http://wiki.debian.org/MeetBot. | 15:02 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 15:02 |
opendevmeet | The meeting name has been set to 'ironic' | 15:02 |
TheJulia | but, yes, it is a can of worms | 15:02 |
iurygregory | o/ | 15:02 |
dtantsur | true | 15:02 |
dtantsur | hi folks o/ | 15:02 |
TheJulia | dtantsur: window is against the heartbeat operation still occuring | 15:02 |
matfechner | o/ | 15:02 |
JayF | Should be a pretty quick one this morning; I'll try to fly through | 15:02 |
TheJulia | o/ | 15:03 |
JayF | #topic Announcements/Reminder | 15:03 |
JayF | As always, tag your patches with ironic-week-prio to get them reviewed | 15:03 |
JayF | And we are counting down, focusing on any outstanding reviews would be good; I haven't looked but will make a point to check today | 15:03 |
kaloyank | TheJulia: what do you mean by application credentials exactly? How Ironic authenticates to the storage backend or something else? | 15:03 |
rpittau | o/ | 15:03 |
JayF | #topic Review Actions from last meeting | 15:03 |
TheJulia | kaloyank: lets revisit once the irc meeting is over :) | 15:04 |
kaloyank | sure | 15:04 |
JayF | kaloyank: I'll be fast :D | 15:04 |
JayF | okay, action items, we had 3 | 15:04 |
JayF | First one is tied to a general announcement; I will be out Weds-Mon at a conference with limited availability. If you need me urgently, email is best (anything-but-IRC). | 15:05 |
JayF | And iurygregory will be running the meeting next week in my stead | 15:05 |
JayF | I have an untaken action I'll renew | 15:05 |
JayF | #action JayF To look into VMT for Ironic | 15:05 |
JayF | That's the actions from last week | 15:05 |
JayF | moving on | 15:05 |
JayF | #topic CI Status | 15:05 |
JayF | Is there anything notable about CI from the last 7 days that people should know? | 15:05 |
rpittau | JayF: just 2 minor fix for bifrost CI | 15:06 |
rpittau | they're in the prio list | 15:06 |
JayF | Thanks for that \o/ | 15:06 |
JayF | I'll look this morning | 15:06 |
TheJulia | mnasiadka: uhh... wow, that shouldn't break that way. Have you checked oslo.db is appropriate based upon requirements.txt and the xena upper-constraints.txt file from the openstack/requirements repo branch stable/xena ? | 15:06 |
rpittau | thanks | 15:06 |
JayF | #topic VirtualPDU update | 15:06 |
iurygregory | I saw some of the jobs in networking-generic-switch failing (but didn't really look to find the root cause) | 15:07 |
JayF | So, thanks to rpittau, fungi and their work, VirtualPDU has us as cores now | 15:07 |
rpittau | \o/ | 15:07 |
mnasiadka | TheJulia: that's a Kolla container, I can't assume it's different but let me check :) | 15:07 |
JayF | additionally, I proposed + landed governance change to put it officially under Ironic's program | 15:07 |
iurygregory | tks everyone! | 15:07 |
JayF | So VirtualPDU is adopted. There is a further step; moving the repo from x/virtualpdu -> openstack/virtualpdu; but that requires gerrit downtime and will not happen until after the release sometime | 15:08 |
jssfr | kaloyank, specifying root_device helped, thanks! | 15:08 |
rpittau | JayF: what do we want to do for a possible release ? | 15:08 |
rpittau | for virtualpdu | 15:08 |
TheJulia | mnasiadka: the issue is it has been for 6 years, something is *very* wrong. | 15:08 |
JayF | rpittau: cut a manual one in the meantime would be my suggestion? | 15:08 |
rpittau | mmm yeah | 15:08 |
rpittau | ok | 15:08 |
JayF | I would assume release-management of virtualpdu won't be possible until after antelope is done | 15:08 |
JayF | so if we need a virtualpdu release before that; manual is the way | 15:08 |
rpittau | I'll look into that | 15:09 |
JayF | Thanks for that \o/ | 15:09 |
JayF | #topic Release Countdown | 15:09 |
JayF | cycle highlights have landed | 15:09 |
JayF | probably too late to land any changes of bulk | 15:09 |
JayF | but if you have something you want in antelope, now is the time to make noise and get it reviewed | 15:10 |
JayF | #link https://review.opendev.org/c/openstack/ironic/+/876195 release notes prelude | 15:10 |
JayF | I will revise this today and push to get it merged; if you have opinions on it post them soon :) | 15:10 |
mnasiadka | TheJulia: oslo.db==11.0.0, and yoga upper-constraints.txt says 11.2.0, yes, something is wrong :) | 15:10 |
JayF | anything else for release before I move on? | 15:10 |
JayF | #topic Open Discussion | 15:11 |
JayF | As said earlier, I will be unavailable from Weds thru Mon traveling to and participating in SCALE 20x. | 15:11 |
TheJulia | mnasiadka: so that should be xena | 15:11 |
JayF | If you need me urgently for anything, I think most of you have my email. That'll be checked much more often than IRC | 15:11 |
JayF | anything else for open discussion | 15:11 |
mnasiadka | TheJulia: sorry, mixed it up - yes, it's according to xena's u-c | 15:12 |
TheJulia | I'll also be at SCALE 20x later this week, just fyi. | 15:12 |
mnasiadka | TheJulia: what else can be wrong? | 15:12 |
TheJulia | mnasiadka: i guess step 0 is going to be open the database, ensure the column is there | 15:12 |
JayF | I'm calling it | 15:13 |
JayF | #endmeeting | 15:13 |
opendevmeet | Meeting ended Mon Mar 6 15:13:16 2023 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 15:13 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/ironic/2023/ironic.2023-03-06-15.02.html | 15:13 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/ironic/2023/ironic.2023-03-06-15.02.txt | 15:13 |
opendevmeet | Log: https://meetings.opendev.org/meetings/ironic/2023/ironic.2023-03-06-15.02.log.html | 15:13 |
TheJulia | heh | 15:13 |
TheJulia | kaloyank: o/ So I meant credentials for how ironic-python-agent connects to Cinder to access the image, that is unless you want Ironic's conductor to handle downloading and packing the image locally. | 15:14 |
TheJulia | kaloyank: I *thought* cinder had a capability to do that though, now that I think abou tit. | 15:14 |
TheJulia | about it | 15:14 |
mnasiadka | TheJulia: I was upgrading from 17.1.1.dev7, there is no 'id' column in 'bios_settings' (only 'node_id') | 15:14 |
mnasiadka | TheJulia: https://paste.openstack.org/show/b0LjGZNV0fwszRgiOdJL/ | 15:14 |
kaloyank | TheJulia: I was thinking of making the ironic-conductor process attach a Cinder volume to itself and serve its contents | 15:15 |
kaloyank | implementing this in the IPA will be harder, tough more efficient in terms of the end result | 15:15 |
TheJulia | mnasiadka: wow, that is a bug then | 15:16 |
TheJulia | mnasiadka: can you get a full stacktrace for us? I'm really surprised we didn't find this in CI over the years | 15:16 |
TheJulia | I was thinking id would be coming from the underlying base model, but not in that table, node_id is primary key | 15:16 |
mnasiadka | TheJulia: sure, is there a guide somewhere so I don't miss anything? | 15:17 |
TheJulia | mnasiadka: an upgrade guide? or on getting a stacktrace from the logs? I'm confused by the question, sorry | 15:17 |
mnasiadka | TheJulia: stacktrace :) | 15:17 |
TheJulia | kaloyank: I still could have sworn cinder was supposed to have an extract capability :) | 15:18 |
TheJulia | mnasiadka: no, as much context around the error as possible, typically the ~20 lines before is most of the stack trace which would occur | 15:18 |
kaloyank | TheJulia: by extract you mean point it a path and let it dump a volume to a file? | 15:18 |
TheJulia | kaloyank: yeah | 15:19 |
TheJulia | kaloyank: I might be totally off base, and maybe this is not universally supported, but I thought you could save a cinder volume to a glance image in some way | 15:19 |
mnasiadka | TheJulia: https://storyboard.openstack.org/#!/story/2010632 - hope that's enough (in the comment) | 15:20 |
dtantsur | some code reshuffling if anyone has time: https://review.opendev.org/c/openstack/ironic/+/874677 https://review.opendev.org/c/openstack/ironic/+/875915 | 15:20 |
kaloyank | TheJulia: You absolutely can make an image out of a volume but how does this help the fact that the ironic-conductor still has to download the image from Glance | 15:20 |
TheJulia | oh, i bet the batching doesn't grok there being no id field | 15:21 |
TheJulia | kaloyank: so the underlying issue is that a cinder volume may be any storage technology, and the ironic-conductor may not be able to access it | 15:22 |
TheJulia | an mvp might be, get it packaged out to glance, download it, and then work on improving the flow for greater efficiency | 15:22 |
kaloyank | TheJulia: I still don't see how that resolves the issue :/ | 15:24 |
kaloyank | right now, Glance attaches the volume, copies the data to a temporary file, serves the image from the file and after that deletes it | 15:25 |
kaloyank | The load of the process grows exponentially because Ironic doesn't allow baremetal nodes to share images | 15:25 |
kaloyank | I'm looking towards eliminating the extra copying from the volume to a temporary file at Glance to a temporary file at the ironic-conductor host | 15:27 |
TheJulia | eww, so it doesn't store it, so it would rinse/repeat with each deployment | 15:27 |
kaloyank | exactly | 15:28 |
TheJulia | kaloyank: would you be able to trigger online-data-migrations without the batching functionality, or is your environment too big? | 15:30 |
kaloyank | TheJulia: I don't know what you mean by online-data-migrations | 15:32 |
mnasiadka | TheJulia: anything I can do in the meantime? I was supposed to upgrade to Yoga by tomorrow - should we leave it at Xena level and wait for a patch? | 15:45 |
dtantsur | kaloyank: a correction: each glance image should be downloaded only once | 15:47 |
kaloyank | dtantsur: I believe you, still I can't explain to myself why did it took 252 secs (per logs) to download and verify an image when spawning 4 instances simultaneously while it took 82 secs for the same image when spawning a single instance :( | 15:55 |
dtantsur | hmm | 15:55 |
dtantsur | kaloyank: the image cache should log its decisions in DEBUG level | 15:55 |
kaloyank | I know, that's where I got those numbers from, but the logs got rotated and I haven't setup a debug file for journald and now all the HW is occupied by another team | 15:57 |
dtantsur | that's pity. even when all 4 images are downloaded from scratch, the downloads should happen in parallel | 15:59 |
opendevreview | Dmitry Tantsur proposed openstack/ironic master: [WIP] Migrate the inspector's /continue API https://review.opendev.org/c/openstack/ironic/+/875944 | 16:01 |
kaloyank | dtantsur: OK, managed to snatch one, brb | 16:02 |
espenfl | is there a way to inspect the configdrive post deployment? | 16:11 |
samuelkunkel[m] | Normally its a RO Blockdevice. If you check via lsblk you should see a block device | 16:11 |
samuelkunkel[m] | if you mount this, you can read into the config drive | 16:11 |
samuelkunkel[m] | But you need access to the OS for this | 16:11 |
samuelkunkel[m] | s/RO/read-only/, s/Blockdevice/blockdevice/ | 16:12 |
samuelkunkel[m] | s/RO/read-only/, s/Blockdevice/blockdevice (assuming regular iso 9660/ | 16:13 |
espenfl | ah, okey, this is the problem, trying to diagnose setting of network, default password and SSH keys. Is there a way to see in Ironic side? or regenerate it in a way so that I can inspect based on what I have in my inventory (Bifrost and putting it into instance_info) | 16:13 |
kaloyank | dtantsur: I confirm the image is downloaded only once | 16:20 |
samuelkunkel[m] | s/RO/read-only/, s/Blockdevice/blockdevice (assuming regular iso 9660)/ | 16:20 |
opendevreview | Jay Faulkner proposed openstack/ironic master: Add prelude for OpenStack 2023.1 Ironic release https://review.opendev.org/c/openstack/ironic/+/876195 | 16:27 |
JayF | We need to land ^^ that very soon, if folks wanna review the changes I made for rpittau | 16:27 |
kaloyank | dtantsur: So I think I found the culprit and this is the image SHA verification: `Computed sha512 checksum for image /var/lib/ironic/images/2c2f6c92-1595-47e5-8a75-816dce4af0ea/disk in 657.11 seconds` | 16:31 |
TheJulia | +2'ed | 16:32 |
TheJulia | kaloyank: yeouch | 16:32 |
TheJulia | how big is that image? | 16:32 |
kaloyank | TheJulia: ~10GB | 16:32 |
TheJulia | .... it shouldn't that slow... :\ | 16:32 |
kaloyank | I suppose so, I see that Ironic uses the python hashlib, shouldn't it be using, say sha256sum ? | 16:33 |
* TheJulia wonders if we still need conductor side checksum operation on images | 16:33 | |
TheJulia | kaloyank: I think it might be attempting to match the the glance record to verify what was generated/stored | 16:33 |
TheJulia | os_image_algo in the glance image properties ... I think | 16:34 |
kaloyank | TheJulia: yeah, I was thinking the same | 16:34 |
TheJulia | since we don't have iscsi deploy, doing the checksum locally might be redundant | 16:34 |
TheJulia | well | 16:34 |
TheJulia | depends on if it is a remote deployment or a local boot I Guess | 16:35 |
JayF | It's also the only place we can act on it | 16:35 |
JayF | IPA finding a bad checksum has no option but to fail | 16:35 |
TheJulia | if is_ramdisk_deploy, then run_checksum_check() | 16:35 |
JayF | conductor finding one could potentially retry or give a better error | 16:35 |
TheJulia | JayF: it is fatal to begin with even if it is found by the conductor | 16:36 |
TheJulia | true | 16:36 |
JayF | yeah I figured we didn't do anything with it today | 16:36 |
TheJulia | Although I think we carry the error back from the agent | 16:36 |
kaloyank | I have to admit, this function doesn't look good: https://github.com/openstack/oslo.utils/blob/master/oslo_utils/fileutils.py#L112 | 16:38 |
JayF | kaloyank: what approach would you take instead? | 16:38 |
JayF | that looks more or less like a standard python hash-checking function | 16:39 |
JayF | the time.sleep(0) is because we use eventlet | 16:39 |
kaloyank | JayF: indeed it does, but the standard sha512sum tool on the same file completed for ~ 40s | 16:40 |
kaloyank | I'd replace it for that, or atleast increase the chunksize (it's 64K currently) | 16:41 |
TheJulia | the issue is without time.sleep(0), you might end up blocking the rest of the app for ~40s | 16:41 |
JayF | ... is there a chance something else is blocking significantly so that hash is giving up too much time to other stuff | 16:41 |
TheJulia | ^^^^ that | 16:41 |
JayF | debugging that is going to be ridic | 16:41 |
kaloyank | I gather some stats for this host, any clues? | 16:43 |
TheJulia | I think the question might be what else was the conductor doing | 16:45 |
TheJulia | ... but my underlying thought is "do we still need it" | 16:45 |
dtantsur | TheJulia: we need to recalculate checksums for raw images after conversion | 16:48 |
dtantsur | although it's possible that in this case we don't check if the images are initially raw | 16:48 |
TheJulia | oh yeah, this could be unpacked to something absurd | 16:49 |
kaloyank | will spawning more workers help? | 16:49 |
TheJulia | kaloyank: no, what type of image is this? | 16:49 |
dtantsur | kaloyank: how much does sha512sum (the tool) take? | 16:49 |
JayF | > kaloyank | JayF: indeed it does, but the standard sha512sum tool on the same file completed for ~ 40s | 16:50 |
dtantsur | ah, missed that. damn, yeah. | 16:50 |
kaloyank | TheJulia: it's a raw image | 16:50 |
dtantsur | TheJulia: re blocking the app: I'd expect eventlet to yield to other threads when waiting for a command? | 16:51 |
dtantsur | otherwise we wouldn't be able to use ipmitool :) | 16:51 |
JayF | there's no chance to yield, generally no blocking on that hashing activity | 16:51 |
JayF | I was thinking the opposite: there's a competing thread that needs to yield but isn't | 16:51 |
dtantsur | https://github.com/eventlet/eventlet/blob/master/eventlet/green/subprocess.py#L91 | 16:52 |
dtantsur | subprocess works as a loop with sleep in eventlet | 16:53 |
JayF | could eventlet.sleep vs time.sleep be an impact there? | 16:53 |
dtantsur | unlikely; time.sleep should be monkey-patched | 16:53 |
dtantsur | also, the native time.sleep(0) is basically no-op | 16:53 |
JayF | https://github.com/openstack/ironic/blob/master/ironic/cmd/__init__.py#L26 yeah just checking, only thing we exclude is os | 16:55 |
TheJulia | hmm | 16:56 |
kaloyank | jftr, the time it takes for a checksum to be calculated seems to scale with the number of machines currently being deployed. For a single machine with the same image the checksum was calculated for ~ 80s which correlates with the results of sha512sum | 16:57 |
JayF | kaloyank: provisioning from the same conductor group, yeah? | 16:58 |
dtantsur | well, a CPU-intensive task and green threads..... | 16:58 |
kaloyank | JayF: yes, I have only one ironic-conductor | 16:59 |
JayF | So that's what's happening. We're running N greenthreads for N deployments | 16:59 |
JayF | all trying to checksum the same image | 16:59 |
JayF | yeah? | 16:59 |
kaloyank | yes | 16:59 |
TheJulia | has anyone looked to see if hashlib is doing its work in python, or is it compiled module from c? | 16:59 |
dtantsur | basically | 16:59 |
dtantsur | I *think* hashlib is in C | 16:59 |
dtantsur | which does not mean it releases gil | 16:59 |
dtantsur | I wonder if it helps to crank up read_chunksize significantly | 17:00 |
dtantsur | .. insignificantly in my testing | 17:01 |
kaloyank | if there's image cache, why does the checksum has to be verified for each instance and not once per image? | 17:02 |
dtantsur | kaloyank: the cache does not contain the checksum | 17:03 |
dtantsur | what IS a good question is why we recalculate it at all, given that your images are already raw | 17:03 |
dtantsur | (in my testing, using eventlet.green.subprocess is comparable with the oslo's function) | 17:04 |
rpittau | good night! o/ | 17:04 |
kaloyank | dtantsur: well, besides someone tampering the images I can't think of any other reason | 17:06 |
dtantsur | let me prototype something here | 17:07 |
TheJulia | dtantsur: why I think is the most excellent question atm | 17:08 |
kaloyank | I'm off for today, thanks for the help and support, bye! 0/ | 17:08 |
opendevreview | Dmitry Tantsur proposed openstack/ironic master: [WIP] Do not recalculate checksum if disk_format is not changed https://review.opendev.org/c/openstack/ironic/+/876595 | 17:15 |
dtantsur | TheJulia, kaloyank maybe ^^^ | 17:15 |
dtantsur | this won't solve the "checksumming is slow" problem | 17:15 |
dtantsur | but at least kaloyank will be happy (if it works) | 17:16 |
dtantsur | I suspect that outside of the raw->raw context, the slowness is not such a huge deal | 17:16 |
jlvillal | I think this might be wrong. As x86 and aarch64 get put all into the same bucket. At least for me I was sending ipxe-x86_64.efi to my ARM64 client. https://github.com/openstack/bifrost/blob/0e6be25ee17ea75d60eb4f32fea37db0f79af52d/playbooks/roles/bifrost-ironic-install/templates/dnsmasq.conf.j2#L91-L93 | 17:31 |
jlvillal | 11 seems to be the client-arch code for aarch64. | 17:32 |
TheJulia | mnasiadka: so I think i figured out what the issue is | 18:30 |
TheJulia | jlvillal: likely bifrost was never expected to have a mixed fleet | 18:30 |
opendevreview | Verification of a change to openstack/ironic master failed: Add prelude for OpenStack 2023.1 Ironic release https://review.opendev.org/c/openstack/ironic/+/876195 | 18:30 |
mnasiadka | TheJulia: fantastic, how dark is the alley I'm in? :) | 18:31 |
TheJulia | mnasiadka: not a bad one | 18:33 |
TheJulia | mnasiadka: I'm working on a patch now | 18:33 |
TheJulia | unfortunately I think this was not found before becuase we don't have data in the table in CI | 18:33 |
TheJulia | but it is a glaringly bad bug | 18:34 |
mnasiadka | TheJulia: that's why I deducted by myself (the lack of data in CI) | 18:34 |
mnasiadka | s/why/what | 18:34 |
mnasiadka | TheJulia: in the meantime I was thinking of making a mysqldump of bios_settings table, going forward with the upgrade, and importing it back in after the upgrade - does it make any sense? | 18:35 |
TheJulia | eh... I'm not so sure about that | 18:35 |
JayF | if the bug is as bad as it sounds it is, I bet we'll have a fixed stable/xena out for you in a few days | 18:36 |
JayF | maybe not released but in git fo so | 18:36 |
JayF | *for sure | 18:36 |
mnasiadka | Kolla builds containers based on stable/xena and so on, we don't really follow releases - so no worries for that | 18:37 |
JayF | We are extremely on top of bug backports in Ironic, I think it's one of the things we do really well | 18:37 |
jrosser | jlvillal: do you have ipxe_bootfile_name_by_arch set in the ironic [ipxe] config section? | 18:38 |
jlvillal | jrosser, I do. | 18:38 |
mnasiadka | TheJulia: Ok then, I'll probably push off the upgrade to Yoga to next week - unless leaving Ironic Xena and upgrading Nova to Yoga is something that is supposed to work. | 18:40 |
jlvillal | TheJulia, makes sense not expecting a mixed fleet. Thanks. | 18:41 |
* TheJulia gives cookies to tox | 18:47 | |
TheJulia | I've got a unit test now which reproduces the error | 18:57 |
TheJulia | the bios setting model incremented from 1.0 to 1.1 too, so since you had data at version 1.0 | 18:58 |
opendevreview | Merged openstack/ironic master: Add prelude for OpenStack 2023.1 Ironic release https://review.opendev.org/c/openstack/ironic/+/876195 | 19:11 |
mnasiadka | TheJulia: nice, it's 8pm here, so if there's something I could try out in my morning - please leave me a message - thanks for the help :) | 19:14 |
TheJulia | mnasiadka: absolutely | 19:23 |
TheJulia | mnasiadka: goodnight :) | 19:23 |
*** shadower6 is now known as shadower | 19:39 | |
opendevreview | Julia Kreger proposed openstack/ironic master: Fix online upgrades for Bios/Traits https://review.opendev.org/c/openstack/ironic/+/876626 | 19:40 |
TheJulia | JayF: I think ^^^ will take care of mnasiadka's issue | 19:41 |
JayF | looking | 19:41 |
dtantsur | jlvillal: if you make bifrost multi-arch-aware, you'll be my hero :) | 19:49 |
dtantsur | I think I already promised you a beer the other day? | 19:49 |
* dtantsur slowly blends into the background | 19:49 | |
TheJulia | I feel like I'm going to need to have lots of beer on hand.... | 19:54 |
jlvillal | dtantsur, hehe :) | 19:57 |
opendevreview | Julia Kreger proposed openstack/ironic master: Fix online upgrades for Bios/Traits https://review.opendev.org/c/openstack/ironic/+/876626 | 20:35 |
TheJulia | stevebaker[m]: https://review.opendev.org/q/I99a2d8ecab657c8e4c852c73e816a5a8f2856471 <-- since you asked me | 21:06 |
jlvillal | So when Ironic/Bifrost creates a configdrive-*.iso.gz file. Is it not really gzipped and not really an ISO file? I notice the ones in /var/lib/ironic/httpboot/ are ASCII text. Looks like it might be bas64 encoded. | 21:07 |
JayF | TheJulia: stevebaker[m]: I just landed all the remaining oens | 21:09 |
stevebaker[m] | \o/ | 21:29 |
TheJulia | jlvillal: decode them and they should be gzipped iso files | 21:39 |
opendevreview | Julia Kreger proposed openstack/ironic master: Fix online upgrades for Bios/Traits https://review.opendev.org/c/openstack/ironic/+/876626 | 21:56 |
TheJulia | JayF: typo fixed ^ | 21:57 |
JayF | +2 | 22:18 |
opendevreview | Merged openstack/metalsmith stable/yoga: Use a network cache in Instance https://review.opendev.org/c/openstack/metalsmith/+/873770 | 23:27 |
opendevreview | Merged openstack/metalsmith stable/xena: Use a network cache in Instance https://review.opendev.org/c/openstack/metalsmith/+/873771 | 23:27 |
opendevreview | Merged openstack/metalsmith stable/wallaby: Use a network cache in Instance https://review.opendev.org/c/openstack/metalsmith/+/873772 | 23:27 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!