Monday, 2023-03-06

opendevreviewMerged openstack/sushy master: Update master for stable/2023.1  https://review.opendev.org/c/openstack/sushy/+/87615603:51
rpittaugood morning ironic! o/09:15
dtantsurJayF, jlvillal, the IPA example in bifrost does look questionable. There does not seem to be support for per-node kernel/ramdisk, it's a global vairable.09:18
espenflHi there. Upon using cloud-init (from the configdrive) for setting up networks, SSH keys etc. I wonder if it in Bifrost now is possible to set `user-data` in JSON style instead of a string? Seems Ironic added support for this (https://opendev.org/openstack/ironic/commit/3e1e0c9d5e6e5753d0fe2529901986ea36934971 etc.).10:37
dtantsurespenfl: no, but there is https://opendev.org/openstack/bifrost/commit/bf1bb49c38c26c4b88e330d3f6f9ccdc8419a1f510:56
espenfldtantsur: That is awesome. Thanks. Also noticed from your writeup: https://owlet.today/posts/ephemeral-workloads-with-ironic/#id9 that `network_data` should work already. With that and the SSH keys to the `default` user we are ready to go. Testing it now. Thanks for the nice writeup.11:03
opendevreviewDmitry Tantsur proposed openstack/ironic bugfix/21.2: Fixes for tox 4.0  https://review.opendev.org/c/openstack/ironic/+/87640911:04
opendevreviewDmitry Tantsur proposed openstack/ironic bugfix/21.2: Configure CI for bugfix/21.2  https://review.opendev.org/c/openstack/ironic/+/87641011:04
opendevreviewRiccardo Pittau proposed openstack/bifrost master: Fix enabling epel repo for rpm distributions  https://review.opendev.org/c/openstack/bifrost/+/87592911:06
opendevreviewRiccardo Pittau proposed openstack/ironic master: Add a non-voting metal3 CI job  https://review.opendev.org/c/openstack/ironic/+/86387311:15
opendevreviewRiccardo Pittau proposed openstack/ironic master: Add a non-voting metal3 CI job  https://review.opendev.org/c/openstack/ironic/+/86387311:16
kaloyankHello everyone, consider the following scenario. There are a couple of machines that I want to deploy using their local disks. I'm using Ironic in conjunction with Nova, Neutron and Cinder. In general, everything goes smoothly and Ironic deploys the machines as expected.11:55
kubajjHello everyone 11:56
kaloyankI then turn on fast-track deployment for these machines, destroy their respective Nova instances. Ironic cleans up these machines and leaves them powered on. I then create the same Nova instances. However, looking at their IPMI remote console, I don't see the IPA doing anything.11:57
kaloyankAfter a while, the machine is rebooted. However, it doesn't pxe boot iPXE and ends up in EFI shell.11:58
dtantsurkaloyank: fast-track with nova/neutron is risky. if networking parameters change between 2 instances, what happens?11:58
dtantsurotherwise, it's hard to debug from textual message. you need to check the logs (DHCP, ironic-conductor, maybe even tcpdump)11:58
kaloyankI haven't tried changing, I'd like to share something more :)11:58
dtantsurdo you use the same ports for the instance? otherwise, the IP address may change.11:59
kaloyankI do, let me share some more information so you can have better context11:59
kaloyankI checked the ironic-conductors logs and found out the following sequence of events:12:00
opendevreviewVanou Ishii proposed openstack/ironic stable/zed: [iRMC] Handle IPMI incompatibility in iRMC S6 2.x  https://review.opendev.org/c/openstack/ironic/+/87088112:00
kaloyank1. After cleaning a machine, the IPA sends heartbeats and the ironic-conductors processes them as expected.12:01
kaloyank2. When an instance is spawned, the ironic-conductor starts downloading the image from Glance, so it can serve it from it's HTTP server to the IPA12:01
kaloyank3. In the mean time, the IPA sends more heartbeats (two to be precise) but the ironic-conductor refuses to register them, as the code requires an exclusive instance lock. However, such a lock is already created for the node when the deployment process started12:02
kaloyank4. After the fast_track_timeout expires, the ironic-conductor considers the IPA dead and reboots the node12:03
dtantsurthe "requires an exclusive instance lock" is probably a red herring hiding the actual failure12:04
kaloyank5. However, the created neutron port lacks the extra DHCP options to perform the PXE chainloading of iPXE12:04
kaloyankwell, I can share the logs and the exact line of code12:04
dtantsuraha, #5 is probably a bug, I guess we ignore DHCP options on fast-track and then forget to add them if fast-track fails12:05
kaloyankI presumed the same as it seemed quite out of order, anyway logs incoming, shall I pastebin them?12:06
dtantsuryep, although I cannot promise to dive into them too deeply12:07
kaloyankhttps://github.com/openstack/ironic/blob/master/ironic/drivers/modules/agent_base.py#L583 where the shared lock is elevated to exclusive but fails (an exclusive lock already exists)12:08
dtantsuryeah, but the exclusive one should disappear soon12:10
dtantsurif it does not, you need to track what is holding the lock for so long (stuck image download, etc)12:10
kaloyankthat was my other question, the image is quite large, ~10GB, with a smaller image it's not a problem since the fast_track timeout hasn't expired12:11
kaloyankso is this behavior expected and do I need to work around it somehow?12:11
dtantsurkaloyank: if you don't see any abnormalities in the logs, you may try to play with the value of fast_track_timeout12:16
dtantsurunder #5 is quite likely a bug that needs fixing12:16
kaloyankit's already at its maximum (5mins), the image just downloads and verifies for a lot more12:16
dtantsurI'm frankly not sure why fast_track_timeout has an upper limit.. it has to be lower than ramdisk_heartbeat_timeout, but even that can be changed12:18
dtantsurbut that's a curious issue you describe. indeed, we cannot register a heartbeat while the lock is actively held..12:18
kaloyankI presume because the code updates important fields, still I tried bypassing the code in the exception handler but nothing changed in behavior12:19
dtantsurit could be interesting to release the lock while we're downloading/converting the image. The question is: what to do if we cannot re-acquire the lock afterwards.12:20
dtantsurOR allow heartbeats to be registered when the lock is held. but that requires changing database.12:21
kaloyankcan the image be downloaded without an exclusive lock?12:22
dtantsurkaloyank: a workaround may be to disable image conversion on the conductor side12:22
dtantsurkaloyank: that's my first idea. the problem is: what to do if you can no longer acquire the lock after the image is downloaded.12:22
kaloyankI use raw images only, that's why they're so big12:23
dtantsurah, so there is no conversion, it's just downloaded for 5 minutes?12:23
kaloyankyes, it's really big, I have a smaller one and it deploys without a problem12:23
dtantsurif you use swift as a glance backend, you can make ironic serve a URL to swift without local caching12:24
kaloyankI don't have swift :/12:24
kaloyankas I store the images in Cinder (Glance Cinder store) I was thinking of attaching the volume to the ironic conductor and streaming the block device. I know that no such code exists but if there are no other simpler options, I'd opt it for writing the code12:25
dtantsurcc TheJulia JayF (once you wake up) for more ideas ^^12:31
opendevreviewDmitry Tantsur proposed openstack/ironic bugfix/21.2: Update .gitreview for bugfix/21.2  https://review.opendev.org/c/openstack/ironic/+/86782612:36
kaloyanklogs were rotated, I think the issue was made clear, do you still need them?12:37
kaloyankjftr, I'm running the latest Yoga release from RDO12:45
mnasiadkaHello12:58
mnasiadkaDuring upgrade Wallaby->Xena while running online_data_migrations I get "Error while running update_to_latest_versions: 'BIOSSetting' object has no attribute 'id'." - help appreciated ;-)12:59
jssfrhow do I reset a node which is in state `deploy failed` or `wait call-back` state?13:14
jssfrby reset, I mean bring it into a clean/available state via cleaning13:14
jssfrabort, provide, and manage all give state errors13:14
jssfrah, undeploy!13:16
jssfrthanks documentation!13:16
kaloyankor.. baremetal node manage <node-name>; baremetal node provide <node-name>13:23
rpittaulooking for reviews and approval for 2 fixes in bifrost CI https://review.opendev.org/c/openstack/bifrost/+/872634 and https://review.opendev.org/c/openstack/bifrost/+/875929 thanks!13:31
jssfrokay, different issue now13:55
jssfrWe are trying to deploy an image which contains an mdraid13:55
jssfrusing the IPA, the image gets written to the disk successfully, but then the IPA re-reads mdraid and then attempts to installg rub13:55
jssfrwhich fails, because:13:55
jssfrMar 06 13:03:16 ubuntu ironic-python-agent[2384]: 2023-03-06 13:03:16.645 2384 ERROR root [-] Command failed: install_bootloader, error: Error finding the disk or partition device to deploy the image onto: No EFI partition could be detected on device /dev/md127 and EFI partition UUID has not been recorded during deployment (which is often the case for whole disk images). Are you using a UEFI-compat13:55
jssfrible image?: ironic_python_agent.errors.DeviceNotFound: Error finding the disk or partition device to deploy the image onto: No EFI partition could be detected on device /dev/md127 and EFI partition UUID has not been recorded during deployment (which is often the case for whole disk images). Are you using a UEFI-compatible image?13:55
jssfris there a sane way to get disk images which include an mdraid deployed?13:56
kaloyankWell, can you verify that the image has an EFI system partition? What does it's partition table look like?13:56
jssfr(the EFI partition is outside the mdraid)13:56
jssfrkaloyank, https://paste.ubuntu.com/p/KkfYgqDmVj/13:57
jssfrOARrr13:57
jssfrwhy the heck does that need a login13:57
jssfrlet me find a different paste, I was assuming pastebinit to be sane13:57
jssfrhttps://paste.debian.net/hidden/0f5930b1/13:58
jssfrkaloyank, ^ that's the partition table of the image13:58
jssfrp3 is the RAID, p1 is the EFI system partition.13:58
jssfr(this is from the box where we created the image, hence /dev/loopX)13:59
kaloyankjssfr: I'm looking at this: https://github.com/openstack/ironic-python-agent/blob/a1670753a23a79b6536f67eae9cca154e0ed2e65/ironic_python_agent/extensions/image.py#L695 which calls this: https://github.com/openstack/ironic-python-agent/blob/fcb65cae18f4a6b4b05fb70677e2fa114e0558a9/ironic_python_agent/hardware.py#L118214:05
kaloyankdid you specify a root_device_hint ?14:06
jssfrnope, looking into that now, thanks!14:28
jssfrkaloyank, do you happen to have a documentation link for root_device_hint? I can only find dcumentation from pike :/14:30
kaloyankjssfr: https://docs.openstack.org/ironic/pike/install/advanced.html#specifying-the-disk-for-deployment-root-device-hints14:30
jssfrthat's the pike stuff I found14:31
kaloyankThat's what you need, it's called root_device in Pike14:31
jssfrso something like {"name": "/dev/sda"} as root_device_hint?14:32
jssfrand that's still a property, not a driver_info or something?14:32
kaloyankYes, it's a property14:33
jssfrack, thanks14:33
jssfr(the code reads as if it's still `root_device` though, not root_device_hint?14:33
kaloyank"That's what you need, it's called root_device in Pike"14:33
kaloyankjssfr: JFTR, Pike is not supported, there might be bugs related to your case that are fixed in more recent releases, it'd be really nice if you can run a supported release14:35
jssfrwe are on victoria14:35
jssfrthat's why I was asking for newer documentation14:36
kaloyankthen why were you asking about Pike?14:36
jssfrah, misunderstanding :)14:36
jssfrwhat I meant to say was "all documentation I found was for Pike"14:36
jssfr"… which is not what I wanted, I wanted something more recen"14:36
jssfrI see how that wasn't clear, sorry14:36
kaloyankhttps://docs.openstack.org/ironic/victoria/install/advanced.html#specifying-the-disk-for-deployment-root-device-hints14:36
jssfrthanks a lot14:36
kaloyankthis is the appropriate link then14:36
iurygregorygood morning Ironic14:56
TheJuliagood morning14:57
TheJuliakaloyank: so... I guess that is doable with appropriate code, the question is going to be "how to get application credentials for volume access down to the agent". We've generally discussed it before and have a general idea, but I don't think we've implemented any such credential path.15:01
dtantsurthe initial problem is also curious. fast-track can break if the image is too large. which is definitely not the desired behavior.15:01
TheJuliaI'n not quite sure if it is still heartbeating15:02
JayFgood morning15:02
dtantsurwhy not?15:02
JayF#startmeeting ironic15:02
opendevmeetMeeting started Mon Mar  6 15:02:33 2023 UTC and is due to finish in 60 minutes.  The chair is JayF. Information about MeetBot at http://wiki.debian.org/MeetBot.15:02
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.15:02
opendevmeetThe meeting name has been set to 'ironic'15:02
TheJuliabut, yes, it is a can of worms15:02
iurygregoryo/15:02
dtantsurtrue15:02
dtantsurhi folks o/15:02
TheJuliadtantsur: window is against the heartbeat operation still occuring15:02
matfechnero/15:02
JayFShould be a pretty quick one this morning; I'll try to fly through15:02
TheJuliao/15:03
JayF#topic Announcements/Reminder15:03
JayFAs always, tag your patches with ironic-week-prio to get them reviewed15:03
JayFAnd we are counting down, focusing on any outstanding reviews would be good; I haven't looked but will make a point to check today15:03
kaloyankTheJulia: what do you mean by application credentials exactly? How Ironic authenticates to the storage backend or something else?15:03
rpittauo/15:03
JayF#topic Review Actions from last meeting15:03
TheJuliakaloyank: lets revisit once the irc meeting is over :)15:04
kaloyanksure 15:04
JayFkaloyank: I'll be fast :D 15:04
JayFokay, action items, we had 315:04
JayFFirst one is tied to a general announcement; I will be out Weds-Mon at a conference with limited availability. If you need me urgently, email is best (anything-but-IRC).15:05
JayFAnd iurygregory will be running the meeting next week in my stead15:05
JayFI have an untaken action I'll renew15:05
JayF#action JayF To look into VMT for Ironic15:05
JayFThat's the actions from last week15:05
JayFmoving on15:05
JayF#topic CI Status 15:05
JayFIs there anything notable about CI from the last 7 days that people should know?15:05
rpittauJayF: just 2 minor fix for bifrost CI15:06
rpittauthey're in the prio list15:06
JayFThanks for that \o/15:06
JayFI'll look this morning15:06
TheJuliamnasiadka: uhh... wow, that shouldn't break that way. Have you checked oslo.db is appropriate based upon requirements.txt and the xena upper-constraints.txt file from the openstack/requirements repo branch stable/xena ?15:06
rpittauthanks15:06
JayF#topic VirtualPDU update15:06
iurygregoryI saw some of the jobs in networking-generic-switch failing (but didn't really look to find the root cause)15:07
JayFSo, thanks to rpittau, fungi and their work, VirtualPDU has us as cores now15:07
rpittau\o/15:07
mnasiadkaTheJulia: that's a Kolla container, I can't assume it's different but let me check :)15:07
JayFadditionally, I proposed + landed governance change to put it officially under Ironic's program15:07
iurygregorytks everyone!15:07
JayFSo VirtualPDU is adopted. There is a further step; moving the repo from x/virtualpdu -> openstack/virtualpdu; but that requires gerrit downtime and will not happen until after the release sometime15:08
jssfrkaloyank, specifying root_device helped, thanks!15:08
rpittauJayF: what do we want to do for a possible release ?15:08
rpittaufor virtualpdu15:08
TheJuliamnasiadka: the issue is it has been for 6 years, something is *very* wrong.15:08
JayFrpittau: cut a manual one in the meantime would be my suggestion?15:08
rpittaummm yeah15:08
rpittauok15:08
JayFI would assume release-management of virtualpdu won't be possible until after antelope is done15:08
JayFso if we need a virtualpdu release before that; manual is the way15:08
rpittauI'll look into that15:09
JayFThanks for that \o/15:09
JayF#topic Release Countdown15:09
JayFcycle highlights have landed15:09
JayFprobably too late to land any changes of bulk15:09
JayFbut if you have something you want in antelope, now is the time to make noise and get it reviewed15:10
JayF#link https://review.opendev.org/c/openstack/ironic/+/876195 release notes prelude15:10
JayFI will revise this today and push to get it merged; if you have opinions on it post them soon :)15:10
mnasiadkaTheJulia: oslo.db==11.0.0, and yoga upper-constraints.txt says 11.2.0, yes, something is wrong :)15:10
JayFanything else for release before I move on?15:10
JayF#topic Open Discussion15:11
JayFAs said earlier, I will be unavailable from Weds thru Mon traveling to and participating in SCALE 20x.15:11
TheJuliamnasiadka: so that should be xena15:11
JayFIf you need me urgently for anything, I think most of you have my email. That'll be checked much more often than IRC15:11
JayFanything else for open discussion15:11
mnasiadkaTheJulia: sorry, mixed it up - yes, it's according to xena's u-c15:12
TheJuliaI'll also be at SCALE 20x later this week, just fyi.15:12
mnasiadkaTheJulia: what else can be wrong?15:12
TheJuliamnasiadka: i guess step 0 is going to be open the database, ensure the column is there15:12
JayFI'm calling it15:13
JayF#endmeeting15:13
opendevmeetMeeting ended Mon Mar  6 15:13:16 2023 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)15:13
opendevmeetMinutes:        https://meetings.opendev.org/meetings/ironic/2023/ironic.2023-03-06-15.02.html15:13
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/ironic/2023/ironic.2023-03-06-15.02.txt15:13
opendevmeetLog:            https://meetings.opendev.org/meetings/ironic/2023/ironic.2023-03-06-15.02.log.html15:13
TheJuliaheh15:13
TheJuliakaloyank: o/ So I meant credentials for how ironic-python-agent connects to Cinder to access the image, that is unless you want Ironic's conductor to handle downloading and packing the image locally. 15:14
TheJuliakaloyank: I *thought* cinder had a capability to do that though, now that I think abou tit.15:14
TheJuliaabout it15:14
mnasiadkaTheJulia: I was upgrading from 17.1.1.dev7, there is no 'id' column in 'bios_settings' (only 'node_id')15:14
mnasiadkaTheJulia: https://paste.openstack.org/show/b0LjGZNV0fwszRgiOdJL/15:14
kaloyankTheJulia: I was thinking of making the ironic-conductor process attach a Cinder volume to itself and serve its contents15:15
kaloyankimplementing this in the IPA will be harder, tough more efficient in terms of the end result15:15
TheJuliamnasiadka: wow, that is a bug then15:16
TheJuliamnasiadka: can you get a full stacktrace for us? I'm really surprised we didn't find this in CI over the years15:16
TheJuliaI was thinking id would be coming from the underlying base model, but not in that table, node_id is primary key15:16
mnasiadkaTheJulia: sure, is there a guide somewhere so I don't miss anything?15:17
TheJuliamnasiadka: an upgrade guide? or on getting a stacktrace from the logs? I'm confused by the question, sorry15:17
mnasiadkaTheJulia: stacktrace :)15:17
TheJuliakaloyank: I still could have sworn cinder was supposed to have an extract capability :)15:18
TheJuliamnasiadka: no, as much context around the error as possible, typically the ~20 lines before is most of the stack trace which would occur15:18
kaloyankTheJulia: by extract you mean point it a path and let it dump a volume to a file?15:18
TheJuliakaloyank: yeah15:19
TheJuliakaloyank: I might be totally off base, and maybe this is not universally supported, but I thought you could save a cinder volume to a glance image in some way15:19
mnasiadkaTheJulia: https://storyboard.openstack.org/#!/story/2010632 - hope that's enough (in the comment)15:20
dtantsursome code reshuffling if anyone has time: https://review.opendev.org/c/openstack/ironic/+/874677 https://review.opendev.org/c/openstack/ironic/+/87591515:20
kaloyankTheJulia: You absolutely can make an image out of a volume but how does this help the fact that the ironic-conductor still has to download the image from Glance15:20
TheJuliaoh, i bet the batching doesn't grok there being no id field15:21
TheJuliakaloyank: so the underlying issue is that a cinder volume may be any storage technology, and the ironic-conductor may not be able to access it15:22
TheJuliaan mvp might be, get it packaged out to glance, download it, and then work on improving the flow for greater efficiency15:22
kaloyankTheJulia: I still don't see how that resolves the issue :/15:24
kaloyankright now, Glance attaches the volume, copies the data to a temporary file, serves the image from the file and after that deletes it15:25
kaloyankThe load of the process grows exponentially because Ironic doesn't allow baremetal nodes to share images15:25
kaloyankI'm looking towards eliminating the extra copying from the volume to a temporary file at Glance to a temporary file at the ironic-conductor host15:27
TheJuliaeww, so it doesn't store it, so it would rinse/repeat with each deployment15:27
kaloyankexactly15:28
TheJuliakaloyank: would you be able to trigger online-data-migrations without the batching functionality, or is your environment too big?15:30
kaloyankTheJulia: I don't know what you mean by online-data-migrations15:32
mnasiadkaTheJulia: anything I can do in the meantime? I was supposed to upgrade to Yoga by tomorrow - should we leave it at Xena level and wait for a patch?15:45
dtantsurkaloyank: a correction: each glance image should be downloaded only once15:47
kaloyankdtantsur: I believe you, still I can't explain to myself why did it took 252 secs (per logs) to download and verify an image when spawning 4 instances simultaneously while it took 82 secs for the same image when spawning a single instance :(15:55
dtantsurhmm15:55
dtantsurkaloyank: the image cache should log its decisions in DEBUG level15:55
kaloyankI know, that's where I got those numbers from, but the logs got rotated and I haven't setup a debug file for journald and now all the HW is occupied by another team15:57
dtantsurthat's pity. even when all 4 images are downloaded from scratch, the downloads should happen in parallel15:59
opendevreviewDmitry Tantsur proposed openstack/ironic master: [WIP] Migrate the inspector's /continue API  https://review.opendev.org/c/openstack/ironic/+/87594416:01
kaloyankdtantsur: OK, managed to snatch one, brb16:02
espenflis there a way to inspect the configdrive post deployment?16:11
samuelkunkel[m]Normally its a RO Blockdevice. If you check via lsblk you should see a block device16:11
samuelkunkel[m]if you mount this, you can read into the config drive16:11
samuelkunkel[m]But you need access to the OS for this16:11
samuelkunkel[m]s/RO/read-only/, s/Blockdevice/blockdevice/16:12
samuelkunkel[m]s/RO/read-only/, s/Blockdevice/blockdevice (assuming regular iso 9660/16:13
espenflah, okey, this is the problem, trying to diagnose setting of network, default password and SSH keys. Is there a way to see in Ironic side? or regenerate it in a way so that I can inspect based on what I have in my inventory (Bifrost and putting it into instance_info)16:13
kaloyankdtantsur: I confirm the image is downloaded only once16:20
samuelkunkel[m]s/RO/read-only/, s/Blockdevice/blockdevice (assuming regular iso 9660)/16:20
opendevreviewJay Faulkner proposed openstack/ironic master: Add prelude for OpenStack 2023.1 Ironic release  https://review.opendev.org/c/openstack/ironic/+/87619516:27
JayFWe need to land ^^ that very soon, if folks wanna review the changes I made for rpittau 16:27
kaloyankdtantsur: So I think I found the culprit and this is the image SHA verification: `Computed sha512 checksum for image /var/lib/ironic/images/2c2f6c92-1595-47e5-8a75-816dce4af0ea/disk in 657.11 seconds`16:31
TheJulia+2'ed16:32
TheJuliakaloyank: yeouch16:32
TheJuliahow big is that image?16:32
kaloyankTheJulia: ~10GB16:32
TheJulia.... it shouldn't that slow... :\16:32
kaloyankI suppose so, I see that Ironic uses the python hashlib, shouldn't it be using, say sha256sum ?16:33
* TheJulia wonders if we still need conductor side checksum operation on images16:33
TheJuliakaloyank: I think it might be attempting to match the the glance record to verify what was generated/stored16:33
TheJuliaos_image_algo in the glance image properties ... I think16:34
kaloyankTheJulia: yeah, I was thinking the same 16:34
TheJuliasince we don't have iscsi deploy, doing the checksum locally might be redundant16:34
TheJuliawell16:34
TheJuliadepends on if it is a remote deployment or a local boot I Guess16:35
JayFIt's also the only place we can act on it16:35
JayFIPA finding a bad checksum has no option but to fail16:35
TheJuliaif is_ramdisk_deploy, then run_checksum_check()16:35
JayFconductor finding one could potentially retry or give a better error16:35
TheJuliaJayF: it is fatal to begin with even if it is found by the conductor16:36
TheJuliatrue16:36
JayFyeah I figured we didn't do anything with it today16:36
TheJuliaAlthough I think we carry the error back from the agent16:36
kaloyankI have to admit, this function doesn't look good: https://github.com/openstack/oslo.utils/blob/master/oslo_utils/fileutils.py#L11216:38
JayFkaloyank: what approach would you take instead?16:38
JayFthat looks more or less like a standard python hash-checking function16:39
JayFthe time.sleep(0) is because we use eventlet16:39
kaloyankJayF: indeed it does, but the standard sha512sum tool on the same file completed for ~ 40s16:40
kaloyankI'd replace it for that, or atleast increase the chunksize (it's 64K currently)16:41
TheJuliathe issue is without time.sleep(0), you might end up blocking the rest of the app for ~40s16:41
JayF... is there a chance something else is blocking significantly so that hash is giving up too much time to other stuff16:41
TheJulia^^^^ that16:41
JayFdebugging that is going to be ridic16:41
kaloyankI gather some stats for this host, any clues?16:43
TheJuliaI think the question might be what else was the conductor doing16:45
TheJulia... but my underlying thought is "do we still need it"16:45
dtantsurTheJulia: we need to recalculate checksums for raw images after conversion16:48
dtantsuralthough it's possible that in this case we don't check if the images are initially raw16:48
TheJuliaoh yeah, this could be unpacked to something absurd16:49
kaloyankwill spawning more workers help?16:49
TheJuliakaloyank: no, what type of image is this?16:49
dtantsurkaloyank: how much does sha512sum (the tool) take?16:49
JayF> kaloyank | JayF: indeed it does, but the standard sha512sum tool on the same file completed for ~ 40s16:50
dtantsurah, missed that. damn, yeah.16:50
kaloyankTheJulia: it's a raw image16:50
dtantsurTheJulia: re blocking the app: I'd expect eventlet to yield to other threads when waiting for a command?16:51
dtantsurotherwise we wouldn't be able to use ipmitool :)16:51
JayFthere's no chance to yield, generally no blocking on that hashing activity16:51
JayFI was thinking the opposite: there's a competing thread that needs to yield but isn't16:51
dtantsurhttps://github.com/eventlet/eventlet/blob/master/eventlet/green/subprocess.py#L9116:52
dtantsursubprocess works as a loop with sleep in eventlet16:53
JayFcould eventlet.sleep vs time.sleep be an impact there?16:53
dtantsurunlikely; time.sleep should be monkey-patched16:53
dtantsuralso, the native time.sleep(0) is basically no-op16:53
JayFhttps://github.com/openstack/ironic/blob/master/ironic/cmd/__init__.py#L26 yeah just checking, only thing we exclude is os16:55
TheJuliahmm16:56
kaloyankjftr, the time it takes for a checksum to be calculated seems to scale with the number of machines currently being deployed. For a single machine with the same image the checksum was calculated for ~ 80s which correlates with the results of sha512sum 16:57
JayFkaloyank: provisioning from the same conductor group, yeah?16:58
dtantsurwell, a CPU-intensive task and green threads.....16:58
kaloyankJayF: yes, I have only one ironic-conductor16:59
JayFSo that's what's happening. We're running N greenthreads for N deployments16:59
JayFall trying to checksum the same image16:59
JayFyeah?16:59
kaloyankyes16:59
TheJuliahas anyone looked to see if hashlib is doing its work in python, or is it compiled module from c?16:59
dtantsurbasically16:59
dtantsurI *think* hashlib is in C16:59
dtantsurwhich does not mean it releases gil16:59
dtantsurI wonder if it helps to crank up read_chunksize significantly 17:00
dtantsur.. insignificantly in my testing17:01
kaloyankif there's image cache, why does the checksum has to be verified for each instance and not once per image?17:02
dtantsurkaloyank: the cache does not contain the checksum17:03
dtantsurwhat IS a good question is why we recalculate it at all, given that your images are already raw17:03
dtantsur(in my testing, using eventlet.green.subprocess is comparable with the oslo's function)17:04
rpittaugood night! o/17:04
kaloyankdtantsur: well, besides someone tampering the images I can't think of any other reason17:06
dtantsurlet me prototype something here17:07
TheJuliadtantsur: why I think is the most excellent question atm17:08
kaloyankI'm off for today, thanks for the help and support, bye! 0/17:08
opendevreviewDmitry Tantsur proposed openstack/ironic master: [WIP] Do not recalculate checksum if disk_format is not changed  https://review.opendev.org/c/openstack/ironic/+/87659517:15
dtantsurTheJulia, kaloyank maybe ^^^17:15
dtantsurthis won't solve the "checksumming is slow" problem17:15
dtantsurbut at least kaloyank will be happy (if it works)17:16
dtantsurI suspect that outside of the raw->raw context, the slowness is not such a huge deal17:16
jlvillalI think this might be wrong. As x86 and aarch64 get put all into the same bucket. At least for me I was sending ipxe-x86_64.efi to my ARM64 client.  https://github.com/openstack/bifrost/blob/0e6be25ee17ea75d60eb4f32fea37db0f79af52d/playbooks/roles/bifrost-ironic-install/templates/dnsmasq.conf.j2#L91-L9317:31
jlvillal11 seems to be the client-arch code for aarch64.17:32
TheJuliamnasiadka: so I think i figured out what the issue is18:30
TheJuliajlvillal: likely bifrost was never expected to have a mixed fleet18:30
opendevreviewVerification of a change to openstack/ironic master failed: Add prelude for OpenStack 2023.1 Ironic release  https://review.opendev.org/c/openstack/ironic/+/87619518:30
mnasiadkaTheJulia: fantastic, how dark is the alley I'm in? :)18:31
TheJuliamnasiadka: not a bad one18:33
TheJuliamnasiadka: I'm working on a patch now18:33
TheJuliaunfortunately I think this was not found before becuase we don't have data in the table in CI18:33
TheJuliabut it is a glaringly bad bug18:34
mnasiadkaTheJulia: that's why I deducted by myself (the lack of data in CI)18:34
mnasiadkas/why/what18:34
mnasiadkaTheJulia: in the meantime I was thinking of making a mysqldump of bios_settings table, going forward with the upgrade, and importing it back in after the upgrade - does it make any sense?18:35
TheJuliaeh... I'm not so sure about that18:35
JayFif the bug is as bad as it sounds it is, I bet we'll have a fixed stable/xena out for you in a few days18:36
JayFmaybe not released but in git fo so18:36
JayF*for sure18:36
mnasiadkaKolla builds containers based on stable/xena and so on, we don't really follow releases - so no worries for that18:37
JayFWe are extremely on top of bug backports in Ironic, I think it's one of the things we do really well18:37
jrosserjlvillal: do you have ipxe_bootfile_name_by_arch set in the ironic [ipxe] config section?18:38
jlvillaljrosser, I do.18:38
mnasiadkaTheJulia: Ok then, I'll probably push off the upgrade to Yoga to next week - unless leaving Ironic Xena and upgrading Nova to Yoga is something that is supposed to work.18:40
jlvillalTheJulia, makes sense not expecting a mixed fleet. Thanks.18:41
* TheJulia gives cookies to tox18:47
TheJuliaI've got a unit test now which reproduces the error18:57
TheJuliathe bios setting model incremented from 1.0 to 1.1 too, so since you had data at version 1.018:58
opendevreviewMerged openstack/ironic master: Add prelude for OpenStack 2023.1 Ironic release  https://review.opendev.org/c/openstack/ironic/+/87619519:11
mnasiadkaTheJulia: nice, it's 8pm here, so if there's something I could try out in my morning - please leave me a message - thanks for the help :)19:14
TheJuliamnasiadka: absolutely19:23
TheJuliamnasiadka: goodnight :)19:23
*** shadower6 is now known as shadower19:39
opendevreviewJulia Kreger proposed openstack/ironic master: Fix online upgrades for Bios/Traits  https://review.opendev.org/c/openstack/ironic/+/87662619:40
TheJuliaJayF: I think ^^^ will take care of mnasiadka's issue19:41
JayFlooking19:41
dtantsurjlvillal: if you make bifrost multi-arch-aware, you'll be my hero :)19:49
dtantsurI think I already promised you a beer the other day?19:49
* dtantsur slowly blends into the background19:49
TheJuliaI feel like I'm going to need to have lots of beer on hand....19:54
jlvillaldtantsur, hehe :)19:57
opendevreviewJulia Kreger proposed openstack/ironic master: Fix online upgrades for Bios/Traits  https://review.opendev.org/c/openstack/ironic/+/87662620:35
TheJuliastevebaker[m]: https://review.opendev.org/q/I99a2d8ecab657c8e4c852c73e816a5a8f2856471 <-- since you asked me21:06
jlvillalSo when Ironic/Bifrost creates a configdrive-*.iso.gz file. Is it not really gzipped and not really an ISO file? I notice the ones in /var/lib/ironic/httpboot/ are ASCII text. Looks like it might be bas64 encoded.21:07
JayFTheJulia: stevebaker[m]: I just landed all the remaining oens21:09
stevebaker[m]\o/21:29
TheJuliajlvillal: decode them and they should be gzipped iso files21:39
opendevreviewJulia Kreger proposed openstack/ironic master: Fix online upgrades for Bios/Traits  https://review.opendev.org/c/openstack/ironic/+/87662621:56
TheJuliaJayF: typo fixed ^21:57
JayF+222:18
opendevreviewMerged openstack/metalsmith stable/yoga: Use a network cache in Instance  https://review.opendev.org/c/openstack/metalsmith/+/87377023:27
opendevreviewMerged openstack/metalsmith stable/xena: Use a network cache in Instance  https://review.opendev.org/c/openstack/metalsmith/+/87377123:27
opendevreviewMerged openstack/metalsmith stable/wallaby: Use a network cache in Instance  https://review.opendev.org/c/openstack/metalsmith/+/87377223:27

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!