| *** mhen_ is now known as mhen | 02:30 | |
| ykarel | gmaan, ack updated | 06:10 |
|---|---|---|
| frickler | gmaan: nova-ceph-multistore seems to be consistently failing now on https://review.opendev.org/c/openstack/devstack/+/968076 , I hope we didn't introduce a real regression due to lack of test coverage. still I'm wary of simply making the job non-voting now | 08:25 |
| frickler | yes, it is passing on other changes, like https://zuul.opendev.org/t/openstack/build/1aa05ba4148a46ba9f1bd547cb95c145 . :-( sean-k-mooney ^^ maybe you can check this? | 08:36 |
| sean-k-mooney | frickler: just getting started but ill take a look shortly | 09:13 |
| sean-k-mooney | test_boot_cloned_encrypted_volume | 09:29 |
| sean-k-mooney | frickler: so we have passing runs like my patch above where that passes https://review.opendev.org/c/openstack/devstack/+/968249 | 09:31 |
| sean-k-mooney | so it by no means universally broken, i guess the concern is https://review.opendev.org/c/openstack/devstack/+/968076/1/lib/tempest#829 | 09:32 |
| sean-k-mooney | is the behviaorl chagne that is casue the test case to faile | 09:32 |
| sean-k-mooney | let me see if that est is running on the passing version or if its skipped | 09:32 |
| sean-k-mooney | hum this is also form cinder_tempest_plugin | 09:33 |
| sean-k-mooney | i didnt think we ran that plugin in nova-ceph-multistore | 09:33 |
| sean-k-mooney | i wonder of the test is new | 09:33 |
| sean-k-mooney | no its running and passing on my patch | 09:35 |
| sean-k-mooney | frickler: ah so this is a know issue we hit a few months ago | 09:38 |
| sean-k-mooney | https://pb.teim.app/?90a8d20013a7b570#HH3B9MsRVHJAuFjtDvUKkLduZtwnELkFNAr5hYQFBHbq | 09:38 |
| sean-k-mooney | the tldr is cidner is using a cgrpu liek mechanisum plimits or something to enforce some runtim restriction on the processutils.exec calls to qemu-img | 09:40 |
| sean-k-mooney | in some cases that failes with 'qemu-img: /opt/stack/data/cinder/conversion/tmpzj1a9mr9.luks: error while converting luks: Unable to get accurate CPU usage\n' | 09:40 |
| sean-k-mooney | its the throttle_cmd bit of convert_image(tuple(throttle_cmd['prefix']), in the trace File "/opt/stack/cinder/cinder/image/image_utils.py", line 553, in convert_image | 09:41 |
| sean-k-mooney | https://tinyurl.com/3wjwjpuu | 09:45 |
| sean-k-mooney | i dont know howlove or logs are but it looks liek 250 ish failure in the last 2 weeks | 09:45 |
| sean-k-mooney | frickler: for now i have recheked it again the one thing that i do find intersting is all the failures are on raxflex-* | 09:50 |
| sean-k-mooney | those are kvm host using the pc machine type and a pretty old biors firmware. the passign case is on xen on the older rax cloud | 09:53 |
| sean-k-mooney | i dont know if i can trust | 09:54 |
| sean-k-mooney | ansible_bios_date: 04/01/2014 | 09:55 |
| sean-k-mooney | ansible_bios_vendor: SeaBIOS | 09:55 |
| sean-k-mooney | ansible_bios_version: 1.16.2-debian-1.16.2-1 | 09:55 |
| sean-k-mooney | that would be the bookwork version https://packages.debian.org/source/oldstable/seabios | 09:56 |
| sean-k-mooney | which i guess may make sense | 09:56 |
| sean-k-mooney | i was orgianbly wonderign if this was that xen bug where sometime we dont get all the cpus we expect but aprpently not | 09:58 |
| opendevreview | Merged openstack/devstack master: Revert "Cap stable/2025.2 network, swift, volume api_extensions for tempest" https://review.opendev.org/c/openstack/devstack/+/968076 | 14:01 |
| frickler | sean-k-mooney: oh, so completely unrelated to the change after all, just bad luck, thx for checking. and logs expire after 10 days or so for the opensearch access iirc | 14:23 |
| frickler | maybe we can check with rax whether we can use a different machine type or something, I guess this could affect other customers of them, too? | 14:24 |
| frickler | also do you know whether there might be some option for qemu-img that could be added as a workaround? | 14:27 |
| sean-k-mooney | i doubt this is failing 100% of the time | 14:27 |
| sean-k-mooney | i.e. i suspect we had more then 250 jobs run on the rax flex nodes in those 10 days | 14:28 |
| sean-k-mooney | i.e. with that test | 14:28 |
| sean-k-mooney | but ya it look like bad timing but i also think we need to evenutaly fix this in cinder | 14:28 |
| sean-k-mooney | frickler: with that said i think there may be a qemu-img bug at play as well | 14:29 |
| * sean-k-mooney is trying to remember all the context form a few weeks ago | 14:30 | |
| sean-k-mooney | frickler: ok so ya to work around a qemu-img bug related to adresspace usage with there new async mode in librbd | 14:31 |
| opendevreview | Maxim Sava proposed openstack/tempest master: Add image decompression import tests https://review.opendev.org/c/openstack/tempest/+/965889 | 14:31 |
| sean-k-mooney | we moved the ceph job to debian 12 | 14:31 |
| sean-k-mooney | frickler: then we hit this bug which is fixed in the version in debian backports | 14:32 |
| sean-k-mooney | but i never got aroudn to enabling that | 14:32 |
| sean-k-mooney | so a fix for this may be to move this job to debian 13 now that we have that | 14:32 |
| sean-k-mooney | frickler: ill push a change for that to review and we can see if we want to proceed | 14:32 |
| sean-k-mooney | frickler: https://github.com/openstack/devstack-plugin-ceph/commit/536ebea559673d694c8f8deeafe1362e2c41c021 | 14:33 |
| sean-k-mooney | https://bugs.launchpad.net/ceph/+bug/2116852 | 14:33 |
| frickler | .oO(distro hopping ;) | 14:34 |
| sean-k-mooney | yep to aovid the buggy version fo qemu-img | 14:34 |
| sean-k-mooney | this was inteded to be temproy to be fair but other then this rare faliure it been just as stable as ubuntu | 14:34 |
| sean-k-mooney | we ment form an almost alwasy failrue to a very rare one | 14:35 |
| frickler | clarkb: fungi: not sure if you're reading along anyway, but fyi the tinyurl from sean above shows yet another case of "raxflex is different and can break things", which we might want to forward to rax, too | 14:37 |
| fungi | i was not following, catching up now sorry | 14:38 |
| frickler | the failure likely got more frequent recently as we enabled more capacity on raxflex | 14:38 |
| fungi | so maybe related to https://bugzilla.redhat.com/show_bug.cgi?id=2336437 i guess? | 14:43 |
| fungi | "...a subtle bug on machines which are very fast..." (from dan berrangé's related qemu-devel post) | 14:49 |
| fungi | seems like the way rackspace flex may be different in this case is by having faster processors (which we already know) | 14:49 |
| fungi | if it needs https://gitlab.com/qemu-project/qemu/-/commit/145f12e then that will require qemu 10.0.0 or later, looks like? | 14:58 |
| fungi | frickler: sean-k-mooney: am i understanding the concern correctly? | 14:59 |
| sean-k-mooney | sorry was looking at somethign else reading back | 15:00 |
| fungi | also the error message in your opensearch query was first introduced by qemu 9.2.0, looks like | 15:01 |
| sean-k-mooney | qemu-9.2.0-3.fc42 | 15:02 |
| fungi | so in theory you'd start seeing it with qemu >=9.2,<10 if the error is related to the bug i mentioned | 15:02 |
| sean-k-mooney | ya so there are 2 related issues | 15:03 |
| sean-k-mooney | librados | 15:03 |
| fungi | well, qemu-9.2.0-3.fc42 is the fedora backport of the fix mentioned in the rh bug, not the upstream version it's fixed in | 15:03 |
| sean-k-mooney | is internally rewritign how it works to su c++ coruties and an async executor | 15:03 |
| sean-k-mooney | that increased the virtual memory usage but not nessisalry the RSS | 15:04 |
| sean-k-mooney | causing a check in nova to hard fail | 15:04 |
| sean-k-mooney | to work around that we swaped to debian 12 to ahve a version that does not have that behavior | 15:04 |
| sean-k-mooney | which uncoverd this other bug | 15:04 |
| sean-k-mooney | debian backport in 12 has a newer qemu-image that shoudl have a fix for https://bugzilla.redhat.com/show_bug.cgi?id=2336437 as shoudl 13 | 15:05 |
| sean-k-mooney | on the nova side i have not got around to fixing our usage to check RSS not vritual memory | 15:05 |
| sean-k-mooney | on the cinder side | 15:05 |
| fungi | looks like debian 23 (trixie) has qemu 10 so would include dan's upstream fix | 15:06 |
| fungi | er, debian 12 | 15:06 |
| sean-k-mooney | you mean 13 (trixie) | 15:06 |
| sean-k-mooney | ya so first step i was going to push a patch to just move the ceph jobs form 12->13 | 15:06 |
| sean-k-mooney | and see if that passes a few times | 15:06 |
| fungi | oh, sorry yes. debian 12 (bookworm) has qemu 7.2 which pre-dates that check, but also bookworm-backports has qemu 10 | 15:07 |
| sean-k-mooney | if so that should side step most of the issues but i sould also do a draft patch to change the nova check | 15:07 |
| fungi | so depending on whether or not you're installing qemu packages from -backports, you either have a version from before the check was added or a version which includes the fix for it | 15:07 |
| opendevreview | Fernando Ferraz proposed openstack/devstack-plugin-nfs master: [DNM] Test Glance over Cinder/NFS with NFS driver fixes https://review.opendev.org/c/openstack/devstack-plugin-nfs/+/965409 | 15:07 |
| sean-k-mooney | fungi: yes | 15:08 |
| fungi | and yeah, switching to trixie you'll have a similar qemu version to what's in bookworm-backports | 15:09 |
| sean-k-mooney | we dicsused using backport a while ago but its not exactly clean. it requires some change to devstack that are more invasive then just updateign the package in files/ ... | 15:10 |
| sean-k-mooney | we are mirroring backport but debian backport rquire you to opt in per package to install form them | 15:11 |
| sean-k-mooney | so its not just a case fo enabling the repo | 15:11 |
| sean-k-mooney | we either need to pass -t backports? or simialr or install the package with <name>/backports or simialr | 15:12 |
| fungi | well, you don't *have* to opt in per-package, you can do it per invocation of apt by using -t, but generally yes it requires some adjustment because backports isn't meant to be something you install everything froim | 15:12 |
| sean-k-mooney | yep so i tied the targeted approch when i had a free after noone but ran out of time because i needed to isntall each dep as well | 15:13 |
| fungi | technically you can make that happen by overriding the priority for that suite in apt's configuration, but i don't recommend it. though you can configure apt pinning to backports for specific packages too | 15:13 |
| sean-k-mooney | gibi got a poc working that way | 15:14 |
| sean-k-mooney | btu the simple way to do this vai backprot would be to have an extra if here | 15:14 |
| sean-k-mooney | https://github.com/openstack/devstack/blob/master/lib/nova_plugins/functions-libvirt#L72 | 15:14 |
| sean-k-mooney | for debian if its debian 12 | 15:14 |
| fungi | yeah, you could instead drop a conffile in /etc/apt/apt.conf.d/ that basically says always install qemu packages from backports | 15:16 |
| sean-k-mooney | do you have an exampel of that bacues i coudl not find that but if there is one that woudl eb an path forward | 15:17 |
| sean-k-mooney | we coudl do that in the fixup-stuff methods | 15:17 |
| fungi | sean-k-mooney: i'm looking for a more concise example, but https://linuxconfig.org/debian-pinning-howto is pretty close | 15:20 |
| fungi | and it's /etc/apt/preferences.d/ you'd put it under actually | 15:21 |
| opendevreview | Sean Mooney proposed openstack/devstack-plugin-ceph master: move ceph jobs to debian 13 https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/968335 | 15:22 |
| sean-k-mooney | ah | 15:23 |
| sean-k-mooney | Package: * | 15:23 |
| sean-k-mooney | Pin: release a=stable | 15:23 |
| sean-k-mooney | Pin-Priority: 900 | 15:23 |
| sean-k-mooney | so i could proably do *qemu* and *libvirt* or simialr | 15:23 |
| sean-k-mooney | ah via /etc/apt/preferences | 15:24 |
| fungi | yes, i recommend using the /etc/apt/preferences.d/ directory so you don't have to worry about file-level overwrites/conflicts with your config additions | 15:25 |
| sean-k-mooney | ok i can also try that. i still think moving to debian 13 woudl make sense but we can do both to have somethign we could backport for sable branches | 15:26 |
| fungi | the default priorities for each suite are encoded into their release files, so you only have to worry about installing configuration for your overrides | 15:26 |
| sean-k-mooney | yep | 15:26 |
| fungi | and yes, if this is for testing openstack master branches then debian 13 is what the pti says we're going to test anyway | 15:27 |
| fungi | now that our package mirror network should be working for it, hopefully it's pretty trivial to get going | 15:27 |
| sean-k-mooney | ok well we can review the patch above with an view to this cycle but let me spend 20 mins creating a patch for devstack to use backport of qemu on debian 12 seperatly | 15:28 |
| fungi | https://superuser.com/questions/678396/apt-pinning-several-packages-in-one-section has some examples | 15:29 |
| fungi | from over a decade ago, but things haven't changed much with this mechanism | 15:29 |
| sean-k-mooney | ai say it shoudl be somthign like this https://pb.teim.app/?6bec1eccfb89cf14#4P3UUs6JMJt3jcXpNc3e1C9cEmwDeN9mxj6Eh1Gmw7de | 15:29 |
| sean-k-mooney | based on https://manpages.debian.org/testing/apt/apt_preferences.5.en.html | 15:30 |
| sean-k-mooney | which seam to be consitent with https://manpages.debian.org/testing/apt/apt_preferences.5.en.html#Matching_packages_in_the_Package_field | 15:30 |
| sean-k-mooney | packages is a space seperate list with bash glob support | 15:30 |
| fungi | yeah, i'd have to refresh my memory on whether 900 is high enough priority to overcome the stable default | 15:31 |
| sean-k-mooney | i dont know if you have tried perplexity.ai as a search tool but i like that it provide sitations for the souce that it used to make the recomendaiton so you can click in and verify it | 15:32 |
| fungi | ah, yeah priority >=1000 is only needed if you want apt to downgrade to the target version (otherwise it avoids downgrading) | 15:33 |
| sean-k-mooney | the man pages have coverage of the priorties https://manpages.debian.org/testing/apt/apt_preferences.5.en.html#How_APT_Interprets_Priorities | 15:33 |
| sean-k-mooney | so it looks liike 990 | 15:33 |
| sean-k-mooney | might be more correct | 15:33 |
| sean-k-mooney | altough i think anythin over 500 would work for us | 15:34 |
| fungi | yes, i think you want >=990 according to the manpage | 15:34 |
| fungi | "causes a version to be installed even if it does not come from the target release, unless the installed version is more recent" | 15:35 |
| sean-k-mooney | ya and we are not passign a version | 15:35 |
| sean-k-mooney | so i think we need to raise to 990 as a result | 15:35 |
| sean-k-mooney | i mean we will see in either case alsothg im goign to create a debian 12 vm to test this locally to avoid the ci round trip | 15:36 |
| sean-k-mooney | fungi: its kind of a novel expiricne to have an issueb becasue our ci vms are too fast for once | 15:42 |
| sean-k-mooney | not that im particually complaining, its a good problem to have | 15:43 |
| fungi | heh, indeed! | 15:44 |
| sean-k-mooney | lol for the first time ever i hit the kernel paninc that cirros get locally booting a debin 12 image | 15:55 |
| sean-k-mooney | and of couse it booted fine on a reboot | 15:55 |
| clarkb | frickler: fungi: I'm not sure complaining to a cloud that their cpus are too fast is something I want to do | 16:04 |
| clarkb | the last time someone did that HPCloud instituted draconian anti noisy neighbor measures that made the cloud useless for us | 16:04 |
| sean-k-mooney | clarkb: no one is suggesting that at least not seriously | 16:05 |
| sean-k-mooney | this is a qemu bug | 16:05 |
| sean-k-mooney | not an openstack one | 16:05 |
| frickler | clarkb: yes, that's not what I meant, I was assuming earlier from sean's comments that the old bios or machine type might be part of the trigger for this | 16:05 |
| clarkb | got it so we thought maybe it was a config thing but now we realize its a software bug in qemu and we should focus on addressing that | 16:06 |
| sean-k-mooney | clarkb: i tought it might have been failing on the xen host wehre we had the kernel issue whre not all the cpu cores were aviable | 16:06 |
| sean-k-mooney | but it was the opicite it was passing on xen on the old rax hardware | 16:07 |
| fungi | but in this case it's a kvm provider not xen | 16:07 |
| fungi | right | 16:07 |
| sean-k-mooney | correct it was passign on the slow cpus because slow cpus dont trigger the qemu bug (apprently) | 16:07 |
| clarkb | I guess luks is trying to determine how many cycles it needs to hash secrets based on cpu speed? I wonder if you can configure it to a value instead and skip that step? | 16:08 |
| sean-k-mooney | im not sure if that is the reason | 16:09 |
| sean-k-mooney | but i dont think we coudl without changind cinder/nova to invoke qemu-img diffently in any case | 16:09 |
| sean-k-mooney | i.e. i dont think qemu-img has a global config file to change that behvior | 16:10 |
| clarkb | ack | 16:10 |
| opendevreview | Merged openstack/grenade stable/2025.1: Update grenade-skip-level-always job FROM branch https://review.opendev.org/c/openstack/grenade/+/963714 | 16:20 |
| opendevreview | Maxim Sava proposed openstack/tempest master: Add image decompression import tests https://review.opendev.org/c/openstack/tempest/+/965889 | 16:47 |
| sean-k-mooney | fungi: creating a debian vm took way longer then i plande because of dumb reasons but i think i have a prefernce file that work locally i just need to make devstack generate it and test it end to end. | 16:53 |
| clarkb | sean-k-mooney: are we doing luks in a nested vm? or does qemu-img runs this stuff unconditionally? | 16:54 |
| sean-k-mooney | clarkb: this is failign the the ceph job that test encypted ceph volumes | 16:54 |
| clarkb | the other thought that occurred to me is should we be using luks but it isn't clear to me if we do that intentionally | 16:54 |
| clarkb | aha ok so it is a necessary prereq of the test case | 16:55 |
| sean-k-mooney | yep it failing in the test that confimrs you can clone luks encypted voluems | 16:55 |
| sean-k-mooney | but that also why only 1 test fails | 16:55 |
| clarkb | it is probably possible to use something other than qemu-img to manage luks volumes/devices | 16:59 |
| clarkb | but that also probably implies big changes to nova? | 16:59 |
| sean-k-mooney | proably but why woudl we do that for a bug that is already fixed upstream? | 16:59 |
| sean-k-mooney | i.e. why replace somethign that works for a temporay bug | 17:00 |
| clarkb | the primary reason woudl be because the bug isn't expected to get fixed in the platforms people deploy on | 17:01 |
| clarkb | (I don't know if that is the case but it is common for distros to not fix issues in their stable releases unless they are security issues or hard crashes) | 17:01 |
| sean-k-mooney | so the cammand that is failign ins the image convertsion | 17:02 |
| sean-k-mooney | sudo cinder-rootwrap /etc/cinder/rootwrap.conf qemu-img convert -O luks -f raw -o cipher-alg=aes-256,cipher-mode=xts,ivgen-alg=plain64 --object secret,id=luks_sec,format=raw,file=/opt/stack/data/cinder/conversion/luks_qj1kdpp8 -o key-secret=*** /opt/stack/data/cinder/conversion/tmpzj1a9mr9 /opt/stack/data/cinder/conversion/tmpzj1a9mr9.luks | 17:02 |
| clarkb | I guess with debian the fix may be via the backports repo | 17:02 |
| clarkb | which is probably sufficient for our needs | 17:03 |
| sean-k-mooney | so it converting the raw imamge into a lunks encyped datastrame that writen to the backing volume | 17:03 |
| fungi | clarkb: yes, the fix is in the version in bullseye-backports and also the version in trixie proper | 17:03 |
| clarkb | fungi: ack thanks | 17:03 |
| fungi | at least if it's the bug we think it is | 17:04 |
| fungi | which seems probable | 17:04 |
| sean-k-mooney | it was not in ubuntu last summer but it may be now we woudl need to check if it ever made it to noble | 17:04 |
| sean-k-mooney | s/last summer/in august/ | 17:05 |
| sean-k-mooney | fungi: https://termbin.com/t48q seams to work as the preferncie file content | 17:06 |
| sean-k-mooney | im thinkign of doing this in https://github.com/openstack/devstack/blob/master/tools/fixup_stuff.sh | 17:07 |
| sean-k-mooney | but do you have any other suggestion? | 17:08 |
| sean-k-mooney | i could just put it in https://github.com/openstack/devstack/blob/master/lib/nova_plugins/functions-libvirt | 17:08 |
| sean-k-mooney | with the actual package installation as well | 17:08 |
| sean-k-mooney | i guess we enabel the virt previw repo for fedora there | 17:08 |
| fungi | i don't really have any architectural opinions on devstack itself, since i'm not helping maintain it | 17:08 |
| sean-k-mooney | so its kind of the same | 17:08 |
| sean-k-mooney | i htink ill leave it in functions-libvirt as its a little more discoverable | 17:09 |
| opendevreview | Sean Mooney proposed openstack/devstack master: use qemu/libvirt from backport repos on debian 12 https://review.opendev.org/c/openstack/devstack/+/968354 | 17:34 |
| sean-k-mooney | 1} cinder_tempest_plugin.scenario.test_volume_encrypted.TestEncryptedCinderVolumes.test_boot_cloned_encrypted_volume [138.328944s] ... ok | 18:29 |
| sean-k-mooney | that still running and 1 success for a race condtion does not prove anything | 18:29 |
| sean-k-mooney | but https://review.opendev.org/c/openstack/devstack/+/968354 seams to be working | 18:29 |
| opendevreview | Merged openstack/grenade master: Drop reference to removed services https://review.opendev.org/c/openstack/grenade/+/935474 | 18:31 |
| frickler | sean-k-mooney: nice, we could backport that to stable branches and switch to trixie for master anyway I guess | 18:54 |
| sean-k-mooney | yep ill do a dnm form nova to get a little more coverage | 18:56 |
| sean-k-mooney | maybe also cidner | 18:56 |
| sean-k-mooney | for both patches btu that was what i was thinking | 18:57 |
| frickler | +1 | 18:57 |
| sean-k-mooney | frickler: do you know how ofte we typically bump the ceph verion in these jobs? it dont think that currently covered by our testing runtimes is it? | 18:58 |
| sean-k-mooney | it looks like we did it 8 months ago https://github.com/openstack/devstack-plugin-ceph/commit/9820be32934a4a074415e320ce757a96c973bc28 | 18:59 |
| sean-k-mooney | i ask becasue the "tentacle" release replaced "squid" as the most recent release last week 2025-11-18 | 19:01 |
| sean-k-mooney | so i was wonderign if i should propsoae a mail or soemthing to the dev list ot consier moving to that this cycle? | 19:02 |
| sean-k-mooney | we install ceph form container and you can overried the version in the job if you liek so its not urgent either way | 19:03 |
| frickler | I think gouthamr was doing more on this, I personally would be very conservative in terms of upgrading ceph at least for productive environments, I wouldn't switch to a new ceph release until at least 6 months have passed | 19:27 |
| *** haleyb is now known as haleyb|out | 22:51 | |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!