*** mfo is now known as Guest973 | 02:45 | |
*** mfo_ is now known as mfo | 02:45 | |
*** arne_wiebalck_ is now known as arne_wiebalck | 05:28 | |
bauzas | hi nova | 07:50 |
---|---|---|
* bauzas is back today | 07:50 | |
gibi | o/ | 07:51 |
bauzas | wow, missed one day and world became crazy | 08:56 |
bauzas | was hoping to do reviews this morning, apparently I was wrong :/ | 08:56 |
gibi | which crazyness you observe? | 08:57 |
bauzas | gibi: just a lot of things arrived in my inbox that require a bit of priority :) | 09:08 |
bauzas | don't worry, I'm french, I'm used to complain | 09:09 |
gibi | :) | 09:09 |
opendevreview | Balazs Gibizer proposed openstack/nova master: Reject AZ changes during aggregate add / remove host https://review.opendev.org/c/openstack/nova/+/821423 | 09:46 |
opendevreview | Rajesh Tailor proposed openstack/nova master: Remove unnecessary if condition https://review.opendev.org/c/openstack/nova/+/844418 | 11:11 |
opendevreview | Rico Lin proposed openstack/nova master: libvirt: Add vIOMMU device to guest https://review.opendev.org/c/openstack/nova/+/830646 | 11:48 |
opendevreview | Alexey Stupnikov proposed openstack/nova master: Optimize _local_delete calls by compute unit tests https://review.opendev.org/c/openstack/nova/+/844285 | 13:41 |
opendevreview | Alexey Stupnikov proposed openstack/nova master: Optimize _local_delete calls by compute unit tests https://review.opendev.org/c/openstack/nova/+/844285 | 14:24 |
opendevreview | Balazs Gibizer proposed openstack/nova master: Unparent PciDeviceSpec from PciAddressSpec https://review.opendev.org/c/openstack/nova/+/844491 | 14:55 |
dansmith | kashyap: slaweq is seeing a qemu segv on their fedora periodic job.. could you help us examine and open a bug for the qemu type folks to look at? | 15:20 |
kashyap | dansmith: Hiya; sure. | 15:21 |
dansmith | kashyap: thanks, we're still in meeting, but I imagine slaweq will be around here with a job link shortly | 15:21 |
kashyap | dansmith: Got a link for it? I'm on a call right now, but can look at the errors (I wonder which version of Fedora) | 15:21 |
kashyap | Sure | 15:21 |
slaweq | kashyap dansmith here's failed job https://zuul.openstack.org/build/4a7f284f32eb436da6b5ef59d46e615d/logs | 15:22 |
slaweq | I know that @gibi was looking briefly into it yesterday | 15:22 |
slaweq | but interesting thing is that today this job passed https://zuul.openstack.org/build/4c1f894e55f84447b8b0b0f14c774c89/logs | 15:23 |
kashyap | So it's a bit intermittent | 15:23 |
slaweq | kashyap it was failing every day in at least last week, except today | 15:23 |
slaweq | this is periodic job so we can check how it will be tomorrow | 15:23 |
kashyap | slaweq: Can you link me to the exact error, pls? | 15:23 |
slaweq | kashyap https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_4a7/periodic/opendev.org/openstack/neutron/master/neutron-ovn-tempest-ovs-master-fedora/4a7f284/controller/logs/libvirt/libvirt/qemu/instance-0000002e_log.txt | 15:24 |
kashyap | slaweq: Interesting ... I wonder if there's a `coredumpctl list` output, then we can get the crashdumps right away | 15:25 |
slaweq | I don't think there is anything like that in the job's logs | 15:26 |
slaweq | all logs are here https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_4a7/periodic/opendev.org/openstack/neutron/master/neutron-ovn-tempest-ovs-master-fedora/4a7f284/controller/logs/index.html | 15:26 |
kashyap | slaweq: Thanks; looking while on a call | 15:27 |
dansmith | kashyap: if it's just something we run post-crash we can add that as a one-off | 15:27 |
slaweq | kashyap sure, I'm on call now too | 15:27 |
kashyap | dansmith: What would be good is to capture both: `coredumpctl list | grep qemu`, and then for each QEMU PID log `coredumpctl info $PID` (I know ... I'm asking too much) | 15:28 |
kashyap | dansmith: E.g. see at the bottom here for the example output of `coredumpctl info $PID` - https://www.freedesktop.org/software/systemd/man/coredumpctl.html | 15:29 |
kashyap | The reason I ask is, I've successfully found several root-cause stack trace from it in the past. | 15:29 |
kashyap | dansmith: slaweq: Ah, scratch the above, we could even just get this post-crash: `coredumpctl -o qemu.coredump dump /usr/bin/qemu-system-x86_64` | 15:31 |
kashyap | slaweq: A quick question: the instance simply crashes when launching it? | 15:32 |
slaweq | kashyap I think it crashed during snapshoting | 15:32 |
slaweq | it was spawned properly | 15:32 |
opendevreview | Alexey Stupnikov proposed openstack/nova master: Optimize _local_delete calls by compute unit tests https://review.opendev.org/c/openstack/nova/+/844285 | 15:33 |
dansmith | kashyap: ah, running it on a specific pid would be much harder | 15:33 |
* dansmith is catching up | 15:34 | |
dansmith | running it like you describe is something we could hack in as a post job | 15:34 |
kashyap | dansmith: Nah, we can disregad the per-PID thing | 15:34 |
kashyap | Yeah, binary is easier indeed | 15:34 |
dansmith | okay after call(s) I can help hack that in if we need, but if it's pretty repeatable it might be easier to just try to repro locally | 15:35 |
kashyap | dansmith: Yeah, that's the next thing I'm looking. It looks like it's not any Ceph-based, and just plain local storage, IIRC | 15:36 |
dansmith | cool | 15:36 |
kashyap | slaweq: I'm just trying to find the precise test trigger. From looking at the 'n-cpu' log, snapshots seem to happen just fine. I'll look more after I'm done w/ this call | 15:37 |
slaweq | kashyap but IIUC nova logs, instance is gone during snapshoting process | 15:38 |
slaweq | please take Your time, it's not urgent for us for sure | 15:38 |
dansmith | kashyap: it's test_create_backup | 15:38 |
dansmith | https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_4a7/periodic/opendev.org/openstack/neutron/master/neutron-ovn-tempest-ovs-master-fedora/4a7f284/testr_results.html | 15:38 |
kashyap | (Yeah, just found it; thx) | 15:39 |
kashyap | dansmith: Thanks; so the above command I noted above "coredumpctl dump" will dump the most recent core dump. I guess we have to redirect it to a file | 15:43 |
opendevreview | Merged openstack/osc-placement master: Add Python3 zed unit tests https://review.opendev.org/c/openstack/osc-placement/+/835369 | 15:43 |
kashyap | dansmith: Err, ignore the above comment; the "-o" is the file. | 15:43 |
dansmith | kashyap: ack, are you going to try to repro locally first? | 15:43 |
kashyap | dansmith: On F35, just run the Tempest test, or construct a manual libvirt-based repro? | 15:44 |
dansmith | if it doesn't repro locally that might be interesting to know as well, like whether it's related to having run a lot of tests first, or that one thing always fails in isolation | 15:44 |
dansmith | kashyap: I just meant devstack, run that one tempest test | 15:44 |
kashyap | Ah; nod. I can't today; but I can give it a go tomorrow. | 15:45 |
dansmith | when I'm done here I can work on adding that as a post task, it just might take a bunch of iterations to get it right (based on experience) | 15:45 |
dansmith | okay, I'll give it a shot at least when I'm done here | 15:45 |
kashyap | dansmith: When you say "adding that as a post task" -- I take it you mean adding the above "coredumpctl ... dump", yeah? | 15:47 |
dansmith | yep | 15:47 |
kashyap | I have a deja vu about this test_create_backup test, reading its code | 15:49 |
kashyap | slaweq: When you get a minute, can you please file an upstream LP bug to track this? So we can keep all the investigation in one place? | 15:56 |
dansmith | kashyap: against what, nova? | 16:01 |
kashyap | Yeah, I'd say so | 16:01 |
kashyap | dansmith: I just looked at the compressed libvirtd.log | 16:01 |
kashyap | And I see a familiar libvirt error: | 16:01 |
kashyap | 2022-06-01 03:35:33.685+0000: 87576: error : qemuMonitorJSONCheckErrorFull:412 : internal error: unable to execute QEMU command 'blockdev-del': Failed to find node with node-name='libvirt-5-storage' | 16:01 |
kashyap | dansmith: In the past we found the same error earlier this year, and I recall working w/ libvirt folks to get a fix. But that was in a different context: https://listman.redhat.com/archives/libvir-list/2022-February/msg00790.html | 16:02 |
kashyap | dansmith: slaweq: The root TripleO (actually should've filed for Nova) bug was this where did the analysis: https://bugs.launchpad.net/tripleo/+bug/1959014 | 16:03 |
kashyap | If you open that last link, scroll from bottom for more signal | 16:03 |
dansmith | kashyap: yeah I remember that one.. so to be clear, you expect this is a different issue right? | 16:54 |
dansmith | kashyap: this is running, we'll see: https://review.opendev.org/c/openstack/devstack/+/844503 | 17:09 |
ricolin | bauzas: I think https://review.opendev.org/c/openstack/nova/+/830646 is ready for review now, could you kindly remove the -2 | 17:10 |
bauzas | ricolin: sure, lemme look | 17:11 |
ricolin | bauzas: thanks:) | 17:12 |
bauzas | ricolin: oh, yeah you created the bp and the spec, ta | 17:12 |
ricolin | bauzas: yeah, the spec merged:) | 17:12 |
ricolin | and I updated the implement patch accordingly | 17:13 |
ricolin | I think:) | 17:13 |
opendevreview | Rico Lin proposed openstack/nova master: Add traits for viommu model https://review.opendev.org/c/openstack/nova/+/844507 | 17:20 |
opendevreview | Artom Lifshitz proposed openstack/nova stable/ussuri: fake: Ensure need_legacy_block_device_info returns False https://review.opendev.org/c/openstack/nova/+/843950 | 17:21 |
opendevreview | Artom Lifshitz proposed openstack/nova stable/ussuri: Add a regression test for bug 1939545 https://review.opendev.org/c/openstack/nova/+/843951 | 17:21 |
opendevreview | Artom Lifshitz proposed openstack/nova stable/ussuri: compute: Ensure updates to bdms during pre_live_migration are saved https://review.opendev.org/c/openstack/nova/+/843952 | 17:21 |
ricolin | sean-k-mooney: this should be the last piece for libvirt-viommu-device implementation, but as I'm not familiar with traits, can you take a review on it and let me know if I do it right/wrong | 17:24 |
ricolin | https://review.opendev.org/c/openstack/nova/+/844507 | 17:24 |
sean-k-mooney | sure | 17:41 |
sean-k-mooney | just so you are aware the unit test will fail untill the trait is merged and released | 17:41 |
sean-k-mooney | but the tempest test shoudl be able to pass becauses depens on works for devstack jobs | 17:41 |
sean-k-mooney | but not for tox jobs | 17:41 |
sean-k-mooney | so if you see the tox py38 job fail that will be why | 17:42 |
sean-k-mooney | assuming your tests are otherwise correct :) | 17:42 |
sean-k-mooney | ricolin: the patch is deffently not correct but ill comment inline | 17:45 |
sean-k-mooney | ricolin: libvirt is never going to report a iommu model of auto or none | 17:46 |
sean-k-mooney | so you need to actully see what is reported form the domain caps api | 17:46 |
sean-k-mooney | by doing virsh domcapabilities --machine q35 --arch x86_64 | 17:47 |
sean-k-mooney | ricolin: but looking at that this is not somethign that is reported in that api | 17:50 |
sean-k-mooney | so instead of looking at the domaincap api you need to report the traits based on the libvirt version number | 17:50 |
melwitt | artom, sean-k-mooney: dunno if yall have seen this related preserve_on_delete bug from a few years ago https://bugs.launchpad.net/nova/+bug/1834463 | 18:22 |
opendevreview | Merged openstack/nova stable/ussuri: [stable-only] Make sdk broken job non voting until it is fixed https://review.opendev.org/c/openstack/nova/+/844309 | 18:43 |
ricolin | sean-k-mooney: so I need to check libvirt version before I put iommu in devices for fakelibvirt, right? | 18:52 |
artom | melwitt, hrmm, good find | 19:23 |
melwitt | artom: I looked through the code and saw that _heal_instance_info_cache preserves the existing value of preserve_on_delete. tried it out on devstack (created server with nova creating port, changed the value of preserve_on_delete to true in the database, saw _heal_instance_info_cache run a number of times, then detached the port) and it did not delete the port | 19:43 |
melwitt | I'm realizing the scenario in the above bug is different. they're saying they removed an interface by a manual database update, then nova added it back without (obviously) the original value of preserve_on_delete. I guess they are saying if they detach the port and then reattach it, they don't get the same value of preserve_on_delete. a bit different issue | 19:49 |
melwitt | although, they should get the same value bc if they reattach the port, nova won't consider it to be created by nova and thus should set preserve_on_delete = True | 19:56 |
sean-k-mooney | melwitt: they were updating the db | 20:00 |
sean-k-mooney | so really all bets are off at that point | 20:01 |
melwitt | just tried reattach and it indeed has preserve_on_delete = true. that means the bug report is very specifically the case where the interface gets removed from the info cache not via the API and then _heal_instance_info_cache runs. I don't know how that could happen during normal operation (no manual db update) | 20:01 |
sean-k-mooney | so there case does not make sense | 20:01 |
sean-k-mooney | well | 20:01 |
melwitt | yeah, I assumed they did the manual update to simplify a real world case but without any more data, I don't know how that case can happen | 20:02 |
sean-k-mooney | the booted with nova creating a nic | 20:02 |
sean-k-mooney | they somehow detached it without it gettting deleted | 20:02 |
sean-k-mooney | and then reattached it | 20:02 |
sean-k-mooney | so with the undocumented behavior when it got detach it shoudl have gotten delted | 20:02 |
sean-k-mooney | so there is not port to reattach | 20:02 |
melwitt | no, in their report they say they called server create with port_id passed in | 20:02 |
melwitt | so that means it begins with preserve_on_delete = true | 20:03 |
sean-k-mooney | oh then it shoudl have preserve on delete ture | 20:03 |
melwitt | yeah | 20:03 |
melwitt | no idea how what they say can happen "in real life" | 20:03 |
sean-k-mooney | so lest see | 20:04 |
sean-k-mooney | they are simulated the network info cache getting currpted | 20:04 |
sean-k-mooney | and then wating fo the heal taks to fix the info cache | 20:04 |
sean-k-mooney | and then nova things its created by it | 20:04 |
sean-k-mooney | i guess i can see that happeing if we lost the info of how the port was requested | 20:05 |
sean-k-mooney | so that is implying we sotre that in the info cache only | 20:05 |
melwitt | yeah, nova just sets the flag to true if it created the port, at port creation time. after that it's cache only | 20:05 |
sean-k-mooney | well thats broken | 20:06 |
sean-k-mooney | i guess we do that for attach | 20:06 |
sean-k-mooney | too | 20:06 |
sean-k-mooney | e.g. if we do attach network instead of attch port | 20:06 |
sean-k-mooney | we porably need to change this to sotre this in either the virtual interfaces tabel or instance_system_metadata if we want to avoid a db migration | 20:07 |
sean-k-mooney | the initall boot requeest would be stored in the request spec but we dont update that on network attach at least i dobt we do | 20:08 |
melwitt | it would be nice to save it somewhere... other than instance_info_caches if that table is apparently fraught with problems | 20:09 |
sean-k-mooney | well its ment to be a cache | 20:11 |
melwitt | request_spec seems like a good place? | 20:11 |
sean-k-mooney | as in we shoudl be able to drop it if we needed too | 20:12 |
melwitt | fair | 20:12 |
sean-k-mooney | request_sepc is in the api db | 20:12 |
melwitt | oh right :/ | 20:12 |
sean-k-mooney | so we could update teh requested networks in the api but we would have to wait till after the virt driver finsihed attaching | 20:12 |
sean-k-mooney | is this a call or a cast | 20:12 |
sean-k-mooney | i guess its a call | 20:13 |
sean-k-mooney | since its a 200 respone | 20:13 |
sean-k-mooney | https://docs.openstack.org/api-ref/compute/?expanded=add-network-detail%2Ccreate-interface-detail#create-interface= | 20:13 |
melwitt | yeah it's a call | 20:13 |
sean-k-mooney | so we coudl update teh request_spec network_requests list if we really wanted too | 20:13 |
sean-k-mooney | we just need to make sure to only do it if the call succeds | 20:14 |
melwitt | but nova-compute couldn't get to it without an upcall right | 20:14 |
sean-k-mooney | not via nova-comptue in the api | 20:14 |
sean-k-mooney | when we wait for the call | 20:15 |
sean-k-mooney | i think there are better places to store it however | 20:15 |
melwitt | if nova-compute needs to rebuild the info cache from nothing, like the db row update example in the bug | 20:15 |
sean-k-mooney | ya it should be able too | 20:16 |
sean-k-mooney | we have had cases where we lost ports in the cache due to buggy neturon backend or neutron policy issues | 20:16 |
sean-k-mooney | e.g. where neutorn returned an empty port list | 20:16 |
melwitt | nova-compute can't read it from request_specs without it being an upcall. am I missing something? | 20:16 |
sean-k-mooney | the heal logic will recreate the info cache entries form the neutron data if that happens | 20:17 |
sean-k-mooney | melwitt: correct it cant | 20:17 |
melwitt | so storing it in request spec doesn't help afaict | 20:17 |
sean-k-mooney | not really no | 20:17 |
sean-k-mooney | https://github.com/openstack/nova/blob/master/nova/db/main/models.py#L784= | 20:17 |
sean-k-mooney | the virtual interfaces tabel shoudl store it but it has no feield we can abuse to store it without a db change | 20:18 |
melwitt | just saying it sounded like a good place to store it initially but if nova-compute can't read it, it doesn't solve this issue | 20:18 |
sean-k-mooney | instance_system_metadata can store it since it just a set of key value pairs | 20:18 |
sean-k-mooney | and thats in the cell db | 20:19 |
sean-k-mooney | so that is proably where i woudl stash it | 20:19 |
melwitt | yeah, that would work | 20:19 |
sean-k-mooney | so we jsut have the key be the <neutron port uuid>_preserve_on_delete | 20:20 |
sean-k-mooney | or store the list as a single key | 20:20 |
sean-k-mooney | that is proably better since its indexed by the instance_id anyway | 20:20 |
sean-k-mooney | it denormaises the db technially | 20:20 |
sean-k-mooney | but a preserve_on_delete_list key that we lookup by "select preserve_on_delete from instance_system_metadata where instance_id = xyz" | 20:21 |
sean-k-mooney | is much simpler to lookup | 20:22 |
sean-k-mooney | but either would work | 20:22 |
opendevreview | Merged openstack/placement stable/ussuri: Use 'functional-without-sample-db-tests' tox env for placement nova job https://review.opendev.org/c/openstack/placement/+/840773 | 21:10 |
opendevreview | melanie witt proposed openstack/nova stable/train: DNM Testing for ceph setup gate fail https://review.opendev.org/c/openstack/nova/+/844530 | 22:40 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!