Thursday, 2022-06-02

*** mfo is now known as Guest973		02:45
*** mfo_ is now known as mfo		02:45
*** arne_wiebalck_ is now known as arne_wiebalck		05:28
bauzas	hi nova	07:50
* bauzas is back today		07:50
gibi	o/	07:51
bauzas	wow, missed one day and world became crazy	08:56
bauzas	was hoping to do reviews this morning, apparently I was wrong :/	08:56
gibi	which crazyness you observe?	08:57
bauzas	gibi: just a lot of things arrived in my inbox that require a bit of priority :)	09:08
bauzas	don't worry, I'm french, I'm used to complain	09:09
gibi	:)	09:09
opendevreview	Balazs Gibizer proposed openstack/nova master: Reject AZ changes during aggregate add / remove host https://review.opendev.org/c/openstack/nova/+/821423	09:46
opendevreview	Rajesh Tailor proposed openstack/nova master: Remove unnecessary if condition https://review.opendev.org/c/openstack/nova/+/844418	11:11
opendevreview	Rico Lin proposed openstack/nova master: libvirt: Add vIOMMU device to guest https://review.opendev.org/c/openstack/nova/+/830646	11:48
opendevreview	Alexey Stupnikov proposed openstack/nova master: Optimize _local_delete calls by compute unit tests https://review.opendev.org/c/openstack/nova/+/844285	13:41
opendevreview	Alexey Stupnikov proposed openstack/nova master: Optimize _local_delete calls by compute unit tests https://review.opendev.org/c/openstack/nova/+/844285	14:24
opendevreview	Balazs Gibizer proposed openstack/nova master: Unparent PciDeviceSpec from PciAddressSpec https://review.opendev.org/c/openstack/nova/+/844491	14:55
dansmith	kashyap: slaweq is seeing a qemu segv on their fedora periodic job.. could you help us examine and open a bug for the qemu type folks to look at?	15:20
kashyap	dansmith: Hiya; sure.	15:21
dansmith	kashyap: thanks, we're still in meeting, but I imagine slaweq will be around here with a job link shortly	15:21
kashyap	dansmith: Got a link for it? I'm on a call right now, but can look at the errors (I wonder which version of Fedora)	15:21
kashyap	Sure	15:21
slaweq	kashyap dansmith here's failed job https://zuul.openstack.org/build/4a7f284f32eb436da6b5ef59d46e615d/logs	15:22
slaweq	I know that @gibi was looking briefly into it yesterday	15:22
slaweq	but interesting thing is that today this job passed https://zuul.openstack.org/build/4c1f894e55f84447b8b0b0f14c774c89/logs	15:23
kashyap	So it's a bit intermittent	15:23
slaweq	kashyap it was failing every day in at least last week, except today	15:23
slaweq	this is periodic job so we can check how it will be tomorrow	15:23
kashyap	slaweq: Can you link me to the exact error, pls?	15:23
slaweq	kashyap https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_4a7/periodic/opendev.org/openstack/neutron/master/neutron-ovn-tempest-ovs-master-fedora/4a7f284/controller/logs/libvirt/libvirt/qemu/instance-0000002e_log.txt	15:24
kashyap	slaweq: Interesting ... I wonder if there's a `coredumpctl list` output, then we can get the crashdumps right away	15:25
slaweq	I don't think there is anything like that in the job's logs	15:26
slaweq	all logs are here https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_4a7/periodic/opendev.org/openstack/neutron/master/neutron-ovn-tempest-ovs-master-fedora/4a7f284/controller/logs/index.html	15:26
kashyap	slaweq: Thanks; looking while on a call	15:27
dansmith	kashyap: if it's just something we run post-crash we can add that as a one-off	15:27
slaweq	kashyap sure, I'm on call now too	15:27
kashyap	dansmith: What would be good is to capture both: `coredumpctl list \| grep qemu`, and then for each QEMU PID log `coredumpctl info $PID` (I know ... I'm asking too much)	15:28
kashyap	dansmith: E.g. see at the bottom here for the example output of `coredumpctl info $PID` - https://www.freedesktop.org/software/systemd/man/coredumpctl.html	15:29
kashyap	The reason I ask is, I've successfully found several root-cause stack trace from it in the past.	15:29
kashyap	dansmith: slaweq: Ah, scratch the above, we could even just get this post-crash: `coredumpctl -o qemu.coredump dump /usr/bin/qemu-system-x86_64`	15:31
kashyap	slaweq: A quick question: the instance simply crashes when launching it?	15:32
slaweq	kashyap I think it crashed during snapshoting	15:32
slaweq	it was spawned properly	15:32
opendevreview	Alexey Stupnikov proposed openstack/nova master: Optimize _local_delete calls by compute unit tests https://review.opendev.org/c/openstack/nova/+/844285	15:33
dansmith	kashyap: ah, running it on a specific pid would be much harder	15:33
* dansmith is catching up		15:34
dansmith	running it like you describe is something we could hack in as a post job	15:34
kashyap	dansmith: Nah, we can disregad the per-PID thing	15:34
kashyap	Yeah, binary is easier indeed	15:34
dansmith	okay after call(s) I can help hack that in if we need, but if it's pretty repeatable it might be easier to just try to repro locally	15:35
kashyap	dansmith: Yeah, that's the next thing I'm looking. It looks like it's not any Ceph-based, and just plain local storage, IIRC	15:36
dansmith	cool	15:36
kashyap	slaweq: I'm just trying to find the precise test trigger. From looking at the 'n-cpu' log, snapshots seem to happen just fine. I'll look more after I'm done w/ this call	15:37
slaweq	kashyap but IIUC nova logs, instance is gone during snapshoting process	15:38
slaweq	please take Your time, it's not urgent for us for sure	15:38
dansmith	kashyap: it's test_create_backup	15:38
dansmith	https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_4a7/periodic/opendev.org/openstack/neutron/master/neutron-ovn-tempest-ovs-master-fedora/4a7f284/testr_results.html	15:38
kashyap	(Yeah, just found it; thx)	15:39
kashyap	dansmith: Thanks; so the above command I noted above "coredumpctl dump" will dump the most recent core dump. I guess we have to redirect it to a file	15:43
opendevreview	Merged openstack/osc-placement master: Add Python3 zed unit tests https://review.opendev.org/c/openstack/osc-placement/+/835369	15:43
kashyap	dansmith: Err, ignore the above comment; the "-o" is the file.	15:43
dansmith	kashyap: ack, are you going to try to repro locally first?	15:43
kashyap	dansmith: On F35, just run the Tempest test, or construct a manual libvirt-based repro?	15:44
dansmith	if it doesn't repro locally that might be interesting to know as well, like whether it's related to having run a lot of tests first, or that one thing always fails in isolation	15:44
dansmith	kashyap: I just meant devstack, run that one tempest test	15:44
kashyap	Ah; nod. I can't today; but I can give it a go tomorrow.	15:45
dansmith	when I'm done here I can work on adding that as a post task, it just might take a bunch of iterations to get it right (based on experience)	15:45
dansmith	okay, I'll give it a shot at least when I'm done here	15:45
kashyap	dansmith: When you say "adding that as a post task" -- I take it you mean adding the above "coredumpctl ... dump", yeah?	15:47
dansmith	yep	15:47
kashyap	I have a deja vu about this test_create_backup test, reading its code	15:49
kashyap	slaweq: When you get a minute, can you please file an upstream LP bug to track this? So we can keep all the investigation in one place?	15:56
dansmith	kashyap: against what, nova?	16:01
kashyap	Yeah, I'd say so	16:01
kashyap	dansmith: I just looked at the compressed libvirtd.log	16:01
kashyap	And I see a familiar libvirt error:	16:01
kashyap	2022-06-01 03:35:33.685+0000: 87576: error : qemuMonitorJSONCheckErrorFull:412 : internal error: unable to execute QEMU command 'blockdev-del': Failed to find node with node-name='libvirt-5-storage'	16:01
kashyap	dansmith: In the past we found the same error earlier this year, and I recall working w/ libvirt folks to get a fix. But that was in a different context: https://listman.redhat.com/archives/libvir-list/2022-February/msg00790.html	16:02
kashyap	dansmith: slaweq: The root TripleO (actually should've filed for Nova) bug was this where did the analysis: https://bugs.launchpad.net/tripleo/+bug/1959014	16:03
kashyap	If you open that last link, scroll from bottom for more signal	16:03
dansmith	kashyap: yeah I remember that one.. so to be clear, you expect this is a different issue right?	16:54
dansmith	kashyap: this is running, we'll see: https://review.opendev.org/c/openstack/devstack/+/844503	17:09
ricolin	bauzas: I think https://review.opendev.org/c/openstack/nova/+/830646 is ready for review now, could you kindly remove the -2	17:10
bauzas	ricolin: sure, lemme look	17:11
ricolin	bauzas: thanks:)	17:12
bauzas	ricolin: oh, yeah you created the bp and the spec, ta	17:12
ricolin	bauzas: yeah, the spec merged:)	17:12
ricolin	and I updated the implement patch accordingly	17:13
ricolin	I think:)	17:13
opendevreview	Rico Lin proposed openstack/nova master: Add traits for viommu model https://review.opendev.org/c/openstack/nova/+/844507	17:20
opendevreview	Artom Lifshitz proposed openstack/nova stable/ussuri: fake: Ensure need_legacy_block_device_info returns False https://review.opendev.org/c/openstack/nova/+/843950	17:21
opendevreview	Artom Lifshitz proposed openstack/nova stable/ussuri: Add a regression test for bug 1939545 https://review.opendev.org/c/openstack/nova/+/843951	17:21
opendevreview	Artom Lifshitz proposed openstack/nova stable/ussuri: compute: Ensure updates to bdms during pre_live_migration are saved https://review.opendev.org/c/openstack/nova/+/843952	17:21
ricolin	sean-k-mooney: this should be the last piece for libvirt-viommu-device implementation, but as I'm not familiar with traits, can you take a review on it and let me know if I do it right/wrong	17:24
ricolin	https://review.opendev.org/c/openstack/nova/+/844507	17:24
sean-k-mooney	sure	17:41
sean-k-mooney	just so you are aware the unit test will fail untill the trait is merged and released	17:41
sean-k-mooney	but the tempest test shoudl be able to pass becauses depens on works for devstack jobs	17:41
sean-k-mooney	but not for tox jobs	17:41
sean-k-mooney	so if you see the tox py38 job fail that will be why	17:42
sean-k-mooney	assuming your tests are otherwise correct :)	17:42
sean-k-mooney	ricolin: the patch is deffently not correct but ill comment inline	17:45
sean-k-mooney	ricolin: libvirt is never going to report a iommu model of auto or none	17:46
sean-k-mooney	so you need to actully see what is reported form the domain caps api	17:46
sean-k-mooney	by doing virsh domcapabilities --machine q35 --arch x86_64	17:47
sean-k-mooney	ricolin: but looking at that this is not somethign that is reported in that api	17:50
sean-k-mooney	so instead of looking at the domaincap api you need to report the traits based on the libvirt version number	17:50
melwitt	artom, sean-k-mooney: dunno if yall have seen this related preserve_on_delete bug from a few years ago https://bugs.launchpad.net/nova/+bug/1834463	18:22
opendevreview	Merged openstack/nova stable/ussuri: [stable-only] Make sdk broken job non voting until it is fixed https://review.opendev.org/c/openstack/nova/+/844309	18:43
ricolin	sean-k-mooney: so I need to check libvirt version before I put iommu in devices for fakelibvirt, right?	18:52
artom	melwitt, hrmm, good find	19:23
melwitt	artom: I looked through the code and saw that _heal_instance_info_cache preserves the existing value of preserve_on_delete. tried it out on devstack (created server with nova creating port, changed the value of preserve_on_delete to true in the database, saw _heal_instance_info_cache run a number of times, then detached the port) and it did not delete the port	19:43
melwitt	I'm realizing the scenario in the above bug is different. they're saying they removed an interface by a manual database update, then nova added it back without (obviously) the original value of preserve_on_delete. I guess they are saying if they detach the port and then reattach it, they don't get the same value of preserve_on_delete. a bit different issue	19:49
melwitt	although, they should get the same value bc if they reattach the port, nova won't consider it to be created by nova and thus should set preserve_on_delete = True	19:56
sean-k-mooney	melwitt: they were updating the db	20:00
sean-k-mooney	so really all bets are off at that point	20:01
melwitt	just tried reattach and it indeed has preserve_on_delete = true. that means the bug report is very specifically the case where the interface gets removed from the info cache not via the API and then _heal_instance_info_cache runs. I don't know how that could happen during normal operation (no manual db update)	20:01
sean-k-mooney	so there case does not make sense	20:01
sean-k-mooney	well	20:01
melwitt	yeah, I assumed they did the manual update to simplify a real world case but without any more data, I don't know how that case can happen	20:02
sean-k-mooney	the booted with nova creating a nic	20:02
sean-k-mooney	they somehow detached it without it gettting deleted	20:02
sean-k-mooney	and then reattached it	20:02
sean-k-mooney	so with the undocumented behavior when it got detach it shoudl have gotten delted	20:02
sean-k-mooney	so there is not port to reattach	20:02
melwitt	no, in their report they say they called server create with port_id passed in	20:02
melwitt	so that means it begins with preserve_on_delete = true	20:03
sean-k-mooney	oh then it shoudl have preserve on delete ture	20:03
melwitt	yeah	20:03
melwitt	no idea how what they say can happen "in real life"	20:03
sean-k-mooney	so lest see	20:04
sean-k-mooney	they are simulated the network info cache getting currpted	20:04
sean-k-mooney	and then wating fo the heal taks to fix the info cache	20:04
sean-k-mooney	and then nova things its created by it	20:04
sean-k-mooney	i guess i can see that happeing if we lost the info of how the port was requested	20:05
sean-k-mooney	so that is implying we sotre that in the info cache only	20:05
melwitt	yeah, nova just sets the flag to true if it created the port, at port creation time. after that it's cache only	20:05
sean-k-mooney	well thats broken	20:06
sean-k-mooney	i guess we do that for attach	20:06
sean-k-mooney	too	20:06
sean-k-mooney	e.g. if we do attach network instead of attch port	20:06
sean-k-mooney	we porably need to change this to sotre this in either the virtual interfaces tabel or instance_system_metadata if we want to avoid a db migration	20:07
sean-k-mooney	the initall boot requeest would be stored in the request spec but we dont update that on network attach at least i dobt we do	20:08
melwitt	it would be nice to save it somewhere... other than instance_info_caches if that table is apparently fraught with problems	20:09
sean-k-mooney	well its ment to be a cache	20:11
melwitt	request_spec seems like a good place?	20:11
sean-k-mooney	as in we shoudl be able to drop it if we needed too	20:12
melwitt	fair	20:12
sean-k-mooney	request_sepc is in the api db	20:12
melwitt	oh right :/	20:12
sean-k-mooney	so we could update teh requested networks in the api but we would have to wait till after the virt driver finsihed attaching	20:12
sean-k-mooney	is this a call or a cast	20:12
sean-k-mooney	i guess its a call	20:13
sean-k-mooney	since its a 200 respone	20:13
sean-k-mooney	https://docs.openstack.org/api-ref/compute/?expanded=add-network-detail%2Ccreate-interface-detail#create-interface=	20:13
melwitt	yeah it's a call	20:13
sean-k-mooney	so we coudl update teh request_spec network_requests list if we really wanted too	20:13
sean-k-mooney	we just need to make sure to only do it if the call succeds	20:14
melwitt	but nova-compute couldn't get to it without an upcall right	20:14
sean-k-mooney	not via nova-comptue in the api	20:14
sean-k-mooney	when we wait for the call	20:15
sean-k-mooney	i think there are better places to store it however	20:15
melwitt	if nova-compute needs to rebuild the info cache from nothing, like the db row update example in the bug	20:15
sean-k-mooney	ya it should be able too	20:16
sean-k-mooney	we have had cases where we lost ports in the cache due to buggy neturon backend or neutron policy issues	20:16
sean-k-mooney	e.g. where neutorn returned an empty port list	20:16
melwitt	nova-compute can't read it from request_specs without it being an upcall. am I missing something?	20:16
sean-k-mooney	the heal logic will recreate the info cache entries form the neutron data if that happens	20:17
sean-k-mooney	melwitt: correct it cant	20:17
melwitt	so storing it in request spec doesn't help afaict	20:17
sean-k-mooney	not really no	20:17
sean-k-mooney	https://github.com/openstack/nova/blob/master/nova/db/main/models.py#L784=	20:17
sean-k-mooney	the virtual interfaces tabel shoudl store it but it has no feield we can abuse to store it without a db change	20:18
melwitt	just saying it sounded like a good place to store it initially but if nova-compute can't read it, it doesn't solve this issue	20:18
sean-k-mooney	instance_system_metadata can store it since it just a set of key value pairs	20:18
sean-k-mooney	and thats in the cell db	20:19
sean-k-mooney	so that is proably where i woudl stash it	20:19
melwitt	yeah, that would work	20:19
sean-k-mooney	so we jsut have the key be the <neutron port uuid>_preserve_on_delete	20:20
sean-k-mooney	or store the list as a single key	20:20
sean-k-mooney	that is proably better since its indexed by the instance_id anyway	20:20
sean-k-mooney	it denormaises the db technially	20:20
sean-k-mooney	but a preserve_on_delete_list key that we lookup by "select preserve_on_delete from instance_system_metadata where instance_id = xyz"	20:21
sean-k-mooney	is much simpler to lookup	20:22
sean-k-mooney	but either would work	20:22
opendevreview	Merged openstack/placement stable/ussuri: Use 'functional-without-sample-db-tests' tox env for placement nova job https://review.opendev.org/c/openstack/placement/+/840773	21:10
opendevreview	melanie witt proposed openstack/nova stable/train: DNM Testing for ceph setup gate fail https://review.opendev.org/c/openstack/nova/+/844530	22:40

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!