Wednesday, 2021-10-27

*** dpawlik3 is now known as dpawlik		08:34
gibi	good day nova	08:56
* kashyap waves		09:09
bauzas	hola folks	09:17
gibi	o/	09:20
gibi	-another day in downstream land-	09:20
opendevreview	Balazs Gibizer proposed openstack/nova master: Add a WA flag waiting for vif-plugged event during reboot https://review.opendev.org/c/openstack/nova/+/813419	09:34
gibi	stephenfin: do you recall why you needed to configure both db in a single Database fixture in https://review.opendev.org/c/openstack/nova/+/799526/5/nova/tests/fixtures/nova.py#612 ?	10:58
gibi	I think our tests are using a separate fixture instance for each db (api, main)	10:59
stephenfin	gibi: I'm not 100% sure but my guess is that I don't, and that was simply for expediency/laziness :) If SESSION_CONFIGURED became a mapping of DB type to "is configured" bool, we probably wouldn't need that	11:00
stephenfin	*we don't	11:00
gibi	I see so we had a single global but with two dbs to configure	11:01
stephenfin	Yeah, I think so	11:01
gibi	OK, if that is the only reason then I think I have a way to remove that global flag (based on melwitt's idea) with patch_factory from oslo_db	11:02
gibi	it is no pretty confusing that we have to Database fixture intantiated one for main and one for api but the first one configures both db	11:03
gibi	s/no/now	11:03
gibi	/to/two	11:03
gibi	/o\	11:03
stephenfin	yeah, tbc it could be more complicated than that but I really doubt it	11:04
gibi	yeah, lets see if my idea works	11:08
sean-k-mooney	stephenfin: since your about here an easy one for you https://review.opendev.org/c/openstack/nova/+/811947 think we can get that over the line	11:12
stephenfin	sure, will look now	11:13
sean-k-mooney	thanks :)	11:14
sean-k-mooney	stephenfin: based on the ptg discussion woudl you mind removing your -2 on https://review.opendev.org/c/openstack/nova/+/804292 im going to rebase that and the autopep8 one shortly	11:19
frickler	kashyap: I didn't make progress with reproduction without nova yet, so I created https://gitlab.com/qemu-project/qemu/-/issues/693 for now. let me know if you need additional data there	13:05
kashyap	frickler: Thanks for the report. A quick one is: were you using nested setup, or was this DevStack instance on a baremetal host (<shudder>)?	13:08
kashyap	A thumb-rule is to always explicitly state so if you're using a nested setup	13:09
kashyap	frickler: Can you edit the report to state that "deploy DevStack in a VM?" So that an unsuspecting dev won't run it on their baremetal laptop and wreak havoc...	13:09
kashyap	I'll add a quick comment there, actualy	13:10
kashyap	Done	13:17
kashyap	frickler: I'll check about it w/ a TCG dev	13:17
frickler	kashyap: yes, nested is correct, I added that to the description. though I could also duplicate on a baremetal host if you assume that it would behave differently	13:25
kashyap	frickler: No, no need for baremetal. VMs are best. Can you also post the QEMU command-line of the DevStack VM itself? (The level-1 VM)	13:37
frickler	kashyap: no, I have no admin access to the cloud it is running on. I'm assuming it will essentially look the same, though, just with accel=kvm	13:39
kashyap	Hmm, not sure if it'll be that same w/ accel=kvm. The details would change quite a bit. The "host" (guest hypervisor) setup can determine the guest behaviour here a lot.	13:43
kashyap	That's one of the questions I'd expect from a TCG dev	13:43
frickler	kashyap: o.k., I'll try running cirros locally without devstack in between, that would give the simplest setup in the end	13:45
kashyap	frickler: Sure; yeah, that'd be the best. The shorter the route to the reproducer, the more likely we can get to the root cause	13:47
kashyap	frickler: Thanks for all the testing! It's a pain, I Know	13:47
kashyap	s/K/k/	13:47
*** kopecmartin is now known as kopecmartin\|pto		14:00
frickler	kashyap: that went easier than I expected, updated the issue	14:14
bauzas	gibi: sean-k-mooney: I think we said https://bugs.launchpad.net/nova/+bug/1947753 is valid during our PTG, right?	14:19
gibi	bauzas: I don't remember discussing this	14:21
bauzas	gibi: sorry you're maybe right	14:21
gibi	what I see is that in the bug they evacuate instances without restarting the compute node	14:21
bauzas	I originally thought this was about evacuate/evacuateback/evacuate	14:21
bauzas	adding a comment	14:22
gibi	so far we said that you can only evacuate if you make sure that the compute is dead	14:23
gibi	in the bug case the compute was halted / stuck, the heartbeat was missing so the service was considered down, nova allowed evacuation, then the compute recovered without the nova-compute service restarted	14:23
gibi	nova-compute only cleans up evacuated instance during init_host but does not do it periodically	14:24
gibi	so in this case the evacuated instance was not cleaned up on the source leading to duplicated instances causing corruption	14:24
gibi	option a) change nova-compute to clean up evacuated instance in a periodic	14:25
gibi	option b) change the evac API to only allow evacuation if the compute is forced down (meaning the admin mades sure the host is fenced)	14:26
gibi	option c) declare the current bug as user error as the nova-compute was not restarted as part of the compute node recovery	14:27
bauzas	gibi: I wrote a large comment on the bvug	14:33
bauzas	I think CERN is triggering evacuations before verifying the host status	14:33
sean-k-mooney	bauzas: i dont think we talked baout this either	14:34
bauzas	as I said, I feel healthchecks can help them getting a better decision-making about whether they need to evacuate or not	14:34
sean-k-mooney	bauzas: we talk about a related issue with allcoations tha talso impact evacuate	14:34
bauzas	sean-k-mooney: yeah, my confusion, I originally thought it was about the back-and-forth about evacuate we discussed for pain points	14:35
sean-k-mooney	e.g. if for any reason we oversubscie the allcoations then we cant evacuate	14:35
sean-k-mooney	i have not read it fully but it sound like they are not properly fencing the node an ensuring the vm is not running	14:36
sean-k-mooney	before evacuating if they are having apllciation data currption	14:36
bauzas	that's literrally what I wrote.	14:41
bauzas	anyway, moving to a new bug.	14:41
sean-k-mooney	ack as i said have not finsihed reading the bug description or comments so glad we agree :)	14:42
kashyap	frickler: Ah-ha, noted. Good news: there's already some response from two QEMU devs, with a patch in a newer version :)	14:51
frickler	kashyap: yeah I just responded, but I didn't see the patch reference. going from 32M to 1G really sounds a bit excessive, would be good to be able to tune that	14:55
kashyap	(Well, I don't quite think it's "good news" ...)	14:55
kashyap	frickler: Sorry, I was referring to the commit that DanPB pointed out - https://gitlab.com/qemu-project/qemu/-/commit/600e17b26	14:55
frickler	kashyap: ah, yes, that seems to be the patch that triggers this, I though you were referring to a fix in a recent commit	14:56
kashyap	frickler: Yeah, that increase is a tad too much.	14:56
kashyap	frickler: Yes, poor phrasing on my part.	14:56
opendevreview	Balazs Gibizer proposed openstack/nova master: Remove SESSION_CONFIGURED global from DB fixture https://review.opendev.org/c/openstack/nova/+/815689	14:56
frickler	kashyap: otoh that also is likely to explain why tests seemed to be going faster on Bullseye than on Focal	14:57
opendevreview	Balazs Gibizer proposed openstack/nova master: Refactor Database fixture https://review.opendev.org/c/openstack/nova/+/815690	14:58
kashyap	frickler: Interesting; what tests are going faster?	14:59
opendevreview	Balazs Gibizer proposed openstack/nova master: Fix interference in db unit test https://review.opendev.org/c/openstack/nova/+/814735	14:59
gibi	stephenfin: ^^ here is the removal of the global SESSION_CONFIGURED flag from the DB fixture and some extra :D	15:00
frickler	kashyap: I didn't check in particular, but the whole tempest-full job with --serial on Debian doesn't take much longer than with the default (-c 4 I think) on Focal	15:02
kashyap	I see.	15:02
gibi	melwitt: thanks a lot for the help exlaning the global db transaction factory situation. I used your info to actually remove SESSION_CONFIGURED from our fixture along the the unit test fixes	15:05
kashyap	frickler: So, it is tunable via command-line, but it's not wired up in libvirt yet, though.	15:05
kashyap	frickler: See the option: -accel=tcg,tb-size=$value_in_MiB	15:05
kashyap	"tb-size" in the man page	15:06
frickler	kashyap: as long as libvirt doesn't support it, I fear that won't help much. might be good to cap it to something like 50% of the VM memory	15:10
kashyap	frickler: Right; libvirt just didn't wire it up ... we can meanwhile do a nasty hack of uploading a QEMU binary to the CI system w/ this param tweaked	15:11
kashyap	frickler: Do you have the appetite to file a libvirt upstream RFE? (Then I can clone it downstream, and get it triaged)	15:11
frickler	kashyap: I think I'll do a local test with a reduced default tb-size first in order to be certain that that's the cause. but not before tomorrow	15:13
kashyap	Right, no rush at all	15:13
gibi	bauzas: replied in https://bugs.launchpad.net/nova/+bug/1947753 I think _destroy_evacuated_instances is not called periodically	15:14
kashyap	frickler: So I see that someone else has raised this upstream last year: https://lists.gnu.org/archive/html/qemu-devel/2020-07/msg05235.html (TB Cache size grows out of control with qemu 5.0)	15:16
bauzas	gibi: indeed, only when restarting	15:22
bauzas	did I said the other way ?	15:22
kashyap	frickler: So, this worked for me:	15:22
kashyap	-machine q35	15:22
kashyap	-accel tcg,tb-size=256	15:22
kashyap	(As an example)	15:23
gibi	bauzas: at least I understood this sentence that way "Either way, if the service continues to run, it verifies the evacuation status periodically and deletes the host."	15:25
gibi	bauzas: btw, about https://bugs.launchpad.net/nova/+bug/1947687 I cannot formulate a logstash signature it seems that this error happens in a lot of cases when no test cases are failing so I get a lot of false positives	15:27
bauzas	gibi: okay, then my brain fucked	15:27
kashyap	frickler: For reference, a minimal command-line:	15:27
kashyap	$> qemu-kvm -display none -cpu Nehalem -no-user-config \ -machine q35 \ -accel tcg,tb-size=256 \ -nodefaults -m 2048 -serial stdio \ -drive file=/export/vm1.qcow2,format=qcow2,if=virtio	15:27
kashyap	(Ugh, line-breaks are broken, but you see what I mean)	15:27
bauzas	gibi: ack for the logstash thing, no worries	15:28
frickler	kashyap: thx, added a comment to the issue, seems the libvirt path is really the most promising one	15:34
kashyap	frickler: Definitely. Please file the RFE (and post a link to me, Bz Ccs will take me slower to process) when you can	15:39
kashyap	Thanks for the patience :)	15:39
melwitt	bauzas: hi, could you pls take a look at these train backports when you get a chance? someone posted a comment on the top patch yesterday indicating they are awaiting merge of the fixes https://review.opendev.org/q/topic:%2522bug/1927677%2522+branch:stable/train+status:open	15:49
opendevreview	Balazs Gibizer proposed openstack/nova stable/pike: Add a WA flag waiting for vif-plugged event during reboot https://review.opendev.org/c/openstack/nova/+/813437	15:49
bauzas	melwitt: ack, doing it now	15:49
melwitt	thanks!	15:50
bauzas	melwitt: I already reviewed them but forgot to submit, my bad	15:51
bauzas	now this is fixed.	15:52
melwitt	bauzas: a-ha, thank you	15:59
stephenfin	gibi: question on https://review.opendev.org/c/openstack/nova/+/815690	16:13
stephenfin	please excuse my ignorance	16:13
gibi	lookgin	16:14
gibi	stephenfin: you are right something is fishy there	16:23
gibi	I have to go back and poke that test to understand what is happening	16:23
opendevreview	Artom Lifshitz proposed openstack/nova master: DNM:goat https://review.opendev.org/c/openstack/nova/+/815705	16:32
opendevreview	Artom Lifshitz proposed openstack/nova master: DNM: goat 2 https://review.opendev.org/c/openstack/nova/+/815706	16:32
opendevreview	Artom Lifshitz proposed openstack/nova master: DNM: goat3 https://review.opendev.org/c/openstack/nova/+/815707	16:32
gibi	gmann: I did the change you requested in https://review.opendev.org/c/openstack/tempest/+/809168/comment/35477e85_10754ba5/ but I wondering why we need that indirection	16:34
opendevreview	Artom Lifshitz proposed openstack/nova master: DNM: goat 2 https://review.opendev.org/c/openstack/nova/+/815706	16:36
opendevreview	Artom Lifshitz proposed openstack/nova master: DNM: goat3 https://review.opendev.org/c/openstack/nova/+/815707	16:36
em_	are there currently issues with xena nova and (debian) cloud images? Neither my ssh keys nor the admin password seems to get applied. Any open bugs (maybe libvirt issues or kernel related?) using 5.10 debian bullseye as host, kolla xena (ubuntu/source) as libvirt	17:16
opendevreview	Balazs Gibizer proposed openstack/nova master: Refactor Database fixture https://review.opendev.org/c/openstack/nova/+/815690	17:18
gibi	stephenfin: you had a valid point, fixed it ^^	17:19
opendevreview	Balazs Gibizer proposed openstack/nova master: Fix interference in db unit test https://review.opendev.org/c/openstack/nova/+/814735	17:20
gmann	gibi: replied, basically Tempest test the services with what is configured to test instead of 'test what cloud/service APIs return'	17:21
gmann	autodetecting service features/extensions to what to test can hide the error.	17:22
gibi	gmann: OK, I think I got it. Does devstack needs to be changed to generate the extension name to the tempest config/	17:26
gibi	?	17:26
gmann	gibi: we do that, like master test with 'All' (enable everything) and stable are pin with the extensions list at the time of stable branch is released. like this - https://review.opendev.org/c/openstack/devstack/+/811485	17:28
gmann	for now on master we do not need to do anything in devstack side	17:28
gibi	gmann: ack, thanks for the help and explanation	17:30
gmann	I will review the tempest patch once gate result is finished	17:32
gmann	thanks for update	17:32
opendevreview	Merged openstack/nova master: Ensure MAC addresses characters are in the same case https://review.opendev.org/c/openstack/nova/+/811947	18:03
opendevreview	Merged openstack/nova master: Fix instance's image_ref lost on failed unshelving https://review.opendev.org/c/openstack/nova/+/807551	18:29
Zer0Byte	hey	19:14
Zer0Byte	question	19:14
Zer0Byte	im using the cinder frontend option to perform QOS at the storage with the spec total_iops_sec_per_gb=3	19:15
Zer0Byte	is working great	19:15
Zer0Byte	but after extend the volume is not updating the total_iops_sec property on the KVM template	19:15
Zer0Byte	is that normal	19:15
Zer0Byte	?	19:15
EugenMayer	Hello. Anybody else has troubles with (Xena) bootstrapping a debian 11 (generic cloud) or debian 10(openstack variant) and not able to pre-deploy a ssh-key or even a root password? Looking at the logs, it always prints that there is no suitable ssh key to deploy. Tried it with an rsa ord ed key, no hopes. Any hints?	19:37
EugenMayer	The boot log looks like this: https://gist.github.com/EugenMayer/452de9229e8f47dad0fadb4f8774d482	19:39
clarkb	EugenMayer: are you booting it with the proper flag to assign a nova ssh key to the instance?	20:21
clarkb	Also if cloud-init can't reach the nova metadata service this might happen. You might try using a config drive if it isn't already	20:22
Zer0Byte	no one with the issue of refresh the kvm	23:52
Zer0Byte	volume iops	23:52
Zer0Byte	?	23:52

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!