*** dpawlik3 is now known as dpawlik | 08:34 | |
gibi | good day nova | 08:56 |
---|---|---|
* kashyap waves | 09:09 | |
bauzas | hola folks | 09:17 |
gibi | o/ | 09:20 |
gibi | -another day in downstream land- | 09:20 |
opendevreview | Balazs Gibizer proposed openstack/nova master: Add a WA flag waiting for vif-plugged event during reboot https://review.opendev.org/c/openstack/nova/+/813419 | 09:34 |
gibi | stephenfin: do you recall why you needed to configure both db in a single Database fixture in https://review.opendev.org/c/openstack/nova/+/799526/5/nova/tests/fixtures/nova.py#612 ? | 10:58 |
gibi | I think our tests are using a separate fixture instance for each db (api, main) | 10:59 |
stephenfin | gibi: I'm not 100% sure but my guess is that I don't, and that was simply for expediency/laziness :) If SESSION_CONFIGURED became a mapping of DB type to "is configured" bool, we probably wouldn't need that | 11:00 |
stephenfin | *we don't | 11:00 |
gibi | I see so we had a single global but with two dbs to configure | 11:01 |
stephenfin | Yeah, I think so | 11:01 |
gibi | OK, if that is the only reason then I think I have a way to remove that global flag (based on melwitt's idea) with patch_factory from oslo_db | 11:02 |
gibi | it is no pretty confusing that we have to Database fixture intantiated one for main and one for api but the first one configures both db | 11:03 |
gibi | s/no/now | 11:03 |
gibi | /to/two | 11:03 |
gibi | /o\ | 11:03 |
stephenfin | yeah, tbc it could be more complicated than that but I really doubt it | 11:04 |
gibi | yeah, lets see if my idea works | 11:08 |
sean-k-mooney | stephenfin: since your about here an easy one for you https://review.opendev.org/c/openstack/nova/+/811947 think we can get that over the line | 11:12 |
stephenfin | sure, will look now | 11:13 |
sean-k-mooney | thanks :) | 11:14 |
sean-k-mooney | stephenfin: based on the ptg discussion woudl you mind removing your -2 on https://review.opendev.org/c/openstack/nova/+/804292 im going to rebase that and the autopep8 one shortly | 11:19 |
frickler | kashyap: I didn't make progress with reproduction without nova yet, so I created https://gitlab.com/qemu-project/qemu/-/issues/693 for now. let me know if you need additional data there | 13:05 |
kashyap | frickler: Thanks for the report. A quick one is: were you using nested setup, or was this DevStack instance on a baremetal host (<shudder>)? | 13:08 |
kashyap | A thumb-rule is to always explicitly state so if you're using a nested setup | 13:09 |
kashyap | frickler: Can you edit the report to state that "deploy DevStack in a VM?" So that an unsuspecting dev won't run it on their baremetal laptop and wreak havoc... | 13:09 |
kashyap | I'll add a quick comment there, actualy | 13:10 |
kashyap | Done | 13:17 |
kashyap | frickler: I'll check about it w/ a TCG dev | 13:17 |
frickler | kashyap: yes, nested is correct, I added that to the description. though I could also duplicate on a baremetal host if you assume that it would behave differently | 13:25 |
kashyap | frickler: No, no need for baremetal. VMs are best. Can you also post the QEMU command-line of the DevStack VM itself? (The level-1 VM) | 13:37 |
frickler | kashyap: no, I have no admin access to the cloud it is running on. I'm assuming it will essentially look the same, though, just with accel=kvm | 13:39 |
kashyap | Hmm, not sure if it'll be that same w/ accel=kvm. The details would change quite a bit. The "host" (guest hypervisor) setup can determine the guest behaviour here a lot. | 13:43 |
kashyap | That's one of the questions I'd expect from a TCG dev | 13:43 |
frickler | kashyap: o.k., I'll try running cirros locally without devstack in between, that would give the simplest setup in the end | 13:45 |
kashyap | frickler: Sure; yeah, that'd be the best. The shorter the route to the reproducer, the more likely we can get to the root cause | 13:47 |
kashyap | frickler: Thanks for all the testing! It's a pain, I Know | 13:47 |
kashyap | s/K/k/ | 13:47 |
*** kopecmartin is now known as kopecmartin|pto | 14:00 | |
frickler | kashyap: that went easier than I expected, updated the issue | 14:14 |
bauzas | gibi: sean-k-mooney: I think we said https://bugs.launchpad.net/nova/+bug/1947753 is valid during our PTG, right? | 14:19 |
gibi | bauzas: I don't remember discussing this | 14:21 |
bauzas | gibi: sorry you're maybe right | 14:21 |
gibi | what I see is that in the bug they evacuate instances without restarting the compute node | 14:21 |
bauzas | I originally thought this was about evacuate/evacuateback/evacuate | 14:21 |
bauzas | adding a comment | 14:22 |
gibi | so far we said that you can only evacuate if you make sure that the compute is dead | 14:23 |
gibi | in the bug case the compute was halted / stuck, the heartbeat was missing so the service was considered down, nova allowed evacuation, then the compute recovered without the nova-compute service restarted | 14:23 |
gibi | nova-compute only cleans up evacuated instance during init_host but does not do it periodically | 14:24 |
gibi | so in this case the evacuated instance was not cleaned up on the source leading to duplicated instances causing corruption | 14:24 |
gibi | option a) change nova-compute to clean up evacuated instance in a periodic | 14:25 |
gibi | option b) change the evac API to only allow evacuation if the compute is forced down (meaning the admin mades sure the host is fenced) | 14:26 |
gibi | option c) declare the current bug as user error as the nova-compute was not restarted as part of the compute node recovery | 14:27 |
bauzas | gibi: I wrote a large comment on the bvug | 14:33 |
bauzas | I think CERN is triggering evacuations before verifying the host status | 14:33 |
sean-k-mooney | bauzas: i dont think we talked baout this either | 14:34 |
bauzas | as I said, I feel healthchecks can help them getting a better decision-making about whether they need to evacuate or not | 14:34 |
sean-k-mooney | bauzas: we talk about a related issue with allcoations tha talso impact evacuate | 14:34 |
bauzas | sean-k-mooney: yeah, my confusion, I originally thought it was about the back-and-forth about evacuate we discussed for pain points | 14:35 |
sean-k-mooney | e.g. if for any reason we oversubscie the allcoations then we cant evacuate | 14:35 |
sean-k-mooney | i have not read it fully but it sound like they are not properly fencing the node an ensuring the vm is not running | 14:36 |
sean-k-mooney | before evacuating if they are having apllciation data currption | 14:36 |
bauzas | that's literrally what I wrote. | 14:41 |
bauzas | anyway, moving to a new bug. | 14:41 |
sean-k-mooney | ack as i said have not finsihed reading the bug description or comments so glad we agree :) | 14:42 |
kashyap | frickler: Ah-ha, noted. Good news: there's already some response from two QEMU devs, with a patch in a newer version :) | 14:51 |
frickler | kashyap: yeah I just responded, but I didn't see the patch reference. going from 32M to 1G really sounds a bit excessive, would be good to be able to tune that | 14:55 |
kashyap | (Well, I don't quite think it's "good news" ...) | 14:55 |
kashyap | frickler: Sorry, I was referring to the commit that DanPB pointed out - https://gitlab.com/qemu-project/qemu/-/commit/600e17b26 | 14:55 |
frickler | kashyap: ah, yes, that seems to be the patch that triggers this, I though you were referring to a fix in a recent commit | 14:56 |
kashyap | frickler: Yeah, that increase is a tad too much. | 14:56 |
kashyap | frickler: Yes, poor phrasing on my part. | 14:56 |
opendevreview | Balazs Gibizer proposed openstack/nova master: Remove SESSION_CONFIGURED global from DB fixture https://review.opendev.org/c/openstack/nova/+/815689 | 14:56 |
frickler | kashyap: otoh that also is likely to explain why tests seemed to be going faster on Bullseye than on Focal | 14:57 |
opendevreview | Balazs Gibizer proposed openstack/nova master: Refactor Database fixture https://review.opendev.org/c/openstack/nova/+/815690 | 14:58 |
kashyap | frickler: Interesting; what tests are going faster? | 14:59 |
opendevreview | Balazs Gibizer proposed openstack/nova master: Fix interference in db unit test https://review.opendev.org/c/openstack/nova/+/814735 | 14:59 |
gibi | stephenfin: ^^ here is the removal of the global SESSION_CONFIGURED flag from the DB fixture and some extra :D | 15:00 |
frickler | kashyap: I didn't check in particular, but the whole tempest-full job with --serial on Debian doesn't take much longer than with the default (-c 4 I think) on Focal | 15:02 |
kashyap | I see. | 15:02 |
gibi | melwitt: thanks a lot for the help exlaning the global db transaction factory situation. I used your info to actually remove SESSION_CONFIGURED from our fixture along the the unit test fixes | 15:05 |
kashyap | frickler: So, it is tunable via command-line, but it's not wired up in libvirt yet, though. | 15:05 |
kashyap | frickler: See the option: -accel=tcg,tb-size=$value_in_MiB | 15:05 |
kashyap | "tb-size" in the man page | 15:06 |
frickler | kashyap: as long as libvirt doesn't support it, I fear that won't help much. might be good to cap it to something like 50% of the VM memory | 15:10 |
kashyap | frickler: Right; libvirt just didn't wire it up ... we can meanwhile do a nasty hack of uploading a QEMU binary to the CI system w/ this param tweaked | 15:11 |
kashyap | frickler: Do you have the appetite to file a libvirt upstream RFE? (Then I can clone it downstream, and get it triaged) | 15:11 |
frickler | kashyap: I think I'll do a local test with a reduced default tb-size first in order to be certain that that's the cause. but not before tomorrow | 15:13 |
kashyap | Right, no rush at all | 15:13 |
gibi | bauzas: replied in https://bugs.launchpad.net/nova/+bug/1947753 I think _destroy_evacuated_instances is not called periodically | 15:14 |
kashyap | frickler: So I see that someone else has raised this upstream last year: https://lists.gnu.org/archive/html/qemu-devel/2020-07/msg05235.html (TB Cache size grows out of control with qemu 5.0) | 15:16 |
bauzas | gibi: indeed, only when restarting | 15:22 |
bauzas | did I said the other way ? | 15:22 |
kashyap | frickler: So, this worked for me: | 15:22 |
kashyap | -machine q35 | 15:22 |
kashyap | -accel tcg,tb-size=256 | 15:22 |
kashyap | (As an example) | 15:23 |
gibi | bauzas: at least I understood this sentence that way "Either way, if the service continues to run, it verifies the evacuation status periodically and deletes the host." | 15:25 |
gibi | bauzas: btw, about https://bugs.launchpad.net/nova/+bug/1947687 I cannot formulate a logstash signature it seems that this error happens in a lot of cases when no test cases are failing so I get a lot of false positives | 15:27 |
bauzas | gibi: okay, then my brain fucked | 15:27 |
kashyap | frickler: For reference, a minimal command-line: | 15:27 |
kashyap | $> qemu-kvm -display none -cpu Nehalem -no-user-config \ -machine q35 \ -accel tcg,tb-size=256 \ -nodefaults -m 2048 -serial stdio \ -drive file=/export/vm1.qcow2,format=qcow2,if=virtio | 15:27 |
kashyap | (Ugh, line-breaks are broken, but you see what I mean) | 15:27 |
bauzas | gibi: ack for the logstash thing, no worries | 15:28 |
frickler | kashyap: thx, added a comment to the issue, seems the libvirt path is really the most promising one | 15:34 |
kashyap | frickler: Definitely. Please file the RFE (and post a link to me, Bz Ccs will take me slower to process) when you can | 15:39 |
kashyap | Thanks for the patience :) | 15:39 |
melwitt | bauzas: hi, could you pls take a look at these train backports when you get a chance? someone posted a comment on the top patch yesterday indicating they are awaiting merge of the fixes https://review.opendev.org/q/topic:%2522bug/1927677%2522+branch:stable/train+status:open | 15:49 |
opendevreview | Balazs Gibizer proposed openstack/nova stable/pike: Add a WA flag waiting for vif-plugged event during reboot https://review.opendev.org/c/openstack/nova/+/813437 | 15:49 |
bauzas | melwitt: ack, doing it now | 15:49 |
melwitt | thanks! | 15:50 |
bauzas | melwitt: I already reviewed them but forgot to submit, my bad | 15:51 |
bauzas | now this is fixed. | 15:52 |
melwitt | bauzas: a-ha, thank you | 15:59 |
stephenfin | gibi: question on https://review.opendev.org/c/openstack/nova/+/815690 | 16:13 |
stephenfin | please excuse my ignorance | 16:13 |
gibi | lookgin | 16:14 |
gibi | stephenfin: you are right something is fishy there | 16:23 |
gibi | I have to go back and poke that test to understand what is happening | 16:23 |
opendevreview | Artom Lifshitz proposed openstack/nova master: DNM:goat https://review.opendev.org/c/openstack/nova/+/815705 | 16:32 |
opendevreview | Artom Lifshitz proposed openstack/nova master: DNM: goat 2 https://review.opendev.org/c/openstack/nova/+/815706 | 16:32 |
opendevreview | Artom Lifshitz proposed openstack/nova master: DNM: goat3 https://review.opendev.org/c/openstack/nova/+/815707 | 16:32 |
gibi | gmann: I did the change you requested in https://review.opendev.org/c/openstack/tempest/+/809168/comment/35477e85_10754ba5/ but I wondering why we need that indirection | 16:34 |
opendevreview | Artom Lifshitz proposed openstack/nova master: DNM: goat 2 https://review.opendev.org/c/openstack/nova/+/815706 | 16:36 |
opendevreview | Artom Lifshitz proposed openstack/nova master: DNM: goat3 https://review.opendev.org/c/openstack/nova/+/815707 | 16:36 |
em_ | are there currently issues with xena nova and (debian) cloud images? Neither my ssh keys nor the admin password seems to get applied. Any open bugs (maybe libvirt issues or kernel related?) using 5.10 debian bullseye as host, kolla xena (ubuntu/source) as libvirt | 17:16 |
opendevreview | Balazs Gibizer proposed openstack/nova master: Refactor Database fixture https://review.opendev.org/c/openstack/nova/+/815690 | 17:18 |
gibi | stephenfin: you had a valid point, fixed it ^^ | 17:19 |
opendevreview | Balazs Gibizer proposed openstack/nova master: Fix interference in db unit test https://review.opendev.org/c/openstack/nova/+/814735 | 17:20 |
gmann | gibi: replied, basically Tempest test the services with what is configured to test instead of 'test what cloud/service APIs return' | 17:21 |
gmann | autodetecting service features/extensions to what to test can hide the error. | 17:22 |
gibi | gmann: OK, I think I got it. Does devstack needs to be changed to generate the extension name to the tempest config/ | 17:26 |
gibi | ? | 17:26 |
gmann | gibi: we do that, like master test with 'All' (enable everything) and stable are pin with the extensions list at the time of stable branch is released. like this - https://review.opendev.org/c/openstack/devstack/+/811485 | 17:28 |
gmann | for now on master we do not need to do anything in devstack side | 17:28 |
gibi | gmann: ack, thanks for the help and explanation | 17:30 |
gmann | I will review the tempest patch once gate result is finished | 17:32 |
gmann | thanks for update | 17:32 |
opendevreview | Merged openstack/nova master: Ensure MAC addresses characters are in the same case https://review.opendev.org/c/openstack/nova/+/811947 | 18:03 |
opendevreview | Merged openstack/nova master: Fix instance's image_ref lost on failed unshelving https://review.opendev.org/c/openstack/nova/+/807551 | 18:29 |
Zer0Byte | hey | 19:14 |
Zer0Byte | question | 19:14 |
Zer0Byte | im using the cinder frontend option to perform QOS at the storage with the spec total_iops_sec_per_gb=3 | 19:15 |
Zer0Byte | is working great | 19:15 |
Zer0Byte | but after extend the volume is not updating the total_iops_sec property on the KVM template | 19:15 |
Zer0Byte | is that normal | 19:15 |
Zer0Byte | ? | 19:15 |
EugenMayer | Hello. Anybody else has troubles with (Xena) bootstrapping a debian 11 (generic cloud) or debian 10(openstack variant) and not able to pre-deploy a ssh-key or even a root password? Looking at the logs, it always prints that there is no suitable ssh key to deploy. Tried it with an rsa ord ed key, no hopes. Any hints? | 19:37 |
EugenMayer | The boot log looks like this: https://gist.github.com/EugenMayer/452de9229e8f47dad0fadb4f8774d482 | 19:39 |
clarkb | EugenMayer: are you booting it with the proper flag to assign a nova ssh key to the instance? | 20:21 |
clarkb | Also if cloud-init can't reach the nova metadata service this might happen. You might try using a config drive if it isn't already | 20:22 |
Zer0Byte | no one with the issue of refresh the kvm | 23:52 |
Zer0Byte | volume iops | 23:52 |
Zer0Byte | ? | 23:52 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!