opendevreview | melanie witt proposed openstack/nova master: Add functional regression test for bug 1853009 https://review.opendev.org/c/openstack/nova/+/695012 | 00:02 |
---|---|---|
opendevmeet | bug 1853009 in OpenStack Compute (nova) ussuri "Ironic node rebalance race can lead to missing compute nodes in DB" [High,In progress] https://launchpad.net/bugs/1853009 - Assigned to Mark Goddard (mgoddard) | 00:02 |
opendevreview | melanie witt proposed openstack/nova master: Clear rebalanced compute nodes from resource tracker https://review.opendev.org/c/openstack/nova/+/695187 | 00:02 |
opendevreview | melanie witt proposed openstack/nova master: Invalidate provider tree when compute node disappears https://review.opendev.org/c/openstack/nova/+/695188 | 00:02 |
opendevreview | melanie witt proposed openstack/nova master: Prevent deletion of a compute node belonging to another host https://review.opendev.org/c/openstack/nova/+/694802 | 00:02 |
opendevreview | melanie witt proposed openstack/nova master: Fix inactive session error in compute node creation https://review.opendev.org/c/openstack/nova/+/695189 | 00:02 |
*** martinkennelly has quit IRC | 00:09 | |
*** martinkennelly_ has quit IRC | 00:09 | |
*** rloo has quit IRC | 00:21 | |
*** swp20 has joined #openstack-nova | 00:42 | |
*** LinPeiWen has joined #openstack-nova | 01:04 | |
*** bhagyashris_ has joined #openstack-nova | 01:12 | |
*** bhagyashris has quit IRC | 01:20 | |
*** liuyulong has joined #openstack-nova | 01:25 | |
*** spatel has joined #openstack-nova | 02:05 | |
*** swp20 is now known as wenpingsong | 02:27 | |
*** spatel has quit IRC | 02:51 | |
*** spatel has joined #openstack-nova | 02:54 | |
*** brinzhang_ has joined #openstack-nova | 03:07 | |
*** brinzhang0 has quit IRC | 03:14 | |
*** spatel has quit IRC | 03:22 | |
*** LinPeiWen has quit IRC | 04:07 | |
*** abhishekk has joined #openstack-nova | 04:48 | |
*** opendevreview has quit IRC | 05:26 | |
*** vishalmanchanda has joined #openstack-nova | 05:35 | |
*** bhagyashris_ is now known as bhagyashris | 06:08 | |
*** LinPeiWen has joined #openstack-nova | 06:21 | |
masterpe[m] | I have more resources declined in the placement.allocations table then actual in use. I see some instance ID there that are deleted. | 06:39 |
masterpe[m] | What shall I do with them? | 06:39 |
frickler | masterpe[m]: seems this tool was made for you https://docs.openstack.org/nova/latest/cli/nova-manage.html#placement-audit | 06:46 |
frickler | or maybe the heal-allocations above | 06:46 |
masterpe[m] | nice thanks | 06:52 |
*** ralonsoh has joined #openstack-nova | 06:54 | |
*** rpittau|afk is now known as rpittau | 06:54 | |
*** tosky has joined #openstack-nova | 07:03 | |
*** david-lyle has joined #openstack-nova | 07:04 | |
*** lucasagomes has joined #openstack-nova | 07:07 | |
*** dklyle has quit IRC | 07:08 | |
*** andrewbonney has joined #openstack-nova | 07:20 | |
*** luksky has joined #openstack-nova | 07:29 | |
*** alex_xu has joined #openstack-nova | 07:58 | |
*** kashyap has joined #openstack-nova | 07:59 | |
*** wenpingsong has quit IRC | 08:05 | |
lyarwood | kashyap: https://review.opendev.org/c/openstack/nova/+/795533/5#message-3e7b1dc4d3f8d22b2c9a55637c7366199d00eb56 - I'll write this up in a Nova bug and likely libvirt bug later today but would you mind scanning this if you get a chance? | 08:06 |
*** derekh has joined #openstack-nova | 08:06 | |
kashyap | lyarwood: Mornin; /me clicks | 08:06 |
kashyap | lyarwood: Ah, you're debugging the informative error, "reason=failed" | 08:07 |
lyarwood | there's two parts really | 08:07 |
lyarwood | yeah the reason=failed thing | 08:07 |
lyarwood | and the way that the call to virDomainMigrateToURI3 from the source doesn't pick up the failure on the dest | 08:08 |
kashyap | Hmm, yikes | 08:09 |
lyarwood | I'm assuming that the dest tears down the connection between the two leading to the eventual `unable to connect to server` error | 08:09 |
lyarwood | instead of the source being told the migration has failed etc | 08:10 |
kashyap | lyarwood: This is the dest, right: https://zuul.opendev.org/t/openstack/build/f3b829801901417c9310ad5cc5a0e886/log/controller/logs/libvirt/libvirtd_log.txt | 08:10 |
lyarwood | yeah | 08:10 |
kashyap | Is it just me or is it loading deadly slow? I just want to pull down the entire raw file | 08:11 |
lyarwood | yeah these files are pretty large, I typically pull them down now | 08:11 |
kashyap | My browser is hung here (probably it goes into MBs). Do I clicck on the "View log" to get the raw link? | 08:12 |
lyarwood | it's awkward as the raw links don't use .gz at the end of the file names | 08:12 |
kashyap | Maybe I can simply `wget` the above URL - /me tries | 08:12 |
lyarwood | mv it to a .gz file and either gunzip or let vim unpack them btw | 08:12 |
kashyap | Right, but it _is_ a .gz file - a new user will discover it after your browser crashes :D | 08:12 |
kashyap | lyarwood: Yeah; that's become "muscle memory" now | 08:12 |
lyarwood | browsers can handle .gz | 08:12 |
kashyap | Oh, sure; sometimes, a very large file just crashed FF for me | 08:14 |
kashyap | Actual link to `wget` is: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_f3b/795533/5/check/nova-next/f3b8298/controller/logs/libvirt/libvirtd_log.txt | 08:15 |
kashyap | 93MB only | 08:15 |
*** martinkennelly has joined #openstack-nova | 08:28 | |
*** martinkennelly_ has joined #openstack-nova | 08:28 | |
kashyap | lyarwood: Two quick things: is this reproducible? Or is it the first you noticed? | 08:32 |
lyarwood | It's the first time I've seen this | 08:32 |
kashyap | lyarwood: Also, strangely, I don't see the migrateToURI3() failure in the source libvirtd log: | 08:32 |
kashyap | I pulled the 113MB file from here: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_f3b/795533/5/check/nova-next/f3b8298/compute1/logs/libvirt/libvirtd_log.txt | 08:32 |
kashyap | I wonder if it got log-rotated | 08:33 |
* lyarwood looks | 08:33 | |
*** opendevreview has joined #openstack-nova | 08:34 | |
opendevreview | Wenping Song proposed openstack/nova master: Replaces tenant_id with project_id from List/Update Servers APIs https://review.opendev.org/c/openstack/nova/+/764292 | 08:34 |
opendevreview | Wenping Song proposed openstack/nova master: Replace all_tenants with all_projects in List Server APIs https://review.opendev.org/c/openstack/nova/+/765311 | 08:34 |
opendevreview | Wenping Song proposed openstack/nova master: Replaces tenant_id with project_id from Rebuild Server API https://review.opendev.org/c/openstack/nova/+/766380 | 08:34 |
opendevreview | Wenping Song proposed openstack/nova master: Replaces tenant_id with project_id from List SG API https://review.opendev.org/c/openstack/nova/+/766726 | 08:34 |
opendevreview | Wenping Song proposed openstack/nova master: Replaces tenant_id with project_id from Flavor Access APIs https://review.opendev.org/c/openstack/nova/+/767704 | 08:34 |
opendevreview | Wenping Song proposed openstack/nova master: Replaces tenant_id with project_id from List/Show usage APIs https://review.opendev.org/c/openstack/nova/+/768509 | 08:34 |
opendevreview | Wenping Song proposed openstack/nova master: Replace tenants* with projects* of policies https://review.opendev.org/c/openstack/nova/+/765315 | 08:34 |
opendevreview | Wenping Song proposed openstack/nova master: Replace os-simple-tenant-usage with os-simple-project-usage https://review.opendev.org/c/openstack/nova/+/768852 | 08:34 |
opendevreview | Wenping Song proposed openstack/nova master: Replace tenant_id with project_id in os-quota-sets path https://review.opendev.org/c/openstack/nova/+/768851 | 08:34 |
opendevreview | Wenping Song proposed openstack/nova master: Replace tenant_id with project_id in Limits API https://review.opendev.org/c/openstack/nova/+/768862 | 08:34 |
opendevreview | Wenping Song proposed openstack/nova master: Replace tenant* with project* in codes https://review.opendev.org/c/openstack/nova/+/769329 | 08:34 |
lyarwood | 93717 2021-06-10 06:41:58.982+0000: 58504: debug : qemuBlockJobEventProcessConcluded:1489 : handling job 'drive-virtio-disk0' state '3' newstate '0' | 08:38 |
lyarwood | 93718 2021-06-10 06:41:58.982+0000: 58504: debug : qemuBlockJobProcessEventConcludedCopyAbort:1250 : copy job 'drive-virtio-disk0' on VM 'instance-0000001d' aborted | 08:38 |
lyarwood | looks like the block job failed | 08:38 |
lyarwood | 93774 2021-06-10 06:41:59.429+0000: 58504: debug : qemuDomainObjSetJobPhase:9291 : Setting 'migration out' phase to 'confirm3_cancelled' | 08:39 |
kashyap | lyarwood: Yeah; that's a good find | 08:41 |
kashyap | The thing is - why the connection is refused? I want to think of "firewall", but I don't think that's it | 08:41 |
lyarwood | didn't we have issues with test_live_block_migration_paused before? | 08:42 |
lyarwood | kashyap: I don't think it's refused | 08:42 |
lyarwood | kashyap: the migration just ends and the python libvirt lib is just incorrectly handling the failure | 08:42 |
kashyap | lyarwood: Oh, right: | 08:42 |
kashyap | It's because the mirroring was cancelled on the source: | 08:42 |
kashyap | --- | 08:42 |
kashyap | 2021-06-10 06:41:59.429+0000: 58504: debug : qemuDomainObjSetJobPhase:9291 : Setting 'migration out' phase to 'confirm3_cancelled' | 08:42 |
kashyap | 2021-06-10 06:41:59.429+0000: 58504: debug : qemuMigrationEatCookie:1483 : cookielen=0 cookie='<null>' | 08:42 |
kashyap | 2021-06-10 06:41:59.430+0000: 58504: debug : qemuMigrationSrcNBDCopyCancel:707 : Cancelling drive mirrors for domain instance-0000001d | 08:42 |
kashyap | 2021-06-10 06:41:59.430+0000: 58504: debug : qemuMigrationSrcNBDCopyCancelled:632 : All disk mirrors are gone | 08:42 |
kashyap | --- | 08:42 |
kashyap | Yeah, having it return a graceful error is a reasonable request | 08:43 |
kashyap | lyarwood: On test_live_block_migration_paused - yes! | 08:43 |
lyarwood | https://review.opendev.org/c/openstack/nova/+/766720 | 08:43 |
kashyap | lyarwood: IIRC, it was because it was being tested on shared storage | 08:44 |
lyarwood | no I think in the end we were told that there were known issues with -drive and told to upgrade to a newer version of QEMU | 08:44 |
kashyap | lyarwood: Yep; that skip matches my understanding | 08:44 |
kashyap | So is the 'nova-next' job using Bioninc? | 08:45 |
lyarwood | I need to head offline for a haircut for ~90mins but I think we need to flag this to the QEMU folks again | 08:45 |
lyarwood | I think it's focal | 08:45 |
lyarwood | https://zuul.opendev.org/t/openstack/build/f3b829801901417c9310ad5cc5a0e886/log/zuul-info/inventory.yaml#106 | 08:45 |
*** swp20 has joined #openstack-nova | 08:45 | |
lyarwood | yeah it's focal | 08:45 |
kashyap | lyarwood: I just mentioned to Dave Gilbert ... need to file a bug | 08:45 |
lyarwood | kk awesome | 08:45 |
lyarwood | brb in ~90 | 08:45 |
kashyap | I also need to head out for an errand shortly | 08:45 |
bauzas | can anyone give me a short summary about the pep8 gate issue ? saw https://review.opendev.org/c/openstack/nova/+/795533 | 08:47 |
gibi | bauzas: mypy 0.9 removed some third party type defs from the package, those needs to be installed separately | 08:47 |
gibi | bauzas: we hit by the missing paramiko typedef after 0.9 | 08:48 |
bauzas | it was a minor version upgrade from mypy ? woah | 08:48 |
bauzas | the fix is still -1 from Zuul, not related ? | 08:49 |
gibi | bauzas: the failing nova-next and nova-grenade-multinode in that patch fails with thing that I saw before on master so I rule them unrelated | 08:49 |
bauzas | gibi: can we just remove https://github.com/openstack/nova/blob/master/tox.ini#L57 for the moment ? | 08:49 |
gibi | bauzas: landing that removal need to pass the same test as the patch that is in the check queue | 08:50 |
gibi | if your tests are unstable then both equally hard to land | 08:50 |
gibi | s/your/our/ | 08:51 |
bauzas | gibi: if we would remove https://github.com/openstack/nova/blob/master/tox.ini#L57 then the jobs wouldn't be running | 08:52 |
gibi | bauzas: removing something from tox ini still triggers nova-next and nova-grenade-multinode isn't it? | 08:52 |
bauzas | gibi: so, we could fix the gate issue *and then* trying to adding this change | 08:52 |
bauzas | gibi: good question, AFAIK, I wasn't knowing | 08:52 |
stephenfin | lyarwood: artom: The reason 'mypy --install-types' wasn't enough is that that requires an existing mypy cache (.mypy_cache), which will only be created if you run mypy. So you'd have to run mypy, wait for it to potentially fail, run '--install-types', then run mypy again | 08:53 |
bauzas | gibi: lemme try to see this | 08:53 |
stephenfin | It worked locally because I had the cache already, but failed in the gate because it's a new env | 08:53 |
bauzas | gibi: stephenfin: so we could just remove mypy first, then trying https://review.opendev.org/c/openstack/nova/+/795533 to be merged, and then adding again mypy | 08:53 |
gibi | bauzas: go ahead | 08:54 |
bauzas | doing it now | 08:54 |
stephenfin | why can't https://review.opendev.org/c/openstack/nova/+/795533 merge? | 08:54 |
stephenfin | the requirements change has merged | 08:54 |
stephenfin | and that's the correct fix | 08:54 |
gibi | I don't want to block bauzas to try another angle. If nova-next and nova-grenade-mutlinode does not trigger a tox.ini change then we might faster land the tox.ini change than the requirement change. | 08:55 |
gibi | s/trigger a/ trigger on a/ | 08:55 |
stephenfin | oh, I'm possibly missing context. Is there another issue now? | 08:55 |
stephenfin | i.e. with nova-next and nova-grenade-multinode ? | 08:56 |
gibi | we needed to recheck the nova requirement patch couple of times already as it always hit someting either in those jobvs | 08:56 |
bauzas | stephenfin: the problem is that we run lots of jobs | 08:56 |
stephenfin | do we have a fix for those jobs? | 08:56 |
gibi | totally unrealted problems in unstable tests | 08:56 |
gibi | I don't think so | 08:56 |
bauzas | stephenfin: so I'm trying to see whether we could just remove the issue without running all of them | 08:57 |
kashyap | lyarwood: Oh, BTW -- when you're back: the "cancelled" in the QEMU logs has a special meaning for NBD. I forgot that I documented this myself upstream :D | 08:58 |
kashyap | lyarwood: See step (4) here: https://qemu.readthedocs.io/en/latest/interop/live-block-operations.html#qmp-invocation-for-live-storage-migration-with-drive-mirror-nbd | 08:58 |
bauzas | gibi: stephenfin: I guess we have a bug ? | 08:58 |
gibi | stephenfin: we have abou 10% failing rate on master in those two jobs | 08:58 |
opendevreview | Victor Coutellier proposed openstack/nova master: Allow configuration of direct-snapshot feature https://review.opendev.org/c/openstack/nova/+/794837 | 08:59 |
kashyap | lyarwood: In short, once the mirroring from src --> dest completes, and the _READY event is emitted, source libvirtd issues QMP `block-job-cancel` to gracefully end the mirroring. | 08:59 |
gibi | so I guess we are just unlucky with the req patch | 09:00 |
stephenfin | I think so | 09:00 |
stephenfin | fwiw though I wouldn't be in favour of simply dropping the requirements fix so we can merge other stuff, if the reason the requirements patch is failing is unrelated | 09:01 |
opendevreview | Sylvain Bauza proposed openstack/nova master: Removing mypy to fix the nova CI https://review.opendev.org/c/openstack/nova/+/795744 | 09:01 |
gibi | stephenfin: not dropping the req fix | 09:01 |
gibi | stephenfin: if it lands then we are done | 09:02 |
gibi | if bauzas's patch lands first, then we revert that when yours land | 09:02 |
bauzas | gibi: stephenfin: patch is up for just removing mypy run until we fix the requirements | 09:02 |
stephenfin | that makes no sense to me though | 09:02 |
stephenfin | the mypy run is causing the other failures | 09:02 |
stephenfin | *isn't | 09:02 |
bauzas | stephenfin: sure | 09:03 |
stephenfin | so they have an equal chance of failing randomly | 09:03 |
bauzas | I'm just proposing this one because I guess nova-next WONT be running on my patch | 09:03 |
bauzas | it's just a tox.ini change | 09:03 |
stephenfin | but what does this achieve? | 09:03 |
gibi | stephenfin: not equal chance iff the tox.ini change does not trigger the unstable jobs | 09:03 |
gibi | stephenfin: if it trigger the same job, then I agree that it mypy removal patch is pointless | 09:03 |
stephenfin | even if it doesn't, so what? | 09:04 |
bauzas | gibi: stephenfin: yup, indeed, if this runs the same jobs, then nevermind my one | 09:04 |
bauzas | stephenfin: I was off yesterday but from what I've seen our gate is blocked | 09:04 |
bauzas | I just want to unblock it asap | 09:05 |
stephenfin | right, but it's blocked because of the flaky tests | 09:05 |
stephenfin | not because of the mypy thing | 09:05 |
bauzas | the clean fix requires a requirement bump | 09:05 |
stephenfin | we have a fix for the mypy thing | 09:05 |
bauzas | stephenfin: yup, I saw it | 09:05 |
bauzas | stephenfin: I'm just trying to see whether we can land things easier | 09:05 |
bauzas | again, unblocking the gate seems to me the most important | 09:06 |
bauzas | we could revert stuff later | 09:06 |
stephenfin | but you still won't be able to land anything that causes nova-next or nova-grenade-mutlinode to run? | 09:06 |
stephenfin | so it's not unblocked | 09:06 |
bauzas | stephenfin: that's a classic chicken-and-egg issue | 09:06 |
bauzas | you have 2 unrelated gate issues | 09:06 |
stephenfin | I mean, if they're flaky then they're flaky for everything, surely? | 09:06 |
bauzas | stephenfin: I don't disagree | 09:07 |
bauzas | 20% is a high rate of flakiness | 09:07 |
stephenfin | then we achieve nothing with this | 09:07 |
gibi | for me moving from a full red due to pep8 (mypy) failure to a falky red due to unstable jobs is still progress | 09:07 |
bauzas | except we go from 100% of failures to a random 20% from what I understando | 09:07 |
bauzas | this | 09:07 |
bauzas | either way, I proposed but I don't have opinions | 09:08 |
bauzas | the last call, tho, is that I need to get my kids in a min | 09:08 |
bauzas | https://zuul.opendev.org/t/openstack/status#795744 | 09:09 |
bauzas | we don't run the flaky jobs | 09:10 |
* bauzas needs to disappear for a 10 min | 09:10 | |
bauzas | 30 mins roughly sorry | 09:10 |
stephenfin | idk, if the failure rate is that high that we can't land a simple reqs patch, that would suggest we need to be working on the other jobs. I see lyarwood has already been looking. I can start now too | 09:11 |
bauzas | stephenfin: yup, we need to parallelize efforts, I don't disagree | 09:11 |
bauzas | my patch is just a hack | 09:12 |
bauzas | and we need to consider themigrate and next jobs as the top prio | 09:12 |
bauzas | I absolutely don't disagree | 09:12 |
bauzas | and I also absolutely agree we need to land the types-paramiko changes | 09:12 |
bauzas | it's just, again, a way to unblock the gate even if flakey with the other jobs | 09:13 |
opendevreview | Victor Coutellier proposed openstack/nova master: Allow configuration of direct-snapshot feature https://review.opendev.org/c/openstack/nova/+/794837 | 09:13 |
* bauzas drops now | 09:13 | |
*** artom has quit IRC | 09:16 | |
*** artom has joined #openstack-nova | 09:17 | |
gibi | the requirement patch failed again. I'm looking at the grenade failure... | 09:50 |
gibi | https://f141fb01d9c1d07df646-94aaf0771088c81abb9a09d47e91a608.ssl.cf1.rackcdn.com/795533/5/check/nova-grenade-multinode/24485b8/testr_results.html | 09:50 |
elodilles | gibi: sorry, meanwhile i've commented 'recheck' on it :X | 09:55 |
gibi | elodilles: no worries | 09:55 |
gibi | the failure is unrelated | 09:55 |
gibi | but based on this morining discussion above I feel we need to stabilize our gate | 09:55 |
bauzas | I'm back | 09:55 |
bauzas | gibi: stephenfin disagrees with the quick fix approach of removing mypy so I don't want to opiniate here | 09:56 |
bauzas | what i agree tho is that the gate failures are a PITA that need other pair of eyes | 09:57 |
bauzas | do we have a bug for tracking the nova-next and other jobs issues ? | 09:57 |
* bauzas needs more context | 09:58 | |
*** admin1 has joined #openstack-nova | 09:58 | |
bauzas | we probably need other kinda workaround change for those jobs I guess | 09:59 |
gibi | bauzas: lyarwood promised to file a bug for the libvirt.libvirtError: unable to connect to server at 'ubuntu-focal-rax-dfw-0025050041:49152': Connection refused | 09:59 |
gibi | case | 09:59 |
gibi | I'm looking at the last grenade failure | 09:59 |
* bauzas looks at those too | 09:59 | |
gibi | which is a cinder volume detach timeout | 09:59 |
*** abhishekk has quit IRC | 09:59 | |
gibi | at least seem so far | 10:00 |
bauzas | humpf, yet another focal detach saga ? | 10:00 |
admin1 | hi all .. checking if anyone knows this .. what is the percentage latency/performance difference between underlying filesystem vs nova ephemeral disks on top of it .. ( qcow2) and if there is a way to speed up iops performance by changing the format to raw for the vms .. and if such way exist in openstack ? | 10:01 |
elodilles | this volume failure seems quite frequent (however, mostly with volume stuck in 'in-use' state) | 10:01 |
elodilles | http://logstash.openstack.org/#/dashboard/file/logstash.json?query=message:%5C%22failed%20to%20reach%20available%20status%20(current%5C%22 | 10:01 |
gibi | elodilles: both in-use and detaching state happens | 10:02 |
bauzas | elodilles: gibi: which jobs are we talking about ? | 10:02 |
bauzas | I see nova-live-migrate and nova-ceph-multistore, right? | 10:02 |
gibi | bauzas: I'm looking at the last failure in https://review.opendev.org/c/openstack/nova/+/795533 which is in nova-grenade-multinode | 10:02 |
bauzas | holy snap | 10:03 |
bauzas | I was about to propose to make some failing jobs non-voting until we identify a proper fix, but if that's an issue occurring on a large set of jobs, neverming this proposal | 10:03 |
bauzas | we're in a fscking bad situation :/ | 10:04 |
bauzas | elodilles: the in-use issue seems to not happen for a while | 10:06 |
gibi | I see both | 10:09 |
gibi | in http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22failed%20to%20reach%20available%20status%5C%22 | 10:10 |
elodilles | well, i haven't realized but logstash shows only failures from 4th June | 10:14 |
elodilles | i mean, only that day | 10:16 |
bauzas | yep | 10:16 |
bauzas | some transient issue I think | 10:16 |
gibi | elodilles: I think that is a shortcoming of logstash | 10:16 |
bauzas | ... or this ? | 10:16 |
gibi | I see the same recently | 10:16 |
bauzas | anyway, logstash is sunsetting unfortunately | 10:17 |
elodilles | i also think it's some indexing issue, or something like that... | 10:17 |
bauzas | so we can no longer count on it :( | 10:17 |
bauzas | (even if i told about it) | 10:17 |
gibi | I cannot figure out this failure | 10:33 |
gibi | I lost in cinder | 10:33 |
gibi | I wait for lyarwood to look at it but other than that I can only file a bug on cinder | 10:38 |
* lyarwood is back online | 10:53 | |
gibi | lyarwood: o/ | 10:53 |
lyarwood | gibi: Can you throw me a pointer to some logs? | 10:53 |
gibi | sure | 10:53 |
gibi | https://f141fb01d9c1d07df646-94aaf0771088c81abb9a09d47e91a608.ssl.cf1.rackcdn.com/795533/5/check/nova-grenade-multinode/24485b8/testr_results.html | 10:55 |
gibi | this is the recent grenade job failure on https://review.opendev.org/c/openstack/nova/+/795533 | 10:55 |
gibi | but I think this 'failed to reach available status (current detaching)' and 'failed to reach available status (current in-use)' are pretty common failures | 10:55 |
lyarwood | right it's still the detach logic | 10:56 |
lyarwood | https://zuul.opendev.org/t/openstack/build/24485b8c450740c5946c39dcf310a746/log/controller/logs/screen-n-cpu.txt#25665 | 10:56 |
lyarwood | I had a few hits of this last week and wanted to dump the instance console on failure | 10:57 |
lyarwood | led me down a rabbit hole that I've not had time to revisit this week | 10:57 |
lyarwood | https://review.opendev.org/c/openstack/tempest/+/794757 was my initial attempt | 10:57 |
gibi | ohh | 10:58 |
gibi | thanks for the info | 10:58 |
gibi | I went to the cinder side | 10:58 |
gibi | and lost | 10:58 |
gibi | lyarwood: what I saw that the tempest sent a volume detachment https://zuul.opendev.org/t/openstack/build/24485b8c450740c5946c39dcf310a746/log/job-output.txt#58720 and that led to the volume being in detaching state | 11:00 |
gibi | and that succeeded in nova https://zuul.opendev.org/t/openstack/build/24485b8c450740c5946c39dcf310a746/log/controller/logs/screen-n-cpu.txt#23456 | 11:02 |
gibi | ohh it does not | 11:03 |
gibi | it oly succeded from the persistent domain | 11:03 |
lyarwood | right req-6f7d27cf-3d82-4e8d-ad25-7f53249a05a0 fails to detach the volume from the live domain | 11:04 |
gibi | yeah, now I see | 11:04 |
lyarwood | I traced this all through previously and couldn't see any issues with n-cpu, libvirt or even QEMU tbh | 11:04 |
lyarwood | so I wanted to see what the state of the guest OS was | 11:05 |
lyarwood | before I reported a bug to the QEMU folks | 11:05 |
lyarwood | and/or cirros | 11:05 |
lyarwood | https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_ec1/795533/5/check/nova-live-migration/ec1e40c/testr_results.html - the latest run has failed in nova-live-migration | 11:07 |
lyarwood | with a timeout during the initial POST, odd. | 11:07 |
lyarwood | oh I wonder if it's https://bugs.launchpad.net/nova/+bug/1929446 | 11:08 |
opendevmeet | Launchpad bug 1929446 in OpenStack Compute (nova) "check_can_live_migrate_source taking > 60 seconds in CI" [Medium,Triaged] | 11:08 |
lyarwood | sean-k-mooney: ^ did you get anywhere with that? | 11:09 |
bauzas | fwiw, the mypy removal is green from the CI https://review.opendev.org/c/openstack/nova/+/795744 | 11:09 |
lyarwood | It's still likely to fail in the actual gate with all of these failures however right? | 11:11 |
sean-k-mooney | i think its stil ovsdbapp but i did not see a way to stop the polling in the lib | 11:12 |
sean-k-mooney | i can take another look im wondering if we need to move the polling to the prive sep deamon or into a real pthread | 11:13 |
lyarwood | sean-k-mooney: I still don't understand the timing here | 11:16 |
lyarwood | sean-k-mooney: is there a long running os-vif thing in the background or is it related to check_can_live_migrate_source? | 11:16 |
sean-k-mooney | the first | 11:17 |
lyarwood | kk the short term workaround is just to bump the rpc timeout I guess | 11:17 |
lyarwood | just in the LM jobs | 11:17 |
sean-k-mooney | or rather os-vif usese ovsdbapp which create a connection to ovs and then it kicks off a pooling loop that monitors ovs for the addtion and removal of ports | 11:18 |
sean-k-mooney | os-vif never uses that feature of ovsdbapp | 11:18 |
sean-k-mooney | it is used by the neutron l2 agent which uses ovsdbapp directly | 11:18 |
sean-k-mooney | to know when we add and remove vm ports | 11:18 |
sean-k-mooney | but ovsdbapp appears to not have an obvios way to turn it off | 11:19 |
lyarwood | ah so we do this once via os-vif and ovsdbapp keeps polling in the background forever? | 11:19 |
sean-k-mooney | yep | 11:20 |
lyarwood | ewwww | 11:20 |
sean-k-mooney | im sure you have seen it in the debug logs | 11:20 |
lyarwood | so we don't even need this?! | 11:20 |
sean-k-mooney | correct | 11:20 |
lyarwood | yeah I just assumed we needed it | 11:20 |
lyarwood | christ | 11:20 |
lyarwood | :D | 11:20 |
sean-k-mooney | so there is a better workaround | 11:20 |
sean-k-mooney | https://github.com/openstack/os-vif/blob/master/vif_plug_ovs/ovs.py#L65-L75 | 11:20 |
sean-k-mooney | go back to the old cli based backend | 11:21 |
sean-k-mooney | sorry i proably have to expand on that more | 11:22 |
sean-k-mooney | [os_vif_ovs]/ovsdb_interface=vsctl | 11:23 |
lyarwood | tbh if the native approach is broken and stealing this much time from actual n-cpu requests I'd suggest we do that | 11:23 |
sean-k-mooney | set that in the nova.conf | 11:23 |
sean-k-mooney | well the native implemation is much much faster espaically at scale | 11:23 |
sean-k-mooney | but for right now we could go back to the old one which i was ment to delete last cycle until i can fix the native implementation | 11:24 |
lyarwood | question is do we do this for all CI envs or just the live migration ones? | 11:26 |
sean-k-mooney | at the scale we operate at in the ci its safe to do it for all if we want too | 11:27 |
sean-k-mooney | the performace delta only becomes appearent if you have 100s of ports | 11:27 |
gibi | I'm not be surpised if this polling interferese with out eventlet monkey patching as internally it also patches some eventlet things | 11:27 |
lyarwood | fun | 11:27 |
sean-k-mooney | gibi: well os-vif intentionally does not use eventlets | 11:27 |
sean-k-mooney | although its always or almost always loaded into an enve that is monkypatched already | 11:28 |
gibi | /usr/lib/python3/dist-packages/ovs/poller.py | 11:28 |
gibi | i mean | 11:28 |
gibi | https://github.com/openvswitch/ovs/blob/210c4cba9bc69412473a2fee8e9b6f023150e6e6/python/ovs/poller.py#L270 | 11:28 |
gibi | it does have eventlet patching | 11:28 |
sean-k-mooney | that is in the ovs python binding but ya | 11:29 |
gibi | as far as I see it actually escaping monkey patching | 11:29 |
gibi | "If select.poll is | 11:29 |
gibi | monkey patched by eventlet or gevent library, it gets the original | 11:29 |
gibi | select.poll and returns an object of it" | 11:29 |
sean-k-mooney | we woudl want the pooling if it cant be disabled to be on a real pthread | 11:29 |
sean-k-mooney | https://github.com/openvswitch/ovs/blob/210c4cba9bc69412473a2fee8e9b6f023150e6e6/python/ovs/poller.py#L59-L63 | 11:30 |
gibi | ohh this is nice https://github.com/openvswitch/ovs/blob/210c4cba9bc69412473a2fee8e9b6f023150e6e6/python/ovs/poller.py#L59-L63 | 11:30 |
gibi | hehe, found the same thing :SD | 11:30 |
sean-k-mooney | ya that sound very familar | 11:31 |
lyarwood | sean-k-mooney: can you update https://bugs.launchpad.net/nova/+bug/1929446 to point to os-vif and update the bug title? | 11:31 |
opendevmeet | Launchpad bug 1929446 in OpenStack Compute (nova) "check_can_live_migrate_source taking > 60 seconds in CI" [Medium,Triaged] | 11:31 |
sean-k-mooney | i guess but its really in ovs or ovsdbapp. im going to read through the poller implementation and see if we can tweak our usage | 11:32 |
lyarwood | right but any changes and/or fixes will end up in os-vif right? | 11:32 |
sean-k-mooney | not nessisarly it could be in ovsdbapp but it wont be in nova | 11:33 |
sean-k-mooney | if we can fix it in os-vif i might just do it there | 11:33 |
lyarwood | ack cool sorry the no changes required in nova part was more my point :) | 11:34 |
sean-k-mooney | yep | 11:34 |
sean-k-mooney | this is where ovsdbapp is using that poller implementaion https://github.com/openstack/ovsdbapp/blob/master/ovsdbapp/backend/ovs_idl/connection.py#L105 | 11:35 |
sean-k-mooney | we create an instanice of that oconnection object here https://github.com/openstack/os-vif/blob/master/vif_plug_ovs/ovsdb/impl_idl.py#L31-L33 | 11:36 |
sean-k-mooney | ovsdbapp is trying to run the poller in a seperate thread https://github.com/openstack/ovsdbapp/blob/master/ovsdbapp/backend/ovs_idl/connection.py#L91 | 11:37 |
sean-k-mooney | but that is monkypatched | 11:37 |
sean-k-mooney | really we want that to be a pthread | 11:37 |
gibi | sean-k-mooney: on hacky thing we can do is to monkey patch ovs.poller.Poller class from os-vif to be an empty implementation | 11:38 |
sean-k-mooney | i was considering doing something like that | 11:39 |
sean-k-mooney | i mean i could proablu just use mock to replace it | 11:39 |
gibi | yeah | 11:39 |
sean-k-mooney | ok ill see if i can play with this quickly but i think we should really fix this in ovsdbapp by allowing the connect to be created without poolling | 11:42 |
* lyarwood holds off landing anything in devstack for now | 11:43 | |
sean-k-mooney | https://github.com/openstack/ovsdbapp/blob/master/ovsdbapp/backend/ovs_idl/connection.py#L98-L102 | 11:43 |
sean-k-mooney | this does concern me a bit | 11:43 |
lyarwood | https://bugs.launchpad.net/nova/+bugs?field.tag=gate-failure we really need to clean this out | 11:44 |
* lyarwood gives it a go | 11:44 | |
sean-k-mooney | but yes ovs is defently un monkypatching https://github.com/openvswitch/ovs/blob/210c4cba9bc69412473a2fee8e9b6f023150e6e6/python/ovs/poller.py#L270 | 11:45 |
lyarwood | gibi: is https://bugs.launchpad.net/nova/+bug/1793364 resolved now by https://review.opendev.org/q/topic:%22bp%252Fcompact-db-migrations-wallaby%22+(status:open%20OR%20status:merged) ? | 11:55 |
opendevmeet | Launchpad bug 1793364 in Cinder "mysql db opportunistic unit tests timing out intermittently in the gate (bad thread switch?)" [High,Confirmed] | 11:55 |
lyarwood | your comment suggests it was supposed to be | 11:55 |
gibi | lyarwood: it helped but I think we saw timeouts even after the compation | 11:55 |
gibi | let me find the info | 11:56 |
sean-k-mooney | i think we did too | 11:56 |
lyarwood | kk do we want to keep this open or move it to incomplete and update it with fresh logs etc the next time it happens? | 11:56 |
gibi | https://review.opendev.org/c/openstack/nova/+/774889/1#message-db5303e41b22ebfb7b33067d815a12ece2df510b | 11:56 |
gibi | lyarwood: ack | 11:56 |
*** whoami-rajat has quit IRC | 11:57 | |
gibi | I will check logstash for fresh occurences | 11:57 |
masterpe[m] | If I run ./nova-manage placement heal_allocations on Train I get the error: "Compute host scpuko57 could not be found." But it is in the "openstack hypervisor list" and "openstack compute service list" | 12:00 |
gibi | I made 1793364 duplicate of https://bugs.launchpad.net/nova/+bug/1823251 (as the newer report has more info) and added a link to the log of a recent occurence | 12:02 |
opendevmeet | Launchpad bug 1823251 in OpenStack Compute (nova) "Spike in TestNovaMigrationsMySQL.test_walk_versions/test_innodb_tables failures since April 1 2019 on limestone-regionone" [High,Confirmed] | 12:02 |
gibi | I have https://review.opendev.org/c/openstack/nova/+/775094 for more logs (now rechecked) but honestly I looked at this problem so many times that I don't think I can solve it. | 12:03 |
*** spatel has joined #openstack-nova | 12:19 | |
opendevreview | sean mooney proposed openstack/os-vif master: [WIP] mock ovs.poller.Poller https://review.opendev.org/c/openstack/os-vif/+/795770 | 12:24 |
sean-k-mooney | gibi: that passes the os-vif functional tests which actully create ports in ovs and locally i had it assert that the Poller was called | 12:25 |
sean-k-mooney | but im not sure if this will actully fix the issue | 12:25 |
sean-k-mooney | ill take a look at the tempest run when its done and we can see if the repeating dbug message is still present or not | 12:25 |
kashyap | lyarwood: </me back after eclipse hunting> Hey. Have you got a bug filed, or shall I file one? | 12:29 |
gibi | sean-k-mooney: cool | 12:29 |
kashyap | [OT] A couple of pictures of projections, if you missed it: https://kashyapc.fedorapeople.org/partial_solar_eclipse_2021/ | 12:29 |
lyarwood | kashyap: I don't yet so feel free to write one up if you have time | 12:29 |
* lyarwood jumps on a call | 12:29 | |
kashyap | lyarwood: Sure; I'll do it right now | 12:29 |
kashyap | But would be good to see if we can reproduce this at least twice... | 12:30 |
*** brinzhang0 has joined #openstack-nova | 12:31 | |
*** rloo has joined #openstack-nova | 12:32 | |
gibi | kashyap: I think we hit it 7 times in the last 7 days http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22libvirt.libvirtError%3A%20unable%20to%20connect%20to%20server%20at%5C%22 | 12:32 |
kashyap | gibi: Ah, that's good to know. | 12:32 |
kashyap | So the trigger is block-migrating a puased instance | 12:33 |
kashyap | (Where "block-migrating" == migrating the instance along with its storage) | 12:33 |
*** rloo has quit IRC | 12:33 | |
*** rloo has joined #openstack-nova | 12:34 | |
*** brinzhang_ has quit IRC | 12:38 | |
lyarwood | kashyap: https://bugs.launchpad.net/nova/+bug/1912310 found an older bug we could mark as a duplicate if you've already written a fresh bug up | 13:21 |
opendevmeet | Launchpad bug 1912310 in OpenStack Compute (nova) "libvirt.libvirtError: unable to connect to server at " [Medium,Confirmed] | 13:21 |
kashyap | lyarwood: Ah, good find; I'm just drafting it locally | 13:22 |
* kashyap clicks | 13:22 | |
lyarwood | https://bugs.launchpad.net/nova/+bug/1797185 is another | 13:22 |
opendevmeet | Launchpad bug 1797185 in OpenStack Compute (nova) "live migration intermittently fails in CI with "Connection refused" during guest transfer" [Low,Confirmed] | 13:22 |
kashyap | Sigh | 13:22 |
kashyap | lyarwood: What I wonder is - how can we check _why_ the destination is unreahable... | 13:23 |
kashyap | Is it something peculiar to our upstream CI; or something else | 13:23 |
lyarwood | I honestly think it's a bug in the libvirt python bindings where they miss that the migration has already failed, try to poll it's progress and don't handle the fact that it's no longer there correctly | 13:26 |
lyarwood | the unreachable error is a red herring IMHO | 13:26 |
* lyarwood looks at the trace again | 13:26 | |
kashyap | lyarwood: Yeah; I asked Michal from libvirt; and he suggested Jiri Denemark | 13:27 |
kashyap | He isn't around; but I'll ask him to comment on this once he's back | 13:28 |
kashyap | lyarwood: Also, you only speak of Python bindings - why won't it be a bug in the C API itself? | 13:28 |
kashyap | Asking out of ignorance, not challenging :) | 13:28 |
bauzas | hola | 13:28 |
bauzas | lyarwood: gibican we know how many jobs have problems ? | 13:29 |
bauzas | gibi: ^ | 13:29 |
bauzas | if we have problems with fixing those problems, can we make those specific jobs non-voting until we fix the main issues ? | 13:29 |
lyarwood | kashyap: it could be that as well but n-cpu is calling the python bindings and they are raising the error here so it easily be an issue there | 13:29 |
*** abhishekk has joined #openstack-nova | 13:29 | |
bauzas | fwiw, I saw that https://review.opendev.org/c/openstack/nova/+/791506 passed the check pipeline but we don't know yet whether the gate one will accept it | 13:30 |
kashyap | lyarwood: Ah, okay; you're just speaking of the error trace in our case. Reasonable | 13:30 |
kashyap | lyarwood: Can you copy/paste your comment from the review to start with in there? | 13:31 |
kashyap | (In the latest bug you filed 2021 Jan) | 13:31 |
lyarwood | bauzas: There are various intermittent failures at the moment, I'm trying to clean up the gate-failures bug tag to get a better handle on what is failing and how often | 13:31 |
lyarwood | bauzas: we could move them to non-voting but this has been going on for weeks so I'd rather we try to get a better handle on this first | 13:32 |
bauzas | okay, fair enough | 13:32 |
lyarwood | I honestly think we haven't been landing enough to notice recently | 13:32 |
bauzas | lyarwood: I guess my point is that we're still having *all* the changes getting -1 because of the mypy issue | 13:32 |
bauzas | so I want the gate back asap | 13:32 |
lyarwood | yup fair | 13:33 |
bauzas | we have two possibilities, the main fix https://review.opendev.org/c/openstack/nova/+/791506 | 13:33 |
* gibi is on a call | 13:33 | |
bauzas | but I also prepared an alternative change that's simplier for the gate and doesn't hit the transient issues we know https://review.opendev.org/c/openstack/nova/+/795744 | 13:33 |
bauzas | so, I guess, I'd recommend to wait for the main change to merge, but if we get a Zuul -2 from the gate on it, I'd honestly recommend to let https://review.opendev.org/c/openstack/nova/+/795744 go | 13:34 |
lyarwood | sean-k-mooney: https://bugs.launchpad.net/nova/+bug/1863889/comments/3 another possible hit of the ovsdbapp issue | 13:42 |
opendevmeet | Launchpad bug 1863889 in OpenStack Compute (nova) "Revert resize problem in neutron-tempest-dvr-ha-multinode-full" [Medium,Confirmed] | 13:42 |
sean-k-mooney | lyarwood: im talking to otherwiseguy and ralonsoh in #openstack-neutron about it now but ill let them know | 13:44 |
lyarwood | it's just another random timeout FWIW | 13:44 |
lyarwood | with lots of spam from ovsdbapp so no hard proof | 13:44 |
*** artom_ has joined #openstack-nova | 13:50 | |
*** pjakuszew has joined #openstack-nova | 13:54 | |
bauzas | lyarwood: the problem is that logstash isn't reliable | 13:55 |
bauzas | despite some people like me and dansmith expressing our needs | 13:55 |
bauzas | we need to focus on the few things we saw occuring often | 13:56 |
*** artom has quit IRC | 13:56 | |
bauzas | and either fix them or hide them under the carpet until we are in a better situation | 13:56 |
lyarwood | yup agreed | 14:01 |
lyarwood | and that's what I have been doing prior to now with things like https://bugs.launchpad.net/nova/+bug/1929710 for example | 14:01 |
opendevmeet | Launchpad bug 1929710 in OpenStack Compute (nova) "virDomainGetBlockJobInfo fails during swap_volume as disk '$disk' not found in domain" [Medium,New] | 14:01 |
*** pjakuszew has quit IRC | 14:02 | |
lyarwood | https://bugs.launchpad.net/nova/+bugs?field.tag=gate-failure should be in a much better state now, I've moved loads of things to incomplete where I couldn't find any evidence of recent hits etc. | 14:03 |
lyarwood | these should close automatically in the next 60 days | 14:03 |
*** pjakuszew has joined #openstack-nova | 14:04 | |
*** pjakuszew has quit IRC | 14:05 | |
*** pjakuszew has joined #openstack-nova | 14:05 | |
bauzas | lyarwood: ++ thanks for triaging | 14:08 |
*** pjakuszew is now known as cz3|work | 14:09 | |
*** cz3|work is now known as pjakuszew | 14:10 | |
*** pjakuszew has left #openstack-nova | 14:10 | |
*** cz3|work has joined #openstack-nova | 14:17 | |
opendevreview | Merged openstack/nova-specs master: Speed up server details https://review.opendev.org/c/openstack/nova-specs/+/791620 | 14:22 |
*** alistarle has joined #openstack-nova | 14:27 | |
*** mdbooth has joined #openstack-nova | 14:28 | |
*** alistarle has left #openstack-nova | 14:32 | |
*** alistarle has joined #openstack-nova | 14:32 | |
*** alistarle has left #openstack-nova | 14:34 | |
*** alistarle has joined #openstack-nova | 14:34 | |
kashyap | lyarwood: Before I forget, for now, I've copy/pasted (with attribution) your comment from the change here: https://bugs.launchpad.net/nova/+bug/1912310 | 14:36 |
opendevmeet | Launchpad bug 1912310 in OpenStack Compute (nova) "libvirt.libvirtError: unable to connect to server at " [Medium,Confirmed] | 14:36 |
*** brinzhang0 has quit IRC | 14:36 | |
* gibi reads back... | 14:36 | |
*** brinzhang0 has joined #openstack-nova | 14:37 | |
gibi | lyarwood: thanks for the cleanup. I add that query to the weekly agenda | 14:40 |
lyarwood | kashyap: ack thanks | 14:40 |
lyarwood | gibi: awesome cheers | 14:40 |
sean-k-mooney | gibi: lyarwood mocking the poller wont work but i have a seperate patch to move it to a real thread against ovsdb and talking to otherwiseguy https://bugs.launchpad.net/neutron/+bug/1930926 and https://review.opendev.org/c/openstack/neutron/+/794892 might also help if we ported the same change to os-vif | 14:54 |
opendevmeet | Launchpad bug 1930926 in neutron "Failing over OVN dbs can cause original controller to permanently lose connection" [Medium,Fix released] - Assigned to Terry Wilson (otherwiseguy) | 14:54 |
*** abhishekk has quit IRC | 14:54 | |
sean-k-mooney | mocking the poller actully cause the delay | 14:54 |
sean-k-mooney | as it forces use to reconnect every 5 seconds sicne we dont respond to the server side echo | 14:54 |
lyarwood | `# Overwriting globals in a library is clearly a good idea` lol | 14:56 |
lyarwood | sean-k-mooney: ack kk | 14:56 |
sean-k-mooney | ya... | 14:58 |
gibi | sean-k-mooney: will look at it shortly | 15:00 |
*** lucasagomes has quit IRC | 15:08 | |
*** liuyulong has quit IRC | 15:12 | |
*** liuyulong has joined #openstack-nova | 15:12 | |
*** tkajinam has quit IRC | 15:19 | |
bauzas | mmm, /me needs to leave, unfortunately too early to see outcomes of https://zuul.opendev.org/t/openstack/status#795533 | 15:32 |
bauzas | hopefully the change will be merged, but the failing jobs are currently still running | 15:32 |
*** kashyap has quit IRC | 15:45 | |
gibi | sean-k-mooney: is the the new directon? https://review.opendev.org/c/openstack/ovsdbapp/+/795789 | 15:54 |
*** alistarle has quit IRC | 16:03 | |
opendevreview | Balazs Gibizer proposed openstack/nova master: Test the NotificationFixture https://review.opendev.org/c/openstack/nova/+/758450 | 16:10 |
opendevreview | Balazs Gibizer proposed openstack/nova master: Move fake_notifier impl under NotificationFixture https://review.opendev.org/c/openstack/nova/+/758451 | 16:10 |
opendevreview | Balazs Gibizer proposed openstack/nova master: rpc: Mark attributes as private https://review.opendev.org/c/openstack/nova/+/792803 | 16:10 |
sean-k-mooney | gibi: its potentally part of it where are some other optimistaion that they have done in neutron that we could port ot os-vif that might help | 16:11 |
stephenfin | grenade-multinode failed on the deps change again :( tempest.api.compute.admin.test_live_migration.LiveMigrationTest.test_live_block_migration_with_attached_volume this time | 16:11 |
gibi | stephenfin: :/ | 16:11 |
sean-k-mooney | gibi: it sound like the issue really need to be fixed in the ovs python bindindings | 16:11 |
gibi | sean-k-mooney: so no easy fix? | 16:12 |
gibi | sean-k-mooney: should we switch back to vsctl in the gate until we make a fix for the bindig? | 16:12 |
gibi | stephenfin: looking | 16:12 |
sean-k-mooney | gibi: my next attempt will be to patch get_system_poll in the ovs binding | 16:12 |
gibi | sean-k-mooney: OK, lets try that | 16:12 |
sean-k-mooney | to not return the unpatch version form os-vif | 16:12 |
sean-k-mooney | gibi: the short term fix would be to revert to the deprecated driver in devstack | 16:13 |
sean-k-mooney | untill we figure out a way to work around this properly | 16:13 |
sean-k-mooney | so we can certenly do that if we want too | 16:13 |
gibi | sean-k-mooney: lets keep the devstack option open, but I think it is OK to try patching get_system_poll first | 16:13 |
gibi | let see if that helps, if not then go with the devstack change | 16:14 |
sean-k-mooney | basicaly if i have get_system_poll returng _SelectSelect then i think it shoudl not block | 16:14 |
sean-k-mooney | on reconnect | 16:14 |
sean-k-mooney | then the neutron chagnes to disbale the echo and instead use tcp keepalive and limit the tables we cache woudl also miniums reconnect time | 16:15 |
gibi | stephenfin: that is the volume detaching issue again that where lyarwood wants to look at the guest console log for hints | 16:16 |
gibi | stephenfin: https://review.opendev.org/c/openstack/tempest/+/794757 | 16:17 |
gibi | stephenfin: should we just retry or try to land https://review.opendev.org/c/openstack/nova/+/795744 in parallel as well to see which lands first? | 16:18 |
gibi | or turn off these unstable tests? | 16:19 |
*** rpittau is now known as rpittau|afk | 16:24 | |
*** alex_xu has quit IRC | 16:27 | |
melwitt | the gate is so angry :( seeing so many of Jun 10 16:08:13.476629 ubuntu-focal-rax-dfw-0025058768 nova-compute[50365]: ERROR oslo_messaging.rpc.server nova.exception.DeviceDetachFailed: Device detach failed for vdb: Run out of retry while detaching device vdb with device alias virtio-disk1 from instance 63ddd2de-502b-4467-b7b5-350e601e4d1b from the live domain config. Device is still attached to the guest. | 16:49 |
gibi | melwitt: yes we discussed it with lyarwood during the day | 16:51 |
melwitt | whyyyyyy /o\ | 16:51 |
* lyarwood is back | 16:51 | |
gibi | he suspects that something in the guest OS makes the detach hanging | 16:51 |
*** raildo has joined #openstack-nova | 16:51 | |
gibi | hence https://review.opendev.org/c/openstack/tempest/+/794757 | 16:52 |
lyarwood | I'm going to fix that up quickly now | 16:52 |
gibi | (I looked at the failures in the tempest test but did not figured out what was wrong:/) | 16:52 |
melwitt | how does that help the guest OS? sorry I don't understand | 16:53 |
gibi | that tempest patch will make sure that the guest console log is dumped | 16:53 |
melwitt | ohhh ok | 16:54 |
lyarwood | yeah it's not going to resolve anything | 16:54 |
lyarwood | I had to rework the cleanup ordering to get wait_for_volume_attachment_remove first | 16:54 |
lyarwood | as that is able to dump the console | 16:54 |
* lyarwood was sure he pushed a fixed version of this last week but nvm | 16:55 | |
melwitt | gotcha | 16:55 |
*** derekh has quit IRC | 16:59 | |
*** raildo has quit IRC | 17:18 | |
*** raildo has joined #openstack-nova | 17:18 | |
*** david-lyle is now known as dklyle | 17:26 | |
*** supamatt has joined #openstack-nova | 17:41 | |
*** sean-k-mooney has quit IRC | 17:41 | |
ganso | melwitt, gibi: hi! could you please take a look at this backport? It already has a +2. Thanks in advance! | 18:02 |
melwitt | ganso: link? | 18:04 |
ganso | melwitt: sorry! forgot the link! https://review.opendev.org/c/openstack/nova/+/794328 | 18:06 |
melwitt | thanks | 18:06 |
*** cz3|work is now known as pjakuszew | 18:08 | |
opendevreview | Lee Yarwood proposed openstack/nova master: DNM testing tempest volume detach failure capture of console https://review.opendev.org/c/openstack/nova/+/794766 | 18:35 |
*** sean-k-mooney has joined #openstack-nova | 19:04 | |
*** vishalmanchanda has quit IRC | 19:14 | |
sean-k-mooney | gibi: looking at the ci result https://review.opendev.org/c/openstack/ovsdbapp/+/795789 actully does seam to if not fix greatly imporve the issue | 19:15 |
sean-k-mooney | gibi: there are no recoonection or stalls with that patch | 19:15 |
sean-k-mooney | im still going to look at addtional impormennt within os-vif but that looks to me like a viable path forward | 19:16 |
*** swp20 has quit IRC | 19:46 | |
*** andrewbonney has quit IRC | 20:00 | |
*** spatel has quit IRC | 20:06 | |
*** bnemec has quit IRC | 20:26 | |
*** bnemec has joined #openstack-nova | 20:29 | |
*** spatel has joined #openstack-nova | 20:35 | |
*** spatel has quit IRC | 20:52 | |
*** ralonsoh has quit IRC | 20:56 | |
*** spatel has joined #openstack-nova | 21:47 | |
*** spatel has quit IRC | 22:06 | |
opendevreview | Ade Lee proposed openstack/nova master: Add check job for FIPS https://review.opendev.org/c/openstack/nova/+/790519 | 22:16 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!