Thursday, 2021-06-10

opendevreviewmelanie witt proposed openstack/nova master: Add functional regression test for bug 1853009  https://review.opendev.org/c/openstack/nova/+/69501200:02
opendevmeetbug 1853009 in OpenStack Compute (nova) ussuri "Ironic node rebalance race can lead to missing compute nodes in DB" [High,In progress] https://launchpad.net/bugs/1853009 - Assigned to Mark Goddard (mgoddard)00:02
opendevreviewmelanie witt proposed openstack/nova master: Clear rebalanced compute nodes from resource tracker  https://review.opendev.org/c/openstack/nova/+/69518700:02
opendevreviewmelanie witt proposed openstack/nova master: Invalidate provider tree when compute node disappears  https://review.opendev.org/c/openstack/nova/+/69518800:02
opendevreviewmelanie witt proposed openstack/nova master: Prevent deletion of a compute node belonging to another host  https://review.opendev.org/c/openstack/nova/+/69480200:02
opendevreviewmelanie witt proposed openstack/nova master: Fix inactive session error in compute node creation  https://review.opendev.org/c/openstack/nova/+/69518900:02
*** martinkennelly has quit IRC00:09
*** martinkennelly_ has quit IRC00:09
*** rloo has quit IRC00:21
*** swp20 has joined #openstack-nova00:42
*** LinPeiWen has joined #openstack-nova01:04
*** bhagyashris_ has joined #openstack-nova01:12
*** bhagyashris has quit IRC01:20
*** liuyulong has joined #openstack-nova01:25
*** spatel has joined #openstack-nova02:05
*** swp20 is now known as wenpingsong02:27
*** spatel has quit IRC02:51
*** spatel has joined #openstack-nova02:54
*** brinzhang_ has joined #openstack-nova03:07
*** brinzhang0 has quit IRC03:14
*** spatel has quit IRC03:22
*** LinPeiWen has quit IRC04:07
*** abhishekk has joined #openstack-nova04:48
*** opendevreview has quit IRC05:26
*** vishalmanchanda has joined #openstack-nova05:35
*** bhagyashris_ is now known as bhagyashris06:08
*** LinPeiWen has joined #openstack-nova06:21
masterpe[m]I have more resources declined in the placement.allocations table then actual in use. I see some instance ID there that are deleted.06:39
masterpe[m]What shall I do with them?06:39
fricklermasterpe[m]: seems this tool was made for you https://docs.openstack.org/nova/latest/cli/nova-manage.html#placement-audit06:46
frickleror maybe the heal-allocations above06:46
masterpe[m]nice thanks06:52
*** ralonsoh has joined #openstack-nova06:54
*** rpittau|afk is now known as rpittau06:54
*** tosky has joined #openstack-nova07:03
*** david-lyle has joined #openstack-nova07:04
*** lucasagomes has joined #openstack-nova07:07
*** dklyle has quit IRC07:08
*** andrewbonney has joined #openstack-nova07:20
*** luksky has joined #openstack-nova07:29
*** alex_xu has joined #openstack-nova07:58
*** kashyap has joined #openstack-nova07:59
*** wenpingsong has quit IRC08:05
lyarwoodkashyap: https://review.opendev.org/c/openstack/nova/+/795533/5#message-3e7b1dc4d3f8d22b2c9a55637c7366199d00eb56 - I'll  write this up in a Nova bug and likely libvirt bug later today but would you mind scanning this if you get a chance?08:06
*** derekh has joined #openstack-nova08:06
kashyaplyarwood: Mornin; /me clicks08:06
kashyaplyarwood: Ah, you're debugging the informative error, "reason=failed"08:07
lyarwoodthere's two parts really08:07
lyarwoodyeah the reason=failed thing08:07
lyarwoodand the way that the call to virDomainMigrateToURI3 from the source doesn't pick up the failure on the dest08:08
kashyapHmm, yikes08:09
lyarwoodI'm assuming that the dest tears down the connection between the two leading to the eventual `unable to connect to server` error08:09
lyarwoodinstead of the source being told the migration has failed etc08:10
kashyaplyarwood: This is the dest, right: https://zuul.opendev.org/t/openstack/build/f3b829801901417c9310ad5cc5a0e886/log/controller/logs/libvirt/libvirtd_log.txt08:10
lyarwoodyeah08:10
kashyapIs it just me or is it loading deadly slow?  I just want to pull down the entire raw file08:11
lyarwoodyeah these files are pretty large, I typically pull them down now08:11
kashyapMy browser is hung here (probably it goes into MBs).  Do I clicck on the "View log" to get the raw link?08:12
lyarwoodit's awkward as the raw links don't use .gz at the end of the file names08:12
kashyapMaybe I can simply `wget` the above URL - /me tries08:12
lyarwoodmv it to a .gz file and either gunzip or let vim unpack them btw08:12
kashyapRight, but it _is_ a .gz file - a new user will discover it after your browser crashes :D08:12
kashyaplyarwood: Yeah; that's become "muscle memory" now08:12
lyarwoodbrowsers can handle .gz08:12
kashyapOh, sure; sometimes, a very large file just crashed FF for me08:14
kashyapActual link to `wget` is: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_f3b/795533/5/check/nova-next/f3b8298/controller/logs/libvirt/libvirtd_log.txt08:15
kashyap93MB only08:15
*** martinkennelly has joined #openstack-nova08:28
*** martinkennelly_ has joined #openstack-nova08:28
kashyaplyarwood: Two quick things: is this reproducible?  Or is it the first you noticed?08:32
lyarwoodIt's the first time I've seen this08:32
kashyaplyarwood: Also, strangely, I don't see the migrateToURI3() failure in the source libvirtd log:08:32
kashyapI pulled the 113MB file from here: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_f3b/795533/5/check/nova-next/f3b8298/compute1/logs/libvirt/libvirtd_log.txt08:32
kashyapI wonder if it got log-rotated08:33
* lyarwood looks08:33
*** opendevreview has joined #openstack-nova08:34
opendevreviewWenping Song proposed openstack/nova master: Replaces tenant_id with project_id from List/Update Servers APIs  https://review.opendev.org/c/openstack/nova/+/76429208:34
opendevreviewWenping Song proposed openstack/nova master: Replace all_tenants with all_projects in List Server APIs  https://review.opendev.org/c/openstack/nova/+/76531108:34
opendevreviewWenping Song proposed openstack/nova master: Replaces tenant_id with project_id from Rebuild Server API  https://review.opendev.org/c/openstack/nova/+/76638008:34
opendevreviewWenping Song proposed openstack/nova master: Replaces tenant_id with project_id from List SG API  https://review.opendev.org/c/openstack/nova/+/76672608:34
opendevreviewWenping Song proposed openstack/nova master: Replaces tenant_id with project_id from Flavor Access APIs  https://review.opendev.org/c/openstack/nova/+/76770408:34
opendevreviewWenping Song proposed openstack/nova master: Replaces tenant_id with project_id from List/Show usage APIs  https://review.opendev.org/c/openstack/nova/+/76850908:34
opendevreviewWenping Song proposed openstack/nova master: Replace tenants* with projects* of policies  https://review.opendev.org/c/openstack/nova/+/76531508:34
opendevreviewWenping Song proposed openstack/nova master: Replace os-simple-tenant-usage with os-simple-project-usage  https://review.opendev.org/c/openstack/nova/+/76885208:34
opendevreviewWenping Song proposed openstack/nova master: Replace tenant_id with project_id in os-quota-sets path  https://review.opendev.org/c/openstack/nova/+/76885108:34
opendevreviewWenping Song proposed openstack/nova master: Replace tenant_id with project_id in Limits API  https://review.opendev.org/c/openstack/nova/+/76886208:34
opendevreviewWenping Song proposed openstack/nova master: Replace tenant* with project* in codes  https://review.opendev.org/c/openstack/nova/+/76932908:34
lyarwood 93717 2021-06-10 06:41:58.982+0000: 58504: debug : qemuBlockJobEventProcessConcluded:1489 : handling job 'drive-virtio-disk0' state '3' newstate '0'08:38
lyarwood 93718 2021-06-10 06:41:58.982+0000: 58504: debug : qemuBlockJobProcessEventConcludedCopyAbort:1250 : copy job 'drive-virtio-disk0' on VM 'instance-0000001d' aborted08:38
lyarwoodlooks like the block job failed08:38
lyarwood 93774 2021-06-10 06:41:59.429+0000: 58504: debug : qemuDomainObjSetJobPhase:9291 : Setting 'migration out' phase to 'confirm3_cancelled'08:39
kashyaplyarwood: Yeah; that's a good find08:41
kashyapThe thing is - why the connection is refused?  I want to think of "firewall", but I don't think that's it08:41
lyarwooddidn't we have issues with test_live_block_migration_paused before?08:42
lyarwoodkashyap: I don't think it's refused08:42
lyarwoodkashyap: the migration just ends and the python libvirt lib is just incorrectly handling the failure08:42
kashyaplyarwood: Oh, right:08:42
kashyapIt's because the mirroring was cancelled on the source:08:42
kashyap---08:42
kashyap2021-06-10 06:41:59.429+0000: 58504: debug : qemuDomainObjSetJobPhase:9291 : Setting 'migration out' phase to 'confirm3_cancelled'08:42
kashyap2021-06-10 06:41:59.429+0000: 58504: debug : qemuMigrationEatCookie:1483 : cookielen=0 cookie='<null>'08:42
kashyap2021-06-10 06:41:59.430+0000: 58504: debug : qemuMigrationSrcNBDCopyCancel:707 : Cancelling drive mirrors for domain instance-0000001d08:42
kashyap2021-06-10 06:41:59.430+0000: 58504: debug : qemuMigrationSrcNBDCopyCancelled:632 : All disk mirrors are gone08:42
kashyap---08:42
kashyapYeah, having it return a graceful error is a reasonable request08:43
kashyaplyarwood: On test_live_block_migration_paused - yes!08:43
lyarwoodhttps://review.opendev.org/c/openstack/nova/+/76672008:43
kashyaplyarwood: IIRC, it was because it was being tested on shared storage08:44
lyarwoodno I think in the end we were told that there were known issues with -drive and told to upgrade to a newer version of QEMU08:44
kashyaplyarwood: Yep; that skip matches my understanding08:44
kashyapSo is the 'nova-next' job using Bioninc?08:45
lyarwoodI need to head offline for a haircut for ~90mins but I think we need to flag this to the QEMU folks again08:45
lyarwoodI think it's focal08:45
lyarwoodhttps://zuul.opendev.org/t/openstack/build/f3b829801901417c9310ad5cc5a0e886/log/zuul-info/inventory.yaml#10608:45
*** swp20 has joined #openstack-nova08:45
lyarwoodyeah it's focal08:45
kashyaplyarwood: I just mentioned to Dave Gilbert ... need to file a bug08:45
lyarwoodkk awesome08:45
lyarwoodbrb in ~9008:45
kashyapI also need to head out for an errand shortly08:45
bauzascan anyone give me a short summary about the pep8 gate issue ? saw https://review.opendev.org/c/openstack/nova/+/79553308:47
gibibauzas: mypy 0.9 removed some third party type defs from the package, those needs to be installed separately08:47
gibibauzas: we hit by the missing paramiko typedef after 0.908:48
bauzasit was a minor version upgrade from mypy ? woah08:48
bauzasthe fix is still -1 from Zuul, not related ?08:49
gibibauzas: the failing nova-next and nova-grenade-multinode in that patch fails with thing that I saw before on master so I rule them unrelated08:49
bauzasgibi: can we just remove https://github.com/openstack/nova/blob/master/tox.ini#L57 for the moment ?08:49
gibibauzas: landing that removal need to pass the same test as the patch that is in the check queue08:50
gibiif your tests are unstable then both equally hard to land08:50
gibis/your/our/08:51
bauzasgibi: if we would remove https://github.com/openstack/nova/blob/master/tox.ini#L57 then the jobs wouldn't be running08:52
gibibauzas: removing something from tox ini still triggers nova-next and nova-grenade-multinode isn't it?08:52
bauzasgibi: so, we could fix the gate issue *and then* trying to adding this change08:52
bauzasgibi: good question, AFAIK, I wasn't knowing08:52
stephenfinlyarwood: artom: The reason 'mypy --install-types' wasn't enough is that that requires an existing mypy cache (.mypy_cache), which will only be created if you run mypy. So you'd have to run mypy, wait for it to potentially fail, run '--install-types', then run mypy again08:53
bauzasgibi: lemme try to see this08:53
stephenfinIt worked locally because I had the cache already, but failed in the gate because it's a new env08:53
bauzasgibi: stephenfin: so we could just remove mypy first, then trying https://review.opendev.org/c/openstack/nova/+/795533 to be merged, and then adding again mypy08:53
gibibauzas: go ahead08:54
bauzasdoing it now08:54
stephenfinwhy can't https://review.opendev.org/c/openstack/nova/+/795533 merge?08:54
stephenfinthe requirements change has merged08:54
stephenfinand that's the correct fix08:54
gibiI don't want to block bauzas to try another angle. If nova-next and nova-grenade-mutlinode does not trigger a tox.ini change then we might faster land the tox.ini change than the requirement change.08:55
gibis/trigger a/ trigger on a/08:55
stephenfinoh, I'm possibly missing context. Is there another issue now?08:55
stephenfini.e. with nova-next and nova-grenade-multinode ?08:56
gibiwe needed to recheck the nova requirement patch couple of times already as it always hit someting either in those jobvs08:56
bauzasstephenfin: the problem is that we run lots of jobs08:56
stephenfindo we have a fix for those jobs?08:56
gibitotally unrealted problems in unstable tests08:56
gibiI don't think so08:56
bauzasstephenfin: so I'm trying to see whether we could just remove the issue without running all of them08:57
kashyaplyarwood: Oh, BTW -- when you're back: the "cancelled" in the QEMU logs has a special meaning for NBD.  I forgot that I documented this myself upstream :D08:58
kashyaplyarwood: See step (4) here: https://qemu.readthedocs.io/en/latest/interop/live-block-operations.html#qmp-invocation-for-live-storage-migration-with-drive-mirror-nbd08:58
bauzasgibi: stephenfin: I guess we have a bug ?08:58
gibistephenfin: we have abou 10% failing rate on master in those two jobs08:58
opendevreviewVictor Coutellier proposed openstack/nova master: Allow configuration of direct-snapshot feature  https://review.opendev.org/c/openstack/nova/+/79483708:59
kashyaplyarwood: In short, once the mirroring from src --> dest completes, and the _READY event is emitted, source libvirtd issues QMP `block-job-cancel` to gracefully end the mirroring.08:59
gibiso I guess we are just unlucky with the req patch09:00
stephenfinI think so09:00
stephenfinfwiw though I wouldn't be in favour of simply dropping the requirements fix so we can merge other stuff, if the reason the requirements patch is failing is unrelated09:01
opendevreviewSylvain Bauza proposed openstack/nova master: Removing mypy to fix the nova CI  https://review.opendev.org/c/openstack/nova/+/79574409:01
gibistephenfin: not dropping the req fix09:01
gibistephenfin: if it lands then we are done09:02
gibiif bauzas's patch lands first, then we revert that when yours land09:02
bauzasgibi: stephenfin: patch is up for just removing mypy run until we fix the requirements09:02
stephenfinthat makes no sense to me though09:02
stephenfinthe mypy run is causing the other failures09:02
stephenfin*isn't09:02
bauzasstephenfin: sure09:03
stephenfinso they have an equal chance of failing randomly09:03
bauzasI'm just proposing this one because I guess nova-next WONT be running on my patch09:03
bauzasit's just a tox.ini change09:03
stephenfinbut what does this achieve?09:03
gibistephenfin: not equal chance iff the tox.ini change does not trigger the unstable jobs09:03
gibistephenfin: if it trigger the same job, then I agree that it mypy removal patch is pointless09:03
stephenfineven if it doesn't, so what?09:04
bauzasgibi: stephenfin: yup, indeed, if this runs the same jobs, then nevermind my one09:04
bauzasstephenfin: I was off yesterday but from what I've seen our gate is blocked09:04
bauzasI just want to unblock it asap09:05
stephenfinright, but it's blocked because of the flaky tests09:05
stephenfinnot because of the mypy thing09:05
bauzasthe clean fix requires a requirement bump09:05
stephenfinwe have a fix for the mypy thing09:05
bauzasstephenfin: yup, I saw it09:05
bauzasstephenfin: I'm just trying to see whether we can land things easier09:05
bauzasagain, unblocking the gate seems to me the most important09:06
bauzaswe could revert stuff later09:06
stephenfinbut you still won't be able to land anything that causes  nova-next or nova-grenade-mutlinode to run?09:06
stephenfinso it's not unblocked09:06
bauzasstephenfin: that's a classic chicken-and-egg issue09:06
bauzasyou have 2 unrelated gate issues09:06
stephenfinI mean, if they're flaky then they're flaky for everything, surely?09:06
bauzasstephenfin: I don't disagree09:07
bauzas20% is a high rate of flakiness09:07
stephenfinthen we achieve nothing with this09:07
gibifor me moving from a full red due to pep8 (mypy) failure to a falky red due to unstable jobs is still progress09:07
bauzasexcept we go from 100% of failures to a random 20% from what I understando09:07
bauzasthis09:07
bauzaseither way, I proposed but I don't have opinions09:08
bauzasthe last call, tho, is that I need to get my kids in a min09:08
bauzashttps://zuul.opendev.org/t/openstack/status#79574409:09
bauzaswe don't run the flaky jobs09:10
* bauzas needs to disappear for a 10 min09:10
bauzas30 mins roughly sorry09:10
stephenfinidk, if the failure rate is that high that we can't land a simple reqs patch, that would suggest we need to be working on the other jobs. I see lyarwood has already been looking. I can start now too09:11
bauzasstephenfin: yup, we need to parallelize efforts, I don't disagree09:11
bauzasmy patch is just a hack09:12
bauzasand we need to consider themigrate and next jobs as the top prio09:12
bauzasI absolutely don't disagree09:12
bauzasand I also absolutely agree we need to land the types-paramiko changes09:12
bauzasit's just, again, a way to unblock the gate even if flakey with the other jobs09:13
opendevreviewVictor Coutellier proposed openstack/nova master: Allow configuration of direct-snapshot feature  https://review.opendev.org/c/openstack/nova/+/79483709:13
* bauzas drops now09:13
*** artom has quit IRC09:16
*** artom has joined #openstack-nova09:17
gibithe requirement patch failed again. I'm looking at the grenade failure...09:50
gibihttps://f141fb01d9c1d07df646-94aaf0771088c81abb9a09d47e91a608.ssl.cf1.rackcdn.com/795533/5/check/nova-grenade-multinode/24485b8/testr_results.html09:50
elodillesgibi: sorry, meanwhile i've commented 'recheck' on it :X09:55
gibielodilles: no worries09:55
gibithe failure is unrelated09:55
gibibut based on this morining discussion above I feel we need to stabilize our gate09:55
bauzasI'm back09:55
bauzasgibi: stephenfin disagrees with the quick fix approach of removing mypy so I don't want to opiniate here09:56
bauzaswhat i agree tho is that the gate failures are a PITA that need other pair of eyes09:57
bauzasdo we have a bug for tracking the nova-next and other jobs issues ?09:57
* bauzas needs more context09:58
*** admin1 has joined #openstack-nova09:58
bauzaswe probably need other kinda workaround change for those jobs I guess09:59
gibibauzas: lyarwood promised to file a bug for the libvirt.libvirtError: unable to connect to server at 'ubuntu-focal-rax-dfw-0025050041:49152': Connection refused09:59
gibicase09:59
gibiI'm looking at the last grenade failure09:59
* bauzas looks at those too09:59
gibiwhich is a cinder volume detach timeout09:59
*** abhishekk has quit IRC09:59
gibiat least seem so far10:00
bauzashumpf, yet another focal detach saga ?10:00
admin1hi all .. checking if anyone knows this .. what is the percentage latency/performance difference between underlying   filesystem   vs   nova ephemeral disks on top of it .. ( qcow2) and if there is a way to speed up iops performance by changing the format to raw for the vms .. and if such way exist in openstack ?10:01
elodillesthis volume failure seems quite frequent (however, mostly with volume stuck in 'in-use' state)10:01
elodilleshttp://logstash.openstack.org/#/dashboard/file/logstash.json?query=message:%5C%22failed%20to%20reach%20available%20status%20(current%5C%2210:01
gibielodilles: both in-use and detaching state happens10:02
bauzaselodilles: gibi: which jobs are we talking about ?10:02
bauzasI see nova-live-migrate and nova-ceph-multistore, right?10:02
gibibauzas: I'm looking at the last failure in https://review.opendev.org/c/openstack/nova/+/795533 which is in nova-grenade-multinode10:02
bauzasholy snap10:03
bauzasI was about to propose to make some failing jobs non-voting until we identify a proper fix, but if that's an issue occurring on a large set of jobs, neverming this proposal10:03
bauzaswe're in a fscking bad situation :/10:04
bauzaselodilles: the in-use issue seems to not happen for a while10:06
gibiI see both10:09
gibiin http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22failed%20to%20reach%20available%20status%5C%2210:10
elodilleswell, i haven't realized but logstash shows only failures from 4th June10:14
elodillesi mean, only that day10:16
bauzasyep10:16
bauzassome transient issue I think10:16
gibielodilles: I think that is a shortcoming of logstash10:16
bauzas... or this ?10:16
gibiI see the same recently10:16
bauzasanyway, logstash is sunsetting unfortunately10:17
elodillesi also think it's some indexing issue, or something like that...10:17
bauzasso we can no longer count on it :(10:17
bauzas(even if i told about it)10:17
gibiI cannot figure out this failure10:33
gibiI lost in cinder10:33
gibiI wait for lyarwood to look at it but other than that I can only file a bug on cinder10:38
* lyarwood is back online10:53
gibilyarwood: o/10:53
lyarwoodgibi: Can you throw me a pointer to some logs?10:53
gibisure10:53
gibihttps://f141fb01d9c1d07df646-94aaf0771088c81abb9a09d47e91a608.ssl.cf1.rackcdn.com/795533/5/check/nova-grenade-multinode/24485b8/testr_results.html10:55
gibithis is the recent grenade job failure on https://review.opendev.org/c/openstack/nova/+/79553310:55
gibibut I think this 'failed to reach available status (current detaching)' and 'failed to reach available status (current in-use)' are pretty common failures10:55
lyarwoodright it's still the detach logic10:56
lyarwoodhttps://zuul.opendev.org/t/openstack/build/24485b8c450740c5946c39dcf310a746/log/controller/logs/screen-n-cpu.txt#2566510:56
lyarwoodI had a few hits of this last week and wanted to dump the instance console on failure10:57
lyarwoodled me down a rabbit hole that I've not had time to revisit this week10:57
lyarwoodhttps://review.opendev.org/c/openstack/tempest/+/794757 was my initial attempt10:57
gibiohh10:58
gibithanks for the info10:58
gibiI went to the cinder side10:58
gibiand lost10:58
gibilyarwood: what I saw that the tempest sent a volume detachment https://zuul.opendev.org/t/openstack/build/24485b8c450740c5946c39dcf310a746/log/job-output.txt#58720 and that led to the volume being in detaching state11:00
gibiand that succeeded in nova https://zuul.opendev.org/t/openstack/build/24485b8c450740c5946c39dcf310a746/log/controller/logs/screen-n-cpu.txt#2345611:02
gibiohh it does not11:03
gibiit oly succeded from the persistent domain11:03
lyarwoodright req-6f7d27cf-3d82-4e8d-ad25-7f53249a05a0 fails to detach the volume from the live domain11:04
gibiyeah, now I see11:04
lyarwoodI traced this all through previously and couldn't see any issues with n-cpu, libvirt or even QEMU tbh11:04
lyarwoodso I wanted to see what the state of the guest OS was11:05
lyarwoodbefore I reported a bug to the QEMU folks11:05
lyarwoodand/or cirros11:05
lyarwoodhttps://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_ec1/795533/5/check/nova-live-migration/ec1e40c/testr_results.html - the latest run has failed in nova-live-migration11:07
lyarwoodwith a timeout during the initial POST, odd.11:07
lyarwoodoh I wonder if it's https://bugs.launchpad.net/nova/+bug/192944611:08
opendevmeetLaunchpad bug 1929446 in OpenStack Compute (nova) "check_can_live_migrate_source taking > 60 seconds in CI" [Medium,Triaged]11:08
lyarwoodsean-k-mooney: ^ did you get anywhere with that?11:09
bauzasfwiw, the mypy removal is green from the CI https://review.opendev.org/c/openstack/nova/+/79574411:09
lyarwoodIt's still likely to fail in the actual gate with all of these failures however right?11:11
sean-k-mooneyi think its stil ovsdbapp but i did not see a way to stop the polling in the lib11:12
sean-k-mooneyi can take another look im wondering if we need to move the polling to the prive sep deamon or into a real pthread11:13
lyarwoodsean-k-mooney: I still don't understand the timing here11:16
lyarwoodsean-k-mooney: is there a long running os-vif thing in the background or is it related to check_can_live_migrate_source?11:16
sean-k-mooneythe first11:17
lyarwoodkk the short term workaround is just to bump the rpc timeout I guess11:17
lyarwoodjust in the LM jobs11:17
sean-k-mooneyor rather os-vif usese ovsdbapp which create a connection to ovs and then it kicks off a pooling loop that monitors ovs for the addtion and removal of ports11:18
sean-k-mooneyos-vif never uses that feature of ovsdbapp11:18
sean-k-mooneyit is used by the neutron l2 agent which uses ovsdbapp directly11:18
sean-k-mooneyto know when we add and remove vm ports11:18
sean-k-mooneybut ovsdbapp appears to not have an obvios way to turn it off11:19
lyarwoodah so we do this once via os-vif and ovsdbapp keeps polling in the background forever?11:19
sean-k-mooneyyep11:20
lyarwoodewwww11:20
sean-k-mooneyim sure you have seen it in the debug logs11:20
lyarwoodso we don't even need this?!11:20
sean-k-mooneycorrect11:20
lyarwoodyeah I just assumed we needed it11:20
lyarwoodchrist11:20
lyarwood:D11:20
sean-k-mooneyso there is a better workaround11:20
sean-k-mooneyhttps://github.com/openstack/os-vif/blob/master/vif_plug_ovs/ovs.py#L65-L7511:20
sean-k-mooneygo back to the old cli based backend11:21
sean-k-mooneysorry i proably have to expand on that more11:22
sean-k-mooney[os_vif_ovs]/ovsdb_interface=vsctl11:23
lyarwoodtbh if the native approach is broken and stealing this much time from actual n-cpu requests I'd suggest we do that11:23
sean-k-mooneyset that in the nova.conf11:23
sean-k-mooneywell the native implemation is much much faster espaically at scale11:23
sean-k-mooneybut for right now we could go back to the old one which i was ment to delete last cycle until i can fix the native implementation11:24
lyarwoodquestion is do we do this for all CI envs or  just the live migration ones?11:26
sean-k-mooneyat the scale we operate at in the ci its safe to do it for all if we want too11:27
sean-k-mooneythe performace delta only becomes appearent if you have 100s of ports11:27
gibiI'm not be surpised if this polling interferese with out eventlet monkey patching as internally it also patches some eventlet things11:27
lyarwoodfun11:27
sean-k-mooneygibi: well os-vif intentionally does not use eventlets11:27
sean-k-mooneyalthough its always or almost always loaded into an enve that is monkypatched already11:28
gibi/usr/lib/python3/dist-packages/ovs/poller.py11:28
gibii mean11:28
gibihttps://github.com/openvswitch/ovs/blob/210c4cba9bc69412473a2fee8e9b6f023150e6e6/python/ovs/poller.py#L27011:28
gibiit does have eventlet patching11:28
sean-k-mooneythat is in the ovs python binding but ya11:29
gibias far as I see it actually escaping monkey patching11:29
gibi"If select.poll is11:29
gibi    monkey patched by eventlet or gevent library, it gets the original11:29
gibi    select.poll and returns an object of it"11:29
sean-k-mooneywe woudl want the pooling if it cant be disabled to be on a real pthread11:29
sean-k-mooneyhttps://github.com/openvswitch/ovs/blob/210c4cba9bc69412473a2fee8e9b6f023150e6e6/python/ovs/poller.py#L59-L6311:30
gibiohh this is nice https://github.com/openvswitch/ovs/blob/210c4cba9bc69412473a2fee8e9b6f023150e6e6/python/ovs/poller.py#L59-L6311:30
gibihehe, found the same thing :SD11:30
sean-k-mooneyya that sound very familar11:31
lyarwoodsean-k-mooney: can you update https://bugs.launchpad.net/nova/+bug/1929446 to point to os-vif and update the bug title?11:31
opendevmeetLaunchpad bug 1929446 in OpenStack Compute (nova) "check_can_live_migrate_source taking > 60 seconds in CI" [Medium,Triaged]11:31
sean-k-mooneyi guess but its really in ovs or ovsdbapp. im going to read through the poller implementation and see if we can tweak our usage11:32
lyarwoodright but any changes and/or fixes will end up in os-vif right?11:32
sean-k-mooneynot nessisarly it could be in ovsdbapp but it wont be in nova11:33
sean-k-mooneyif we can fix it in os-vif i might just do it there11:33
lyarwoodack cool sorry the no changes required in nova part was more my point :)11:34
sean-k-mooneyyep11:34
sean-k-mooneythis is where ovsdbapp is using that poller implementaion https://github.com/openstack/ovsdbapp/blob/master/ovsdbapp/backend/ovs_idl/connection.py#L10511:35
sean-k-mooneywe create an instanice of that oconnection object here https://github.com/openstack/os-vif/blob/master/vif_plug_ovs/ovsdb/impl_idl.py#L31-L3311:36
sean-k-mooneyovsdbapp is trying to run the poller in a seperate thread https://github.com/openstack/ovsdbapp/blob/master/ovsdbapp/backend/ovs_idl/connection.py#L9111:37
sean-k-mooneybut that is monkypatched11:37
sean-k-mooneyreally we want that to be a pthread11:37
gibisean-k-mooney: on hacky thing we can do is to monkey patch ovs.poller.Poller class from os-vif to be an empty implementation11:38
sean-k-mooneyi was considering doing something like that11:39
sean-k-mooneyi mean i could proablu just use mock to replace it11:39
gibiyeah11:39
sean-k-mooneyok ill see if i can play with this quickly but i think we should really fix this in ovsdbapp by allowing the connect to be created without poolling11:42
* lyarwood holds off landing anything in devstack for now11:43
sean-k-mooneyhttps://github.com/openstack/ovsdbapp/blob/master/ovsdbapp/backend/ovs_idl/connection.py#L98-L10211:43
sean-k-mooneythis does concern me a bit11:43
lyarwoodhttps://bugs.launchpad.net/nova/+bugs?field.tag=gate-failure we really need to clean this out11:44
* lyarwood gives it a go11:44
sean-k-mooneybut yes ovs is defently un monkypatching https://github.com/openvswitch/ovs/blob/210c4cba9bc69412473a2fee8e9b6f023150e6e6/python/ovs/poller.py#L27011:45
lyarwoodgibi: is https://bugs.launchpad.net/nova/+bug/1793364 resolved now by https://review.opendev.org/q/topic:%22bp%252Fcompact-db-migrations-wallaby%22+(status:open%20OR%20status:merged) ?11:55
opendevmeetLaunchpad bug 1793364 in Cinder "mysql db opportunistic unit tests timing out intermittently in the gate (bad thread switch?)" [High,Confirmed]11:55
lyarwoodyour comment suggests it was supposed to be11:55
gibilyarwood: it helped but I think we saw timeouts even after the compation11:55
gibilet me find the info11:56
sean-k-mooneyi think we did too11:56
lyarwoodkk do we want to keep this open or move it to incomplete and update it with fresh logs etc the next time it happens?11:56
gibihttps://review.opendev.org/c/openstack/nova/+/774889/1#message-db5303e41b22ebfb7b33067d815a12ece2df510b11:56
gibilyarwood: ack11:56
*** whoami-rajat has quit IRC11:57
gibiI will check logstash for fresh occurences11:57
masterpe[m]If I run ./nova-manage placement heal_allocations on Train I get the error: "Compute host scpuko57 could not be found." But it is in the "openstack hypervisor list" and "openstack compute service list"12:00
gibiI made 1793364 duplicate of https://bugs.launchpad.net/nova/+bug/1823251 (as the newer report has more info) and added a link to the log of a recent occurence12:02
opendevmeetLaunchpad bug 1823251 in OpenStack Compute (nova) "Spike in TestNovaMigrationsMySQL.test_walk_versions/test_innodb_tables failures since April 1 2019 on limestone-regionone" [High,Confirmed]12:02
gibiI have https://review.opendev.org/c/openstack/nova/+/775094 for more logs (now rechecked) but honestly I looked at this problem so many times that I don't think I can solve it.12:03
*** spatel has joined #openstack-nova12:19
opendevreviewsean mooney proposed openstack/os-vif master: [WIP] mock ovs.poller.Poller  https://review.opendev.org/c/openstack/os-vif/+/79577012:24
sean-k-mooneygibi: that passes the os-vif functional tests which actully create ports in ovs and locally i had it assert that the Poller was called12:25
sean-k-mooneybut im not sure if this will actully fix the issue12:25
sean-k-mooneyill take a look at the tempest run when its done and we can see if the repeating dbug message is still present or not12:25
kashyaplyarwood: </me back after eclipse hunting> Hey. Have you got a bug filed, or shall I file one?12:29
gibisean-k-mooney: cool12:29
kashyap[OT] A couple of pictures of projections, if you missed it: https://kashyapc.fedorapeople.org/partial_solar_eclipse_2021/12:29
lyarwoodkashyap: I don't yet so feel free to write one up if you have time12:29
* lyarwood jumps on a call12:29
kashyaplyarwood: Sure; I'll do it right now12:29
kashyapBut would be good to see if we can reproduce this at least twice...12:30
*** brinzhang0 has joined #openstack-nova12:31
*** rloo has joined #openstack-nova12:32
gibikashyap: I think we hit it 7 times in the last 7 days http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22libvirt.libvirtError%3A%20unable%20to%20connect%20to%20server%20at%5C%2212:32
kashyapgibi: Ah, that's good to know.12:32
kashyapSo the trigger is block-migrating a puased instance12:33
kashyap(Where "block-migrating" == migrating the instance along with its storage)12:33
*** rloo has quit IRC12:33
*** rloo has joined #openstack-nova12:34
*** brinzhang_ has quit IRC12:38
lyarwoodkashyap: https://bugs.launchpad.net/nova/+bug/1912310 found an older bug we could mark as a duplicate if you've already written a fresh bug up13:21
opendevmeetLaunchpad bug 1912310 in OpenStack Compute (nova) "libvirt.libvirtError: unable to connect to server at " [Medium,Confirmed]13:21
kashyaplyarwood: Ah, good find; I'm just drafting it locally13:22
* kashyap clicks13:22
lyarwoodhttps://bugs.launchpad.net/nova/+bug/1797185 is another13:22
opendevmeetLaunchpad bug 1797185 in OpenStack Compute (nova) "live migration intermittently fails in CI with "Connection refused" during guest transfer" [Low,Confirmed]13:22
kashyapSigh13:22
kashyaplyarwood: What I wonder is - how can we check _why_ the destination is unreahable...13:23
kashyapIs it something peculiar to our upstream CI; or something else13:23
lyarwoodI honestly think it's a bug in the libvirt python bindings where they miss that the migration has already failed, try to poll it's progress and don't handle the fact that it's no longer there correctly13:26
lyarwoodthe unreachable error is a red herring IMHO13:26
* lyarwood looks at the trace again13:26
kashyaplyarwood: Yeah; I asked Michal from libvirt; and he suggested Jiri Denemark13:27
kashyapHe isn't around; but I'll ask him to comment on this once he's back13:28
kashyaplyarwood: Also, you only speak of Python bindings - why won't it be a bug in the C API itself?13:28
kashyapAsking out of ignorance, not challenging :)13:28
bauzashola13:28
bauzaslyarwood: gibican we know how many jobs have problems ?13:29
bauzasgibi: ^13:29
bauzasif we have problems with fixing those problems, can we make those specific jobs non-voting until we fix the main issues ?13:29
lyarwoodkashyap: it could be that as well but n-cpu is calling the python bindings and they are raising the error here so it easily be an issue there13:29
*** abhishekk has joined #openstack-nova13:29
bauzasfwiw, I saw that https://review.opendev.org/c/openstack/nova/+/791506 passed the check pipeline but we don't know yet whether the gate one will accept it13:30
kashyaplyarwood: Ah, okay; you're just speaking of the error trace in our case.  Reasonable13:30
kashyaplyarwood: Can you copy/paste your comment from the review to start with in there?13:31
kashyap(In the latest bug you filed 2021 Jan)13:31
lyarwoodbauzas: There are various intermittent failures at the moment, I'm trying to clean up the gate-failures bug tag to get a better handle on what is failing and how often13:31
lyarwoodbauzas: we could move them to non-voting but this has been going on for weeks so I'd rather we try to get a better handle on this first13:32
bauzasokay, fair enough13:32
lyarwoodI honestly think we haven't been landing enough to notice recently13:32
bauzaslyarwood: I guess my point is that we're still having *all* the changes getting -1 because of the mypy issue13:32
bauzasso I want the gate back asap13:32
lyarwoodyup fair13:33
bauzaswe have two possibilities, the main fix https://review.opendev.org/c/openstack/nova/+/79150613:33
* gibi is on a call 13:33
bauzasbut I also prepared an alternative change that's simplier for the gate and doesn't hit the transient issues we know https://review.opendev.org/c/openstack/nova/+/79574413:33
bauzasso, I guess, I'd recommend to wait for the main change to merge, but if we get a Zuul -2 from the gate on it, I'd honestly recommend to let https://review.opendev.org/c/openstack/nova/+/795744 go13:34
lyarwoodsean-k-mooney: https://bugs.launchpad.net/nova/+bug/1863889/comments/3 another possible hit of the ovsdbapp issue13:42
opendevmeetLaunchpad bug 1863889 in OpenStack Compute (nova) "Revert resize problem in neutron-tempest-dvr-ha-multinode-full" [Medium,Confirmed]13:42
sean-k-mooneylyarwood: im talking to otherwiseguy and ralonsoh in #openstack-neutron about it now but ill let them know13:44
lyarwoodit's just another random timeout FWIW13:44
lyarwoodwith lots of spam from ovsdbapp so no hard proof13:44
*** artom_ has joined #openstack-nova13:50
*** pjakuszew has joined #openstack-nova13:54
bauzaslyarwood: the problem is that logstash isn't reliable13:55
bauzasdespite some people like me and dansmith expressing our needs13:55
bauzaswe need to focus on the few things we saw occuring often13:56
*** artom has quit IRC13:56
bauzasand either fix them or hide them under the carpet until we are in a better situation13:56
lyarwoodyup agreed14:01
lyarwoodand that's what I have been doing prior to now with things like https://bugs.launchpad.net/nova/+bug/1929710 for example14:01
opendevmeetLaunchpad bug 1929710 in OpenStack Compute (nova) "virDomainGetBlockJobInfo fails during swap_volume as disk '$disk' not found in domain" [Medium,New]14:01
*** pjakuszew has quit IRC14:02
lyarwoodhttps://bugs.launchpad.net/nova/+bugs?field.tag=gate-failure should be in a much better state now, I've moved loads of things to incomplete where I couldn't find any evidence of recent hits etc.14:03
lyarwoodthese should close automatically in the next 60 days14:03
*** pjakuszew has joined #openstack-nova14:04
*** pjakuszew has quit IRC14:05
*** pjakuszew has joined #openstack-nova14:05
bauzaslyarwood: ++ thanks for triaging14:08
*** pjakuszew is now known as cz3|work14:09
*** cz3|work is now known as pjakuszew14:10
*** pjakuszew has left #openstack-nova14:10
*** cz3|work has joined #openstack-nova14:17
opendevreviewMerged openstack/nova-specs master: Speed up server details  https://review.opendev.org/c/openstack/nova-specs/+/79162014:22
*** alistarle has joined #openstack-nova14:27
*** mdbooth has joined #openstack-nova14:28
*** alistarle has left #openstack-nova14:32
*** alistarle has joined #openstack-nova14:32
*** alistarle has left #openstack-nova14:34
*** alistarle has joined #openstack-nova14:34
kashyaplyarwood: Before I forget, for now, I've copy/pasted (with attribution) your comment from the change here: https://bugs.launchpad.net/nova/+bug/191231014:36
opendevmeetLaunchpad bug 1912310 in OpenStack Compute (nova) "libvirt.libvirtError: unable to connect to server at " [Medium,Confirmed]14:36
*** brinzhang0 has quit IRC14:36
* gibi reads back...14:36
*** brinzhang0 has joined #openstack-nova14:37
gibilyarwood: thanks for the cleanup. I add that query to the weekly agenda14:40
lyarwoodkashyap: ack thanks14:40
lyarwoodgibi: awesome cheers14:40
sean-k-mooneygibi: lyarwood  mocking the poller wont work but i have a seperate patch to move it to a real thread against ovsdb and talking to otherwiseguy https://bugs.launchpad.net/neutron/+bug/1930926  and https://review.opendev.org/c/openstack/neutron/+/794892 might also help if we ported the same change to os-vif14:54
opendevmeetLaunchpad bug 1930926 in neutron "Failing over OVN dbs can cause original controller to permanently lose connection" [Medium,Fix released] - Assigned to Terry Wilson (otherwiseguy)14:54
*** abhishekk has quit IRC14:54
sean-k-mooneymocking the poller actully cause the delay14:54
sean-k-mooneyas it forces use to reconnect every 5 seconds sicne we dont respond to the server side echo14:54
lyarwood`# Overwriting globals in a library is clearly a good idea` lol14:56
lyarwoodsean-k-mooney: ack kk14:56
sean-k-mooneyya...14:58
gibisean-k-mooney: will look at it shortly15:00
*** lucasagomes has quit IRC15:08
*** liuyulong has quit IRC15:12
*** liuyulong has joined #openstack-nova15:12
*** tkajinam has quit IRC15:19
bauzasmmm, /me needs to leave, unfortunately too early to see outcomes of https://zuul.opendev.org/t/openstack/status#79553315:32
bauzashopefully the change will be merged, but the failing jobs are currently still running15:32
*** kashyap has quit IRC15:45
gibisean-k-mooney:  is the the new directon? https://review.opendev.org/c/openstack/ovsdbapp/+/79578915:54
*** alistarle has quit IRC16:03
opendevreviewBalazs Gibizer proposed openstack/nova master: Test the NotificationFixture  https://review.opendev.org/c/openstack/nova/+/75845016:10
opendevreviewBalazs Gibizer proposed openstack/nova master: Move fake_notifier impl under NotificationFixture  https://review.opendev.org/c/openstack/nova/+/75845116:10
opendevreviewBalazs Gibizer proposed openstack/nova master: rpc: Mark attributes as private  https://review.opendev.org/c/openstack/nova/+/79280316:10
sean-k-mooneygibi: its potentally part of it where are some other optimistaion that they have done in neutron that we could port ot os-vif that might help16:11
stephenfingrenade-multinode failed on the deps change again :( tempest.api.compute.admin.test_live_migration.LiveMigrationTest.test_live_block_migration_with_attached_volume this time16:11
gibistephenfin: :/16:11
sean-k-mooneygibi: it sound like the issue really need to be fixed in the ovs python bindindings16:11
gibisean-k-mooney: so no easy fix?16:12
gibisean-k-mooney: should we switch back to vsctl in the gate until we make a fix for the bindig?16:12
gibistephenfin: looking16:12
sean-k-mooneygibi: my next attempt will be to patch get_system_poll in the ovs binding16:12
gibisean-k-mooney: OK, lets try that16:12
sean-k-mooneyto not return the unpatch version form os-vif16:12
sean-k-mooneygibi: the short term fix would be to revert to the deprecated driver in devstack16:13
sean-k-mooneyuntill we figure out a way to work around this properly16:13
sean-k-mooneyso we can certenly do that if we want too16:13
gibisean-k-mooney: lets keep the devstack option open, but I think it is OK to try patching get_system_poll first16:13
gibilet see if that helps, if not then go with the devstack change16:14
sean-k-mooneybasicaly if i have get_system_poll returng _SelectSelect then i think it shoudl not block16:14
sean-k-mooneyon reconnect16:14
sean-k-mooneythen the neutron chagnes to disbale the echo and instead use tcp keepalive and limit the tables we cache woudl also miniums reconnect time16:15
gibistephenfin: that is the volume detaching issue again that where lyarwood wants to look at the guest console log for hints16:16
gibistephenfin: https://review.opendev.org/c/openstack/tempest/+/79475716:17
gibistephenfin: should we just retry or try to land https://review.opendev.org/c/openstack/nova/+/795744 in parallel as well to see which lands first?16:18
gibior turn off these unstable tests?16:19
*** rpittau is now known as rpittau|afk16:24
*** alex_xu has quit IRC16:27
melwittthe gate is so angry :(   seeing so many of Jun 10 16:08:13.476629 ubuntu-focal-rax-dfw-0025058768 nova-compute[50365]: ERROR oslo_messaging.rpc.server nova.exception.DeviceDetachFailed: Device detach failed for vdb: Run out of retry while detaching device vdb with device alias virtio-disk1 from instance 63ddd2de-502b-4467-b7b5-350e601e4d1b from the live domain config. Device is still attached to the guest.16:49
gibimelwitt: yes we discussed it with lyarwood during the day16:51
melwittwhyyyyyy /o\16:51
* lyarwood is back16:51
gibihe suspects that something in the guest OS makes the detach hanging16:51
*** raildo has joined #openstack-nova16:51
gibihence https://review.opendev.org/c/openstack/tempest/+/79475716:52
lyarwoodI'm going to fix that up quickly now16:52
gibi(I looked at the failures in the tempest test but did not figured out what was wrong:/)16:52
melwitthow does that help the guest OS? sorry I don't understand16:53
gibithat tempest patch will make sure that the guest console log is dumped16:53
melwittohhh ok16:54
lyarwoodyeah it's not going to resolve anything16:54
lyarwoodI had to rework the cleanup ordering to get wait_for_volume_attachment_remove first16:54
lyarwoodas that is able to dump the console16:54
* lyarwood was sure he pushed a fixed version of this last week but nvm16:55
melwittgotcha16:55
*** derekh has quit IRC16:59
*** raildo has quit IRC17:18
*** raildo has joined #openstack-nova17:18
*** david-lyle is now known as dklyle17:26
*** supamatt has joined #openstack-nova17:41
*** sean-k-mooney has quit IRC17:41
gansomelwitt, gibi: hi! could you please take a look at this backport? It already has a +2. Thanks in advance!18:02
melwittganso: link?18:04
gansomelwitt: sorry! forgot the link! https://review.opendev.org/c/openstack/nova/+/79432818:06
melwittthanks18:06
*** cz3|work is now known as pjakuszew18:08
opendevreviewLee Yarwood proposed openstack/nova master: DNM testing tempest volume detach failure capture of console  https://review.opendev.org/c/openstack/nova/+/79476618:35
*** sean-k-mooney has joined #openstack-nova19:04
*** vishalmanchanda has quit IRC19:14
sean-k-mooneygibi: looking at the ci result https://review.opendev.org/c/openstack/ovsdbapp/+/795789 actully does seam to if not fix greatly imporve the issue19:15
sean-k-mooneygibi: there are no recoonection or stalls with that patch19:15
sean-k-mooneyim still going to look at addtional impormennt within os-vif but that looks to me like a viable path forward19:16
*** swp20 has quit IRC19:46
*** andrewbonney has quit IRC20:00
*** spatel has quit IRC20:06
*** bnemec has quit IRC20:26
*** bnemec has joined #openstack-nova20:29
*** spatel has joined #openstack-nova20:35
*** spatel has quit IRC20:52
*** ralonsoh has quit IRC20:56
*** spatel has joined #openstack-nova21:47
*** spatel has quit IRC22:06
opendevreviewAde Lee proposed openstack/nova master: Add check job for FIPS  https://review.opendev.org/c/openstack/nova/+/79051922:16

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!