Thursday, 2023-08-10

opendevreview	melanie witt proposed openstack/nova master: WIP libvirt: Introduce support for rbd with LUKS https://review.opendev.org/c/openstack/nova/+/889912	01:01
opendevreview	Elod Illes proposed openstack/nova master: Fix tox docs target https://review.opendev.org/c/openstack/nova/+/891033	13:24
opendevreview	Jimmy McCrory proposed openstack/nova master: Fix typo in placement audit command argument https://review.opendev.org/c/openstack/nova/+/891035	14:00
opendevreview	Dan Smith proposed openstack/nova master: DNM: Test glance new location API https://review.opendev.org/c/openstack/nova/+/891036	14:27
kashyap	dansmith: Thank you for doing the recheck work for me! https://review.opendev.org/c/openstack/nova/+/887255 </embarassed>	14:56
ozzzo_work	I have a couple of VMs that are half-created: https://paste.openstack.org/show/b7wqqzgumoH8tdtvmEh1/	15:28
ozzzo_work	how can I clean this up? I looked in the db and they don't exist in "instances"	15:28
dansmith	ozzzo_work: check the api database's instance_build_requests table	15:30
dansmith	I'm guessing you're seeing that because those request-to-build-an-instance entries are there and we never got beyond that	15:31
dansmith	I'd check the logs to see if you're seeing any fireworks there, because that should be resolvable, AFAIK	15:31
dansmith	any manual DB surgery that has been done?	15:31
ozzzo_work	oic it was in the nova_cell0 table	15:49
ozzzo_work	we're not using multiple cells but I guess everything must go in cell0 by default now	15:50
dansmith	ozzzo_work: dead instances that fail to schedule always go into the cell0 DB	16:12
dansmith	everyone is using cells, you might just only have one cell of computes :)	16:12
melwitt	dansmith: have you noticed "no valid host" fails in nova-ceph-multistore and "Seems service nova-compute on host np0034917240 is down. Last heartbeat was 2023-08-10 15:59:29. Elapsed time is 847.603029" messages in the n-sch log?	17:02
dansmith	I saw one example of that and it looked like rabbit crashed or something because I saw a bunch of amq errors in multiple places	17:03
melwitt	hm, I hadn't noticed the rabbit correlation	17:04
melwitt	the couple of examples I saw so far, the n-cpu log just stops ... early. and the nova valid host "returned 0 hosts" messages in n-sch start after the last timestamp seen in n-cpu	17:06
melwitt	last log message Aug 10 16:00:00.843521 https://zuul.openstack.org/build/e7c306fad21e4057a3af40720049482a/log/controller/logs/screen-n-cpu.txt and then first "ComputeFilter returned 0 hosts" at Aug 10 16:13:36.604315 https://zuul.openstack.org/build/e7c306fad21e4057a3af40720049482a/log/controller/logs/screen-n-sch.txt#3881	17:09
dansmith	melwitt: you've seen multiple cases of n-cpu just freezing or just the one?	17:10
melwitt	my first thought was, is it possible n-cpu is being OOM killed or something but I don't see any OOM messages in syslog.txt (assuming that's where they would be)	17:11
dansmith	yeah I saw no OOM in my case either (n-cpu or otherwise)	17:13
dansmith	yep, syslog	17:13
melwitt	I saw two so far, the other one being https://zuul.openstack.org/build/4884721dbf764b33995d71c990f65dfe/logs	17:15
melwitt	(there's a third but it's in a DNM ceph job I'm using to test ephemeral encryption)	17:15
dansmith	oh that one was mine, yeah	17:16
dansmith	melwitt: so in that last one, the last thing nova logs is running a ceph command ... :)	17:19
dansmith	oh but it returned	17:19
melwitt	:)	17:22
melwitt	it does seem to stop at the same place though, "Creating image(s)"	17:22
dansmith	weird, I dunno what would have changed in the last few days related to this, and I haven't seen this before	17:27
melwitt	same :(	17:28
sean-k-mooney	f****	17:33
sean-k-mooney	DEBUG ovsdbapp.backend.ovs_idl.vlog [-] 4999-ms timeout {{(pid=1435862) __log_wakeup /opt/stack/data/venv/lib/python3.10/site-packages/ovs/poller.py:248}}	17:34
* melwitt imagines sean flipping a table		17:34
sean-k-mooney	are we seeing that in ci	17:34
sean-k-mooney	it looks liek the ovsdbapp thing might be regressed	17:35
sean-k-mooney	im gong to check my devstack/nova logs	17:35
sean-k-mooney	thats the issue where ovs blocks nova for 5 seconds at a time	17:35
melwitt	actually I found one in there https://zuul.openstack.org/build/e7c306fad21e4057a3af40720049482a/log/controller/logs/screen-n-cpu.txt#39523	17:36
sean-k-mooney	this could be a regression in the devstack in a venv change	17:36
sean-k-mooney	but my code in devstack to workaround this does not seam to have put the thing in the nova.conf	17:36
melwitt	it's logged 15 times on that run	17:37
sean-k-mooney	really..... https://github.com/openstack/devstack/commit/db3eff7dd27acdc973e8d189bda80d642be92f03	17:38
sean-k-mooney	i need to revert that	17:39
sean-k-mooney	this is fix only if you use a new enough ovs	17:39
sean-k-mooney	which apprenly is not in ubuntu 22.04	17:39
melwitt	the other example run has DEBUG ovsdbapp.backend.ovs_idl.vlog [-] 999-ms timeout {{(pid=98090) __log_wakeup /usr/local/lib/python3.10/dist-packages/ovs/poller.py:248}} not sure if that's bad also	17:42
melwitt	it doesn't have the 4999	17:42
sean-k-mooney	yep the exact amount isnt really imporant	17:42
sean-k-mooney	the timeout is around 5 seconds	17:43
sean-k-mooney	but you can get reconnnect at shorter intervals depending on what else nova is doing	17:43
sean-k-mooney	https://review.opendev.org/c/openstack/devstack/+/891101 will revert it	17:46
melwitt	wonder why I only see it with ceph. a coincidence?	17:46
sean-k-mooney	probaly im seeing it locally	17:46
sean-k-mooney	i noticed it while i was looking at somethign for dansmith	17:47
melwitt	oh ok	17:47
sean-k-mooney	so i have a 50 second span	17:52
sean-k-mooney	where the compute agent does not log anything else implying it was blocked for 50 seconds reconnecting ot ovs	17:52
sean-k-mooney	the problem is i have a version of ovs that apprenly has this fix and a version of ovsdbapp that also has this fix	17:52
sean-k-mooney	so either the fix does not work or it has regressed again	17:53
sean-k-mooney	this woudl cause the ci jobs to fail alot but also cause issues for people in production obviously	17:53
melwitt	sean-k-mooney++	17:56
sean-k-mooney	so i have a patch version os-vif, ovsdbapp and ovs python binings and its still happening meaning it looks like maybe it regressed again	17:58
melwitt	🤯	18:00
sean-k-mooney	or this is the problem https://github.com/openstack/os-vif/blob/master/vif_plug_ovs/ovsdb/impl_idl.py#L35	18:02
sean-k-mooney	this is the ovsdbapp fix	18:02
sean-k-mooney	https://github.com/openstack/ovsdbapp/blob/master/ovsdbapp/backend/ovs_idl/connection.py#L148-L150	18:02
sean-k-mooney	in its subclass of idl.Idl	18:02
sean-k-mooney	but we are doing " return idl.Idl(conn, helper)" directly usign the one form ovs	18:03
sean-k-mooney	that should be using the NeutronOvsdbIdl class below i think	18:04
dansmith	melwitt: oh SNAP	18:35
dansmith	melwitt: I wonder if the thing with the ceph job is something I broke	18:35
dansmith	melwitt: https://review.opendev.org/c/openstack/devstack/+/890439	18:36
dansmith	I adjusted the interval and the deadline so I think it should be okay	18:36
dansmith	although that wouldn't explain nova just not logging anything after the hang	18:36
melwitt	yeah.. 🤔	18:37
sean-k-mooney	ok the issue i mentioned might be fixed	18:54
sean-k-mooney	i configured my keepalive interval in the kernel to 10 secodn i think but ovs is expecting probes every 5 seconds	18:55
melwitt	yeah, I was just looking at it again and n-cpu hadn't reported a heartbeat in 14 minutes and never came back after the hang	18:55
melwitt	bizarre	18:57
melwitt	libvirt log activity also stops for the most part at the same time n-cpu stops https://zuul.openstack.org/build/e7c306fad21e4057a3af40720049482a/log/controller/logs/libvirt/libvirt/libvirtd_log.txt	19:00
melwitt	not sure if anything near the end of that log is unusual or problematic	19:00
melwitt	finally found that on those failures, at the same time that n-cpu freezes, ceph starts logging WRN (warn) health status such as "osd.0 47 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.15524.0:32 4.1e 4:7cf325a3:::rbd_header.3c6ebb6b8706:head [watch ping cookie 140317856977664] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e47)"	21:16
melwitt	example https://zuul.openstack.org/build/4884721dbf764b33995d71c990f65dfe/log/controller/logs/ceph/ceph/7f57d687-6678-46f4-ae09-7e6faa476d1a/ceph-osd.0_log.txt#11752	21:16
melwitt	log_channel(cluster) log [WRN] : 1 slow requests (by type [ 'delayed' : 1 ] most affected pool [ 'vms' : 1 ])	21:17
opendevreview	melanie witt proposed openstack/nova master: DNM test ephemeral encryption + resize: qcow2, raw, rbd https://review.opendev.org/c/openstack/nova/+/862416	22:50

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!