opendevreview | melanie witt proposed openstack/nova master: WIP libvirt: Introduce support for rbd with LUKS https://review.opendev.org/c/openstack/nova/+/889912 | 01:01 |
---|---|---|
opendevreview | Elod Illes proposed openstack/nova master: Fix tox docs target https://review.opendev.org/c/openstack/nova/+/891033 | 13:24 |
opendevreview | Jimmy McCrory proposed openstack/nova master: Fix typo in placement audit command argument https://review.opendev.org/c/openstack/nova/+/891035 | 14:00 |
opendevreview | Dan Smith proposed openstack/nova master: DNM: Test glance new location API https://review.opendev.org/c/openstack/nova/+/891036 | 14:27 |
kashyap | dansmith: Thank you for doing the recheck work for me! https://review.opendev.org/c/openstack/nova/+/887255 </embarassed> | 14:56 |
ozzzo_work | I have a couple of VMs that are half-created: https://paste.openstack.org/show/b7wqqzgumoH8tdtvmEh1/ | 15:28 |
ozzzo_work | how can I clean this up? I looked in the db and they don't exist in "instances" | 15:28 |
dansmith | ozzzo_work: check the api database's instance_build_requests table | 15:30 |
dansmith | I'm guessing you're seeing that because those request-to-build-an-instance entries are there and we never got beyond that | 15:31 |
dansmith | I'd check the logs to see if you're seeing any fireworks there, because that *should* be resolvable, AFAIK | 15:31 |
dansmith | any manual DB surgery that has been done? | 15:31 |
ozzzo_work | oic it was in the nova_cell0 table | 15:49 |
ozzzo_work | we're not using multiple cells but I guess everything must go in cell0 by default now | 15:50 |
dansmith | ozzzo_work: dead instances that fail to schedule always go into the cell0 DB | 16:12 |
dansmith | everyone is using cells, you might just only have one cell of computes :) | 16:12 |
melwitt | dansmith: have you noticed "no valid host" fails in nova-ceph-multistore and "Seems service nova-compute on host np0034917240 is down. Last heartbeat was 2023-08-10 15:59:29. Elapsed time is 847.603029" messages in the n-sch log? | 17:02 |
dansmith | I saw one example of that and it looked like rabbit crashed or something because I saw a bunch of amq errors in multiple places | 17:03 |
melwitt | hm, I hadn't noticed the rabbit correlation | 17:04 |
melwitt | the couple of examples I saw so far, the n-cpu log just stops ... early. and the nova valid host "returned 0 hosts" messages in n-sch start after the last timestamp seen in n-cpu | 17:06 |
melwitt | last log message Aug 10 16:00:00.843521 https://zuul.openstack.org/build/e7c306fad21e4057a3af40720049482a/log/controller/logs/screen-n-cpu.txt and then first "ComputeFilter returned 0 hosts" at Aug 10 16:13:36.604315 https://zuul.openstack.org/build/e7c306fad21e4057a3af40720049482a/log/controller/logs/screen-n-sch.txt#3881 | 17:09 |
dansmith | melwitt: you've seen multiple cases of n-cpu just freezing or just the one? | 17:10 |
melwitt | my first thought was, is it possible n-cpu is being OOM killed or something but I don't see any OOM messages in syslog.txt (assuming that's where they would be) | 17:11 |
dansmith | yeah I saw no OOM in my case either (n-cpu or otherwise) | 17:13 |
dansmith | yep, syslog | 17:13 |
melwitt | I saw two so far, the other one being https://zuul.openstack.org/build/4884721dbf764b33995d71c990f65dfe/logs | 17:15 |
melwitt | (there's a third but it's in a DNM ceph job I'm using to test ephemeral encryption) | 17:15 |
dansmith | oh that one was mine, yeah | 17:16 |
dansmith | melwitt: so in that last one, the last thing nova logs is running a ceph command ... :) | 17:19 |
dansmith | oh but it returned | 17:19 |
melwitt | :) | 17:22 |
melwitt | it does seem to stop at the same place though, "Creating image(s)" | 17:22 |
dansmith | weird, I dunno what would have changed in the last few days related to this, and I haven't seen this before | 17:27 |
melwitt | same :( | 17:28 |
sean-k-mooney | f**** | 17:33 |
sean-k-mooney | DEBUG ovsdbapp.backend.ovs_idl.vlog [-] 4999-ms timeout {{(pid=1435862) __log_wakeup /opt/stack/data/venv/lib/python3.10/site-packages/ovs/poller.py:248}} | 17:34 |
* melwitt imagines sean flipping a table | 17:34 | |
sean-k-mooney | are we seeing that in ci | 17:34 |
sean-k-mooney | it looks liek the ovsdbapp thing might be regressed | 17:35 |
sean-k-mooney | im gong to check my devstack/nova logs | 17:35 |
sean-k-mooney | thats the issue where ovs blocks nova for 5 seconds at a time | 17:35 |
melwitt | actually I found one in there https://zuul.openstack.org/build/e7c306fad21e4057a3af40720049482a/log/controller/logs/screen-n-cpu.txt#39523 | 17:36 |
sean-k-mooney | this could be a regression in the devstack in a venv change | 17:36 |
sean-k-mooney | but my code in devstack to workaround this does not seam to have put the thing in the nova.conf | 17:36 |
melwitt | it's logged 15 times on that run | 17:37 |
sean-k-mooney | really..... https://github.com/openstack/devstack/commit/db3eff7dd27acdc973e8d189bda80d642be92f03 | 17:38 |
sean-k-mooney | i need to revert that | 17:39 |
sean-k-mooney | this is fix only if you use a new enough ovs | 17:39 |
sean-k-mooney | which apprenly is not in ubuntu 22.04 | 17:39 |
melwitt | the other example run has DEBUG ovsdbapp.backend.ovs_idl.vlog [-] 999-ms timeout {{(pid=98090) __log_wakeup /usr/local/lib/python3.10/dist-packages/ovs/poller.py:248}} not sure if that's bad also | 17:42 |
melwitt | it doesn't have the 4999 | 17:42 |
sean-k-mooney | yep the exact amount isnt really imporant | 17:42 |
sean-k-mooney | the timeout is around 5 seconds | 17:43 |
sean-k-mooney | but you can get reconnnect at shorter intervals depending on what else nova is doing | 17:43 |
sean-k-mooney | https://review.opendev.org/c/openstack/devstack/+/891101 will revert it | 17:46 |
melwitt | wonder why I only see it with ceph. a coincidence? | 17:46 |
sean-k-mooney | probaly im seeing it locally | 17:46 |
sean-k-mooney | i noticed it while i was looking at somethign for dansmith | 17:47 |
melwitt | oh ok | 17:47 |
sean-k-mooney | so i have a 50 second span | 17:52 |
sean-k-mooney | where the compute agent does not log anything else implying it was blocked for 50 seconds reconnecting ot ovs | 17:52 |
sean-k-mooney | the problem is i have a version of ovs that apprenly has this fix and a version of ovsdbapp that also has this fix | 17:52 |
sean-k-mooney | so either the fix does not work or it has regressed again | 17:53 |
sean-k-mooney | this woudl cause the ci jobs to fail alot but also cause issues for people in production obviously | 17:53 |
melwitt | sean-k-mooney++ | 17:56 |
sean-k-mooney | so i have a patch version os-vif, ovsdbapp and ovs python binings and its still happening meaning it looks like maybe it regressed again | 17:58 |
melwitt | 🤯 | 18:00 |
sean-k-mooney | or this is the problem https://github.com/openstack/os-vif/blob/master/vif_plug_ovs/ovsdb/impl_idl.py#L35 | 18:02 |
sean-k-mooney | this is the ovsdbapp fix | 18:02 |
sean-k-mooney | https://github.com/openstack/ovsdbapp/blob/master/ovsdbapp/backend/ovs_idl/connection.py#L148-L150 | 18:02 |
sean-k-mooney | in its subclass of idl.Idl | 18:02 |
sean-k-mooney | but we are doing " return idl.Idl(conn, helper)" directly usign the one form ovs | 18:03 |
sean-k-mooney | that should be using the NeutronOvsdbIdl class below i think | 18:04 |
dansmith | melwitt: oh SNAP | 18:35 |
dansmith | melwitt: I wonder if the thing with the ceph job is something I broke | 18:35 |
dansmith | melwitt: https://review.opendev.org/c/openstack/devstack/+/890439 | 18:36 |
dansmith | I adjusted the interval and the deadline so I think it *should* be okay | 18:36 |
dansmith | although that wouldn't explain nova just not logging anything after the hang | 18:36 |
melwitt | yeah.. 🤔 | 18:37 |
sean-k-mooney | ok the issue i mentioned might be fixed | 18:54 |
sean-k-mooney | i configured my keepalive interval in the kernel to 10 secodn i think but ovs is expecting probes every 5 seconds | 18:55 |
melwitt | yeah, I was just looking at it again and n-cpu hadn't reported a heartbeat in 14 minutes and never came back after the hang | 18:55 |
melwitt | bizarre | 18:57 |
melwitt | libvirt log activity also stops for the most part at the same time n-cpu stops https://zuul.openstack.org/build/e7c306fad21e4057a3af40720049482a/log/controller/logs/libvirt/libvirt/libvirtd_log.txt | 19:00 |
melwitt | not sure if anything near the end of that log is unusual or problematic | 19:00 |
melwitt | finally found that on those failures, at the same time that n-cpu freezes, ceph starts logging WRN (warn) health status such as "osd.0 47 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.15524.0:32 4.1e 4:7cf325a3:::rbd_header.3c6ebb6b8706:head [watch ping cookie 140317856977664] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e47)" | 21:16 |
melwitt | example https://zuul.openstack.org/build/4884721dbf764b33995d71c990f65dfe/log/controller/logs/ceph/ceph/7f57d687-6678-46f4-ae09-7e6faa476d1a/ceph-osd.0_log.txt#11752 | 21:16 |
melwitt | log_channel(cluster) log [WRN] : 1 slow requests (by type [ 'delayed' : 1 ] most affected pool [ 'vms' : 1 ]) | 21:17 |
opendevreview | melanie witt proposed openstack/nova master: DNM test ephemeral encryption + resize: qcow2, raw, rbd https://review.opendev.org/c/openstack/nova/+/862416 | 22:50 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!