f0o | Heya; so the upgrade procedure hard-restarts systemd-networkd which is a really bad thing, I just had a full network meltdown because the ToR routers' networkd was restarted the bgpd got rekt. Can't a simple networkd-reload be done instead of a hard restart? | 09:51 |
---|---|---|
f0o | Speaking of this meltdown; is it safe to re-run the upgrade procedure? obviosuly it now failed on one host since it nuked it from the face of the earth. I'd like to patch out the networkd-restart bit and give it another go | 09:52 |
f0o | or am I now FUBARd? | 09:52 |
f0o | oh it failed ultimately and prompted me to rerun so I guess all is well | 11:09 |
f0o | made that 'restarted' into a 'reloaded' in ansible-role-systemd_networkd and giving it another go | 11:10 |
f0o | upgrade script is now past all the networkd stuff and reloaded worked fine, it picked up the changes and didnt nuke the entire network stack (and subsequently bgpd) | 11:20 |
jrosser | f0o: could you share the point in the code where the openstack-ansible upgrade restarted systemd-networkd? | 12:52 |
f0o | it was part of the setup-hosts playbook I believe - it did it pretty early on after gathering some facts in the first playbook it ran | 12:54 |
jrosser | have you defined openstack_hosts_systemd_networkd_devices or openstack_hosts_systemd_networkd_networks ? | 12:55 |
jrosser | reason i ask is that the host networking is not generally the responsibility of openstack-ansible, so it would be unexpected for the networking to be restarted | 12:55 |
f0o | right now I'm detangling a new web of issues... nova-compute decided to not start up anymore with: Timed out waiting for a reply to message ID | 12:56 |
f0o | jrosser: nop didnt touch any of the network-setup related settings | 12:56 |
jrosser | then i am not sure that "the upgrade procedure hard-restarts systemd-networkd" | 12:57 |
f0o | well the upgrade runs a playbook which then cause systemd-networkd to be restarted on all hosts... not sure how else to phrase it | 12:57 |
jrosser | if you can provide a log then it would be helpful, if that does happen then it would be a bug | 12:58 |
f0o | I ran setup-hosts many times in 2023.2 and never had that behavior but on 2024.1 it happened - some change somewhere caused that without me having touched the config/vars | 12:58 |
f0o | will gather logs once I recover nova | 12:58 |
f0o | nova's error read like an AMQP issue but rabbit seems fine and healthy and other nova-compute's can connect to it fine... not sure what's up | 12:59 |
f0o | so what does `oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID xyz` mean on nova-compute? I see it connect and authenticate correctly against the rabbitmq servers | 13:10 |
f0o | but then nova-compute hits that error and just exits | 13:10 |
f0o | not even graceful, rabbitmq even reports client unexpectedly closed TCP connection | 13:11 |
f0o | who is supposed to reply to that message - I suspect some other nova-related service is not running correctly despite healthchecks passing | 13:12 |
f0o | https://paste.opendev.org/show/btN4MAl2yypjZqQJkPmi/ | 13:12 |
f0o | my first guess was the nova-conductor but it shows as running and has relatively recent logs | 13:16 |
f0o | fully restarted all nova-* components, now compute starts so I guess it was conductor despite showing Up/Healthy everywhere | 13:30 |
f0o | all instances are broken now and it seems that all NFS mounts were removed as well so everything fails with Could not open '/var/lib/nova/mnt/513df3acbd666026449818dcdae343d0/volume-bcce8032-b45e-4cfe-948a-b5a6ae1b832f' ... | 13:31 |
f0o | I'm starting to regret that upgrade lol | 13:31 |
f0o | something is really wrong... I had to chmod 775 all mountpoints and 777 all volumes... This feels very very wrong | 13:41 |
f0o_ | all instances hard-rebooted and back after that scary chmod... not even sure why it is required now... did the UID/GID change? | 13:54 |
f0o_ | makes no sense... | 13:54 |
f0o_ | oh yeah the UID/GID did change from nova to libvirt-qemu:kvm | 14:10 |
f0o_ | holy crap no wonder all VMs crapped out | 14:10 |
f0o_ | so I gotta change my NFS exports to squash to those UID/GIDs | 14:11 |
jrosser | f0o_: here is what we do for nfs CI tests https://github.com/openstack/openstack-ansible/blob/master/tests/roles/bootstrap-host/tasks/prepare_nfs.yml#L81 | 15:20 |
f0o_ | how can I regenerate the /etc/hosts for the deployment host? | 15:45 |
jrosser | there is a tag openstack_hosts-file in the openstack_hosts role for that | 15:49 |
f0o_ | since openvswitch was being updated it turns out that the latest version is incompatible with the previous one - so all networking stopped... now I am moving the deployment-host into a temporary docker container on one of the hosts in the same rack so i dont need to route to get to the targets but just switch | 15:50 |
f0o_ | Learning Experience | 15:50 |
f0o_ | ty @ tag | 15:50 |
jrosser | we do test upgrades for each merged patch | 15:52 |
f0o_ | the error in question is in /var/log/openvswitch/ovs-vswitchd.log with signature: Invalid Geneve tunnel metadata on bridge br-int while processing | 15:53 |
jrosser | which operating system is this? | 15:54 |
f0o_ | after openvswitch upgraded from 2.3.1 (2023.2 shipped this) to 3.3.0 (2024.1 ships this) | 15:54 |
f0o_ | ubuntu jammy | 15:54 |
f0o_ | so right now one host is in 3.3.0 and it cant talk to the rest due to that invalid geneve tunnel metadata; so it broke trafficflow for me | 15:55 |
f0o_ | correction the version from 2023.2 was 3.2.1 - I swapped major/minor in my earlier msg | 15:57 |
f0o_ | odd that a minor version caused the incompatibility | 15:57 |
f0o_ | but bfd_status shows the host has down from all 3.2.1-variants and the opposite is the case from the pov of the 3.3.0 variant | 15:58 |
jrosser | that is almost certainly something for the neutron team | 16:01 |
f0o_ | doing the jack-hammer approach now of rerunning setup-openstack playbook to hope it will just smoothen things out. I've got another behavioral difference to 2023.2 regarding cinder-volume. There is now a /var/lib/cinder/mnt which nova needs to access... but ofc UID/GIDs mismatch | 16:03 |
f0o_ | I'm not sure where this one came from and if I can just bind-mount it into /var/lib/nova/mnt | 16:04 |
jrosser | that will be the cinder nfs_mount_point_base | 16:11 |
jrosser | https://docs.openstack.org/cinder/latest/configuration/block-storage/drivers/nfs-volume-driver.html | 16:11 |
f0o_ | hrm that's not anything new, why is it causing issues now tho... | 16:13 |
jrosser | i cannot say tbh - are you trying to upgrade directly some production system or a staging environment? | 16:13 |
f0o_ | on the upside; jack-hammer method solved the networking for now - all compute-nodes seem to see eachother again | 16:14 |
f0o_ | this is luckily staging; although <this> shell is hosted on it | 16:15 |
f0o_ | so if I drop out, you know I messed up | 16:15 |
jrosser | fwiw we have a full staing environment, and usually have 10-20 items we have to fix/patch in general bits of openstack, and some also in openstack-ansible at each upgrade | 16:15 |
jrosser | i am not surprised at all about your rabbitmq issues - there are race conditions in oslo.messaging on service restarts, depending on how you have it configured | 16:17 |
f0o_ | so openvswitch 3.3.0 seems to be missing the route-table patch that was released March 22nd | 16:17 |
f0o_ | but this is luckily not on os-ansible to fix :D | 16:20 |
f0o_ | how can I override which .deb is being installed? I got a patched openvswitch-switch/common.deb that is higher version than 3.3.0 but it keeps being downgraded to 3.3.0 | 17:05 |
f0o_ | I checked the upper-constraints which I know is used for all the python stuff but no luck there | 17:12 |
f0o_ | apt-pin seems to be working | 17:18 |
f0o_ | finally got through the whole upgrade but now horizon is crashlooping with `AttributeError: module 'django.middleware.csrf' has no attribute 'REASON_BAD_ORIGIN'` haha | 17:46 |
f0o_ | well I'm spent for the day, will debug that tomorrow | 17:46 |
f0o_ | ok that was an easy fix; it just didnt update django | 17:48 |
f0o_ | ok not fixed; lots of these ` UserWarning: Policy ped but the policy requires ['project'] scope. This behavior may change in the future where using the intended scope is required` in the logs and I cant list volumes nor instances in projects but through admin it works | 17:53 |
f0o_ | tomorrow's issues | 17:53 |
f0o_ | didnt leave me any rest; restarted the horizon lxc container and now although the error is still reported it displays fine. I guess that error was always there then | 18:11 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!