Saturday, 2024-08-03

f0o	Heya; so the upgrade procedure hard-restarts systemd-networkd which is a really bad thing, I just had a full network meltdown because the ToR routers' networkd was restarted the bgpd got rekt. Can't a simple networkd-reload be done instead of a hard restart?	09:51
f0o	Speaking of this meltdown; is it safe to re-run the upgrade procedure? obviosuly it now failed on one host since it nuked it from the face of the earth. I'd like to patch out the networkd-restart bit and give it another go	09:52
f0o	or am I now FUBARd?	09:52
f0o	oh it failed ultimately and prompted me to rerun so I guess all is well	11:09
f0o	made that 'restarted' into a 'reloaded' in ansible-role-systemd_networkd and giving it another go	11:10
f0o	upgrade script is now past all the networkd stuff and reloaded worked fine, it picked up the changes and didnt nuke the entire network stack (and subsequently bgpd)	11:20
jrosser	f0o: could you share the point in the code where the openstack-ansible upgrade restarted systemd-networkd?	12:52
f0o	it was part of the setup-hosts playbook I believe - it did it pretty early on after gathering some facts in the first playbook it ran	12:54
jrosser	have you defined openstack_hosts_systemd_networkd_devices or openstack_hosts_systemd_networkd_networks ?	12:55
jrosser	reason i ask is that the host networking is not generally the responsibility of openstack-ansible, so it would be unexpected for the networking to be restarted	12:55
f0o	right now I'm detangling a new web of issues... nova-compute decided to not start up anymore with: Timed out waiting for a reply to message ID	12:56
f0o	jrosser: nop didnt touch any of the network-setup related settings	12:56
jrosser	then i am not sure that "the upgrade procedure hard-restarts systemd-networkd"	12:57
f0o	well the upgrade runs a playbook which then cause systemd-networkd to be restarted on all hosts... not sure how else to phrase it	12:57
jrosser	if you can provide a log then it would be helpful, if that does happen then it would be a bug	12:58
f0o	I ran setup-hosts many times in 2023.2 and never had that behavior but on 2024.1 it happened - some change somewhere caused that without me having touched the config/vars	12:58
f0o	will gather logs once I recover nova	12:58
f0o	nova's error read like an AMQP issue but rabbit seems fine and healthy and other nova-compute's can connect to it fine... not sure what's up	12:59
f0o	so what does `oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID xyz` mean on nova-compute? I see it connect and authenticate correctly against the rabbitmq servers	13:10
f0o	but then nova-compute hits that error and just exits	13:10
f0o	not even graceful, rabbitmq even reports client unexpectedly closed TCP connection	13:11
f0o	who is supposed to reply to that message - I suspect some other nova-related service is not running correctly despite healthchecks passing	13:12
f0o	https://paste.opendev.org/show/btN4MAl2yypjZqQJkPmi/	13:12
f0o	my first guess was the nova-conductor but it shows as running and has relatively recent logs	13:16
f0o	fully restarted all nova-* components, now compute starts so I guess it was conductor despite showing Up/Healthy everywhere	13:30
f0o	all instances are broken now and it seems that all NFS mounts were removed as well so everything fails with Could not open '/var/lib/nova/mnt/513df3acbd666026449818dcdae343d0/volume-bcce8032-b45e-4cfe-948a-b5a6ae1b832f' ...	13:31
f0o	I'm starting to regret that upgrade lol	13:31
f0o	something is really wrong... I had to chmod 775 all mountpoints and 777 all volumes... This feels very very wrong	13:41
f0o_	all instances hard-rebooted and back after that scary chmod... not even sure why it is required now... did the UID/GID change?	13:54
f0o_	makes no sense...	13:54
f0o_	oh yeah the UID/GID did change from nova to libvirt-qemu:kvm	14:10
f0o_	holy crap no wonder all VMs crapped out	14:10
f0o_	so I gotta change my NFS exports to squash to those UID/GIDs	14:11
jrosser	f0o_: here is what we do for nfs CI tests https://github.com/openstack/openstack-ansible/blob/master/tests/roles/bootstrap-host/tasks/prepare_nfs.yml#L81	15:20
f0o_	how can I regenerate the /etc/hosts for the deployment host?	15:45
jrosser	there is a tag openstack_hosts-file in the openstack_hosts role for that	15:49
f0o_	since openvswitch was being updated it turns out that the latest version is incompatible with the previous one - so all networking stopped... now I am moving the deployment-host into a temporary docker container on one of the hosts in the same rack so i dont need to route to get to the targets but just switch	15:50
f0o_	Learning Experience	15:50
f0o_	ty @ tag	15:50
jrosser	we do test upgrades for each merged patch	15:52
f0o_	the error in question is in /var/log/openvswitch/ovs-vswitchd.log with signature: Invalid Geneve tunnel metadata on bridge br-int while processing	15:53
jrosser	which operating system is this?	15:54
f0o_	after openvswitch upgraded from 2.3.1 (2023.2 shipped this) to 3.3.0 (2024.1 ships this)	15:54
f0o_	ubuntu jammy	15:54
f0o_	so right now one host is in 3.3.0 and it cant talk to the rest due to that invalid geneve tunnel metadata; so it broke trafficflow for me	15:55
f0o_	correction the version from 2023.2 was 3.2.1 - I swapped major/minor in my earlier msg	15:57
f0o_	odd that a minor version caused the incompatibility	15:57
f0o_	but bfd_status shows the host has down from all 3.2.1-variants and the opposite is the case from the pov of the 3.3.0 variant	15:58
jrosser	that is almost certainly something for the neutron team	16:01
f0o_	doing the jack-hammer approach now of rerunning setup-openstack playbook to hope it will just smoothen things out. I've got another behavioral difference to 2023.2 regarding cinder-volume. There is now a /var/lib/cinder/mnt which nova needs to access... but ofc UID/GIDs mismatch	16:03
f0o_	I'm not sure where this one came from and if I can just bind-mount it into /var/lib/nova/mnt	16:04
jrosser	that will be the cinder nfs_mount_point_base	16:11
jrosser	https://docs.openstack.org/cinder/latest/configuration/block-storage/drivers/nfs-volume-driver.html	16:11
f0o_	hrm that's not anything new, why is it causing issues now tho...	16:13
jrosser	i cannot say tbh - are you trying to upgrade directly some production system or a staging environment?	16:13
f0o_	on the upside; jack-hammer method solved the networking for now - all compute-nodes seem to see eachother again	16:14
f0o_	this is luckily staging; although <this> shell is hosted on it	16:15
f0o_	so if I drop out, you know I messed up	16:15
jrosser	fwiw we have a full staing environment, and usually have 10-20 items we have to fix/patch in general bits of openstack, and some also in openstack-ansible at each upgrade	16:15
jrosser	i am not surprised at all about your rabbitmq issues - there are race conditions in oslo.messaging on service restarts, depending on how you have it configured	16:17
f0o_	so openvswitch 3.3.0 seems to be missing the route-table patch that was released March 22nd	16:17
f0o_	but this is luckily not on os-ansible to fix :D	16:20
f0o_	how can I override which .deb is being installed? I got a patched openvswitch-switch/common.deb that is higher version than 3.3.0 but it keeps being downgraded to 3.3.0	17:05
f0o_	I checked the upper-constraints which I know is used for all the python stuff but no luck there	17:12
f0o_	apt-pin seems to be working	17:18
f0o_	finally got through the whole upgrade but now horizon is crashlooping with `AttributeError: module 'django.middleware.csrf' has no attribute 'REASON_BAD_ORIGIN'` haha	17:46
f0o_	well I'm spent for the day, will debug that tomorrow	17:46
f0o_	ok that was an easy fix; it just didnt update django	17:48
f0o_	ok not fixed; lots of these ` UserWarning: Policy ped but the policy requires ['project'] scope. This behavior may change in the future where using the intended scope is required` in the logs and I cant list volumes nor instances in projects but through admin it works	17:53
f0o_	tomorrow's issues	17:53
f0o_	didnt leave me any rest; restarted the horizon lxc container and now although the error is still reported it displays fine. I guess that error was always there then	18:11

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!