Saturday, 2024-08-03

f0oHeya; so the upgrade procedure hard-restarts systemd-networkd which is a really bad thing, I just had a full network meltdown because the ToR routers' networkd was restarted the bgpd got rekt. Can't a simple networkd-reload be done instead of a hard restart?09:51
f0oSpeaking of this meltdown; is it safe to re-run the upgrade procedure? obviosuly it now failed on one host since it nuked it from the face of the earth. I'd like to patch out the networkd-restart bit and give it another go09:52
f0oor am I now FUBARd?09:52
f0ooh it failed ultimately and prompted me to rerun so I guess all is well11:09
f0omade that 'restarted' into a 'reloaded' in ansible-role-systemd_networkd and giving it another go11:10
f0oupgrade script is now past all the networkd stuff and reloaded worked fine, it picked up the changes and didnt nuke the entire network stack (and subsequently bgpd)11:20
jrosserf0o: could you share the point in the code where the openstack-ansible upgrade restarted systemd-networkd?12:52
f0oit was part of the setup-hosts playbook I believe - it did it pretty early on after gathering some facts in the first playbook it ran12:54
jrosserhave you defined openstack_hosts_systemd_networkd_devices or openstack_hosts_systemd_networkd_networks ?12:55
jrosserreason i ask is that the host networking is not generally the responsibility of openstack-ansible, so it would be unexpected for the networking to be restarted12:55
f0oright now I'm detangling a new web of issues... nova-compute decided to not start up anymore with: Timed out waiting for a reply to message ID12:56
f0ojrosser: nop didnt touch any of the network-setup related settings12:56
jrosserthen i am not sure that "the upgrade procedure hard-restarts systemd-networkd"12:57
f0owell the upgrade runs a playbook which then cause systemd-networkd to be restarted on all hosts... not sure how else to phrase it12:57
jrosserif you can provide a log then it would be helpful, if that does happen then it would be a bug12:58
f0oI ran setup-hosts many times in 2023.2 and never had that behavior but on 2024.1 it happened - some change somewhere caused that without me having touched the config/vars12:58
f0owill gather logs once I recover nova12:58
f0onova's error read like an AMQP issue but rabbit seems fine and healthy and other nova-compute's can connect to it fine... not sure what's up12:59
f0oso what does `oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID xyz` mean on nova-compute? I see it connect and authenticate correctly against the rabbitmq servers13:10
f0obut then nova-compute hits that error and just exits13:10
f0onot even graceful, rabbitmq even reports client unexpectedly closed TCP connection 13:11
f0owho is supposed to reply to that message - I suspect some other nova-related service is not running correctly despite healthchecks passing13:12
f0ohttps://paste.opendev.org/show/btN4MAl2yypjZqQJkPmi/13:12
f0omy first guess was the nova-conductor but it shows as running and has relatively recent logs13:16
f0ofully restarted all nova-* components, now compute starts so I guess it was conductor despite showing Up/Healthy everywhere13:30
f0oall instances are broken now and it seems that all NFS mounts were removed as well so everything fails with Could not open '/var/lib/nova/mnt/513df3acbd666026449818dcdae343d0/volume-bcce8032-b45e-4cfe-948a-b5a6ae1b832f' ...13:31
f0oI'm starting to regret that upgrade lol13:31
f0osomething is really wrong... I had to chmod 775 all mountpoints and 777 all volumes... This feels very very wrong13:41
f0o_all instances hard-rebooted and back after that scary chmod... not even sure why it is required now... did the UID/GID change?13:54
f0o_makes no sense...13:54
f0o_oh yeah the UID/GID did change from nova to libvirt-qemu:kvm14:10
f0o_holy crap no wonder all VMs crapped out14:10
f0o_so I gotta change my NFS exports to squash to those UID/GIDs14:11
jrosserf0o_: here is what we do for nfs CI tests https://github.com/openstack/openstack-ansible/blob/master/tests/roles/bootstrap-host/tasks/prepare_nfs.yml#L8115:20
f0o_how can I regenerate the /etc/hosts for the deployment host?15:45
jrosserthere is a tag openstack_hosts-file in the openstack_hosts role for that15:49
f0o_since openvswitch was being updated it turns out that the latest version is incompatible with the previous one - so all networking stopped... now I am moving the deployment-host into a temporary docker container on one of the hosts in the same rack so i dont need to route to get to the targets but just switch15:50
f0o_Learning Experience15:50
f0o_ty @ tag15:50
jrosserwe do test upgrades for each merged patch15:52
f0o_the error in question is in /var/log/openvswitch/ovs-vswitchd.log with signature: Invalid Geneve tunnel metadata on bridge br-int while processing15:53
jrosserwhich operating system is this?15:54
f0o_after openvswitch upgraded from 2.3.1 (2023.2 shipped this) to 3.3.0 (2024.1 ships this)15:54
f0o_ubuntu jammy15:54
f0o_so right now one host is in 3.3.0 and it cant talk to the rest due to that invalid geneve tunnel metadata; so it broke trafficflow for me15:55
f0o_correction the version from 2023.2 was 3.2.1 - I swapped major/minor in my earlier msg15:57
f0o_odd that a minor version caused the incompatibility15:57
f0o_but bfd_status shows the host has down from all 3.2.1-variants and the opposite is the case from the pov of the 3.3.0 variant15:58
jrosserthat is almost certainly something for the neutron team16:01
f0o_doing the jack-hammer approach now of rerunning setup-openstack playbook to hope it will just smoothen things out. I've got another behavioral difference to 2023.2 regarding cinder-volume. There is now a /var/lib/cinder/mnt which nova needs to access... but ofc UID/GIDs mismatch16:03
f0o_I'm not sure where this one came from and if I can just bind-mount it into /var/lib/nova/mnt16:04
jrosserthat will be the cinder nfs_mount_point_base16:11
jrosserhttps://docs.openstack.org/cinder/latest/configuration/block-storage/drivers/nfs-volume-driver.html16:11
f0o_hrm that's not anything new, why is it causing issues now tho...16:13
jrosseri cannot say tbh - are you trying to upgrade directly some production system or a staging environment?16:13
f0o_on the upside; jack-hammer method solved the networking for now - all compute-nodes seem to see eachother again16:14
f0o_this is luckily staging; although <this> shell is hosted on it16:15
f0o_so if I drop out, you know I messed up16:15
jrosserfwiw we have a full staing environment, and usually have 10-20 items we have to fix/patch in general bits of openstack, and some also in openstack-ansible at each upgrade16:15
jrosseri am not surprised at all about your rabbitmq issues - there are race conditions in oslo.messaging on service restarts, depending on how you have it configured16:17
f0o_so openvswitch 3.3.0 seems to be missing the route-table patch that was released March 22nd16:17
f0o_but this is luckily not on os-ansible to fix :D16:20
f0o_how can I override which .deb is being installed? I got a patched openvswitch-switch/common.deb that is higher version than 3.3.0 but it keeps being downgraded to 3.3.017:05
f0o_I checked the upper-constraints which I know is used for all the python stuff but no luck there17:12
f0o_apt-pin seems to be working17:18
f0o_finally got through the whole upgrade but now horizon is crashlooping with `AttributeError: module 'django.middleware.csrf' has no attribute 'REASON_BAD_ORIGIN'` haha17:46
f0o_well I'm spent for the day, will debug that tomorrow17:46
f0o_ok that was an easy fix; it just didnt update django17:48
f0o_ok not fixed; lots of these ` UserWarning: Policy ped but the policy requires ['project'] scope. This behavior may change in the future where using the intended scope is required` in the logs and I cant list volumes nor instances in projects but through admin it works17:53
f0o_tomorrow's issues17:53
f0o_didnt leave me any rest; restarted the horizon lxc container and now although the error is still reported it displays fine. I guess that error was always there then18:11

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!