Monday, 2025-03-10

EugenMayer440180frickler well i did so, see my gist05:50
EugenMayer440180i checked the logs of the nodes running network (which is the controller in my case, only 1) and also on the compute that was the target of the VM. Specifically i checked '/var/log/kolla/neutron/neutron-ovn-metadata-agent.log' on both sides. There are no informations or errors in the log files there. On the controller there is also05:53
EugenMayer440180/var/log/kolla/neutron/neutron-server.logs, that also does not include anyhthing specific. 05:53
EugenMayer440180There are also a couple of logs in /var/log/kolla/openvswitch - but neither of those incldued any errors or details. Am i looking at the wrong places?05:55
EugenMayer440180The only logs i could find details have been nova/nova-conductor.log on the controller and nova/nova-compute.log on the target compute05:56
EugenMayer440180When i look at https://openmetal.io/docs/manuals/operators-manual/day-4/troubleshooting/log-filtering#kolla-ansible-log-locations - i do not have most of the log files https://gist.github.com/EugenMayer/1fbf55c3938a27a08b223a0bbdbfe2cb06:09
EugenMayer440180Currently i smell that the issue actually might be a result of a 2024.1 or 2024.2 upgrade - there have not been VMs spanwed since then, so nobody would have noticed06:11
fricklerEugenMayer440180: a setup with just a single controller sounds weird, but probably not the source of the issue. the docs you cite are for OVS, but you are using OVN. I still believe that it is not possible that the port binding fails without a matching log entry in neutron-server.log06:27
EugenMayer440180frickler i grepped the neutron-server.log, there is not a single `status: 500` nor there are any logs when i deploy. Let me just redeploy right now and then tail you the file06:38
EugenMayer440180frickler i currently went through my globals and my inventory, checked the upgrade logs of 2024.1 and 2024.2 to understand if i missed anything significant. I have only one controller since i started with openstack. I understand this is unusual since it is a SPOF, but i would also assume this is not the source of the issue06:40
EugenMayer440180wow ... you were right all the way. Tailing the neutron-server.log while deploying i found: https://gist.github.com/EugenMayer/de98984752dd4fba1ad33b83e65a751f06:43
EugenMayer440180so :06:45
EugenMayer4401802025-03-10 07:42:21.495 791 WARNING neutron.plugins.ml2.drivers.ovn.mech_driver.mech_driver [req-55ff9b58-b86f-4cdf-a2bd-948dbf2ad517 req-34b4d04b-2823-448a-bb18-2671befbecdd 76fa357aefd748a491d83e5dea9aa357 c7c3f8bf330a4dd4a973f592e4aa7a0a - - default default] Refusing to bind port 7f74062f-1d79-4c43-9f03-229b950e0b16 due to no OVN chassis for06:45
EugenMayer440180host: compute306:45
EugenMayer4401802025-03-10 07:42:21.496 791 ERROR neutron.plugins.ml2.managers [req-55ff9b58-b86f-4cdf-a2bd-948dbf2ad517 req-34b4d04b-2823-448a-bb18-2671befbecdd 76fa357aefd748a491d83e5dea9aa357 06:45
EugenMayer440180c7c3f8bf330a4dd4a973f592e4aa7a0a - - default default] Failed to bind port 7f74062f-1d79-4c43-9f03-229b950e0b16 on host compute3 for vnic_type normal using segments [{'id': '1a9470ab-63df-453b-92f4-c0686db01560', 'network_type': 'geneve', 'physical_network': None, 'segmentation_id': 1871, 'network_id': '415ba715-dc0b-4a5e-beb9-43f71b0666a2'}]06:45
EugenMayer440180assuming 'due to no OVN chassis for host' is the most relevant part06:45
EugenMayer440180sounds similar to https://www.reddit.com/r/openstack/comments/18j3r6l/openstack_with_ovn_integration_error_when/06:48
EugenMayer440180assuming this might be intereresting in addition to the erorr logs above frickler - https://gist.github.com/EugenMayer/15e6438d7276a3a290c92a2bac05983206:54
EugenMayer440180Any help would be hugely appritiated06:54
fricklerEugenMayer440180: does the issue only happen for compute3 or also for other hosts? I'm no OVN expert, but it seems weird to me to have the short chassis names and then the fqdn for hostnames, in my deployment both use the short name. but that might also be totally irrelevant07:10
EugenMayer440180frickler does happen for all computes. I deployed arround 20 times now, it was schedueld to any of my computes ( 4 ), so this is not copute specific07:11
EugenMayer440180found https://www.reddit.com/r/openstack/comments/18j3r6l/comment/kdookvr/ - do i understand this the right way, the computes might have issues comunicating over the cluster network to "register"?07:28
EugenMayer440180not sure what the "network agents" are in this scenario - i assume 'neutron_ovn_metadata_agent' ?07:30
EugenMayer440180frickler over at #openstack-neutron they just had a similar glimps towards to fqdn. Nova seems to request compute3, while it should use https://gist.github.com/EugenMayer/8293975fdbe9a21295f0e785d435ce4e the FQDN07:48
EugenMayer440180frickler do you use FQDNs in your inventory "multinode" config?07:48
EugenMayer440180frickler are you using FQDN names in your inventory ?08:21
fricklerEugenMayer440180: no, I don't. it is quite possible that kolla is having some issue in that regard, I remember something similar between nova and libvirt08:59
EugenMayer440180ralonsoh at #openstack-neutron suggest that nova uses the wrong name, it should use the fqdn but it uses the host09:00
EugenMayer440180frickler could you do me a favor and run this on one of your computes: ovs-vsctl list open . | grep external_ids09:00
EugenMayer440180the interesting part is 'hostname' vs 'system-id'09:00
EugenMayer440180and also 'ovn-sbctl show' - how does your chassis name copmpare to the system-id09:01
fricklerI can check later, but pretty sure it is the short hostname everywhere since there is no other long name, even hostname -f just gives the same result09:12
EugenMayer440180frickler so for you, hostname and hostname -f is the FQDN?09:33
opendevreviewMatt Crees proposed openstack/kayobe master: Drop kolla-tags and kolla-limit  https://review.opendev.org/c/openstack/kayobe/+/93566909:37
fricklerEugenMayer440180: nope, no fqdn, just like "compute1" everywhere10:03
EugenMayer440180interesting. Is this the norm?10:03
fricklerwell it is the default in the tooling I have, I don't claim that it is a good default, but it likely does avoid issues like you are seeing10:07
EugenMayer440180it is new to me that having a broken hostname / FQDN setup 'fixes things' :)10:53
EugenMayer440180frickler we found out that the 'Host' that is part of 'openstack --insecure network agent list' is the relevant factor here11:09
EugenMayer440180if Host is 'compute2' and not 'compute2.cluster.kontextwork.net' neutron can bind. We just added a new compute and for that we used the typical kolla-ansible bootstrap workflow. Out of any reason, now compute2 has 'compute2' under host, everything else uses FQDN - we can now schedule payloads on compute2, but anything else does not work, including11:11
EugenMayer440180the new compute511:11
EugenMayer440180the question is, what / how defines how the network agents book in the 'Host' value11:12
opendevreviewIvan Vnučko proposed openstack/kolla-ansible master: Add backend TLS encryption between RabbitMQ management and HAProxy  https://review.opendev.org/c/openstack/kolla-ansible/+/91908611:12
EugenMayer440180frickler in fact, by accident during booking in compute5, we changed the FQDN resolution on compute2 by accident. Changed means hostname === hostname -f === compute2 .... and this fixes neutron11:19
EugenMayer440180so in fact this is a bug that has been introduced in kolla when spawning the neutron agent, registering with FQDN while the hostname is needed, since nova will use the hostname to intruct neutron to bind the bridge, not the FQDN11:20
fricklerEugenMayer440180: maybe it is a bug to have hostname != hostname -f, likely kolla should detect this early and fail12:21
EugenMayer440180no, to be honest, expecting hostname == hostname -f is nothing that should ever be needed.12:40
EugenMayer440180This is not how FQDN works. It is simply wrong to use hostname -f for the 'Hosts' columen for the neutron agent, so the way the configuration has been spawned is wrong12:40
EugenMayer440180and it has been broken with the kolla releases deploying 2024.1 or 2024.2 - which has broken multiple things anyway (like official certiciates for the cluster).12:41
opendevreviewMatúš Jenča proposed openstack/kolla-ansible master: Add certificates for RabbitMQ internode TLS  https://review.opendev.org/c/openstack/kolla-ansible/+/92138014:06
opendevreviewMatúš Jenča proposed openstack/kolla-ansible master: Add certificates for RabbitMQ internode TLS  https://review.opendev.org/c/openstack/kolla-ansible/+/92138014:08
opendevreviewMatúš Jenča proposed openstack/kolla-ansible master: Add support for RabbitMQ internode tls  https://review.opendev.org/c/openstack/kolla-ansible/+/92138114:08
tafkamaxHeh. I also checked mine and the Ubuntu 24.04 hosts have hostname and hostname -f both showing stuff without the FQDN.14:12
tafkamaxIn another debian12 machine it does show the full FQDN14:12
tafkamaxI think if you look in your /etc/hosts file you can see that the entries are generated there.14:13
tafkamax# BEGIN ANSIBLE GENERATED HOSTS14:13
tafkamaxThe debian12 machine is not yet part of our cluster, but we will try to add it probably some time.14:14
opendevreviewTakashi Kajinami proposed openstack/kolla-ansible master: DNM: Testing ...  https://review.opendev.org/c/openstack/kolla-ansible/+/94395015:07
opendevreviewMichal Arbet proposed openstack/kolla-ansible master: Fix Redis Sentinel authentication for octavia's jobboard HA  https://review.opendev.org/c/openstack/kolla-ansible/+/94279917:13
EugenMayer440180tafkamax i'am aware how FQDN works on how it is configured via the /etc/hosts file - this is a linux default, most probably even posix. 17:37
EugenMayer440180tafkamax my FQDN was not missconfigured, that's the point17:37

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!