EugenMayer440180 | frickler well i did so, see my gist | 05:50 |
---|---|---|
EugenMayer440180 | i checked the logs of the nodes running network (which is the controller in my case, only 1) and also on the compute that was the target of the VM. Specifically i checked '/var/log/kolla/neutron/neutron-ovn-metadata-agent.log' on both sides. There are no informations or errors in the log files there. On the controller there is also | 05:53 |
EugenMayer440180 | /var/log/kolla/neutron/neutron-server.logs, that also does not include anyhthing specific. | 05:53 |
EugenMayer440180 | There are also a couple of logs in /var/log/kolla/openvswitch - but neither of those incldued any errors or details. Am i looking at the wrong places? | 05:55 |
EugenMayer440180 | The only logs i could find details have been nova/nova-conductor.log on the controller and nova/nova-compute.log on the target compute | 05:56 |
EugenMayer440180 | When i look at https://openmetal.io/docs/manuals/operators-manual/day-4/troubleshooting/log-filtering#kolla-ansible-log-locations - i do not have most of the log files https://gist.github.com/EugenMayer/1fbf55c3938a27a08b223a0bbdbfe2cb | 06:09 |
EugenMayer440180 | Currently i smell that the issue actually might be a result of a 2024.1 or 2024.2 upgrade - there have not been VMs spanwed since then, so nobody would have noticed | 06:11 |
frickler | EugenMayer440180: a setup with just a single controller sounds weird, but probably not the source of the issue. the docs you cite are for OVS, but you are using OVN. I still believe that it is not possible that the port binding fails without a matching log entry in neutron-server.log | 06:27 |
EugenMayer440180 | frickler i grepped the neutron-server.log, there is not a single `status: 500` nor there are any logs when i deploy. Let me just redeploy right now and then tail you the file | 06:38 |
EugenMayer440180 | frickler i currently went through my globals and my inventory, checked the upgrade logs of 2024.1 and 2024.2 to understand if i missed anything significant. I have only one controller since i started with openstack. I understand this is unusual since it is a SPOF, but i would also assume this is not the source of the issue | 06:40 |
EugenMayer440180 | wow ... you were right all the way. Tailing the neutron-server.log while deploying i found: https://gist.github.com/EugenMayer/de98984752dd4fba1ad33b83e65a751f | 06:43 |
EugenMayer440180 | so : | 06:45 |
EugenMayer440180 | 2025-03-10 07:42:21.495 791 WARNING neutron.plugins.ml2.drivers.ovn.mech_driver.mech_driver [req-55ff9b58-b86f-4cdf-a2bd-948dbf2ad517 req-34b4d04b-2823-448a-bb18-2671befbecdd 76fa357aefd748a491d83e5dea9aa357 c7c3f8bf330a4dd4a973f592e4aa7a0a - - default default] Refusing to bind port 7f74062f-1d79-4c43-9f03-229b950e0b16 due to no OVN chassis for | 06:45 |
EugenMayer440180 | host: compute3 | 06:45 |
EugenMayer440180 | 2025-03-10 07:42:21.496 791 ERROR neutron.plugins.ml2.managers [req-55ff9b58-b86f-4cdf-a2bd-948dbf2ad517 req-34b4d04b-2823-448a-bb18-2671befbecdd 76fa357aefd748a491d83e5dea9aa357 | 06:45 |
EugenMayer440180 | c7c3f8bf330a4dd4a973f592e4aa7a0a - - default default] Failed to bind port 7f74062f-1d79-4c43-9f03-229b950e0b16 on host compute3 for vnic_type normal using segments [{'id': '1a9470ab-63df-453b-92f4-c0686db01560', 'network_type': 'geneve', 'physical_network': None, 'segmentation_id': 1871, 'network_id': '415ba715-dc0b-4a5e-beb9-43f71b0666a2'}] | 06:45 |
EugenMayer440180 | assuming 'due to no OVN chassis for host' is the most relevant part | 06:45 |
EugenMayer440180 | sounds similar to https://www.reddit.com/r/openstack/comments/18j3r6l/openstack_with_ovn_integration_error_when/ | 06:48 |
EugenMayer440180 | assuming this might be intereresting in addition to the erorr logs above frickler - https://gist.github.com/EugenMayer/15e6438d7276a3a290c92a2bac059832 | 06:54 |
EugenMayer440180 | Any help would be hugely appritiated | 06:54 |
frickler | EugenMayer440180: does the issue only happen for compute3 or also for other hosts? I'm no OVN expert, but it seems weird to me to have the short chassis names and then the fqdn for hostnames, in my deployment both use the short name. but that might also be totally irrelevant | 07:10 |
EugenMayer440180 | frickler does happen for all computes. I deployed arround 20 times now, it was schedueld to any of my computes ( 4 ), so this is not copute specific | 07:11 |
EugenMayer440180 | found https://www.reddit.com/r/openstack/comments/18j3r6l/comment/kdookvr/ - do i understand this the right way, the computes might have issues comunicating over the cluster network to "register"? | 07:28 |
EugenMayer440180 | not sure what the "network agents" are in this scenario - i assume 'neutron_ovn_metadata_agent' ? | 07:30 |
EugenMayer440180 | frickler over at #openstack-neutron they just had a similar glimps towards to fqdn. Nova seems to request compute3, while it should use https://gist.github.com/EugenMayer/8293975fdbe9a21295f0e785d435ce4e the FQDN | 07:48 |
EugenMayer440180 | frickler do you use FQDNs in your inventory "multinode" config? | 07:48 |
EugenMayer440180 | frickler are you using FQDN names in your inventory ? | 08:21 |
frickler | EugenMayer440180: no, I don't. it is quite possible that kolla is having some issue in that regard, I remember something similar between nova and libvirt | 08:59 |
EugenMayer440180 | ralonsoh at #openstack-neutron suggest that nova uses the wrong name, it should use the fqdn but it uses the host | 09:00 |
EugenMayer440180 | frickler could you do me a favor and run this on one of your computes: ovs-vsctl list open . | grep external_ids | 09:00 |
EugenMayer440180 | the interesting part is 'hostname' vs 'system-id' | 09:00 |
EugenMayer440180 | and also 'ovn-sbctl show' - how does your chassis name copmpare to the system-id | 09:01 |
frickler | I can check later, but pretty sure it is the short hostname everywhere since there is no other long name, even hostname -f just gives the same result | 09:12 |
EugenMayer440180 | frickler so for you, hostname and hostname -f is the FQDN? | 09:33 |
opendevreview | Matt Crees proposed openstack/kayobe master: Drop kolla-tags and kolla-limit https://review.opendev.org/c/openstack/kayobe/+/935669 | 09:37 |
frickler | EugenMayer440180: nope, no fqdn, just like "compute1" everywhere | 10:03 |
EugenMayer440180 | interesting. Is this the norm? | 10:03 |
frickler | well it is the default in the tooling I have, I don't claim that it is a good default, but it likely does avoid issues like you are seeing | 10:07 |
EugenMayer440180 | it is new to me that having a broken hostname / FQDN setup 'fixes things' :) | 10:53 |
EugenMayer440180 | frickler we found out that the 'Host' that is part of 'openstack --insecure network agent list' is the relevant factor here | 11:09 |
EugenMayer440180 | if Host is 'compute2' and not 'compute2.cluster.kontextwork.net' neutron can bind. We just added a new compute and for that we used the typical kolla-ansible bootstrap workflow. Out of any reason, now compute2 has 'compute2' under host, everything else uses FQDN - we can now schedule payloads on compute2, but anything else does not work, including | 11:11 |
EugenMayer440180 | the new compute5 | 11:11 |
EugenMayer440180 | the question is, what / how defines how the network agents book in the 'Host' value | 11:12 |
opendevreview | Ivan Vnučko proposed openstack/kolla-ansible master: Add backend TLS encryption between RabbitMQ management and HAProxy https://review.opendev.org/c/openstack/kolla-ansible/+/919086 | 11:12 |
EugenMayer440180 | frickler in fact, by accident during booking in compute5, we changed the FQDN resolution on compute2 by accident. Changed means hostname === hostname -f === compute2 .... and this fixes neutron | 11:19 |
EugenMayer440180 | so in fact this is a bug that has been introduced in kolla when spawning the neutron agent, registering with FQDN while the hostname is needed, since nova will use the hostname to intruct neutron to bind the bridge, not the FQDN | 11:20 |
frickler | EugenMayer440180: maybe it is a bug to have hostname != hostname -f, likely kolla should detect this early and fail | 12:21 |
EugenMayer440180 | no, to be honest, expecting hostname == hostname -f is nothing that should ever be needed. | 12:40 |
EugenMayer440180 | This is not how FQDN works. It is simply wrong to use hostname -f for the 'Hosts' columen for the neutron agent, so the way the configuration has been spawned is wrong | 12:40 |
EugenMayer440180 | and it has been broken with the kolla releases deploying 2024.1 or 2024.2 - which has broken multiple things anyway (like official certiciates for the cluster). | 12:41 |
opendevreview | Matúš Jenča proposed openstack/kolla-ansible master: Add certificates for RabbitMQ internode TLS https://review.opendev.org/c/openstack/kolla-ansible/+/921380 | 14:06 |
opendevreview | Matúš Jenča proposed openstack/kolla-ansible master: Add certificates for RabbitMQ internode TLS https://review.opendev.org/c/openstack/kolla-ansible/+/921380 | 14:08 |
opendevreview | Matúš Jenča proposed openstack/kolla-ansible master: Add support for RabbitMQ internode tls https://review.opendev.org/c/openstack/kolla-ansible/+/921381 | 14:08 |
tafkamax | Heh. I also checked mine and the Ubuntu 24.04 hosts have hostname and hostname -f both showing stuff without the FQDN. | 14:12 |
tafkamax | In another debian12 machine it does show the full FQDN | 14:12 |
tafkamax | I think if you look in your /etc/hosts file you can see that the entries are generated there. | 14:13 |
tafkamax | # BEGIN ANSIBLE GENERATED HOSTS | 14:13 |
tafkamax | The debian12 machine is not yet part of our cluster, but we will try to add it probably some time. | 14:14 |
opendevreview | Takashi Kajinami proposed openstack/kolla-ansible master: DNM: Testing ... https://review.opendev.org/c/openstack/kolla-ansible/+/943950 | 15:07 |
opendevreview | Michal Arbet proposed openstack/kolla-ansible master: Fix Redis Sentinel authentication for octavia's jobboard HA https://review.opendev.org/c/openstack/kolla-ansible/+/942799 | 17:13 |
EugenMayer440180 | tafkamax i'am aware how FQDN works on how it is configured via the /etc/hosts file - this is a linux default, most probably even posix. | 17:37 |
EugenMayer440180 | tafkamax my FQDN was not missconfigured, that's the point | 17:37 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!