*** zbr is now known as Guest164 | 05:03 | |
opendevreview | Dmitriy Rabotyagov proposed openstack/openstack-ansible-os_neutron master: Exclude neutron from venv constraints https://review.opendev.org/c/openstack/openstack-ansible-os_neutron/+/798960 | 05:47 |
---|---|---|
*** rpittau|afk is now known as rpittau | 06:59 | |
*** sshnaidm_ is now known as sshnaidm | 08:25 | |
mindthecap | hi! Asking this here aswell (copying from openstack channel): i'm gettin an error with clean install using OSA victoria. I can't create instance - volume attachment fails with cinder error "Invalid input received: Connector doesn't have required information: initiator)". The error is persistent when i try to attach already made volume to the instance. | 11:48 |
mindthecap | .I'm out of ideas what / where to check. I'm using ISCSI (lvm). | 11:48 |
jrosser | mindthecap: i'm guessing you've got something like this in your config https://github.com/openstack/openstack-ansible/blob/master/etc/openstack_deploy/openstack_user_config.yml.example#L631-L642 | 12:23 |
fridtjof[m] | mindthecap: try (re)starting iscsi related services on the compute node. I don't remember which one did the trick right now | 12:33 |
mindthecap | thanks! for some reason iscsid was disabled and stopped on compute hosts. Started them and it works. | 12:42 |
mindthecap | It's weird that the service is stopped and not started event tho i have replayed OSA deployment scripts many times. | 12:43 |
*** rpittau is now known as rpittau|afk | 12:45 | |
jrosser | mindthecap: which OS are you using? | 13:03 |
spatel | any idea related this error during building new vm - {"message": "Build of instance 4e65ec9b-1b47-4972-98f9-2430d67eece5 aborted: Failed to allocate the network(s), not rescheduling.", "code": 500, "created": "2021-07-08T13:08:01Z"} | 13:12 |
spatel | i am not seeing any bad error in neutron logs | 13:12 |
spatel | still looking for more evidence | 13:12 |
noonedeadpunk | well, I see issues for neutron OVN jobs for master | 13:13 |
noonedeadpunk | or you're not talking about OVN? | 13:13 |
spatel | no no | 13:15 |
spatel | i have real production issue in my old openstack | 13:15 |
spatel | what is the issue related OVN? | 13:15 |
noonedeadpunk | I was seing that when dhcp agent was stuck for some reason in ovs | 13:15 |
spatel | hmm! in CI job? | 13:16 |
spatel | send me link i will try to debug and see | 13:16 |
noonedeadpunk | no, in our prod) so when nutron dhcp agent was acting weird, we were not able to create VM with same issue and nothing in any logs | 13:16 |
noonedeadpunk | regarding OVN - https://review.opendev.org/c/openstack/openstack-ansible-os_neutron/+/798960 | 13:17 |
spatel | hmm let me check DHCP logs and see | 13:17 |
noonedeadpunk | there's issue with calico, but I think I will just bump it back to old version for now... | 13:17 |
spatel | ok | 13:17 |
spatel | noonedeadpunk my /var/log/neutron/neutron-dhcp-agent.log looking very clean | 13:19 |
spatel | noticed in this neutron log - 2021-07-08 09:19:42.699 26878 ERROR oslo.messaging._drivers.impl_rabbit [-] [6cd375b0-8f3a-4a70-b2fa-91d52197b74b] AMQP server on 172.28.15.248:5671 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: error: [Errno 104] Connection reset by peer | 13:20 |
spatel | restarting and see | 13:20 |
spatel | noonedeadpunk do you think this is something serious - http://paste.openstack.org/show/807274/ | 14:07 |
noonedeadpunk | um, well, it could mean that either smth wrong with rabbit (but you would see that in other services) or dhcp just stuck for some reason | 14:12 |
noonedeadpunk | and not replying messages | 14:12 |
spatel | rabbitMQ cluster looking health so not sure what is the issue, but i can restart rabbitMQ cluster | 14:13 |
noonedeadpunk | and restart of dhcp agents didn't help? | 14:14 |
spatel | does just restarting rabbitMQ service is enough? | 14:14 |
spatel | no help with dhcp agent restart | 14:14 |
noonedeadpunk | hm... | 14:14 |
jrosser | you could try starting it with debug logging | 14:14 |
spatel | neutron agent? | 14:15 |
spatel | i meant dhcp | 14:15 |
jrosser | yeah, you'd get some idea if it was just sitting doing nothing, or somehow spinning and failing | 14:15 |
spatel | let me try that also... | 14:16 |
noonedeadpunk | also - is it regarding only one network or all networks are failing? | 14:17 |
spatel | i am able to delete vm but not able to create, does that also related to rabbitMQ issue? i know if rabbitMQ is not working then you can't delete vm | 14:17 |
noonedeadpunk | as what we also did - we were adding another dhcp agent to the network | 14:17 |
noonedeadpunk | as it might be issue with namespace actually | 14:17 |
spatel | This is related to any network.. when i create vm it stuck in BUILD and then throw error - aborted: Failed to allocate the network(s), not rescheduling | 14:18 |
noonedeadpunk | well in our case it sometimes was dependant on the network where port for VM resides | 14:19 |
noonedeadpunk | s/sometimes/most times | 14:19 |
spatel | hmm | 14:23 |
spatel | jrosser i have enable debug in /etc/neutron/dhcp_agent.ini is that correct place? | 14:23 |
spatel | after restart agent not seeing any good info except this error - http://paste.openstack.org/show/807274/ | 14:24 |
jrosser | i think it's normally somewhere right at the top of /etc/neutron/neutron.conf | 14:24 |
spatel | let me try that | 14:24 |
spatel | does DHCP agent talk to RabbitMQ? | 14:25 |
jrosser | this looks similar https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=1774764 | 14:26 |
spatel | hmm but not saying any solution or something | 14:27 |
jrosser | but kind of unhelpful other than it does reference some patches like https://review.opendev.org/c/openstack/neutron/+/659274 and https://review.opendev.org/c/openstack/neutron/+/694561 | 14:27 |
jrosser | but i'm just totally guessing | 14:28 |
spatel | when its saying timeout does that it saying fail to talk to neutron server or rabbitMQ? | 14:29 |
spatel | jrosser after restarting all nova-* service look like i am able to spin up vms | 14:36 |
spatel | doesn't make any sense | 14:36 |
jrosser | nothing useful in the nova log? | 14:37 |
jrosser | anyway remember to put any debug back to False :) | 14:37 |
spatel | i noticed some errors like failed to talk to neutron | 14:37 |
jrosser | that is probably related | 14:37 |
jrosser | as booting the VM / creating the port are very much coupled | 14:38 |
spatel | I strongly believe my neutron-server is under presser | 14:38 |
spatel | I have 800 vms running on this cloud.. | 14:39 |
spatel | does more vms put pressure on neutron or compute host? | 14:39 |
spatel | on this cloud i have 260 compute hosts and 800 VMs.. | 14:41 |
noonedeadpunk | it would put pressure on rabbit at the first place | 15:02 |
noonedeadpunk | so maybe it's timeouting for reason | 15:02 |
*** frickler is now known as frickler_pto | 15:03 | |
spatel | hmm | 15:04 |
spatel | how do you guys scale rabbitMQ? | 15:05 |
spatel | Do you guys using some kind of different HA queue in rabbitMQ? like don't sync ABC queue etc? | 15:06 |
spatel | i have noticed rabbit is doing bad job when you sync all queue | 15:06 |
noonedeadpunk | it really does | 15:06 |
noonedeadpunk | ha queue is really a penalty on performance | 15:07 |
spatel | so what do you suggest? I am running everything default come out from OSA | 15:07 |
spatel | never thought of playing with ha queue | 15:07 |
noonedeadpunk | but otherwise you might get issues when restarting rabbit | 15:08 |
noonedeadpunk | I'm not sure though, might be with maintenance mode this can be workedaround, but I never had a time to play a lot with it | 15:08 |
noonedeadpunk | also it's available only from rabbit 3.8 or smth like that | 15:09 |
noonedeadpunk | eventually I know that at some scale ppl even start using dedicated nodes for rabbit to gain some more performance | 15:10 |
noonedeadpunk | and have like 5-7 nodes for rabbit... | 15:10 |
spatel | agreed on dedicated nodes, but still even you add more node that add more load on HA syncing job | 15:11 |
noonedeadpunk | iirc there were also some new features to rabbit that allowed to sync queues between specific nodes only, but not sure here | 15:11 |
noonedeadpunk | yeah, they don't use ha queus | 15:12 |
noonedeadpunk | not sure how they recover in case of rabbit failure though... | 15:12 |
noonedeadpunk | as what we saw previously without ha queus is that services got stuck with rabbit failover for some reason. Maybe it's solved now though | 15:12 |
spatel | when i last time talk to someone they said they are running this HA policy which help them a lot - http://paste.openstack.org/show/807279/ | 15:13 |
spatel | no HA for notifications* etc which is useless | 15:14 |
spatel | i will try to play in lab and see how we can make RabbitMQ more responsive... its painful when you want to scale.. | 15:15 |
spatel | OVN can solved lots of rabbitMQ issue but its itself a beast | 15:15 |
fridtjof[m] | jrosser: re the issue mindthecap was having, i've personally experienced this on ubuntu 18.04 at least | 15:47 |
jrosser | ah - i was wondering if it was a centos type thing where the service is installed but disabled by default | 15:48 |
*** mgoddard- is now known as mgoddard | 20:50 | |
opendevreview | Ghanshyam proposed openstack/openstack-ansible master: Moving IRC network reference to OFTC https://review.opendev.org/c/openstack/openstack-ansible/+/800127 | 23:25 |
opendevreview | Ghanshyam proposed openstack/ansible-hardening master: Moving IRC network reference to OFTC https://review.opendev.org/c/openstack/ansible-hardening/+/800128 | 23:26 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!