opendevreview | Jonathan Rosser proposed openstack/openstack-ansible-repo_server master: Ensure insist=true is always set for lsyncd https://review.opendev.org/c/openstack/openstack-ansible-repo_server/+/828678 | 09:47 |
---|---|---|
opendevreview | Jonathan Rosser proposed openstack/openstack-ansible-repo_server master: Use ssh_keypairs role to generate keys for repo sync https://review.opendev.org/c/openstack/openstack-ansible-repo_server/+/827100 | 09:48 |
opendevreview | Jonathan Rosser proposed openstack/openstack-ansible-os_neutron stable/wallaby: Make calico non voting https://review.opendev.org/c/openstack/openstack-ansible-os_neutron/+/828657 | 10:29 |
jrosser | calico is now NV on all branches | 10:30 |
jrosser | it's all but deprecated really, we've just not said that | 10:30 |
jrosser | evrardjp_: ^ if you are interested in calico it needs some TLC | 10:30 |
evrardjp[m] | jrosser: while I like the concept around calico, I never had my hands into this. It was all handled by logan- | 11:22 |
evrardjp[m] | Thanks for the ping | 11:22 |
*** dviroel|out is now known as dviroel|ruck | 11:30 | |
jrosser | do we want a DNM patch to os_neutron to test this? https://review.opendev.org/c/openstack/openstack-ansible/+/828553 | 11:42 |
jrosser | or are we happy that the same patch on the stable branches is OK | 11:42 |
jrosser | need to start merging that as it's yet another thing holding up centos-8 removal | 11:42 |
jrosser | andrewbonney: is this what you saw with libvirt sockets? https://zuul.opendev.org/t/openstack/build/860df46e474d4333a573f6069e7d90b4/log/job-output.txt#20874 | 11:47 |
andrewbonney | Yes, looks like it. My manual solution was 'systemctl stop libvirtd && systemctl start libvirtd-tls'. The play could then be run successfully | 11:48 |
andrewbonney | If you stop libvirtd manually it gives a warning that the sockets may cause it to start again | 11:49 |
jrosser | noonedeadpunk: ^ feels like theres a proper bug here | 11:51 |
jrosser | there is some specific ordering requirements around systemd sockets and the service activated by the socket | 11:51 |
noonedeadpunk | regarding https://review.opendev.org/c/openstack/openstack-ansible/+/828553 I think it shouldn't hurt for sure. Especially on master where we fix tempest plugins by SHA | 11:52 |
jrosser | i had the same in the galera xinetd changes, and the only thing i could come up with was this https://github.com/openstack/openstack-ansible-galera_server/blob/master/tasks/galera_server_post_install.yml#L71 | 11:52 |
noonedeadpunk | regarding sockets - we haven't catched this for some reason. | 11:55 |
noonedeadpunk | So basically, when we start libvirtd-tcp.socket it starts libvirt and then libvirtd-tls.socket fails? | 11:55 |
noonedeadpunk | or libvirt started with package installation? | 11:55 |
noonedeadpunk | because we explicitly stop it with previous task? | 11:56 |
andrewbonney | I think libvirtd.socket is already started. If you just stop libvirtd on a system and wait for long enough it'll start itself again, which I assume is triggered by one of the sockets given the message it gives when you stop it | 11:56 |
jrosser | with galera, iirc if the service started by the socket was not loaded in systemd when you try to make the .socket service, it all goes bad | 11:58 |
noonedeadpunk | yes, indeed libvirtd is re-activated within 45 seconds tops | 12:04 |
noonedeadpunk | so it's race condition I believe | 12:04 |
andrewbonney | Yeah, and the bigger the deployment the more likely it is you'll see it, which matches our experience | 12:05 |
noonedeadpunk | and that's indeed libvirtd.socket that brings it back | 12:06 |
noonedeadpunk | well, should be trivial fix I beleive then | 12:09 |
opendevreview | Dmitriy Rabotyagov proposed openstack/openstack-ansible-os_nova master: Drop libvirtd_version identification https://review.opendev.org/c/openstack/openstack-ansible-os_nova/+/828702 | 12:13 |
jrosser | ok finally the keypairs stuff is all passing https://review.opendev.org/q/topic:osa%252Fkeypairs | 12:16 |
opendevreview | Dmitriy Rabotyagov proposed openstack/openstack-ansible-os_nova master: Fix race-condition when libvirt starts unwillingly https://review.opendev.org/c/openstack/openstack-ansible-os_nova/+/828704 | 12:23 |
noonedeadpunk | I guess this should fix the thing ^ ? | 12:24 |
*** dviroel is now known as dviroel|ruck | 13:13 | |
spatel | noonedeadpunk jrosser did you see this error before - https://paste.opendev.org/show/bdmUhBJkdDRfrhKg7Lyt/ | 14:22 |
spatel | i am not able to create VMs because my conduction is not happy | 14:22 |
noonedeadpunk | Rings some bell from R | 14:22 |
noonedeadpunk | where after rabbit member restart queues were lost without HA queues | 14:23 |
noonedeadpunk | So had to either restart everything or clean up rest queues and basically rebuilt rabbit cluster. | 14:24 |
noonedeadpunk | But maybe that is something different | 14:24 |
spatel | I had disaster last night and i re-build rabbitMQ | 14:24 |
spatel | now i can't able to build any VM :( | 14:24 |
noonedeadpunk | and conductor restart doesn't help either? | 14:24 |
spatel | trying that now | 14:25 |
spatel | Yesterday this is what happened, i rebuild rabbitMQ and then all my nova-compute node started throwing this error - https://paste.opendev.org/show/bFtAhaCLfed1R3ptNr1F/ | 14:25 |
spatel | i did multiple time re-build rabbitMQ in hope it will fix but no luck :( | 14:25 |
noonedeadpunk | yeah it happens if conductor is dead | 14:26 |
spatel | so i shutdown all nova-api container and then rebuild rabbitMQ | 14:26 |
spatel | then nova-compute stopped throwing error but now i can't able to rebuild VM (this is wallaby so may be a bug or something ) | 14:27 |
spatel | let me restart conductor and see if it help | 14:27 |
spatel | currently my rabbitMQ running without HA queue is that ok? | 14:27 |
noonedeadpunk | you tell me :D | 14:28 |
spatel | let me restart conductor and see it that help | 14:28 |
noonedeadpunk | the thing that bugged me without HA queues is when gone rabbit member was re-joining cluster, it was not aware of any queues that existed, but clients were expecting to be able to get them from it | 14:30 |
noonedeadpunk | which kind of sounds like case here | 14:30 |
spatel | How do i bring back HA? | 14:31 |
spatel | how do i run common-rabbitmq tag so it will deploy queue for all my service? | 14:31 |
spatel | i forgot how to run that | 14:31 |
spatel | do you know command? | 14:31 |
noonedeadpunk | so conductor restart didn't help? | 14:32 |
spatel | no still my VM stuck in BUILD | 14:33 |
spatel | not seeing any error in rabbitMQ logs | 14:34 |
spatel | no error in conductor log | 14:34 |
noonedeadpunk | it might take like $timeout before conductor error out | 14:34 |
spatel | hmm | 14:35 |
noonedeadpunk | but considering that overrides removed, it was smth like `openstack-ansible setup-openstack.yml --tags common-mq` | 14:35 |
spatel | but that command will run on all compute nodes correct? it will take hell of time in 200 compute | 14:36 |
spatel | how about i can do - rabbitmqctl -q -n rabbit set_policy -p /keystone HA '^(?!(amq\\.)|(.*_fanout_)|(reply_)).*' '{"ha-mode": "all"}' --priority 0 | 14:36 |
spatel | for each service | 14:37 |
noonedeadpunk | why not | 14:37 |
spatel | let me try | 14:37 |
noonedeadpunk | you can actually start just from nova | 14:37 |
noonedeadpunk | to see if that halps | 14:37 |
noonedeadpunk | *helps | 14:38 |
spatel | i did only for nova/neutron (because that one is more important) | 14:38 |
noonedeadpunk | still nova-* will likely need to be restarted | 14:38 |
noonedeadpunk | on nova_api at least | 14:38 |
spatel | let me restart all nova-* service | 14:40 |
opendevreview | Merged openstack/openstack-ansible-os_keystone master: Remove legacy policy.json cleanup handler https://review.opendev.org/c/openstack/openstack-ansible-os_keystone/+/827443 | 14:44 |
spatel | noonedeadpunk still no luck VM stuck in BUILD | 14:44 |
spatel | why nova-conductor logs not flowing.. | 14:45 |
spatel | seems like no activity | 14:45 |
spatel | there are 15,235 mesg in Ready queue | 14:46 |
spatel | keep growing.. | 14:46 |
noonedeadpunk | and is everything fine with networking on containers? | 14:47 |
spatel | neutron not throwing any error in logs but i can restart | 14:48 |
spatel | mesg stuck in this kind of queue scheduler_fanout_5fc638e953cb4dc6b47151fef969d54c | 14:49 |
spatel | https://paste.opendev.org/show/bsQzWFYAsSDyb5EOzbLD/ | 14:51 |
noonedeadpunk | and how does scheduler log look like? | 14:52 |
spatel | clean :( | 14:53 |
spatel | no error at all | 14:53 |
noonedeadpunk | hm | 14:56 |
noonedeadpunk | what if restart nova-compute on 1 node? | 14:56 |
noonedeadpunk | will it still complain on rabbit? | 14:57 |
spatel | i am not seeing any error on nova-compute logs | 14:58 |
spatel | also restarting nova-compute taking very long time.. | 14:58 |
noonedeadpunk | as maybe they try to use old fanout queues that are not listened anymore. As they live for some time until die | 14:58 |
spatel | same with neutron-linuxbridge-agent.service | 14:58 |
spatel | how do i use old fanout? | 14:58 |
noonedeadpunk | is everything is really fin with container networking? | 14:58 |
noonedeadpunk | like there's an ip on eth0 and eth1? | 14:59 |
spatel | yes.. i can ping everything so far | 14:59 |
spatel | rabbitMQ is cluster is up | 14:59 |
spatel | agent can ping rabbitMQ IP etc | 14:59 |
spatel | does rabbit restart help? | 15:00 |
spatel | fanout should get delete after restart correct? | 15:00 |
noonedeadpunk | not instantly after restart | 15:01 |
noonedeadpunk | but like in 20 minutes or smth... | 15:01 |
spatel | something stuck in queue which is keep growing | 15:01 |
opendevreview | Merged openstack/openstack-ansible-os_keystone master: Drop ProxyPass out of VHost https://review.opendev.org/c/openstack/openstack-ansible-os_keystone/+/828519 | 15:03 |
spatel | does systemctl restart rabbitmq-server.service is enough or stop everything and then start again? | 15:05 |
opendevreview | Merged openstack/openstack-ansible stable/victoria: Remove enablement of neutron tempest plugin in scenario templates https://review.opendev.org/c/openstack/openstack-ansible/+/828386 | 15:10 |
spatel | noonedeadpunk getting this error in conductor now - Connection pool limit exceeded: current size 30 surpasses max configured rpc_conn_pool_size 30 | 15:12 |
noonedeadpunk | so sounds like it was processing smth | 15:17 |
spatel | noonedeadpunk how much time systemctl restart nova-compute take in your setup ? | 15:17 |
spatel | in my case its talking 5 min and more | 15:17 |
noonedeadpunk | depends. might be 30 sec | 15:17 |
noonedeadpunk | oh, well | 15:17 |
noonedeadpunk | usually it means it wasn't spawned properly | 15:18 |
noonedeadpunk | or running for tooooo long | 15:18 |
noonedeadpunk | so things slow on shutting down there | 15:18 |
spatel | let me try to shutdown and then start | 15:18 |
noonedeadpunk | so if some conenction is stuck - it will wait for timeout likely before stopping service | 15:19 |
spatel | stop taking longer time | 15:19 |
spatel | start is quick | 15:19 |
spatel | i am thinking to rebuild rabbitMQ again.. :( i have no option left | 15:20 |
noonedeadpunk | I kind of agree here | 15:20 |
spatel | i am doing apt-get purge rabbitmq-server on container | 15:20 |
spatel | is that good enough? | 15:20 |
noonedeadpunk | why just not to re-run playbook?:) | 15:21 |
spatel | that won't do anything right> | 15:21 |
spatel | ? | 15:21 |
spatel | like wiping mnesia etc | 15:21 |
noonedeadpunk | with -e rabbitmq_upgrade=true? | 15:21 |
noonedeadpunk | it kind of will | 15:22 |
spatel | do you think it will wipe down | 15:22 |
spatel | ok | 15:22 |
noonedeadpunk | but it won't wipe users/vhosts/permisions | 15:22 |
noonedeadpunk | but clean up all queues | 15:22 |
spatel | ok | 15:22 |
spatel | shutdown shutdown neutron-server and nova container first? | 15:22 |
noonedeadpunk | nah | 15:22 |
spatel | ok i thought it will keep trying and make mess in queue | 15:23 |
noonedeadpunk | it shouldn't be mess if rabbit working | 15:23 |
spatel | ok running -e rabbitmq_upgrade=true | 15:23 |
spatel | how about compute agent and network agent? | 15:24 |
spatel | do they connect itself? | 15:24 |
noonedeadpunk | I'd left everything as is | 15:24 |
noonedeadpunk | and yes, they should | 15:24 |
spatel | ok | 15:24 |
noonedeadpunk | not guaranteed though | 15:24 |
spatel | last night when i rebuild RabbitMQ they didn't so i have to restart all agent | 15:24 |
spatel | but that took hell of time | 15:25 |
spatel | I had zero issue building rabbit in queues but in wallaby this is mess... feels like bug | 15:26 |
spatel | done | 15:27 |
spatel | all rabbitnode is up | 15:27 |
spatel | should i reboot anything or leave it | 15:28 |
spatel | some compute node showing this error so assuming not going to re-try - https://paste.opendev.org/show/b3XS6aQwO4q4AuDIRN8X/ | 15:31 |
opendevreview | Merged openstack/openstack-ansible-repo_server master: Ensure insist=true is always set for lsyncd https://review.opendev.org/c/openstack/openstack-ansible-repo_server/+/828678 | 15:31 |
opendevreview | Merged openstack/openstack-ansible stable/wallaby: Remove enablement of neutron tempest plugin in scenario templates https://review.opendev.org/c/openstack/openstack-ansible/+/828551 | 15:38 |
opendevreview | Merged openstack/openstack-ansible master: Simplify mount options for new kernels. https://review.opendev.org/c/openstack/openstack-ansible/+/827464 | 15:38 |
spatel | noonedeadpunk i am getting this error in conductor - https://paste.opendev.org/show/bRtDZIKivNjbplTFbVuY/ | 15:41 |
spatel | assuming its ok | 15:42 |
spatel | still not able to build VM :( | 15:43 |
opendevreview | Merged openstack/openstack-ansible master: Remove workaround for OpenSUSE when setting AIO hostname https://review.opendev.org/c/openstack/openstack-ansible/+/827465 | 15:48 |
opendevreview | Merged openstack/openstack-ansible master: Rename RBD cinder backend https://review.opendev.org/c/openstack/openstack-ansible/+/828463 | 15:48 |
opendevreview | Dmitriy Rabotyagov proposed openstack/openstack-ansible stable/xena: Rename RBD cinder backend https://review.opendev.org/c/openstack/openstack-ansible/+/828663 | 15:49 |
opendevreview | Dmitriy Rabotyagov proposed openstack/openstack-ansible stable/wallaby: Rename RBD cinder backend https://review.opendev.org/c/openstack/openstack-ansible/+/828664 | 15:50 |
opendevreview | Dmitriy Rabotyagov proposed openstack/openstack-ansible stable/victoria: Rename RBD cinder backend https://review.opendev.org/c/openstack/openstack-ansible/+/828665 | 15:50 |
spatel | noonedeadpunk do you think this is mysql issue - https://paste.opendev.org/show/bxSMimrEKVIXI5LePhAF/ | 15:58 |
noonedeadpunk | nah, it's fine | 16:01 |
spatel | hmm i thought may be conductor not able to make connection to mysql | 16:01 |
spatel | do you think conductor doesn't have enough thread etc.. | 16:02 |
spatel | does this looks ok to you? - https://paste.opendev.org/show/bxSMimrEKVIXI5LePhAF/ | 16:02 |
spatel | rabbitMQ looks health no mesg in queue all queues are zero | 16:03 |
noonedeadpunk | if conductor will drop connection, such message would arise | 16:04 |
noonedeadpunk | and compute service list shows everybody is helthy and online? | 16:04 |
spatel | majority are up | 16:05 |
spatel | even if couple of down i should be able to spin up vm | 16:06 |
spatel | my VM creating process stuck in BUILD means mesg are not passing between components | 16:06 |
noonedeadpunk | and what in scheduler/conductor logs when you try to spawn VM? | 16:07 |
noonedeadpunk | do they even process request? | 16:07 |
spatel | not activity | 16:07 |
spatel | let me go deeper and see | 16:08 |
spatel | i found one log - 'nova.exception.RescheduledException: Build of instance 95c40e2a-2eed-4c75-a131-da55eaac7060 was re-scheduled: Binding failed for port 05272a41-a60c-43ea-9a7a-30a67ed92ddb, please check neutron logs for more information.\n'] | 16:08 |
spatel | let me check neutron logs | 16:08 |
*** priteau_ is now known as priteau | 16:11 | |
spatel | noonedeadpunk now my vm failed with this error - {"code": 500, "created": "2022-02-10T16:16:36Z", "message": "Build of instance 7fbac40d-2829-47f4-9291-31c07c43d04c aborted: Failed to allocate the network(s), not rescheduling."} | 16:18 |
noonedeadpunk | well, it's smth then:) | 16:18 |
spatel | but neutron-server logs not saying anything - https://paste.opendev.org/show/blOElBzra5OVjid9uLHH/ | 16:19 |
spatel | no error in neutron-server logs related that UUID | 16:19 |
noonedeadpunk | but it's likely one of neutron agent that's failing | 16:20 |
spatel | i have 200 nodes so very odd | 16:22 |
noonedeadpunk | nova scheduler doesn't really checks neutron agent aliveness on the compute where it places VM | 16:23 |
nurdie | Hey OSA! Does anyone know of an easy method of migrating openstack instances from one production deployment to another? Currently have a cluster deployed on CentOS7 and going to be building a new cluster on Ubuntu that will replace the old | 16:23 |
noonedeadpunk | nurdie: https://github.com/os-migrate/os-migrate | 16:23 |
noonedeadpunk | or you can just add ubuntu nodes to centos cluster and migrate things | 16:24 |
noonedeadpunk | (or re-setup them step-by-step) | 16:24 |
nurdie | We're going to rebuild from ground up. Controllers + Computes | 16:24 |
nurdie | I remmeber I chated with someone that said it's possible to mix operating systems but can become problematic, so I'd rather just start anew | 16:25 |
noonedeadpunk | well, how I was doing common re-setup - shut down 1 controller, re-setup to ubuntu, then re-setup computes one by one with offline vm migration between them | 16:25 |
spatel | let me create seperate group and try to create vm in that group | 16:25 |
noonedeadpunk | and the re-setup rest of controllers | 16:26 |
noonedeadpunk | basically, algorithm would be same as for OS upgrade | 16:26 |
noonedeadpunk | but it's up to you ofc :) | 16:26 |
nurdie | interesting o.0 | 16:27 |
noonedeadpunk | os-migrate can work with basic set of resources. Nothing like octavia or magnum is supported | 16:27 |
spatel | noonedeadpunk very odd, i can see vm is up but in paused state in virsh | 16:27 |
nurdie | I'll be honest, my hesitation comes from how wonky C7 has been. I don't blame OSA for that. It was beautiful until RHEL bent the whole world over a table | 16:27 |
noonedeadpunk | btw I guess spatel can tell you story of how to migrate from rhel7 to ubuntu :D | 16:28 |
spatel | hehe | 16:28 |
spatel | do you think neutron-agent reporting wrong status.. | 16:29 |
spatel | they are showing up but they are not | 16:29 |
noonedeadpunk | are they showing as healthy? | 16:29 |
nurdie | It's a very basic deployment. We don't even have designate yet. I'm going to checkout os-migrate though. I'm also going to force uniformity of compute node resources instead of the frankenstein cluster we have of several different server models, CPU counts, RAM counts. Gonna use old servers to setup a CERT/staging/testing OSA environment | 16:29 |
spatel | yes they are showing healthy | 16:29 |
nurdie | noonedeadpunk: thanks! | 16:30 |
noonedeadpunk | huh | 16:30 |
noonedeadpunk | yeah, then unlikely they will lie | 16:30 |
noonedeadpunk | and what's in neutron-ovs-agent logs on compute? | 16:30 |
noonedeadpunk | nurdie: running cluster of same computes is possible only on launch :D | 16:32 |
spatel | i have linux-agent | 16:34 |
nurdie | You're not wrong hahahhaha. Alas, the new cluster should be big enough to support us until we can deploy computes that are exactly double or something more mathematically pretty | 16:34 |
noonedeadpunk | spatel: you run LXB? O_O | 16:34 |
spatel | yes LXB | 16:35 |
noonedeadpunk | didn't know that | 16:35 |
* noonedeadpunk loves lxb | 16:35 | |
jamesdenton | It Works™ | 16:39 |
spatel | noonedeadpunk its create vm is spinning up but neutron not attaching nic | 16:40 |
spatel | getting this error in neutron agent - https://paste.opendev.org/show/bJ8O9MnPWXHzwKKlo8sV/ | 16:40 |
spatel | jamesdenton any idea | 16:41 |
spatel | thats all in my log | 16:41 |
noonedeadpunk | see no error | 16:41 |
spatel | that is why very odd | 16:41 |
jamesdenton | anything in nova compute? | 16:42 |
spatel | vm stuck in paused | 16:42 |
spatel | nova created VM and no error in logs but let me check again | 16:42 |
noonedeadpunk | so basically, compute asks for port through API, it's created, but neutron api reports back failure | 16:42 |
spatel | no error in nova | 16:42 |
noonedeadpunk | oh, btw, is port show in neutron api? | 16:43 |
noonedeadpunk | *shown | 16:43 |
spatel | let me check | 16:43 |
spatel | yes - https://paste.opendev.org/show/bfxx05SXqv9GegQe6XoA/ | 16:44 |
spatel | i can see port but they are in BUILD state | 16:44 |
spatel | now vm in error stat | 16:44 |
spatel | what could be wrong? | 16:46 |
spatel | i restarted neutron multiple | 16:47 |
jamesdenton | is rabbitmq cleared out now? | 16:47 |
spatel | no error in rabbitmq | 16:47 |
spatel | what could be wrong to attach nic to VM | 16:47 |
noonedeadpunk | so basically neutron agent doesn't report back to api | 16:48 |
spatel | in agent logs look like it got correct IP etc.. so it should attach | 16:48 |
*** priteau is now known as priteau_ | 16:48 | |
spatel | noonedeadpunk ohh is that possible | 16:48 |
noonedeadpunk | which it should do through rabbit | 16:48 |
*** priteau_ is now known as priteau | 16:48 | |
spatel | damn it.. again rabbit.. | 16:49 |
spatel | how do i check that | 16:49 |
noonedeadpunk | and this specific agent shown as alive? | 16:49 |
spatel | yes its alive | 16:49 |
spatel | let me check again | 16:49 |
noonedeadpunk | and you now have ha-queues for neutron? | 16:49 |
noonedeadpunk | shouldn't matter though | 16:49 |
noonedeadpunk | hm | 16:49 |
noonedeadpunk | any stuck messages in neutron fanout? | 16:50 |
spatel | | 01647432-727b-43ea-a21a-bba2841bc9bc | Linux bridge agent | ostack-phx-comp-gen-1-33.v1v0x.net | None | :-) | UP | neutron-linuxbridge-agent | | 16:50 |
spatel | agent is UP | 16:50 |
spatel | yes i do have HA queue for neutron | 16:50 |
spatel | if agent is showing up that means rabbitMQ is also clear correct | 16:51 |
noonedeadpunk | well, it depends... | 16:52 |
noonedeadpunk | I think the trick here is that any neutron-server can mark agent as up | 16:52 |
noonedeadpunk | but only specific one should recieve reply from agent? | 16:53 |
noonedeadpunk | not sure here tbh | 16:53 |
noonedeadpunk | I don;t recall how nova acts when it comes to interaction with neutron... | 16:53 |
spatel | hmm this is crazy | 16:54 |
noonedeadpunk | does it jsut asks api if prot is ready, or it's witing for reply with same connection... | 16:54 |
noonedeadpunk | *port * waiting | 16:54 |
spatel | jamesdenton know better | 16:54 |
spatel | restarted neutron-server | 16:57 |
spatel | still no luck | 16:57 |
spatel | any other agent i should be restarting | 16:59 |
noonedeadpunk | considering that you already restarted lxb-agent on compute? | 16:59 |
spatel | many time | 16:59 |
spatel | both nova-compute and LXD | 17:00 |
noonedeadpunk | nah, dunno | 17:00 |
noonedeadpunk | it should work) | 17:00 |
noonedeadpunk | maybe smth on net node ofc... | 17:00 |
noonedeadpunk | like dhcp agent... | 17:00 |
noonedeadpunk | or metadata... | 17:00 |
noonedeadpunk | damn | 17:00 |
noonedeadpunk | I can recall that happening | 17:00 |
spatel | really | 17:01 |
noonedeadpunk | it was like dead dnmansq or smth | 17:01 |
spatel | in that case i should loose IP | 17:01 |
noonedeadpunk | or some issue with l3 where net comes from | 17:01 |
spatel | i have 500 vms running on this cloud | 17:01 |
spatel | I have all vlan networking no l3-agent or gateway | 17:01 |
spatel | https://bugs.launchpad.net/neutron/+bug/1719011 | 17:02 |
spatel | this is old bug but may be | 17:02 |
noonedeadpunk | so basically because of weird dns-agent state jsut port creation was failing | 17:02 |
noonedeadpunk | I believe it was dns-agent indeed | 17:02 |
spatel | dns-agent? | 17:03 |
noonedeadpunk | *dhcp | 17:03 |
spatel | you are asking to restart - neutron-dhcp-agent.service ? | 17:03 |
spatel | not seeing in error in logs | 17:04 |
spatel | let me restart | 17:04 |
spatel | any good way i can delete nova queue in openstack and re-create (i don't want to rebuild rabbit again) | 17:05 |
noonedeadpunk | I don't think you need? | 17:05 |
spatel | let me restart dhcp and then see | 17:06 |
noonedeadpunk | there's no problem with nova imo | 17:06 |
spatel | restarting neutron-metadata-agent.service | 17:08 |
noonedeadpunk | metadata is other thing? | 17:08 |
spatel | lets see | 17:08 |
spatel | why these stuff stop VM creating and attach interface | 17:08 |
noonedeadpunk | because interface is not even built properly? | 17:09 |
noonedeadpunk | or well, it misses last bit I guess | 17:09 |
spatel | mmmm | 17:09 |
spatel | lets see | 17:09 |
spatel | restart taking very long time for systemctl restart neutron-metadata-agent.service | 17:10 |
noonedeadpunk | which is to make dhcp aware of ip-mac assignemnt | 17:10 |
spatel | hmm may be your are right here... | 17:10 |
spatel | something is holding back | 17:10 |
spatel | hope you are correct :) | 17:10 |
spatel | waiting for my reboot to finish | 17:10 |
spatel | i meant restart | 17:11 |
noonedeadpunk | I can recall this happening for us, with nothing in logs | 17:11 |
noonedeadpunk | so we as well were restarting everything we saw :D | 17:11 |
noonedeadpunk | (another point for OVN) | 17:11 |
spatel | +1 yes OVN is much simple | 17:12 |
noonedeadpunk | until it breaks :D | 17:12 |
spatel | restarting meta on last node | 17:13 |
spatel | its taking long time so must be stuck threads | 17:14 |
noonedeadpunk | I hope you restarted dhcp agent as well :) | 17:14 |
spatel | yes | 17:14 |
spatel | systemctl restart neutron-dhcp-agent.service and neutron-metadata-agent.service | 17:14 |
spatel | damn this this taking very long time.. still hanging | 17:16 |
*** dviroel|ruck is now known as dviroel|ruck|afk | 17:16 | |
spatel | no kidding..... it works | 17:17 |
spatel | Damn noonedeadpunk | 17:18 |
noonedeadpunk | yeah, I know. Sorry for not recalling earlier | 17:19 |
spatel | OMG!! you saved my life.. | 17:19 |
noonedeadpunk | most nasty thing it wasn't logging anything | 17:19 |
spatel | This should be a bug... trust me... | 17:19 |
spatel | last 24 hour i am hitting the wall | 17:19 |
noonedeadpunk | You should prepare talk about that for summit :D | 17:20 |
noonedeadpunk | They conveniently prolonged CFP | 17:20 |
spatel | I am not there yet :( | 17:21 |
spatel | still need to learn lost of | 17:21 |
noonedeadpunk | one of your blog posts totally could be fit for presentation | 17:22 |
noonedeadpunk | as you did tons of research in networking | 17:22 |
noonedeadpunk | and have numbers | 17:22 |
spatel | I am going to document this incident | 17:25 |
spatel | how i build rabbitMQ how i did HA and how i did everything.. | 17:25 |
spatel | also i am thinking to put IRC script about our discussion | 17:26 |
spatel | jamesdenton do i need neutron-linuxbridge-agent.service in SRIOV nodes? | 17:35 |
spatel | i believe SRIOV need only - neutron-sriov-nic-agent.service | 17:36 |
jamesdenton | i don't think you do | 17:42 |
jamesdenton | scrollback - glad to see you got it working | 17:43 |
spatel | jamesdenton This is very unknown issue.. i don't think its easy to figure out :( Big thanks to noonedeadpunk to stick around with me | 17:55 |
spatel | he didn't give us :) | 17:55 |
spatel | blog is on the way for postmortem | 17:57 |
fridtjof[m] | ugh wow, just got bitten by last year's let's encrypt CA thing. it sneakily broke a major upgrade to wallaby :D nothing could clone stuff from opendev until i upgraded ca-certificates... | 18:08 |
jrosser | i wonder if we should have a task to update ca-certificates specifically to 'latest' rather than just 'present' | 18:39 |
opendevreview | Merged openstack/openstack-ansible-rabbitmq_server master: Allow different install methods for rabbit/erlang https://review.opendev.org/c/openstack/openstack-ansible-rabbitmq_server/+/826445 | 19:01 |
opendevreview | Merged openstack/openstack-ansible-rabbitmq_server master: Update used RabbitMQ and Erlang https://review.opendev.org/c/openstack/openstack-ansible-rabbitmq_server/+/826446 | 19:07 |
*** dviroel|ruck|afk is now known as dviroel|ruck | 19:19 | |
mgariepy | anyone here have an idea how bad it is when a memcached server is down ? | 19:53 |
mgariepy | like what service are really badly impacted by this | 19:54 |
spatel | I think only horizon use it | 19:54 |
spatel | keystone | 19:54 |
spatel | for token cache i believe | 19:54 |
mgariepy | nova has it in the config https://github.com/openstack/openstack-ansible-os_nova/blob/master/templates/nova.conf.j2#L68 | 19:55 |
spatel | may be just to see token is in cache or not | 19:55 |
spatel | https://skuicloud.wordpress.com/2014/11/15/openstack-memcached-for-keystone/ | 19:57 |
spatel | i believe only keystone use memcache | 19:58 |
mgariepy | i'm moving racks between 2 DC. | 19:59 |
mgariepy | i have 1 controllers per rack.. | 20:00 |
mgariepy | but when i do stop memcache server i do have a lot of issue with nova | 20:00 |
spatel | really..? nova mostly use to check cache token i believe to speed up process and put less load on keystone | 20:02 |
mgariepy | or it's the nova/neutron interraction that is failing. | 20:02 |
mgariepy | it's somewhat of a pain really. | 20:03 |
spatel | i went through great pain so i can understand.. but trust me i don't think memcache can cause any issue to nova | 20:03 |
spatel | i never touch memcache in my deployment (even you remind me today that we have memcache) | 20:03 |
mgariepy | when it's up it works. but if you shut one down then there is a lot of delay that start to appears | 20:04 |
spatel | hmm i never test to shutdown mem | 20:09 |
spatel | are you currently down or just observe behavior | 20:09 |
mgariepy | i do have 2 up out of 3. | 20:10 |
mgariepy | Networking client is experiencing an unauthorized exception. (HTTP 400) | 20:13 |
mgariepy | when doing openstack server list --all | 20:14 |
spatel | hmm related to keystone / memcache issue | 20:15 |
mgariepy | i did remove the memcache server that is down from keystone. restarted the processes and flushed the memcached on all nodes. | 20:25 |
spatel | any improvement? | 20:26 |
mgariepy | non | 20:26 |
mgariepy | one in 6-7 call do end with the error. | 20:27 |
mgariepy | hmm.. openstack network list,, take 1 sec except once in a while.. it takes 4 sec. | 20:29 |
spatel | that is normal correct | 20:33 |
mgariepy | 1sec is ok but 4 doesn't seems ok ;) | 20:33 |
mgariepy | it should be consistent i guess | 20:33 |
spatel | in my case it take 2s | 20:34 |
mgariepy | i'll dig more tomorrow. | 20:55 |
spatel | Please share your finding :) | 20:56 |
mgariepy | i will probably poke you guys also tomorrow ! haha | 20:56 |
mgariepy | Moving racks between 2 DC live seems to be a good idea at first .. | 20:57 |
mgariepy | lol | 20:57 |
spatel | hehe you are tough | 20:58 |
mgariepy | well.. the DC are 1 block away. with dual 100Gb links between them .. :) | 20:58 |
mgariepy | it's almost like a local relocation ahha | 20:59 |
spatel | why do you want to split between two DC | 20:59 |
spatel | assuming redendency | 20:59 |
mgariepy | nop | 20:59 |
mgariepy | we need to move out of dc1.. | 20:59 |
mgariepy | haha | 20:59 |
spatel | :) | 21:00 |
spatel | how many nodes you have/ | 21:00 |
jrosser | the services all have keystone stuff in their configs | 21:06 |
mgariepy | ~110 computes, | 21:06 |
jrosser | and by extension they will be affected by loss of memcached | 21:06 |
mgariepy | so i should just update the memcache config on all service ? | 21:07 |
jrosser | it’s something like, when a service needs to auth against keystone it either gets a short circuit via a cached thing in memcached, or keystone has to do expensive crypto thing | 21:08 |
jrosser | so all your services can be affected, and memcached also is not a cluster | 21:09 |
jrosser | there is slightly opaque layer in oslo for having a pool of memcached | 21:10 |
mgariepy | hmm ok i'll update all the service tomorrow then. but i did test that part a bit when i did ubuntu 16>18 upgrades on rocky | 21:10 |
mgariepy | but it migth be a bit different on V. | 21:11 |
jrosser | I think we struggled during OS upgrade when the performance collapsed | 21:11 |
mgariepy | i've seen keystone taking like 30s instead of 2s when a memcache was down | 21:12 |
mgariepy | but never seen errors like i do see now. | 21:12 |
damiandabrowski[m] | sorry for interrupting You guys but I'm confused by our haproxy config :D is there any point to set `balance source` anywhere while by default we use `balance leastconn` + `stick store-request src`? | 21:13 |
jrosser | i am not sure tbh | 21:21 |
damiandabrowski[m] | thank, i'll try to have a deeper look on this. But as far as I can see, we define `balance source` only for: adjutant_api, ceph_rgw, cloudkitty_api, glance_api, horizon, nova_console, sahara_api, swift_proxy, zun_console | 21:29 |
*** dviroel|ruck is now known as dviroel|out | 21:49 | |
*** prometheanfire is now known as Guest2 | 23:49 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!