Thursday, 2022-02-10

opendevreview	Jonathan Rosser proposed openstack/openstack-ansible-repo_server master: Ensure insist=true is always set for lsyncd https://review.opendev.org/c/openstack/openstack-ansible-repo_server/+/828678	09:47
opendevreview	Jonathan Rosser proposed openstack/openstack-ansible-repo_server master: Use ssh_keypairs role to generate keys for repo sync https://review.opendev.org/c/openstack/openstack-ansible-repo_server/+/827100	09:48
opendevreview	Jonathan Rosser proposed openstack/openstack-ansible-os_neutron stable/wallaby: Make calico non voting https://review.opendev.org/c/openstack/openstack-ansible-os_neutron/+/828657	10:29
jrosser	calico is now NV on all branches	10:30
jrosser	it's all but deprecated really, we've just not said that	10:30
jrosser	evrardjp_: ^ if you are interested in calico it needs some TLC	10:30
evrardjp[m]	jrosser: while I like the concept around calico, I never had my hands into this. It was all handled by logan-	11:22
evrardjp[m]	Thanks for the ping	11:22
*** dviroel\|out is now known as dviroel\|ruck		11:30
jrosser	do we want a DNM patch to os_neutron to test this? https://review.opendev.org/c/openstack/openstack-ansible/+/828553	11:42
jrosser	or are we happy that the same patch on the stable branches is OK	11:42
jrosser	need to start merging that as it's yet another thing holding up centos-8 removal	11:42
jrosser	andrewbonney: is this what you saw with libvirt sockets? https://zuul.opendev.org/t/openstack/build/860df46e474d4333a573f6069e7d90b4/log/job-output.txt#20874	11:47
andrewbonney	Yes, looks like it. My manual solution was 'systemctl stop libvirtd && systemctl start libvirtd-tls'. The play could then be run successfully	11:48
andrewbonney	If you stop libvirtd manually it gives a warning that the sockets may cause it to start again	11:49
jrosser	noonedeadpunk: ^ feels like theres a proper bug here	11:51
jrosser	there is some specific ordering requirements around systemd sockets and the service activated by the socket	11:51
noonedeadpunk	regarding https://review.opendev.org/c/openstack/openstack-ansible/+/828553 I think it shouldn't hurt for sure. Especially on master where we fix tempest plugins by SHA	11:52
jrosser	i had the same in the galera xinetd changes, and the only thing i could come up with was this https://github.com/openstack/openstack-ansible-galera_server/blob/master/tasks/galera_server_post_install.yml#L71	11:52
noonedeadpunk	regarding sockets - we haven't catched this for some reason.	11:55
noonedeadpunk	So basically, when we start libvirtd-tcp.socket it starts libvirt and then libvirtd-tls.socket fails?	11:55
noonedeadpunk	or libvirt started with package installation?	11:55
noonedeadpunk	because we explicitly stop it with previous task?	11:56
andrewbonney	I think libvirtd.socket is already started. If you just stop libvirtd on a system and wait for long enough it'll start itself again, which I assume is triggered by one of the sockets given the message it gives when you stop it	11:56
jrosser	with galera, iirc if the service started by the socket was not loaded in systemd when you try to make the .socket service, it all goes bad	11:58
noonedeadpunk	yes, indeed libvirtd is re-activated within 45 seconds tops	12:04
noonedeadpunk	so it's race condition I believe	12:04
andrewbonney	Yeah, and the bigger the deployment the more likely it is you'll see it, which matches our experience	12:05
noonedeadpunk	and that's indeed libvirtd.socket that brings it back	12:06
noonedeadpunk	well, should be trivial fix I beleive then	12:09
opendevreview	Dmitriy Rabotyagov proposed openstack/openstack-ansible-os_nova master: Drop libvirtd_version identification https://review.opendev.org/c/openstack/openstack-ansible-os_nova/+/828702	12:13
jrosser	ok finally the keypairs stuff is all passing https://review.opendev.org/q/topic:osa%252Fkeypairs	12:16
opendevreview	Dmitriy Rabotyagov proposed openstack/openstack-ansible-os_nova master: Fix race-condition when libvirt starts unwillingly https://review.opendev.org/c/openstack/openstack-ansible-os_nova/+/828704	12:23
noonedeadpunk	I guess this should fix the thing ^ ?	12:24
*** dviroel is now known as dviroel\|ruck		13:13
spatel	noonedeadpunk jrosser did you see this error before - https://paste.opendev.org/show/bdmUhBJkdDRfrhKg7Lyt/	14:22
spatel	i am not able to create VMs because my conduction is not happy	14:22
noonedeadpunk	Rings some bell from R	14:22
noonedeadpunk	where after rabbit member restart queues were lost without HA queues	14:23
noonedeadpunk	So had to either restart everything or clean up rest queues and basically rebuilt rabbit cluster.	14:24
noonedeadpunk	But maybe that is something different	14:24
spatel	I had disaster last night and i re-build rabbitMQ	14:24
spatel	now i can't able to build any VM :(	14:24
noonedeadpunk	and conductor restart doesn't help either?	14:24
spatel	trying that now	14:25
spatel	Yesterday this is what happened, i rebuild rabbitMQ and then all my nova-compute node started throwing this error - https://paste.opendev.org/show/bFtAhaCLfed1R3ptNr1F/	14:25
spatel	i did multiple time re-build rabbitMQ in hope it will fix but no luck :(	14:25
noonedeadpunk	yeah it happens if conductor is dead	14:26
spatel	so i shutdown all nova-api container and then rebuild rabbitMQ	14:26
spatel	then nova-compute stopped throwing error but now i can't able to rebuild VM (this is wallaby so may be a bug or something )	14:27
spatel	let me restart conductor and see if it help	14:27
spatel	currently my rabbitMQ running without HA queue is that ok?	14:27
noonedeadpunk	you tell me :D	14:28
spatel	let me restart conductor and see it that help	14:28
noonedeadpunk	the thing that bugged me without HA queues is when gone rabbit member was re-joining cluster, it was not aware of any queues that existed, but clients were expecting to be able to get them from it	14:30
noonedeadpunk	which kind of sounds like case here	14:30
spatel	How do i bring back HA?	14:31
spatel	how do i run common-rabbitmq tag so it will deploy queue for all my service?	14:31
spatel	i forgot how to run that	14:31
spatel	do you know command?	14:31
noonedeadpunk	so conductor restart didn't help?	14:32
spatel	no still my VM stuck in BUILD	14:33
spatel	not seeing any error in rabbitMQ logs	14:34
spatel	no error in conductor log	14:34
noonedeadpunk	it might take like $timeout before conductor error out	14:34
spatel	hmm	14:35
noonedeadpunk	but considering that overrides removed, it was smth like `openstack-ansible setup-openstack.yml --tags common-mq`	14:35
spatel	but that command will run on all compute nodes correct? it will take hell of time in 200 compute	14:36
spatel	how about i can do - rabbitmqctl -q -n rabbit set_policy -p /keystone HA '^(?!(amq\\.)\|(._fanout_)\|(reply_)).' '{"ha-mode": "all"}' --priority 0	14:36
spatel	for each service	14:37
noonedeadpunk	why not	14:37
spatel	let me try	14:37
noonedeadpunk	you can actually start just from nova	14:37
noonedeadpunk	to see if that halps	14:37
noonedeadpunk	*helps	14:38
spatel	i did only for nova/neutron (because that one is more important)	14:38
noonedeadpunk	still nova-* will likely need to be restarted	14:38
noonedeadpunk	on nova_api at least	14:38
spatel	let me restart all nova-* service	14:40
opendevreview	Merged openstack/openstack-ansible-os_keystone master: Remove legacy policy.json cleanup handler https://review.opendev.org/c/openstack/openstack-ansible-os_keystone/+/827443	14:44
spatel	noonedeadpunk still no luck VM stuck in BUILD	14:44
spatel	why nova-conductor logs not flowing..	14:45
spatel	seems like no activity	14:45
spatel	there are 15,235 mesg in Ready queue	14:46
spatel	keep growing..	14:46
noonedeadpunk	and is everything fine with networking on containers?	14:47
spatel	neutron not throwing any error in logs but i can restart	14:48
spatel	mesg stuck in this kind of queue scheduler_fanout_5fc638e953cb4dc6b47151fef969d54c	14:49
spatel	https://paste.opendev.org/show/bsQzWFYAsSDyb5EOzbLD/	14:51
noonedeadpunk	and how does scheduler log look like?	14:52
spatel	clean :(	14:53
spatel	no error at all	14:53
noonedeadpunk	hm	14:56
noonedeadpunk	what if restart nova-compute on 1 node?	14:56
noonedeadpunk	will it still complain on rabbit?	14:57
spatel	i am not seeing any error on nova-compute logs	14:58
spatel	also restarting nova-compute taking very long time..	14:58
noonedeadpunk	as maybe they try to use old fanout queues that are not listened anymore. As they live for some time until die	14:58
spatel	same with neutron-linuxbridge-agent.service	14:58
spatel	how do i use old fanout?	14:58
noonedeadpunk	is everything is really fin with container networking?	14:58
noonedeadpunk	like there's an ip on eth0 and eth1?	14:59
spatel	yes.. i can ping everything so far	14:59
spatel	rabbitMQ is cluster is up	14:59
spatel	agent can ping rabbitMQ IP etc	14:59
spatel	does rabbit restart help?	15:00
spatel	fanout should get delete after restart correct?	15:00
noonedeadpunk	not instantly after restart	15:01
noonedeadpunk	but like in 20 minutes or smth...	15:01
spatel	something stuck in queue which is keep growing	15:01
opendevreview	Merged openstack/openstack-ansible-os_keystone master: Drop ProxyPass out of VHost https://review.opendev.org/c/openstack/openstack-ansible-os_keystone/+/828519	15:03
spatel	does systemctl restart rabbitmq-server.service is enough or stop everything and then start again?	15:05
opendevreview	Merged openstack/openstack-ansible stable/victoria: Remove enablement of neutron tempest plugin in scenario templates https://review.opendev.org/c/openstack/openstack-ansible/+/828386	15:10
spatel	noonedeadpunk getting this error in conductor now - Connection pool limit exceeded: current size 30 surpasses max configured rpc_conn_pool_size 30	15:12
noonedeadpunk	so sounds like it was processing smth	15:17
spatel	noonedeadpunk how much time systemctl restart nova-compute take in your setup ?	15:17
spatel	in my case its talking 5 min and more	15:17
noonedeadpunk	depends. might be 30 sec	15:17
noonedeadpunk	oh, well	15:17
noonedeadpunk	usually it means it wasn't spawned properly	15:18
noonedeadpunk	or running for tooooo long	15:18
noonedeadpunk	so things slow on shutting down there	15:18
spatel	let me try to shutdown and then start	15:18
noonedeadpunk	so if some conenction is stuck - it will wait for timeout likely before stopping service	15:19
spatel	stop taking longer time	15:19
spatel	start is quick	15:19
spatel	i am thinking to rebuild rabbitMQ again.. :( i have no option left	15:20
noonedeadpunk	I kind of agree here	15:20
spatel	i am doing apt-get purge rabbitmq-server on container	15:20
spatel	is that good enough?	15:20
noonedeadpunk	why just not to re-run playbook?:)	15:21
spatel	that won't do anything right>	15:21
spatel	?	15:21
spatel	like wiping mnesia etc	15:21
noonedeadpunk	with -e rabbitmq_upgrade=true?	15:21
noonedeadpunk	it kind of will	15:22
spatel	do you think it will wipe down	15:22
spatel	ok	15:22
noonedeadpunk	but it won't wipe users/vhosts/permisions	15:22
noonedeadpunk	but clean up all queues	15:22
spatel	ok	15:22
spatel	shutdown shutdown neutron-server and nova container first?	15:22
noonedeadpunk	nah	15:22
spatel	ok i thought it will keep trying and make mess in queue	15:23
noonedeadpunk	it shouldn't be mess if rabbit working	15:23
spatel	ok running -e rabbitmq_upgrade=true	15:23
spatel	how about compute agent and network agent?	15:24
spatel	do they connect itself?	15:24
noonedeadpunk	I'd left everything as is	15:24
noonedeadpunk	and yes, they should	15:24
spatel	ok	15:24
noonedeadpunk	not guaranteed though	15:24
spatel	last night when i rebuild RabbitMQ they didn't so i have to restart all agent	15:24
spatel	but that took hell of time	15:25
spatel	I had zero issue building rabbit in queues but in wallaby this is mess... feels like bug	15:26
spatel	done	15:27
spatel	all rabbitnode is up	15:27
spatel	should i reboot anything or leave it	15:28
spatel	some compute node showing this error so assuming not going to re-try - https://paste.opendev.org/show/b3XS6aQwO4q4AuDIRN8X/	15:31
opendevreview	Merged openstack/openstack-ansible-repo_server master: Ensure insist=true is always set for lsyncd https://review.opendev.org/c/openstack/openstack-ansible-repo_server/+/828678	15:31
opendevreview	Merged openstack/openstack-ansible stable/wallaby: Remove enablement of neutron tempest plugin in scenario templates https://review.opendev.org/c/openstack/openstack-ansible/+/828551	15:38
opendevreview	Merged openstack/openstack-ansible master: Simplify mount options for new kernels. https://review.opendev.org/c/openstack/openstack-ansible/+/827464	15:38
spatel	noonedeadpunk i am getting this error in conductor - https://paste.opendev.org/show/bRtDZIKivNjbplTFbVuY/	15:41
spatel	assuming its ok	15:42
spatel	still not able to build VM :(	15:43
opendevreview	Merged openstack/openstack-ansible master: Remove workaround for OpenSUSE when setting AIO hostname https://review.opendev.org/c/openstack/openstack-ansible/+/827465	15:48
opendevreview	Merged openstack/openstack-ansible master: Rename RBD cinder backend https://review.opendev.org/c/openstack/openstack-ansible/+/828463	15:48
opendevreview	Dmitriy Rabotyagov proposed openstack/openstack-ansible stable/xena: Rename RBD cinder backend https://review.opendev.org/c/openstack/openstack-ansible/+/828663	15:49
opendevreview	Dmitriy Rabotyagov proposed openstack/openstack-ansible stable/wallaby: Rename RBD cinder backend https://review.opendev.org/c/openstack/openstack-ansible/+/828664	15:50
opendevreview	Dmitriy Rabotyagov proposed openstack/openstack-ansible stable/victoria: Rename RBD cinder backend https://review.opendev.org/c/openstack/openstack-ansible/+/828665	15:50
spatel	noonedeadpunk do you think this is mysql issue - https://paste.opendev.org/show/bxSMimrEKVIXI5LePhAF/	15:58
noonedeadpunk	nah, it's fine	16:01
spatel	hmm i thought may be conductor not able to make connection to mysql	16:01
spatel	do you think conductor doesn't have enough thread etc..	16:02
spatel	does this looks ok to you? - https://paste.opendev.org/show/bxSMimrEKVIXI5LePhAF/	16:02
spatel	rabbitMQ looks health no mesg in queue all queues are zero	16:03
noonedeadpunk	if conductor will drop connection, such message would arise	16:04
noonedeadpunk	and compute service list shows everybody is helthy and online?	16:04
spatel	majority are up	16:05
spatel	even if couple of down i should be able to spin up vm	16:06
spatel	my VM creating process stuck in BUILD means mesg are not passing between components	16:06
noonedeadpunk	and what in scheduler/conductor logs when you try to spawn VM?	16:07
noonedeadpunk	do they even process request?	16:07
spatel	not activity	16:07
spatel	let me go deeper and see	16:08
spatel	i found one log - 'nova.exception.RescheduledException: Build of instance 95c40e2a-2eed-4c75-a131-da55eaac7060 was re-scheduled: Binding failed for port 05272a41-a60c-43ea-9a7a-30a67ed92ddb, please check neutron logs for more information.\n']	16:08
spatel	let me check neutron logs	16:08
*** priteau_ is now known as priteau		16:11
spatel	noonedeadpunk now my vm failed with this error - {"code": 500, "created": "2022-02-10T16:16:36Z", "message": "Build of instance 7fbac40d-2829-47f4-9291-31c07c43d04c aborted: Failed to allocate the network(s), not rescheduling."}	16:18
noonedeadpunk	well, it's smth then:)	16:18
spatel	but neutron-server logs not saying anything - https://paste.opendev.org/show/blOElBzra5OVjid9uLHH/	16:19
spatel	no error in neutron-server logs related that UUID	16:19
noonedeadpunk	but it's likely one of neutron agent that's failing	16:20
spatel	i have 200 nodes so very odd	16:22
noonedeadpunk	nova scheduler doesn't really checks neutron agent aliveness on the compute where it places VM	16:23
nurdie	Hey OSA! Does anyone know of an easy method of migrating openstack instances from one production deployment to another? Currently have a cluster deployed on CentOS7 and going to be building a new cluster on Ubuntu that will replace the old	16:23
noonedeadpunk	nurdie: https://github.com/os-migrate/os-migrate	16:23
noonedeadpunk	or you can just add ubuntu nodes to centos cluster and migrate things	16:24
noonedeadpunk	(or re-setup them step-by-step)	16:24
nurdie	We're going to rebuild from ground up. Controllers + Computes	16:24
nurdie	I remmeber I chated with someone that said it's possible to mix operating systems but can become problematic, so I'd rather just start anew	16:25
noonedeadpunk	well, how I was doing common re-setup - shut down 1 controller, re-setup to ubuntu, then re-setup computes one by one with offline vm migration between them	16:25
spatel	let me create seperate group and try to create vm in that group	16:25
noonedeadpunk	and the re-setup rest of controllers	16:26
noonedeadpunk	basically, algorithm would be same as for OS upgrade	16:26
noonedeadpunk	but it's up to you ofc :)	16:26
nurdie	interesting o.0	16:27
noonedeadpunk	os-migrate can work with basic set of resources. Nothing like octavia or magnum is supported	16:27
spatel	noonedeadpunk very odd, i can see vm is up but in paused state in virsh	16:27
nurdie	I'll be honest, my hesitation comes from how wonky C7 has been. I don't blame OSA for that. It was beautiful until RHEL bent the whole world over a table	16:27
noonedeadpunk	btw I guess spatel can tell you story of how to migrate from rhel7 to ubuntu :D	16:28
spatel	hehe	16:28
spatel	do you think neutron-agent reporting wrong status..	16:29
spatel	they are showing up but they are not	16:29
noonedeadpunk	are they showing as healthy?	16:29
nurdie	It's a very basic deployment. We don't even have designate yet. I'm going to checkout os-migrate though. I'm also going to force uniformity of compute node resources instead of the frankenstein cluster we have of several different server models, CPU counts, RAM counts. Gonna use old servers to setup a CERT/staging/testing OSA environment	16:29
spatel	yes they are showing healthy	16:29
nurdie	noonedeadpunk: thanks!	16:30
noonedeadpunk	huh	16:30
noonedeadpunk	yeah, then unlikely they will lie	16:30
noonedeadpunk	and what's in neutron-ovs-agent logs on compute?	16:30
noonedeadpunk	nurdie: running cluster of same computes is possible only on launch :D	16:32
spatel	i have linux-agent	16:34
nurdie	You're not wrong hahahhaha. Alas, the new cluster should be big enough to support us until we can deploy computes that are exactly double or something more mathematically pretty	16:34
noonedeadpunk	spatel: you run LXB? O_O	16:34
spatel	yes LXB	16:35
noonedeadpunk	didn't know that	16:35
* noonedeadpunk loves lxb		16:35
jamesdenton	It Works™	16:39
spatel	noonedeadpunk its create vm is spinning up but neutron not attaching nic	16:40
spatel	getting this error in neutron agent - https://paste.opendev.org/show/bJ8O9MnPWXHzwKKlo8sV/	16:40
spatel	jamesdenton any idea	16:41
spatel	thats all in my log	16:41
noonedeadpunk	see no error	16:41
spatel	that is why very odd	16:41
jamesdenton	anything in nova compute?	16:42
spatel	vm stuck in paused	16:42
spatel	nova created VM and no error in logs but let me check again	16:42
noonedeadpunk	so basically, compute asks for port through API, it's created, but neutron api reports back failure	16:42
spatel	no error in nova	16:42
noonedeadpunk	oh, btw, is port show in neutron api?	16:43
noonedeadpunk	*shown	16:43
spatel	let me check	16:43
spatel	yes - https://paste.opendev.org/show/bfxx05SXqv9GegQe6XoA/	16:44
spatel	i can see port but they are in BUILD state	16:44
spatel	now vm in error stat	16:44
spatel	what could be wrong?	16:46
spatel	i restarted neutron multiple	16:47
jamesdenton	is rabbitmq cleared out now?	16:47
spatel	no error in rabbitmq	16:47
spatel	what could be wrong to attach nic to VM	16:47
noonedeadpunk	so basically neutron agent doesn't report back to api	16:48
spatel	in agent logs look like it got correct IP etc.. so it should attach	16:48
*** priteau is now known as priteau_		16:48
spatel	noonedeadpunk ohh is that possible	16:48
noonedeadpunk	which it should do through rabbit	16:48
*** priteau_ is now known as priteau		16:48
spatel	damn it.. again rabbit..	16:49
spatel	how do i check that	16:49
noonedeadpunk	and this specific agent shown as alive?	16:49
spatel	yes its alive	16:49
spatel	let me check again	16:49
noonedeadpunk	and you now have ha-queues for neutron?	16:49
noonedeadpunk	shouldn't matter though	16:49
noonedeadpunk	hm	16:49
noonedeadpunk	any stuck messages in neutron fanout?	16:50
spatel	\| 01647432-727b-43ea-a21a-bba2841bc9bc \| Linux bridge agent \| ostack-phx-comp-gen-1-33.v1v0x.net \| None \| :-) \| UP \| neutron-linuxbridge-agent \|	16:50
spatel	agent is UP	16:50
spatel	yes i do have HA queue for neutron	16:50
spatel	if agent is showing up that means rabbitMQ is also clear correct	16:51
noonedeadpunk	well, it depends...	16:52
noonedeadpunk	I think the trick here is that any neutron-server can mark agent as up	16:52
noonedeadpunk	but only specific one should recieve reply from agent?	16:53
noonedeadpunk	not sure here tbh	16:53
noonedeadpunk	I don;t recall how nova acts when it comes to interaction with neutron...	16:53
spatel	hmm this is crazy	16:54
noonedeadpunk	does it jsut asks api if prot is ready, or it's witing for reply with same connection...	16:54
noonedeadpunk	port waiting	16:54
spatel	jamesdenton know better	16:54
spatel	restarted neutron-server	16:57
spatel	still no luck	16:57
spatel	any other agent i should be restarting	16:59
noonedeadpunk	considering that you already restarted lxb-agent on compute?	16:59
spatel	many time	16:59
spatel	both nova-compute and LXD	17:00
noonedeadpunk	nah, dunno	17:00
noonedeadpunk	it should work)	17:00
noonedeadpunk	maybe smth on net node ofc...	17:00
noonedeadpunk	like dhcp agent...	17:00
noonedeadpunk	or metadata...	17:00
noonedeadpunk	damn	17:00
noonedeadpunk	I can recall that happening	17:00
spatel	really	17:01
noonedeadpunk	it was like dead dnmansq or smth	17:01
spatel	in that case i should loose IP	17:01
noonedeadpunk	or some issue with l3 where net comes from	17:01
spatel	i have 500 vms running on this cloud	17:01
spatel	I have all vlan networking no l3-agent or gateway	17:01
spatel	https://bugs.launchpad.net/neutron/+bug/1719011	17:02
spatel	this is old bug but may be	17:02
noonedeadpunk	so basically because of weird dns-agent state jsut port creation was failing	17:02
noonedeadpunk	I believe it was dns-agent indeed	17:02
spatel	dns-agent?	17:03
noonedeadpunk	*dhcp	17:03
spatel	you are asking to restart - neutron-dhcp-agent.service ?	17:03
spatel	not seeing in error in logs	17:04
spatel	let me restart	17:04
spatel	any good way i can delete nova queue in openstack and re-create (i don't want to rebuild rabbit again)	17:05
noonedeadpunk	I don't think you need?	17:05
spatel	let me restart dhcp and then see	17:06
noonedeadpunk	there's no problem with nova imo	17:06
spatel	restarting neutron-metadata-agent.service	17:08
noonedeadpunk	metadata is other thing?	17:08
spatel	lets see	17:08
spatel	why these stuff stop VM creating and attach interface	17:08
noonedeadpunk	because interface is not even built properly?	17:09
noonedeadpunk	or well, it misses last bit I guess	17:09
spatel	mmmm	17:09
spatel	lets see	17:09
spatel	restart taking very long time for systemctl restart neutron-metadata-agent.service	17:10
noonedeadpunk	which is to make dhcp aware of ip-mac assignemnt	17:10
spatel	hmm may be your are right here...	17:10
spatel	something is holding back	17:10
spatel	hope you are correct :)	17:10
spatel	waiting for my reboot to finish	17:10
spatel	i meant restart	17:11
noonedeadpunk	I can recall this happening for us, with nothing in logs	17:11
noonedeadpunk	so we as well were restarting everything we saw :D	17:11
noonedeadpunk	(another point for OVN)	17:11
spatel	+1 yes OVN is much simple	17:12
noonedeadpunk	until it breaks :D	17:12
spatel	restarting meta on last node	17:13
spatel	its taking long time so must be stuck threads	17:14
noonedeadpunk	I hope you restarted dhcp agent as well :)	17:14
spatel	yes	17:14
spatel	systemctl restart neutron-dhcp-agent.service and neutron-metadata-agent.service	17:14
spatel	damn this this taking very long time.. still hanging	17:16
*** dviroel\|ruck is now known as dviroel\|ruck\|afk		17:16
spatel	no kidding..... it works	17:17
spatel	Damn noonedeadpunk	17:18
noonedeadpunk	yeah, I know. Sorry for not recalling earlier	17:19
spatel	OMG!! you saved my life..	17:19
noonedeadpunk	most nasty thing it wasn't logging anything	17:19
spatel	This should be a bug... trust me...	17:19
spatel	last 24 hour i am hitting the wall	17:19
noonedeadpunk	You should prepare talk about that for summit :D	17:20
noonedeadpunk	They conveniently prolonged CFP	17:20
spatel	I am not there yet :(	17:21
spatel	still need to learn lost of	17:21
noonedeadpunk	one of your blog posts totally could be fit for presentation	17:22
noonedeadpunk	as you did tons of research in networking	17:22
noonedeadpunk	and have numbers	17:22
spatel	I am going to document this incident	17:25
spatel	how i build rabbitMQ how i did HA and how i did everything..	17:25
spatel	also i am thinking to put IRC script about our discussion	17:26
spatel	jamesdenton do i need neutron-linuxbridge-agent.service in SRIOV nodes?	17:35
spatel	i believe SRIOV need only - neutron-sriov-nic-agent.service	17:36
jamesdenton	i don't think you do	17:42
jamesdenton	scrollback - glad to see you got it working	17:43
spatel	jamesdenton This is very unknown issue.. i don't think its easy to figure out :( Big thanks to noonedeadpunk to stick around with me	17:55
spatel	he didn't give us :)	17:55
spatel	blog is on the way for postmortem	17:57
fridtjof[m]	ugh wow, just got bitten by last year's let's encrypt CA thing. it sneakily broke a major upgrade to wallaby :D nothing could clone stuff from opendev until i upgraded ca-certificates...	18:08
jrosser	i wonder if we should have a task to update ca-certificates specifically to 'latest' rather than just 'present'	18:39
opendevreview	Merged openstack/openstack-ansible-rabbitmq_server master: Allow different install methods for rabbit/erlang https://review.opendev.org/c/openstack/openstack-ansible-rabbitmq_server/+/826445	19:01
opendevreview	Merged openstack/openstack-ansible-rabbitmq_server master: Update used RabbitMQ and Erlang https://review.opendev.org/c/openstack/openstack-ansible-rabbitmq_server/+/826446	19:07
*** dviroel\|ruck\|afk is now known as dviroel\|ruck		19:19
mgariepy	anyone here have an idea how bad it is when a memcached server is down ?	19:53
mgariepy	like what service are really badly impacted by this	19:54
spatel	I think only horizon use it	19:54
spatel	keystone	19:54
spatel	for token cache i believe	19:54
mgariepy	nova has it in the config https://github.com/openstack/openstack-ansible-os_nova/blob/master/templates/nova.conf.j2#L68	19:55
spatel	may be just to see token is in cache or not	19:55
spatel	https://skuicloud.wordpress.com/2014/11/15/openstack-memcached-for-keystone/	19:57
spatel	i believe only keystone use memcache	19:58
mgariepy	i'm moving racks between 2 DC.	19:59
mgariepy	i have 1 controllers per rack..	20:00
mgariepy	but when i do stop memcache server i do have a lot of issue with nova	20:00
spatel	really..? nova mostly use to check cache token i believe to speed up process and put less load on keystone	20:02
mgariepy	or it's the nova/neutron interraction that is failing.	20:02
mgariepy	it's somewhat of a pain really.	20:03
spatel	i went through great pain so i can understand.. but trust me i don't think memcache can cause any issue to nova	20:03
spatel	i never touch memcache in my deployment (even you remind me today that we have memcache)	20:03
mgariepy	when it's up it works. but if you shut one down then there is a lot of delay that start to appears	20:04
spatel	hmm i never test to shutdown mem	20:09
spatel	are you currently down or just observe behavior	20:09
mgariepy	i do have 2 up out of 3.	20:10
mgariepy	Networking client is experiencing an unauthorized exception. (HTTP 400)	20:13
mgariepy	when doing openstack server list --all	20:14
spatel	hmm related to keystone / memcache issue	20:15
mgariepy	i did remove the memcache server that is down from keystone. restarted the processes and flushed the memcached on all nodes.	20:25
spatel	any improvement?	20:26
mgariepy	non	20:26
mgariepy	one in 6-7 call do end with the error.	20:27
mgariepy	hmm.. openstack network list,, take 1 sec except once in a while.. it takes 4 sec.	20:29
spatel	that is normal correct	20:33
mgariepy	1sec is ok but 4 doesn't seems ok ;)	20:33
mgariepy	it should be consistent i guess	20:33
spatel	in my case it take 2s	20:34
mgariepy	i'll dig more tomorrow.	20:55
spatel	Please share your finding :)	20:56
mgariepy	i will probably poke you guys also tomorrow ! haha	20:56
mgariepy	Moving racks between 2 DC live seems to be a good idea at first ..	20:57
mgariepy	lol	20:57
spatel	hehe you are tough	20:58
mgariepy	well.. the DC are 1 block away. with dual 100Gb links between them .. :)	20:58
mgariepy	it's almost like a local relocation ahha	20:59
spatel	why do you want to split between two DC	20:59
spatel	assuming redendency	20:59
mgariepy	nop	20:59
mgariepy	we need to move out of dc1..	20:59
mgariepy	haha	20:59
spatel	:)	21:00
spatel	how many nodes you have/	21:00
jrosser	the services all have keystone stuff in their configs	21:06
mgariepy	~110 computes,	21:06
jrosser	and by extension they will be affected by loss of memcached	21:06
mgariepy	so i should just update the memcache config on all service ?	21:07
jrosser	it’s something like, when a service needs to auth against keystone it either gets a short circuit via a cached thing in memcached, or keystone has to do expensive crypto thing	21:08
jrosser	so all your services can be affected, and memcached also is not a cluster	21:09
jrosser	there is slightly opaque layer in oslo for having a pool of memcached	21:10
mgariepy	hmm ok i'll update all the service tomorrow then. but i did test that part a bit when i did ubuntu 16>18 upgrades on rocky	21:10
mgariepy	but it migth be a bit different on V.	21:11
jrosser	I think we struggled during OS upgrade when the performance collapsed	21:11
mgariepy	i've seen keystone taking like 30s instead of 2s when a memcache was down	21:12
mgariepy	but never seen errors like i do see now.	21:12
damiandabrowski[m]	sorry for interrupting You guys but I'm confused by our haproxy config :D is there any point to set `balance source` anywhere while by default we use `balance leastconn` + `stick store-request src`?	21:13
jrosser	i am not sure tbh	21:21
damiandabrowski[m]	thank, i'll try to have a deeper look on this. But as far as I can see, we define `balance source` only for: adjutant_api, ceph_rgw, cloudkitty_api, glance_api, horizon, nova_console, sahara_api, swift_proxy, zun_console	21:29
*** dviroel\|ruck is now known as dviroel\|out		21:49
*** prometheanfire is now known as Guest2		23:49

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!