Friday, 2021-10-15

ozzzo	eandersson: Looking at this doc: https://docs.openstack.org/designate/latest/admin/ha.html	00:05
ozzzo	It appears that only designate-producer uses locking. We installed redis after rebuilding some clusters on train, but the locking doesn't work, and we get "Duplicate record" errors if we delete a VM and re-create it immediately with the same IP	00:06
ozzzo	so I want to figure out how the locking works, and then figure out why it isn't working for us	00:06
ozzzo	it looks like what is happening is, one controller starts deleting the old record, and then before it finishes deleting, another controller tries to create the new one	00:07
johnsom	ozzzo did you confirm the previous steps we provided? Like checking the train version you are running?	00:07
ozzzo	yes, we have the latest train code	00:08
johnsom	And the sink question?	00:08
ozzzo	I lost access to what we discussed. Is openstack-dns archived anywhere?	00:09
johnsom	Yeah, on minute	00:09
ozzzo	my work doesn't allow irc so I have to use the garbage web client	00:09
johnsom	Ah, bummer	00:10
johnsom	ozzzo https://meetings.opendev.org/irclogs/%23openstack-dns/%23openstack-dns.2021-10-13.log.html#t2021-10-13T18:04:56	00:10
ozzzo	yes I looked at our code and we have that patch	00:11
ozzzo	someone told me we're running the latest; how can I verify that we're on 9.0.2?	00:11
johnsom	Look in your logs for this line: INFO designate.service [-] Starting central service (version: 12.1.0)	00:12
ozzzo	yes we're on 9.0.2	00:14
johnsom	If you grep for "coordination" in the central logs what do you see?	00:15
ozzzo	I see old messages that say "no coordination backend configured"	00:16
ozzzo	nothing recent	00:16
johnsom	Are you running with debug logging on?	00:17
ozzzo	no. how can I enable that for just Designate?	00:19
johnsom	In you designate configuration file, in the [DEFAULT] section, set "debug = True"	00:19
ozzzo	in /etc/kolla/designate-central/designate.conf? and then restart the designate_central container?	00:20
johnsom	I don't know anything about kolla and how it deploys designate, but that seems like a logical place, yeah.	00:21
johnsom	e-andersson is the expert on the locking, but I can give some pointers I know about what was implemented.	00:24
johnsom	This decorator: https://github.com/openstack/designate/blob/f101fab29540ba11481abbf9c7558f976d14b26b/designate/central/service.py#L1260	00:24
johnsom	Does the locking using the coordination driver configured.	00:24
johnsom	As you can see, each call for create recordset and create record has the wrapper. It will use the lock manager to lock the zone for each create, etc. The limits concurrent access to change the zone, thus ensuring the serial gets updated correctly.	00:26
ozzzo	ok reading, ty	00:29
ozzzo	it looks like this code prevents updates to a zone that is being deleted. our case is slightly different; the forward and reverse records are in the process of being deleted, and we are creating new forward and reverse records for the same name and IP. Is there a mechanism to make the creation wait until the deletion finishes?	00:50
eandersson	II don't use reverse records, so could be a new / different bug	05:11
eandersson	The code should prevent any modification to a zone while another record is being updated / deleted / created	07:39
eandersson	If something is able to happen that shouldn't we might be missing a lock	07:40
eandersson	I can't remember the path that neutron makes, maybe frickler remembers	07:42
eandersson	but maybe we need a lock here?	07:42
eandersson	https://github.com/openstack/designate/blob/f101fab29540ba11481abbf9c7558f976d14b26b/designate/central/service.py#L2174	07:42
eandersson	ozzzo would you be able to produce us with more logs? also, ideally your config (without passwords)	07:47
eandersson	At least the full exception (ideally traced from all logs)	07:47
frickler	neutron does forward records first and then reverse, all by designate API calls, so nothing special there I'd hope	07:51
frickler	but yes, we need some instructions to reproduce, otherwise I wouldn't know how to tackle it	07:52
opendevreview	Vadym Markov proposed openstack/designate-dashboard master: Rename "Floating IP Description" field https://review.opendev.org/c/openstack/designate-dashboard/+/814121	09:05
ozzzo	eandersson: I can't; in fact I've been advised that I can't contact the community at all from my Verisign workstation until I get upper management approval	11:52
ozzzo	the "security" department is totally out of control; they have us locked down to the point where we can barely work	11:52
*** njohnston_ is now known as njohnston		14:01
ozzzo	after enabling debug logging in designate-central, I see 3 "coordination" messages when I restart the container. Besides that coordination doesn't appear in the log. When I duplicate the issue I see stack traces in designate-sink.log but nothing about coordination there either	15:03
ozzzo	I'm trying to delete a record that was created by Designate and I get error "Managed records may not be deleted"	17:08
ozzzo	how can I work around that?	17:08
johnsom	Usually it's not a good idea to do that, but if you really need to, https://docs.openstack.org/python-designateclient/latest/cli/index.html#recordset-delete the --edit-managed option should override that.	17:53
-opendevstatus- NOTICE: The Gerrit service on review.opendev.org will be offline starting in 5 minutes, at 18:00 UTC, for scheduled project rename maintenance, which should last no more than an hour (but will likely be much shorter): http://lists.opendev.org/pipermail/service-announce/2021-October/000024.html		17:58
eandersson	ozzzo: --edit-managed	17:59
ozzzo	righton, ty!	17:59
ozzzo	I broke designate and then deleted some VMs	17:59
eandersson	I'll try to reproduce it this weekend.	17:59
eandersson	Are you using designate-sink (aka notifications)?	18:00
ozzzo	yes	18:01
eandersson	https://github.com/openstack/kolla-ansible/blob/stable/train/ansible/roles/designate/templates/designate.conf.j2#L64	18:01
eandersson	Looks like it is default in ansible	18:01
eandersson	kolla	18:01
eandersson	And is this happening when you are just creating and destroying VMs?	18:01
ozzzo	designate-sink.log is where I see the stacktraces when it fails	18:01
eandersson	I see	18:01
eandersson	We have this error too, but it isn't very frequent	18:02
ozzzo	yes. To duplicate it I had to create a TF VM with a static IP, taint it, and then run TF apply	18:02
ozzzo	so it's deleting the record and then immediately creating a new record with the same name and IP	18:02
eandersson	I wrote my own notification handler	18:02
eandersson	but we only do nova, not neutron	18:02
eandersson	so we don't hit it very often	18:03
eandersson	like one per 10k VMs or something like that	18:03
ozzzo	I don't think we've seen it in nova	18:03
eandersson	Is it always neutron ptr?	18:03
eandersson	btw sorry when I say nova	18:04
eandersson	I mean nova here https://github.com/openstack/kolla-ansible/blob/stable/train/ansible/roles/designate/templates/designate.conf.j2#L64	18:04
eandersson	handler:nova_fixed vs. handler:neutron_floatingi	18:04
eandersson	Does the exception always come from nova.py or neutron.py?	18:05
eandersson	https://github.com/openstack/designate/blob/stable/train/designate/notification_handler/neutron.py	18:05
eandersson	or is it mixed?	18:06
eandersson	The way ansible-kolla is set up is very aggressive. We run into this issue with ~1 per 10k VMs, but we only create a single A record. ansible-kolla is set up to create 6 records per VM	18:07
ozzzo	I don't see nova nor neutron in the logs. I see pmysql and designate	18:08
eandersson	I see	18:08
ozzzo	it tries to create an entry in the mysql database and fails with "Duplicate entry"	18:08
eandersson	Yea - it's so weird because while we have that, it goes away if I add memcached as a coordinator	18:09
ozzzo	looks like it happens on both forward and reverse	18:09
ozzzo	hopefully I will get permission to share logs sometime soon	18:09
eandersson	johnsom a big todo for testing is to mock nova/neutron notifications and add that as part of the functional testing	18:10
eandersson	When I designed the coordination stuff I would just craft custom json payloads and send a million of them to rabbitmq	18:10
johnsom	eandersson Interesting. Yeah, we aren't targeting the sink or notifications in our initial plan, but it will come.	18:11
eandersson	The way the sink talks to central it is easier to test central using it	18:12
eandersson	for race conditions	18:12
eandersson	(since you don't need to make rest calls, just send 100 messages to rabbitmq and let it consume them instantly)	18:13
johnsom	Might be worth writing up in a bug	18:14
eandersson	Hopefully I will have some time later today or this weekend	18:15
eandersson	I also want to try to reproduce this in general.	18:15
eandersson	The only thing that is different from when I tested is that kolla-ansible creates 6 records per VM.	18:15
eandersson	With a coordinator I was able to create 1k fake VMs without an issue	18:16
eandersson	but I was only creating one record per VM	18:16
eandersson	I also used memcached and not redis, but I have a hard time to think that redis wouldn't work :D	18:19
ozzzo	I can duplicate it consistently. It fails every time, when I use TF to rebuild a VM with a static IP and existing Designate DNS records	18:23
ozzzo	on the 2nd rebuild it works because the DNS doesnt' exist, then fails on the 3rd, etc	18:23
eandersson	Yea - I mean it really just seems like coordination isn't working for you at all, but not sure how to prove that.	19:07
johnsom	eandersson When you have a minute, can you look at https://review.opendev.org/c/openstack/designate-tempest-plugin/+/800280 It has a +1 and +2 and would continue our path towards updating our tempest plugin.	19:25
eandersson	lgtm johnsom	20:39
johnsom	Thank you	20:40
opendevreview	Merged openstack/designate-tempest-plugin master: Update service client access in tempest tests https://review.opendev.org/c/openstack/designate-tempest-plugin/+/800280	21:27

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!