Tuesday, 2023-04-04

opendevreview	Merged openstack/designate-tempest-plugin master: Re-enable the tempest tests and add Antelope https://review.opendev.org/c/openstack/designate-tempest-plugin/+/879168	01:25
opendevreview	Erik Olof Gunnar Andersson proposed openstack/designate master: Move to a batch model for incrementing serial https://review.opendev.org/c/openstack/designate/+/871255	03:48
opendevreview	Merged openstack/designate master: Fix sharing a zone with the zone owner https://review.opendev.org/c/openstack/designate/+/879208	04:57
opendevreview	Michael Johnson proposed openstack/designate stable/2023.1: Fix sharing a zone with the zone owner https://review.opendev.org/c/openstack/designate/+/879474	15:18
ozzzo_work	I'm seeing orphaned DNS records created when I delete VMs in my kolla Train cluster. It seems to happen about 1/50 deletions. The deletion is happening here: https://github.com/openstack/designate/blob/60edc59ff765b406e4b936deb4d200a2d9b411ce/designate/notification_handler/base.py#L113	15:46
ozzzo_work	I added some extra logging to see what is happening: https://paste.openstack.org/show/bM7b0Cd4YDiJDlhqa1HE/	15:46
ozzzo_work	I loop through the recordset and see a single record. Then the if statement at line 117 (or 16 in the paste) fails, and we see the "not found" error. Then I loop through it again and the record is still there	15:48
ozzzo_work	Why is the if failing when the record exists in the recordset?	15:48
johnsom	ozzo_work This check: if record_to_delete not in recordset.records	15:59
johnsom	Is checking the whole object, not just the ID, so I expect there is something different in the object comparison. Maybe it should be looking for the IDs and not the whole object.	16:00
ozzzo_work	johnsom: what would that look like? Would it be: if record_to_delete['id'] not in recordset.records:	16:06
ozzzo_work	or would it be better to leave out the if and just delete it in a try:?	16:07
ozzzo_work	something like: try: recordset.records.remove(record_to_delete) except: (log error) ?	16:09
johnsom	That code is already in a try block, so you probably don't need another one. I would assume the overhead of that remove is low enough that just trying it should be fine.	16:17
johnsom	I would have to dig in that object code to see if it would be a major performance hit with a large number of records in the recordset. I'm not super familiar with that code yet.	16:18
ozzzo_work	ok I'll try it, ty!	16:27
opendevreview	Erik Olof Gunnar Andersson proposed openstack/designate master: Secondary zone loops AXFR transfer during zone creation https://review.opendev.org/c/openstack/designate/+/864131	20:00
eandersson	What version is this ozzzo_work?	20:05
eandersson	There are typically two issues with the notifications code	20:07
eandersson	1) Race conditions with some older versions of Designate can cause records to be missed. This is usually just when a VM is created and destroyed quickly, but could also happen if you have multiple IPs per record (e.g. many VMs with the same name)	20:07
eandersson	2) Missing notifications from Nova / Neutron. This usually happens when the compute is having issues (e.g. hardware issues, service restarted during process)	20:07
johnsom	The paste output led me to think it was an object comparison issue. Like a status was off or something	20:08
eandersson	Yea - that is possible	20:08
eandersson	I have made a ton of improvements around these paths	20:08
eandersson	Although looks like ozzzo should have all of those already	20:10
eandersson	Make sure you are using a coordinator because that is a common issue	20:11
eandersson	I honestly suspect that my batching PR will solve this too. It's one of the reasons I went he batching route, was to solve issues around designate-sink	20:12
eandersson	The API layer protect against a lot of these type of issues	20:13
johnsom	Working on reviewing that patch now.	20:13
eandersson	Awesome	20:13
eandersson	Those last few patches were attempts to move the sink code closer to the API code, but it wasn't always enough	20:13
eandersson	It's the same race conditions we see with PTR code btw	20:14
eandersson	(that occasionally causes our PTR functional tests to fail)	20:15
eandersson	Also, make sure you are using a coordinator that supports locking.	20:15
eandersson	Which is all drivers so as long as you have one of these configured :D https://docs.openstack.org/tooz/latest/user/compatibility.html#id4	20:16
ozzzo	eandersson: We're running Train. I think we must have patched it because the stable-train has different delete code that was last changed in 2020: https://github.com/openstack/designate/blob/train-eol/designate/notification_handler/base.py	23:31
ozzzo	What we're running looks like this: https://github.com/openstack/designate/blob/60edc59ff765b406e4b936deb4d200a2d9b411ce/designate/notification_handler/base.py	23:31
ozzzo	This version includes the 2021 "Fix race condition in the sink when deleting records" patch	23:31
ozzzo	we are running redis; we set that up last year when we were having a similar issue that happened frequently. That helped a lot but we still see the occasional orphaned record.	23:34
ozzzo_work	I tried this but it still fails to delete: https://paste.openstack.org/show/bBCWzZXkronbcrwQF3Ar/	23:39
ozzzo_work	This is happening shortly after the record is created. Could I be hitting some kind of race condition where the record is locked because it hasn't updated on all 3 controllers?	23:41
ozzzo_work	My script creates the VM, pings it, checks forward and reverse DNS at the NS, then checks at all 3 controllers, and after all that is working (usually 10-20 seconds) then it deletes the VM	23:42
ozzzo_work	10 VMs every 10 minutes, per-cluster	23:43

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!