opendevreview | Merged openstack/designate-tempest-plugin master: Re-enable the tempest tests and add Antelope https://review.opendev.org/c/openstack/designate-tempest-plugin/+/879168 | 01:25 |
---|---|---|
opendevreview | Erik Olof Gunnar Andersson proposed openstack/designate master: Move to a batch model for incrementing serial https://review.opendev.org/c/openstack/designate/+/871255 | 03:48 |
opendevreview | Merged openstack/designate master: Fix sharing a zone with the zone owner https://review.opendev.org/c/openstack/designate/+/879208 | 04:57 |
opendevreview | Michael Johnson proposed openstack/designate stable/2023.1: Fix sharing a zone with the zone owner https://review.opendev.org/c/openstack/designate/+/879474 | 15:18 |
ozzzo_work | I'm seeing orphaned DNS records created when I delete VMs in my kolla Train cluster. It seems to happen about 1/50 deletions. The deletion is happening here: https://github.com/openstack/designate/blob/60edc59ff765b406e4b936deb4d200a2d9b411ce/designate/notification_handler/base.py#L113 | 15:46 |
ozzzo_work | I added some extra logging to see what is happening: https://paste.openstack.org/show/bM7b0Cd4YDiJDlhqa1HE/ | 15:46 |
ozzzo_work | I loop through the recordset and see a single record. Then the if statement at line 117 (or 16 in the paste) fails, and we see the "not found" error. Then I loop through it again and the record is still there | 15:48 |
ozzzo_work | Why is the if failing when the record exists in the recordset? | 15:48 |
johnsom | ozzo_work This check: if record_to_delete not in recordset.records | 15:59 |
johnsom | Is checking the whole object, not just the ID, so I expect there is something different in the object comparison. Maybe it should be looking for the IDs and not the whole object. | 16:00 |
ozzzo_work | johnsom: what would that look like? Would it be: if record_to_delete['id'] not in recordset.records: | 16:06 |
ozzzo_work | or would it be better to leave out the if and just delete it in a try:? | 16:07 |
ozzzo_work | something like: try: recordset.records.remove(record_to_delete) except: (log error) ? | 16:09 |
johnsom | That code is already in a try block, so you probably don't need another one. I would assume the overhead of that remove is low enough that just trying it should be fine. | 16:17 |
johnsom | I would have to dig in that object code to see if it would be a major performance hit with a large number of records in the recordset. I'm not super familiar with that code yet. | 16:18 |
ozzzo_work | ok I'll try it, ty! | 16:27 |
opendevreview | Erik Olof Gunnar Andersson proposed openstack/designate master: Secondary zone loops AXFR transfer during zone creation https://review.opendev.org/c/openstack/designate/+/864131 | 20:00 |
eandersson | What version is this ozzzo_work? | 20:05 |
eandersson | There are typically two issues with the notifications code | 20:07 |
eandersson | 1) Race conditions with some older versions of Designate can cause records to be missed. This is usually just when a VM is created and destroyed quickly, but could also happen if you have multiple IPs per record (e.g. many VMs with the same name) | 20:07 |
eandersson | 2) Missing notifications from Nova / Neutron. This usually happens when the compute is having issues (e.g. hardware issues, service restarted during process) | 20:07 |
johnsom | The paste output led me to think it was an object comparison issue. Like a status was off or something | 20:08 |
eandersson | Yea - that is possible | 20:08 |
eandersson | I have made a ton of improvements around these paths | 20:08 |
eandersson | Although looks like ozzzo should have all of those already | 20:10 |
eandersson | Make sure you are using a coordinator because that is a common issue | 20:11 |
eandersson | I honestly suspect that my batching PR will solve this too. It's one of the reasons I went he batching route, was to solve issues around designate-sink | 20:12 |
eandersson | The API layer protect against a lot of these type of issues | 20:13 |
johnsom | Working on reviewing that patch now. | 20:13 |
eandersson | Awesome | 20:13 |
eandersson | Those last few patches were attempts to move the sink code closer to the API code, but it wasn't always enough | 20:13 |
eandersson | It's the same race conditions we see with PTR code btw | 20:14 |
eandersson | (that occasionally causes our PTR functional tests to fail) | 20:15 |
eandersson | Also, make sure you are using a coordinator that supports locking. | 20:15 |
eandersson | Which is all drivers so as long as you have one of these configured :D https://docs.openstack.org/tooz/latest/user/compatibility.html#id4 | 20:16 |
ozzzo | eandersson: We're running Train. I think we must have patched it because the stable-train has different delete code that was last changed in 2020: https://github.com/openstack/designate/blob/train-eol/designate/notification_handler/base.py | 23:31 |
ozzzo | What we're running looks like this: https://github.com/openstack/designate/blob/60edc59ff765b406e4b936deb4d200a2d9b411ce/designate/notification_handler/base.py | 23:31 |
ozzzo | This version includes the 2021 "Fix race condition in the sink when deleting records" patch | 23:31 |
ozzzo | we are running redis; we set that up last year when we were having a similar issue that happened frequently. That helped a lot but we still see the occasional orphaned record. | 23:34 |
ozzzo_work | I tried this but it still fails to delete: https://paste.openstack.org/show/bBCWzZXkronbcrwQF3Ar/ | 23:39 |
ozzzo_work | This is happening shortly after the record is created. Could I be hitting some kind of race condition where the record is locked because it hasn't updated on all 3 controllers? | 23:41 |
ozzzo_work | My script creates the VM, pings it, checks forward and reverse DNS at the NS, then checks at all 3 controllers, and after all that is working (usually 10-20 seconds) then it deletes the VM | 23:42 |
ozzzo_work | 10 VMs every 10 minutes, per-cluster | 23:43 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!