ozzzo | eandersson: Looking at this doc: https://docs.openstack.org/designate/latest/admin/ha.html | 00:05 |
---|---|---|
ozzzo | It appears that only designate-producer uses locking. We installed redis after rebuilding some clusters on train, but the locking doesn't work, and we get "Duplicate record" errors if we delete a VM and re-create it immediately with the same IP | 00:06 |
ozzzo | so I want to figure out how the locking works, and then figure out why it isn't working for us | 00:06 |
ozzzo | it looks like what is happening is, one controller starts deleting the old record, and then before it finishes deleting, another controller tries to create the new one | 00:07 |
johnsom | ozzzo did you confirm the previous steps we provided? Like checking the train version you are running? | 00:07 |
ozzzo | yes, we have the latest train code | 00:08 |
johnsom | And the sink question? | 00:08 |
ozzzo | I lost access to what we discussed. Is openstack-dns archived anywhere? | 00:09 |
johnsom | Yeah, on minute | 00:09 |
ozzzo | my work doesn't allow irc so I have to use the garbage web client | 00:09 |
johnsom | Ah, bummer | 00:10 |
johnsom | ozzzo https://meetings.opendev.org/irclogs/%23openstack-dns/%23openstack-dns.2021-10-13.log.html#t2021-10-13T18:04:56 | 00:10 |
ozzzo | yes I looked at our code and we have that patch | 00:11 |
ozzzo | someone told me we're running the latest; how can I verify that we're on 9.0.2? | 00:11 |
johnsom | Look in your logs for this line: INFO designate.service [-] Starting central service (version: 12.1.0) | 00:12 |
ozzzo | yes we're on 9.0.2 | 00:14 |
johnsom | If you grep for "coordination" in the central logs what do you see? | 00:15 |
ozzzo | I see old messages that say "no coordination backend configured" | 00:16 |
ozzzo | nothing recent | 00:16 |
johnsom | Are you running with debug logging on? | 00:17 |
ozzzo | no. how can I enable that for just Designate? | 00:19 |
johnsom | In you designate configuration file, in the [DEFAULT] section, set "debug = True" | 00:19 |
ozzzo | in /etc/kolla/designate-central/designate.conf? and then restart the designate_central container? | 00:20 |
johnsom | I don't know anything about kolla and how it deploys designate, but that seems like a logical place, yeah. | 00:21 |
johnsom | e-andersson is the expert on the locking, but I can give some pointers I know about what was implemented. | 00:24 |
johnsom | This decorator: https://github.com/openstack/designate/blob/f101fab29540ba11481abbf9c7558f976d14b26b/designate/central/service.py#L1260 | 00:24 |
johnsom | Does the locking using the coordination driver configured. | 00:24 |
johnsom | As you can see, each call for create recordset and create record has the wrapper. It will use the lock manager to lock the zone for each create, etc. The limits concurrent access to change the zone, thus ensuring the serial gets updated correctly. | 00:26 |
ozzzo | ok reading, ty | 00:29 |
ozzzo | it looks like this code prevents updates to a zone that is being deleted. our case is slightly different; the forward and reverse records are in the process of being deleted, and we are creating new forward and reverse records for the same name and IP. Is there a mechanism to make the creation wait until the deletion finishes? | 00:50 |
eandersson | II don't use reverse records, so could be a new / different bug | 05:11 |
eandersson | The code should prevent any modification to a zone while another record is being updated / deleted / created | 07:39 |
eandersson | If something is able to happen that shouldn't we might be missing a lock | 07:40 |
eandersson | I can't remember the path that neutron makes, maybe frickler remembers | 07:42 |
eandersson | but maybe we need a lock here? | 07:42 |
eandersson | https://github.com/openstack/designate/blob/f101fab29540ba11481abbf9c7558f976d14b26b/designate/central/service.py#L2174 | 07:42 |
eandersson | ozzzo would you be able to produce us with more logs? also, ideally your config (without passwords) | 07:47 |
eandersson | At least the full exception (ideally traced from all logs) | 07:47 |
frickler | neutron does forward records first and then reverse, all by designate API calls, so nothing special there I'd hope | 07:51 |
frickler | but yes, we need some instructions to reproduce, otherwise I wouldn't know how to tackle it | 07:52 |
opendevreview | Vadym Markov proposed openstack/designate-dashboard master: Rename "Floating IP Description" field https://review.opendev.org/c/openstack/designate-dashboard/+/814121 | 09:05 |
ozzzo | eandersson: I can't; in fact I've been advised that I can't contact the community at all from my Verisign workstation until I get upper management approval | 11:52 |
ozzzo | the "security" department is totally out of control; they have us locked down to the point where we can barely work | 11:52 |
*** njohnston_ is now known as njohnston | 14:01 | |
ozzzo | after enabling debug logging in designate-central, I see 3 "coordination" messages when I restart the container. Besides that coordination doesn't appear in the log. When I duplicate the issue I see stack traces in designate-sink.log but nothing about coordination there either | 15:03 |
ozzzo | I'm trying to delete a record that was created by Designate and I get error "Managed records may not be deleted" | 17:08 |
ozzzo | how can I work around that? | 17:08 |
johnsom | Usually it's not a good idea to do that, but if you really need to, https://docs.openstack.org/python-designateclient/latest/cli/index.html#recordset-delete the --edit-managed option should override that. | 17:53 |
-opendevstatus- NOTICE: The Gerrit service on review.opendev.org will be offline starting in 5 minutes, at 18:00 UTC, for scheduled project rename maintenance, which should last no more than an hour (but will likely be much shorter): http://lists.opendev.org/pipermail/service-announce/2021-October/000024.html | 17:58 | |
eandersson | ozzzo: --edit-managed | 17:59 |
ozzzo | righton, ty! | 17:59 |
ozzzo | I broke designate and then deleted some VMs | 17:59 |
eandersson | I'll try to reproduce it this weekend. | 17:59 |
eandersson | Are you using designate-sink (aka notifications)? | 18:00 |
ozzzo | yes | 18:01 |
eandersson | https://github.com/openstack/kolla-ansible/blob/stable/train/ansible/roles/designate/templates/designate.conf.j2#L64 | 18:01 |
eandersson | Looks like it is default in ansible | 18:01 |
eandersson | kolla | 18:01 |
eandersson | And is this happening when you are just creating and destroying VMs? | 18:01 |
ozzzo | designate-sink.log is where I see the stacktraces when it fails | 18:01 |
eandersson | I see | 18:01 |
eandersson | We have this error too, but it isn't very frequent | 18:02 |
ozzzo | yes. To duplicate it I had to create a TF VM with a static IP, taint it, and then run TF apply | 18:02 |
ozzzo | so it's deleting the record and then immediately creating a new record with the same name and IP | 18:02 |
eandersson | I wrote my own notification handler | 18:02 |
eandersson | but we only do nova, not neutron | 18:02 |
eandersson | so we don't hit it very often | 18:03 |
eandersson | like one per 10k VMs or something like that | 18:03 |
ozzzo | I don't think we've seen it in nova | 18:03 |
eandersson | Is it always neutron ptr? | 18:03 |
eandersson | btw sorry when I say nova | 18:04 |
eandersson | I mean nova here https://github.com/openstack/kolla-ansible/blob/stable/train/ansible/roles/designate/templates/designate.conf.j2#L64 | 18:04 |
eandersson | handler:nova_fixed vs. handler:neutron_floatingi | 18:04 |
eandersson | Does the exception always come from nova.py or neutron.py? | 18:05 |
eandersson | https://github.com/openstack/designate/blob/stable/train/designate/notification_handler/neutron.py | 18:05 |
eandersson | or is it mixed? | 18:06 |
eandersson | The way ansible-kolla is set up is very aggressive. We run into this issue with ~1 per 10k VMs, but we only create a single A record. ansible-kolla is set up to create 6 records per VM | 18:07 |
ozzzo | I don't see nova nor neutron in the logs. I see pmysql and designate | 18:08 |
eandersson | I see | 18:08 |
ozzzo | it tries to create an entry in the mysql database and fails with "Duplicate entry" | 18:08 |
eandersson | Yea - it's so weird because while we have that, it goes away if I add memcached as a coordinator | 18:09 |
ozzzo | looks like it happens on both forward and reverse | 18:09 |
ozzzo | hopefully I will get permission to share logs sometime soon | 18:09 |
eandersson | johnsom a big todo for testing is to mock nova/neutron notifications and add that as part of the functional testing | 18:10 |
eandersson | When I designed the coordination stuff I would just craft custom json payloads and send a million of them to rabbitmq | 18:10 |
johnsom | eandersson Interesting. Yeah, we aren't targeting the sink or notifications in our initial plan, but it will come. | 18:11 |
eandersson | The way the sink talks to central it is easier to test central using it | 18:12 |
eandersson | for race conditions | 18:12 |
eandersson | (since you don't need to make rest calls, just send 100 messages to rabbitmq and let it consume them instantly) | 18:13 |
johnsom | Might be worth writing up in a bug | 18:14 |
eandersson | Hopefully I will have some time later today or this weekend | 18:15 |
eandersson | I also want to try to reproduce this in general. | 18:15 |
eandersson | The only thing that is different from when I tested is that kolla-ansible creates 6 records per VM. | 18:15 |
eandersson | With a coordinator I was able to create 1k fake VMs without an issue | 18:16 |
eandersson | but I was only creating one record per VM | 18:16 |
eandersson | I also used memcached and not redis, but I have a hard time to think that redis wouldn't work :D | 18:19 |
ozzzo | I can duplicate it consistently. It fails every time, when I use TF to rebuild a VM with a static IP and existing Designate DNS records | 18:23 |
ozzzo | on the 2nd rebuild it works because the DNS doesnt' exist, then fails on the 3rd, etc | 18:23 |
eandersson | Yea - I mean it really just seems like coordination isn't working for you at all, but not sure how to prove that. | 19:07 |
johnsom | eandersson When you have a minute, can you look at https://review.opendev.org/c/openstack/designate-tempest-plugin/+/800280 It has a +1 and +2 and would continue our path towards updating our tempest plugin. | 19:25 |
eandersson | lgtm johnsom | 20:39 |
johnsom | Thank you | 20:40 |
opendevreview | Merged openstack/designate-tempest-plugin master: Update service client access in tempest tests https://review.opendev.org/c/openstack/designate-tempest-plugin/+/800280 | 21:27 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!