michchap | eandersson, frickler: I think the intro doc is ready to go in if you have a moment. https://review.opendev.org/c/openstack/designate/+/763779 | 02:07 |
---|---|---|
*** ianychoi has joined #openstack-dns | 02:09 | |
*** ircuser-1 has quit IRC | 06:04 | |
*** ianychoi has quit IRC | 08:33 | |
*** ianychoi has joined #openstack-dns | 08:48 | |
*** mgagne has quit IRC | 12:52 | |
*** mgagne has joined #openstack-dns | 12:53 | |
-openstackstatus- NOTICE: Recent POST_FAILURE results from Zuul for builds started prior to 15:47 UTC were due to network connectivity issues reaching one of our log storage providers, and can be safely rechecked | 15:50 | |
nicolasbock | Hi! I am working on a somewhat puzzling issue with parallel creation of recordsets. I have a 3 node HA Designate with a BIND9 backend on Stein. I create 10 recordsets in parallel and often end up with fewer records in BIND9. I tracked down the issue to how the zone serial is incremented. Since the code is using `time.time()` in seconds the serial is not always incremented and BIND9 is missing some updates because it | 21:30 |
nicolasbock | believes that the serial hasn't changed. | 21:30 |
nicolasbock | I have a fix as well. But for Stein. | 21:30 |
nicolasbock | I cannot reproduce this issue though starting with Ussuri | 21:30 |
nicolasbock | The increment serial code has not changed though which makes this quite puzzling | 21:31 |
nicolasbock | Does anyone here have an idea where I should look to see what might have "fixed" the issue? | 21:31 |
nicolasbock | Is there some better serialization for API calls? | 21:31 |
johnsom | Could it be a change in behavior from the py2->3 change? | 21:31 |
nicolasbock | The Stein env is Python3 | 21:31 |
nicolasbock | The BIND9 version is different as well | 21:32 |
nicolasbock | Stein is Bionic and Ussuri is Focal | 21:32 |
nicolasbock | I don't see any errors in the BIND9 logs though | 21:32 |
nicolasbock | It's queuing the NOTIFYs properly | 21:32 |
johnsom | Well, a notify should just trigger a serial check and if there are multiple updates to the zone, some with the same serial number, I could see BIND thinking it has the latest data. | 21:35 |
nicolasbock | I was just looking at the BIND9 logs again. I have to correlate those with what central is saying about the zone serials | 21:37 |
nicolasbock | Maybe it's the more recent BIND9 in Focal that "fixes" this | 21:38 |
johnsom | I guess I would check the precision of the results from time.time() on both bionic and focal, just to rule out a change in the epoch precision. | 21:38 |
johnsom | Yeah, I was just going to look at the bind release notes | 21:38 |
nicolasbock | Ah good point about the precision | 21:38 |
nicolasbock | In Stein it's seconds | 21:38 |
nicolasbock | Maybe that changed in Ussuri | 21:38 |
nicolasbock | (or Focal) | 21:39 |
johnsom | https://www.irccloud.com/pastebin/Xuhi8Syd/ | 21:39 |
johnsom | That is from my focal VM | 21:39 |
johnsom | I don't have a bionic around to chekc | 21:39 |
nicolasbock | Oh | 21:39 |
nicolasbock | Let me check | 21:40 |
nicolasbock | ``` | 21:41 |
nicolasbock | ubuntu@bionic-lp:~$ python3 | 21:41 |
nicolasbock | Python 3.6.9 (default, Oct 8 2020, 12:12:24) | 21:41 |
nicolasbock | [GCC 8.4.0] on linux | 21:41 |
nicolasbock | Type "help", "copyright", "credits" or "license" for more information. | 21:41 |
nicolasbock | >>> time.tim() | 21:41 |
nicolasbock | Traceback (most recent call last): | 21:41 |
nicolasbock | File "<stdin>", line 1, in <module> | 21:41 |
nicolasbock | NameError: name 'time' is not defined | 21:41 |
nicolasbock | >>> import time | 21:41 |
nicolasbock | >>> time.time() | 21:41 |
nicolasbock | 1613079631.5469396 | 21:41 |
nicolasbock | ``` | 21:41 |
nicolasbock | Looks the same | 21:41 |
nicolasbock | Wow, am I confused :/ | 21:41 |
nicolasbock | Maybe time for a coffee break ;) | 21:41 |
nicolasbock | Thanks for the help johnsom ! | 21:42 |
johnsom | Sure, I will poke around and see if I see something | 21:42 |
nicolasbock | Thanks! | 21:44 |
nicolasbock | I'll write this up a bit more clearly and open a bug | 21:44 |
nicolasbock | That's easier for tracking purposes | 21:44 |
johnsom | +1 | 21:44 |
johnsom | Ah, yeah, so designate is using oslo timeutils and stripping the sub-second data by converting it to an int. | 21:47 |
nicolasbock | Ah, that makes sense | 21:48 |
johnsom | https://github.com/openstack/oslo.utils/blob/master/oslo_utils/timeutils.py#L153 | 21:49 |
nicolasbock | I'll try to replace the BIND9 package on the backend and downgrade it to the one Bionic is using | 21:49 |
nicolasbock | Maybe that will show us something | 21:49 |
johnsom | Given that, I can certainly see how the bug could happen. Just not sure yet why it wouldn't happen on Ussuri | 21:50 |
nicolasbock | Exactly | 21:50 |
nicolasbock | Looking a https://opendev.org/openstack/designate/src/branch/master/designate/utils.py#L143 this is certain to break under heavy parallel creation of recordsets | 21:51 |
johnsom | Yeah, that is exactly the code I traced to oslo utils/timeutils | 21:52 |
nicolasbock | Ok, I'll do some surgery on my deployment then to downgrade BIND9 | 21:52 |
johnsom | you could also do a tcpdump on the Ussuri setup and see if BIND is pulling the zone for every NOTIFY. | 21:53 |
nicolasbock | True | 21:53 |
johnsom | It could also be some DB optimization where by the time bind does the zone transfer, the new records are there | 21:54 |
nicolasbock | I have to take a break. The kids need snacks and I have to think about preparing dinner :) | 21:54 |
nicolasbock | I'll check in tomorrow again | 21:54 |
johnsom | o/ Ping me whenever you have a bug number and I will add notes there | 21:55 |
nicolasbock | +1 | 21:55 |
*** gmann is now known as gmann_afk | 22:10 | |
johnsom | The central coordinated locking might also be playing a factor here: https://review.opendev.org/c/openstack/designate/+/717955 | 22:13 |
johnsom | That was new in Ussuri | 22:14 |
johnsom | That might be delaying the updates enough that there is a second lapsed, at least often enough it masks the issue with the serial number generater | 22:17 |
nicolasbock | It's backported to Train (https://review.opendev.org/c/openstack/designate/+/736055) but apparently not Stein | 22:17 |
*** PrinzElvis has quit IRC | 22:41 | |
*** PrinzElvis has joined #openstack-dns | 22:41 | |
*** gmann_afk is now known as gmann | 23:11 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!