*** bpokorny_ has joined #openstack-dns | 00:02 | |
*** bpokorny has quit IRC | 00:05 | |
*** rudrajit has joined #openstack-dns | 00:09 | |
*** james_li has quit IRC | 00:11 | |
*** rudrajit_ has quit IRC | 00:12 | |
*** rcarrillocruz has left #openstack-dns | 00:12 | |
*** ducttape_ has joined #openstack-dns | 00:18 | |
*** pglass has quit IRC | 00:24 | |
*** kei_yama has quit IRC | 00:24 | |
*** kei_yama has joined #openstack-dns | 00:25 | |
*** rudrajit_ has joined #openstack-dns | 00:39 | |
*** rudrajit has quit IRC | 00:43 | |
*** chlong has joined #openstack-dns | 00:50 | |
*** rudrajit has joined #openstack-dns | 00:51 | |
*** rudrajit_ has quit IRC | 00:54 | |
*** rudrajit has quit IRC | 00:55 | |
*** ducttape_ has quit IRC | 01:06 | |
*** ccneill has quit IRC | 01:25 | |
*** penick has quit IRC | 01:28 | |
*** mlavalle has quit IRC | 01:38 | |
*** fawadkhaliq has joined #openstack-dns | 01:38 | |
*** fawadkhaliq has quit IRC | 01:38 | |
*** RuiChen has quit IRC | 01:42 | |
*** RuiChen has joined #openstack-dns | 01:42 | |
*** ducttape_ has joined #openstack-dns | 01:43 | |
*** stanzgy has joined #openstack-dns | 01:45 | |
*** jasonsb has joined #openstack-dns | 01:58 | |
*** ducttape_ has quit IRC | 02:01 | |
*** bpokorny_ has quit IRC | 02:16 | |
*** ducttape_ has joined #openstack-dns | 02:31 | |
*** rudrajit has joined #openstack-dns | 02:50 | |
*** jasonsb has quit IRC | 02:55 | |
*** ducttape_ has quit IRC | 03:00 | |
*** ducttape_ has joined #openstack-dns | 03:01 | |
*** kei_yama has quit IRC | 03:11 | |
*** ducttape_ has quit IRC | 03:11 | |
*** ducttape_ has joined #openstack-dns | 03:18 | |
openstackgerrit | James Li proposed openstack/designate: Make supported record type configurable https://review.openstack.org/272902 | 03:31 |
---|---|---|
*** richm has quit IRC | 03:33 | |
*** jasonsb has joined #openstack-dns | 03:49 | |
*** ducttape_ has quit IRC | 04:02 | |
*** kei_yama has joined #openstack-dns | 04:02 | |
*** ducttape_ has joined #openstack-dns | 04:22 | |
*** ducttape_ has quit IRC | 04:36 | |
*** ducttape_ has joined #openstack-dns | 04:36 | |
*** ducttape_ has quit IRC | 04:56 | |
*** chlong has quit IRC | 05:12 | |
*** chlong has joined #openstack-dns | 05:33 | |
*** kei_yama has quit IRC | 05:36 | |
*** kei_yama has joined #openstack-dns | 05:37 | |
*** km has quit IRC | 05:59 | |
*** km has joined #openstack-dns | 06:00 | |
*** jet-rongl has joined #openstack-dns | 06:02 | |
*** RuiChen has quit IRC | 06:22 | |
*** RuiChen has joined #openstack-dns | 06:23 | |
*** chlong has quit IRC | 06:28 | |
*** chlong has joined #openstack-dns | 06:39 | |
*** cgross has joined #openstack-dns | 07:17 | |
*** chlong has quit IRC | 07:32 | |
*** rudrajit has quit IRC | 07:36 | |
*** venkat_ has joined #openstack-dns | 07:43 | |
*** boris-42 has joined #openstack-dns | 07:58 | |
*** RuiChen has quit IRC | 08:04 | |
*** RuiChen has joined #openstack-dns | 08:05 | |
*** nyechiel has joined #openstack-dns | 08:23 | |
*** venkat_ has quit IRC | 08:37 | |
*** nyechiel has quit IRC | 08:59 | |
*** RA_ has quit IRC | 09:04 | |
*** km has quit IRC | 09:32 | |
*** en_austin has joined #openstack-dns | 10:05 | |
en_austin | federico3: ping? | 10:05 |
*** jordanP has joined #openstack-dns | 10:11 | |
*** nyechiel has joined #openstack-dns | 10:36 | |
*** RA_ has joined #openstack-dns | 10:44 | |
federico3 | /query en_austin | 10:50 |
en_austin | yep :) i'm here | 10:51 |
en_austin | so, I'm still experiencing an D failures with pool manager hanging... do you have any news or maybe new ideas abt this? :( | 10:52 |
en_austin | just couple of hours ago the last one was, but with another sympthoms | 10:52 |
en_austin | zone failed in ERROR state and was unable to sync serials with real backend data, but records were still propagating to backends | 10:53 |
en_austin | until couple of restarts of mdns, zone & pool manager | 10:53 |
en_austin | then it returned to ACTIVE state | 10:53 |
en_austin | how do you think, can this problem be solved by reducing an amount of backends from 2 to 1? :) | 10:53 |
federico3 | I haven't replicated the exact behavior yet, and it's rather puzzling that nobody else has been reporting the same issue. This happened only while the caching was enabled, is that correct? What is your caching configuration: do you have a memcached process running? | 10:59 |
federico3 | en_austin: btw, I assume that the latest incident started with the usual exception, right? | 11:04 |
en_austin | - while the caching was enabled? -- I don't know, I've tried to disable a caching, but no luck with it. | 11:13 |
en_austin | - What is your caching config? -- yep, memcached process. | 11:13 |
en_austin | -- started with the usual exception? no, no MessagingTimeout's detected at this time. | 11:14 |
federico3 | "no luck with it" you mean you were unable to disable caching or you had the same issue while caching was disabled? | 11:15 |
federico3 | oh, that can be a different problem then - please upload the log from this last incident where MessagingTimeout did not happen | 11:16 |
en_austin | "no luck" means that zones were still unable to sync w/o cache and I've re-enabled it again :) | 11:20 |
en_austin | btw, maybe, I should try to use mysql cache instead of memcached?... | 11:20 |
federico3 | did you disable the cache during the incident or before? | 11:28 |
en_austin | that was couple of weeks ago, now I'm running with memcached | 11:29 |
*** RA_ has quit IRC | 11:31 | |
*** sonuk has quit IRC | 11:33 | |
*** sonuk has joined #openstack-dns | 11:34 | |
*** stanzgy has quit IRC | 11:36 | |
en_austin | Well, I've found another exception, but seems to be, that it is our network infrastructure issue. | 11:37 |
en_austin | mdns.log.5:2016-01-29 12:29:22.499 10028 ERROR designate.service [req-4e43d316-3a28-46c7-b0be-bfb42f0bae58 - - - - -] Unhandled exception while processing request from x.x.x.x:33594 | 11:37 |
en_austin | but, there were another error reports before this one (most of all - 'Calling pool manager for %domain%: UPDATE: ERROR: (serial) on nameserver x.x.x.x') | 11:38 |
*** kei_yama has quit IRC | 11:44 | |
federico3 | en_austin: please update the bug report with today's issue - and can you please specify if caching was enabled/disabled and what memcached configuration you had in each incident? Be aware that if you configure memcached but do not specify any server there will be an in-memory local cache in use | 11:48 |
en_austin | i've attached a config file to issue - can you verify my memcached setup? | 11:50 |
federico3 | uhm, I'm not seeing it | 11:58 |
en_austin | http://paste.openstack.org/show/dxFyS4Ba8Z9E9HR30uWR/ | 11:58 |
en_austin | btw I've attached a new log | 11:59 |
en_austin | > the cache serial is 1454056936 and the actual serial is 1454056936. | 12:00 |
en_austin | looks weird... | 12:00 |
en_austin | federico3: well, it failed again right now. | 12:04 |
en_austin | 2 zones, one failed in ERROR state, second one in PENDING. | 12:04 |
en_austin | What data I can collect for you? :) | 12:04 |
federico3 | was there the timeout exception? | 12:04 |
federico3 | in general, this really look like you are having network issues | 12:05 |
en_austin | nope. Will verify a record propagation now... | 12:05 |
en_austin | for failed zone: serial is 1454068962, actual is 1454068734 | 12:06 |
en_austin | will try to create new record in "error"-state zone | 12:06 |
federico3 | keep in mind that it's normal for zones to go in error state if there's transient loss of connectivity and the should restore by themselves later on | 12:07 |
en_austin | well, record _not_ propagated. Designate serial 1454069199, real serial still 1454068734 | 12:08 |
en_austin | will now dump the query log... | 12:08 |
*** nyechiel has quit IRC | 12:09 | |
en_austin | mdns.log:2016-01-29 15:09:31.108 12371 WARNING designate.mdns.notify [req-b63274bc-6e81-414d-8d59-23c17736c728 noauth-user noauth-project - - -] Got lower serial for 'xxxxxxxx.' to 'yyyyyyyy:53'. Expected:'1454069365'. Got:'1454068734'.Retries left='9' | 12:10 |
en_austin | Too much 'Sending SOA to YYYYY:53' and 'Got lower serial' message logs here... | 12:10 |
en_austin | Well, zone failed to ERROR again (was PENDING before). | 12:11 |
Kiall | the wording of those logs also sucks.. That's sending BIND a notity packet to say "Hey, you have changes to pull". | 12:11 |
federico3 | when you restart the components please make sure you restart only one component at a time and wait before restarting another one | 12:12 |
en_austin | federico3: I don't restarted any components now, since I'm trying to capture logs in "failed state" | 12:12 |
federico3 | the fact that sometimes you did not have that timeout exception but you are still seeing a lot of zones in error state makes me think that you are having network issues or your nameservers are not updating for other reasons | 12:13 |
en_austin | ...zone returned to PENDING state, seems to be that it is trying to recover itselv | 12:14 |
en_austin | f* | 12:14 |
en_austin | "a lot" = 1. I have two zones at all, and ALWAYS - one of them in ERROR, second one in PENDING. | 12:15 |
Kiall | en_austin: can you add "backdoor_port = 5432" to your config before you restart again, in the [DEFAULT] section | 12:15 |
federico3 | en_austin: as I suspected, we are mixing together different issues: 1) sometimes a zone can go in ERROR and then recover due to transient issues and that's ok | 12:15 |
en_austin | After restart they are both returning to ACTIVE state | 12:15 |
en_austin | Kiall: yep, I can, what's this for? :) | 12:15 |
Kiall | I'm interested to know if the whole process is locked up at the time, or if just stops doing things | 12:15 |
en_austin | I'm interesting to know this too :) sure, I can do it. | 12:15 |
Kiall | it lets you telnet into the running process on that port, and get a python shell | 12:15 |
*** RA_ has joined #openstack-dns | 12:15 | |
en_austin | awesome | 12:15 |
en_austin | will you give me an manual how to acquire a needed info for you? | 12:16 |
Kiall | Working on that now ;) | 12:16 |
en_austin | great | 12:16 |
Kiall | (Trying to identify how I grab a reference to the running service from the shell ;)) | 12:16 |
en_austin | Well, zone is still in PENDING state and new records are not propagating to backend. | 12:17 |
federico3 | en_austin: 2) when we had that TimeoutError and PM was unable to restore the zones for hours -> this is a different issue | 12:17 |
en_austin | Backend SOA is not increasing now (it is still ...8734, while Designate think it is 1454069787 as for now) | 12:17 |
en_austin | federico3: seems to be, but zone does not recover from ERROR state, that's a problem. | 12:18 |
en_austin | As now - it is in PENDING/ERROR states, new records are not propagating, and backend SOA is not increased. | 12:18 |
en_austin | Well, I'll try to create new record again, will dump its request logs and try to restart service again (since it's a production one) :-( | 12:18 |
*** EricGonczer_ has joined #openstack-dns | 12:21 | |
Kiall | en_austin: also, can you `kill -USR1` your pool manager process when it's locked up, and grab all the stderr it prints out | 12:23 |
Kiall | It'll print LOTS of info about the running process | 12:23 |
en_austin | just restarted it via Supervisord couple of seconds ago :( | 12:24 |
en_austin | next time i'll do it, surely | 12:24 |
Kiall | You'll get something like this: http://paste.openstack.org/show/485398/ | 12:24 |
Kiall | What I'm mostly interested in seeing is the # of green threads running once it's locked up | 12:25 |
en_austin | no output by 'kill -USR1' at all... | 12:29 |
*** EricGonc_ has joined #openstack-dns | 12:30 | |
Kiall | It will be printed to stderr, which would be put somewhere by supervisordf | 12:32 |
Kiall | maybe the logfile? | 12:32 |
*** EricGonczer_ has quit IRC | 12:33 | |
en_austin | yea, you're right, it is right in the logfile. | 12:33 |
Kiall | oh, I .. might have just reproduced it - and then killed the process. | 12:33 |
Kiall | (killed it before I realized what I was doing.. let me see what I did) | 12:34 |
en_austin | awh, one of my two zones just became 'active' and synced its SOA to backend... *happy* | 12:34 |
en_austin | second one synced serials too, but still in PENDING state. | 12:35 |
Kiall | Q - How much python do you know? Working though how we would check the running process through the "backdoor" | 12:37 |
en_austin | it's my primary language since 2014 (except of native Russian and upper-intermediate English, hehe) :) | 12:38 |
en_austin | so, I'm listening :) | 12:39 |
Kiall | en_austin: Cool - how much sense is this making them? http://paste.openstack.org/show/485404/ | 12:40 |
Kiall | fo(<class>) == Find all created objects of type <class> | 12:41 |
Kiall | (that's probably the only backdoor specific thing there) | 12:41 |
en_austin | > service = fo(Service)[0] | 12:43 |
en_austin | where does "fo" comes from? | 12:43 |
Kiall | It's magicked into place by the backdoor code :) | 12:43 |
Kiall | https://github.com/openstack/oslo.service/blob/master/oslo_service/eventlet_backdoor.py#L101 | 12:43 |
Kiall | Oh, if you use 1 config file for all of the designate services.. comment the backdoor port out now or set it to a range - e.g. 2000:3000 | 12:44 |
Kiall | (otherwise a restart of any other service will try grab the same port and fail)( | 12:45 |
en_austin | i'm using only /opt/designate/etc/designate/designate.conf | 12:45 |
en_austin | > (otherwise a restart of any other service will try grab the same port and fail)( | 12:45 |
en_austin | damn! that's why i've added it and failed to restart D :) | 12:45 |
Kiall | lol | 12:45 |
Kiall | try a range port = 2000:3000 | 12:45 |
Kiall | it'll pick the lowest free one, and log which one it chose | 12:45 |
en_austin | > and log | 12:46 |
en_austin | :( | 12:46 |
en_austin | whatever, netstat will help me, I think | 12:46 |
Kiall | grep logs for "Eventlet backdoor listening on" | 12:47 |
Kiall | e.g. Eventlet backdoor listening on 5432 for process 18668 | 12:47 |
en_austin | Yea, I've got it. | 12:49 |
Kiall | Anyway, if you can try those out.. USR1 / Those backdoor commands the next time it happens, it might give us some insight into what's actually failing. | 12:50 |
en_austin | >>> service._get_failed_domains(context, 'UPDATE') | 12:50 |
en_austin | <designate.objects.domain.DomainList object at 0x37fb090> | 12:50 |
en_austin | But both of zones in ACTIVE state now. | 12:50 |
Kiall | And it still shows in get_failed_domains? | 12:51 |
en_austin | Yep. | 12:51 |
en_austin | Is it normal? | 12:51 |
Kiall | what does `service._get_failed_domains(context, 'UPDATE')[0].to_dict()` show | 12:52 |
Kiall | ehh | 12:52 |
Kiall | yea actually | 12:52 |
Kiall | service._get_failed_domains(context, 'UPDATE').to_primitive() might also work and show all of them, rather than just 1 | 12:53 |
en_austin | IndexError: list index out of range | 12:53 |
en_austin | :) | 12:53 |
Kiall | It's gone now :) | 12:53 |
en_austin | but there is still "DomainListObject" present as output of "_get_failed_domains". But if it's ok..) | 12:53 |
Kiall | Yea, it's an empty list - I misread :) | 12:54 |
Kiall | try service._get_failed_domains(context, 'UPDATE').to_primitive() | 12:54 |
en_austin | {'designate_object.changes': [u'total_count'], | 12:55 |
en_austin | 'designate_object.data': {'objects': [], 'total_count': 0}, | 12:55 |
en_austin | 'designate_object.name': 'DomainList', | 12:55 |
en_austin | 'designate_object.original_values': {}} | 12:55 |
en_austin | Seems that's now it's all ok (and that's true, both zones are in ACTIVE state) | 12:55 |
Kiall | Okay, so.. Anyway, next time it fails, try gather all ^ info before restarting! | 12:57 |
Kiall | (Also, I rarely get an excuse to use the backdoor, always a bonus ;)) | 12:58 |
en_austin | undocumented feature :D | 13:01 |
en_austin | Well, I've got your idea - when next time I will experience an issue with zones, i will capture a logs (as usual), dump a kill -USR1 data and provide a data from telnet :) | 13:01 |
Kiall | Yea, it exists it most every openstack process :) | 13:03 |
Kiall | (Also - Make sure that isn't public, lol) | 13:03 |
Kiall | it's a backdoor with total control over yout server (well, as much as the user running designate!) | 13:04 |
Kiall | Also, watch out for the USR1 dump, it's got a copy of your config - so, black out passwords | 13:05 |
Kiall | etc | 13:05 |
*** ducttape_ has joined #openstack-dns | 13:10 | |
en_austin | yea, I've got it. Thank you Kiall for support :) | 13:12 |
en_austin | btw, as you think, can this issue be related with >1 backend installed on different machines (e.g some network issues?) | 13:12 |
en_austin | I'm going to merge Designate and BIND to one box tomorrow (and reduce backends count to 1) - can it help or that's an overhead for this? | 13:13 |
Kiall | I would be a little suprised if this was it TBH | 13:21 |
*** ducttape_ has quit IRC | 13:31 | |
*** rsyed_away is now known as rsyed | 13:49 | |
*** RA_ has quit IRC | 14:03 | |
*** ducttape_ has joined #openstack-dns | 14:06 | |
*** richm has joined #openstack-dns | 14:18 | |
*** pglass has joined #openstack-dns | 14:40 | |
*** johnbelamaric has joined #openstack-dns | 14:47 | |
*** ducttape_ has quit IRC | 15:01 | |
en_austin | Kiall: ping? | 15:13 |
Kiall | Kinda here - happening again? | 15:13 |
en_austin | Yea. | 15:14 |
en_austin | 1454080412 on designate and 1454078691 on backend. | 15:14 |
Kiall | Any of the things turn out something useful? | 15:14 |
en_austin | Will check it now :) | 15:15 |
en_austin | >>> print b._execute_rndc(['status']) | 15:17 |
en_austin | None | 15:17 |
*** pglass has quit IRC | 15:20 | |
*** james_li has joined #openstack-dns | 15:21 | |
en_austin | I've attached a "kill -USR1" output to Launchpad issue. | 15:22 |
en_austin | https://bugs.launchpad.net/designate/+bug/1534490 | 15:22 |
openstack | en_austin: Error: malone bug 1534490 not found | 15:22 |
Kiall | en_austin: re execute rndc, I think that might be expected.. it didn't explode is the key thign | 15:22 |
en_austin | same, returns None | 15:23 |
en_austin | _get_failed_domains(context, 'UPDATE').to_primitive() returns same as in normal state (but zones are now not in ERROR, they are in PENDING state) | 15:24 |
james_li | Kiall: timsim: could you please review https://review.openstack.org/#/c/268362/ and https://review.openstack.org/#/c/272669 ? Joe likes to see they land soon. | 15:24 |
en_austin | Seems to be that records are propagated, but veeery slow... | 15:25 |
en_austin | (since one of "failed" records is resolvable now) | 15:25 |
*** james_li has quit IRC | 15:26 | |
en_austin | it's approx 8 min delay right now (Designate's zone serial is 1454081213, backends is 1454080681) | 15:27 |
*** james_li has joined #openstack-dns | 15:27 | |
*** ducttape_ has joined #openstack-dns | 15:27 | |
en_austin | but, both of backends returns None for _execute_rndc(['status']). | 15:27 |
en_austin | Is is normal? | 15:27 |
Kiall | james_li: i thought I commented on the wildcard NS one.. oops. | 15:28 |
james_li | Kiall: oh let me check | 15:28 |
Kiall | I didn't | 15:28 |
james_li | heh | 15:29 |
Kiall | I believe the RE_NS_HOSTNAME and RE_ZONENAME are functionally identical | 15:29 |
Kiall | https://www.debuggex.com/r/t8rXZR3g-jHEhNc9 and https://www.debuggex.com/r/yCPxlCYFDi32eem7 | 15:29 |
Kiall | So, it may be best to just use the existing one | 15:30 |
Kiall | The other comment was.. is NS the only one it doesn't like? | 15:30 |
Kiall | was going to be* | 15:30 |
Kiall | en_austin: I think it is normal | 15:31 |
Kiall | yea, it is.. we don't return anything from that method, it's nothing or an exception | 15:32 |
en_austin | got it | 15:32 |
Kiall | you didn't get an exception, so... it's communicating just fine with bind | 15:32 |
Kiall | (at least via RNDC .. Notifications to update are different) | 15:32 |
Kiall | What about service._get_failed_domains(context, 'UPDATE').to_primitive() ? | 15:33 |
*** sonuk has quit IRC | 15:33 | |
Kiall | It it working? slow? raising? | 15:33 |
Kiall | It doens't matter if there are results, just that it works.. | 15:33 |
en_austin | It works. | 15:33 |
en_austin | But, it is important - there are no 'ERROR' state zones. | 15:33 |
Kiall | Okay, any MessagingTimeout errors? | 15:33 |
Kiall | I see 78 greenthreads runing from the paste, which isn't huge | 15:34 |
en_austin | Yea, some MessagingTimeouts here. | 15:34 |
Kiall | can you USR1 mdns / central too? | 15:35 |
en_austin | i've restarted them to return dns propagation to live state :(( | 15:35 |
en_austin | next time will do it, sure | 15:35 |
Kiall | oo - 62 of the 78 greenthreads are sitting in | 15:35 |
Kiall | `with lockutils.lock('update-status-%s' % domain.id):` | 15:35 |
en_austin | what does it means? | 15:37 |
en_austin | btw I doesn't restarted Central - can dump a USR1 output from it. | 15:37 |
federico3 | are they all trying to acquire the same lock? | 15:38 |
en_austin | but SOAs are close to be synced (but so slow...) | 15:38 |
Kiall | federico3: It's hard to tell from the traces, but possible. | 15:39 |
james_li | Kiall: if RE_NS_HOSTNAME and RE_ZONENAME function the same then yeah we can just use the existing one, let me double check; and yes to your second question, wildcard *NS* record is the only one bind does not like. I recall silkey checked that AWS supports all wildcard records but NS. | 15:39 |
Kiall | james_li: Ok, only applying it to NS is perfect then :) | 15:40 |
james_li | Kiall: thanks I'll do the needful. | 15:41 |
Kiall | en_austin: can you add the latest pool manager logs, ideally all the way from 1 restart to the next | 15:43 |
Kiall | I'd like to try check if we're stalling on locks for HOURS | 15:44 |
en_austin | I can archive all logs I have from pre-latest restart (when I've enabled a telnet port) and give it to you privately :) | 15:52 |
en_austin | btw, new records are propagating now, but with delay in ±5 min | 15:52 |
en_austin | SOAs still out of sync | 15:52 |
*** mlavalle has joined #openstack-dns | 15:54 | |
Kiall | en_austin: perfect | 15:57 |
*** jordanP has quit IRC | 15:57 | |
en_austin | well, after full restart they are synced in ±30 sec | 15:58 |
en_austin | something is extremely slows down a sync process... | 15:58 |
*** james_li has quit IRC | 16:00 | |
*** james_li has joined #openstack-dns | 16:01 | |
en_austin | Kiall: can that be influenced by using memcached instead of mysql for caching? | 16:02 |
Kiall | don't use mysql for caching ;) I still am not entirely sure why we have that! | 16:02 |
en_austin | :D | 16:02 |
Kiall | ping me when you've managed to grab those logs, hopefully I can see something between them and the USR1 dump | 16:03 |
en_austin | well, strange | 16:04 |
en_austin | 2016-01-29 19:03:59.890 12511 INFO designate.mdns.notify [req-65c4bc3d-5af1-4d52-ba56-b600f7b8642f noauth-user noauth-project - - -] Sending 'SOA' for 'aqa.int.zone.' to '10.28.0.18:53'. | 16:04 |
en_austin | ...and nothing changes on backend side | 16:04 |
Kiall | also, BIND logs for the period would useful too, as each one of ^'s should line up | 16:04 |
en_austin | (awh, false alarm - it updated after a while) | 16:04 |
en_austin | sorry, I'm afraid of each lag now :D | 16:05 |
en_austin | Kiall: i can dump it right now, but how can I share it w/you? | 16:05 |
*** james_li has quit IRC | 16:05 | |
Kiall | Yea, I'm nearly certain we're seeing something related to the 62 locks we're trying to get! That would slow things down | 16:05 |
Kiall | kiall - at - hpe.com ? | 16:05 |
en_austin | copy, will send it now | 16:05 |
Kiall | cool | 16:05 |
en_austin | you're workin at Hewlett Packard?) | 16:07 |
Kiall | Yea | 16:07 |
en_austin | :) | 16:07 |
en_austin | Kiall: check your email :) | 16:19 |
Kiall | ah, lol could have just PM'd me that URL :) | 16:20 |
en_austin | actually, yes :)) | 16:20 |
en_austin | meanwhile - SOAs again out of sync for 500+ secs | 16:21 |
Kiall | Bah, how busy is this install? | 16:21 |
en_austin | not much, approx 100-200 read/delete records per hour | 16:22 |
en_austin | totally serving approx 10k records | 16:22 |
Kiall | All in 1 zone, or? | 16:22 |
en_austin | 2 zones | 16:22 |
en_austin | 7.6k and 3.2k records | 16:23 |
en_austin | (too much logs, yea?) :) | 16:23 |
Kiall | So, what time did you do the 2x restart? | 16:24 |
en_austin | Kiall: maybe I can dump some useful info now? Designate stopped to notify a BINDs again. | 16:24 |
en_austin | Last one was 29-Jan-2016 19:11:44.612 (now 19:24) | 16:24 |
en_austin | Kiall: awh... i don't remember (but you can grep for "Eventlet" - that restart was when I've enabled a PMs backdoor port) | 16:25 |
en_austin | mdns logs now are full of "sending SOA to x.x.x.x" (and that's all) | 16:26 |
Kiall | 2016-01-29 15:48:07.648 - 2016-01-29 18:31:05.863 from the looks of it | 16:26 |
en_austin | awh, now again one zone became ACTIVE and second is very close to it | 16:27 |
en_austin | but, approx 15m delay between real sync and tons of "Sending SOA" messages in mdns logs... | 16:27 |
Kiall | So, as a first thing, I think we're probably being WAYY aggressive with the default timers etc for the periodic recovery tasks, and timeouts for mdns to respond, given we have a 7.6k record zone | 16:27 |
*** james_li has joined #openstack-dns | 16:30 | |
*** ccneill has joined #openstack-dns | 16:31 | |
Kiall | my suspicion is that, mDNS is getting all locked up for an extended period of time while serving out that 7.6k record zone (likely many minutes, it's python!), which will cause pool managers attempts to query the zone to sit in the queue and likely timeout etc as the heavy lifting is happening | 16:32 |
en_austin | can we confirm or refute this assertion? | 16:35 |
en_austin | (384s delay between last backends sync and Designate's data... and it's growing) | 16:36 |
en_austin | and after ~200 more seconds there will be MANY MANY logs with "Transfer started..." "Transferred serial XXXX" from BIND's side | 16:36 |
*** EricGonc_ has quit IRC | 16:37 | |
en_austin | Kiall: can I verify something in Designate's state while it's not syncing a data? | 16:38 |
en_austin | now | 16:38 |
en_austin | since we've already not in business hours - I can stay in "failed" state a little bit longer | 16:39 |
*** EricGonczer_ has joined #openstack-dns | 16:41 | |
Kiall | en_austin: ok, start by doing a AFXR of the zone from mDNS with a large timeout | 16:44 |
en_austin | How? | 16:45 |
Kiall | dig +timeout=3600 AXFR zone.com. @mDNS:5354 | 16:45 |
Kiall | time dig +timeout=3600 AXFR zone.com. @mDNS:5354*& | 16:45 |
Kiall | It's going to be slow | 16:45 |
*** EricGonczer_ has quit IRC | 16:46 | |
timsim | Might also check that it's responding on TCP at all. | 16:47 |
en_austin | Not so slow - real0m3.631s | 16:47 |
en_austin | ;; XFR size: 7146 records (messages 4, bytes 230969) | 16:47 |
Kiall | Oh. That's.. What's this running on? beefy bare metal? | 16:48 |
en_austin | Virtuozzo centos7 container | 16:48 |
*** EricGonczer_ has joined #openstack-dns | 16:48 | |
en_austin | 3.6s it's slow? | 16:49 |
en_austin | i thought it's fast :) | 16:49 |
Kiall | It's slow compared to if BIND was serving that out, but not slow considering python. | 16:49 |
Kiall | bind would be a fraction of a second. | 16:49 |
openstackgerrit | Merged openstack/designate: Add validation for MX, TXT, and SSHFP records https://review.openstack.org/272669 | 16:49 |
en_austin | real0m0.182s | 16:50 |
Kiall | yep :) | 16:50 |
en_austin | that's BIND's response for same AXFR | 16:50 |
en_austin | JFYI dig requires to specify port via "-p" param, not via ":5354" :) | 16:50 |
en_austin | well, again - Designate's serial 1454086243, real 1454085967 | 16:51 |
en_austin | 276s and up w/o sync | 16:51 |
Kiall | dooh, I've in the last 2 days written {{ ip_address }}:{{ port }} about 456 times. ;) | 16:51 |
*** EricGonczer_ has quit IRC | 16:51 | |
en_austin | Kiall: seems strange :( | 16:51 |
en_austin | [root@ns2 ~]# dig +timeout=3600 AXFR aqa.int.zone. @10.28.0.16:5354 | 16:51 |
en_austin | dig: couldn't get address for '10.28.0.16:5354': not found | 16:51 |
*** EricGonczer_ has joined #openstack-dns | 16:51 | |
Kiall | Yea, my bad ;) | 16:52 |
en_austin | :) | 16:52 |
Kiall | I've been witing non-dig stuff 100's of times over the last few days ;) | 16:52 |
en_austin | so, any thoughts abt slow soa sync? | 16:52 |
*** bpokorny has joined #openstack-dns | 16:56 | |
en_austin | 2016-01-29 20:05:07.525 12511 INFO designate.mdns.notify [req-573ce645-e5f1-4c3a-a7fd-3c685bec02cc noauth-user noauth-project - - -] Sending 'SOA' for 'aqa.int.zone.' to '10.28.0.17:53'. | 17:05 |
en_austin | and nothing in response | 17:05 |
en_austin | after some unknown delay: | 17:05 |
en_austin | 2016-01-29 20:05:27.617 12511 INFO designate.mdns.notify [req-c9964d60-8f73-48a2-b26c-b2bacc2cb44c noauth-user noauth-project - - -] Sending 'NOTIFY' for 'aqa.int.zone.' to '10.28.0.17:53'. | 17:05 |
en_austin | 2016-01-29 20:05:27.621 12511 DEBUG designate.service [-] Handling UDP Request from: 10.28.0.17:8437 _dns_handle_udp /opt/designate/designate/service.py:304 | 17:05 |
en_austin | and zone became synced (ACTIVE state) | 17:06 |
en_austin | 62/75 greenlets were in "with lockutils" again while serial was out of sync... | 17:09 |
Kiall | en_austin: sorry, was AFK | 17:18 |
Kiall | How often do both zones change? | 17:18 |
*** ducttape_ has quit IRC | 17:18 | |
en_austin | from couple of changes per 10 seconds to one-two change per hour | 17:18 |
Kiall | If it's pretty often, I'd be very tempted to remove the lock and see if things still wedge? I don't believe the lock is strictly necessary | 17:19 |
en_austin | the main scenario is - "we are deploying some stack with Heat, this stack contains X containers and each of them requests an unique hostname" | 17:19 |
Kiall | timsim / mugsie etc - can I get some eyes on https://github.com/openstack/designate/blob/master/designate/pool_manager/service.py#L514 | 17:20 |
Kiall | Removing that lock, harmless ort suicidal? I think harmless, other than doing some excess work | 17:21 |
Kiall | or* | 17:21 |
timsim | Since it's just in one process, probably not that harmful. | 17:21 |
timsim | One of the things that happens in there, is that (i believe in the _get_consensus_serial path), it will call out to mdns.get_serial_number for all nameservers that don't have the right value in cache. | 17:22 |
Kiall | https://github.com/openstack/designate/commit/565e5026eb39a06b86636661176e21cd9013b90a is where it was added | 17:22 |
en_austin | Yea, I'm running a single PM process. | 17:22 |
Kiall | https://bugs.launchpad.net/designate/+bug/1408202 | 17:23 |
openstack | Launchpad bug 1408202 in Designate "Statuses in Pool Manager Cache Are Persistent" [Low,Fix released] - Assigned to Ron Rickard (rjrjr) | 17:23 |
en_austin | I can test it on my instance, if you'll help me with the patch. Just to remove "with lockutils.lock..." or smth else? | 17:23 |
Kiall | That feels like a mysql as a cache issue to me | 17:23 |
timsim | Ohhh | 17:24 |
Kiall | en_austin: remove that line, and then unintent everything after it - can you paste bin your version of that file? (I'm unsure exactly what version your on, if $distro made changes, etc) | 17:24 |
Kiall | timsim: good reason for it yet? | 17:24 |
timsim | So he's got a ton of update_statuses coming in for that zone, and it's locking up? | 17:24 |
Kiall | 62 of the 75 active greenthreads are in / waiting on that lock | 17:24 |
timsim | lol | 17:24 |
mugsie | wow | 17:24 |
en_austin | Kiall: i'm on 94d0f00b09044fcef36604a9c81accb0907c9705 stable/liberty 94d0f00b09044fcef36604a9c81accb0907c9705 | 17:25 |
timsim | Yeah, if you've got multiple pool manager processes, that's pretty much useless. | 17:25 |
Kiall | <Kiall> oo - 62 of the 78 greenthreads are sitting in <Kiall> `with lockutils.lock('update-status-%s' % domain.id):` | 17:25 |
Kiall | ^ from the last lockup | 17:25 |
Kiall | timsim: I think it was just a "lets at least TRY not to do duplicate work" | 17:25 |
timsim | lol | 17:25 |
en_austin | timsim: I'm running a single process for each of services | 17:25 |
mugsie | I can;t see anything really bad. | 17:25 |
mugsie | wank it out and see what 'splodes | 17:26 |
timsim | That's a fart in a hurricane of trying not to do duplicate work in that code. | 17:26 |
mugsie | yank* | 17:26 |
mugsie | -_- | 17:26 |
timsim | LOL mugsie | 17:26 |
mugsie | /clear everyone | 17:26 |
en_austin | OK, so I will remove that line and unindent all under it | 17:26 |
timsim | freudian slip | 17:26 |
Kiall | en_austin: K - removed here, http://paste.openstack.org/show/485460/ | 17:27 |
mugsie | timsim: https://www.youtube.com/watch?v=rwaaFwTluLU | 17:27 |
en_austin | Kiall: installed your file, restarting now. :) | 17:29 |
Kiall | *fingers crossed* | 17:30 |
en_austin | well... both zones switched to ACTIVE in 2 minutes | 17:32 |
en_austin | new record created in ± 20 seconds, zone ACTIVE again | 17:33 |
en_austin | I still do not sure, but seems that it's running much more better :) | 17:33 |
*** rudrajit has joined #openstack-dns | 17:35 | |
en_austin | 20:34:52 record created, 20:35:06 it became resolvable, 20:35:44 zone switched to ACTIVE state | 17:35 |
en_austin | great! | 17:35 |
*** jasonsb has quit IRC | 17:36 | |
Kiall | en_austin: cool, give it a few hours before calling success tho ;) | 17:36 |
en_austin | I still don't call that "a success", but 30sec is much more appropriate delay than 30 minutes :) | 17:36 |
Kiall | timsim: you guys mentioned seeing some issues with PM lockups, right? Any clue if you have 100's or 1000's of greenthreads hanging on that lock? | 17:36 |
timsim | It's all MDNS lockups for us :( | 17:37 |
en_austin | Kiall: PM lockups is that issue I've reported couple of weeks ago to federico3 | 17:37 |
Kiall | Oh, well - `kill -USR1 mdns-pid` and see what it's doing ;) | 17:37 |
Kiall | en_austin: yea, I thought you and tim might be having a similar / the same issue | 17:37 |
*** jasonsb has joined #openstack-dns | 17:37 | |
Kiall | timsim: also, check the backlog for "backdoor" details - there'll be enough in the scrollback for you to figure that out, if you've never used it before. | 17:38 |
en_austin | Well, I'll report that temporary bugfix to Launchpad in order to not miss it, if I will finally succeed with it :) | 17:38 |
Kiall | en_austin: yea, if you can report back after seeing it work for longer than usual, that would be great. | 17:38 |
en_austin | sure I will | 17:39 |
en_austin | btw, I've noticed that cache is cleared after each request (or smth similar to) | 17:39 |
en_austin | too much "cache miss" messages | 17:39 |
en_austin | 2016-01-29 20:39:59.108 17900 DEBUG designate.pool_manager.service [req-188291ac-235e-4a6a-86fe-01014d90a4de noauth-user noauth-project - - -] Clearing cache for domain aqa.int.zone. with action UPDATE. _clear_cache /opt/designate/designate/pool_manager/service.py:618 | 17:40 |
en_austin | aha. | 17:40 |
Kiall | let me eyeball the code again to see if that's harmful (other than more work..) | 17:41 |
*** ducttape_ has joined #openstack-dns | 17:41 | |
*** ducttape_ has quit IRC | 17:42 | |
en_austin | (yea, I was right, no any "cache hit" messages in PM logs since this patch) | 17:42 |
*** rsyed is now known as rsyed_away | 17:43 | |
en_austin | but, I think that fast records propagation is more important that this cache... | 17:43 |
*** ducttape_ has joined #openstack-dns | 17:43 | |
en_austin | (I hope to) | 17:43 |
Kiall | cache miss will be harmless, other than you're doing more work | 17:44 |
Kiall | but.. there should probably be at least some hits | 17:44 |
Kiall | Without the lock, you're going to see more of those, as that codepath can be reached concurrently - previously, it could only be, at any given moment, running once per zone | 17:44 |
*** james_li has quit IRC | 17:45 | |
*** rudrajit has quit IRC | 17:45 | |
Kiall | also - noauth? lol | 17:45 |
Kiall | :) | 17:45 |
en_austin | yea, noauth, I'm running it in secure env with firewalled access to api | 17:46 |
en_austin | :) | 17:46 |
*** jasonsb has quit IRC | 17:46 | |
en_austin | since we're running both Heat and Designate in our own DevCloud | 17:46 |
en_austin | I don't think that the concurrent zone status request is a something I should be afraid of :) | 17:48 |
en_austin | it's slightly more important to propagate a records asap | 17:48 |
en_austin | well... 20 min, seems running well | 17:50 |
Kiall | *fingers crossed* | 17:50 |
en_austin | unfortunately that's Friday evening, there won't be too much activity for domain provisioning during weekend | 17:52 |
en_austin | we can collect more relevant data from Monday, but, if I will notice something strange - I'll report here and/or to Launchpad, for sure. | 17:52 |
en_austin | ....maybe, a MessagingTimeouts was caused by that lock too? | 17:53 |
Kiall | Possibly, if they were stuck waiting, they could have exceeded the timeouts | 18:02 |
en_austin | Kiall: I've just saw a following in my logs: PM called mdns for serial 1454090601, and after couple of seconds it called for serial 1454090600. | 18:04 |
en_austin | It's not a dangerous think? | 18:04 |
en_austin | thing* | 18:04 |
en_austin | e.g X+1 called earlier, than X (maybe, that's those lock was for?) | 18:05 |
Kiall | No, the lock wouldn't have helped there.. I suspect that's just fine, and is just stuff working it's way through. federico3 has a patch up that buffers notifies and prevents us every doing the 00 one | 18:06 |
en_austin | so, I shouldn't care about X and X+1 serials. Okay, I've got it :) | 18:07 |
Kiall | Okay, outta here for the day | 18:09 |
en_austin | oook | 18:10 |
en_austin | it's time to drive home, I think | 18:10 |
en_austin | friday evening... :) | 18:10 |
en_austin | well, thanks a lot Kiall and all others who helped me | 18:10 |
en_austin | have a nice day :) | 18:11 |
*** en_austin has quit IRC | 18:13 | |
*** james_li has joined #openstack-dns | 18:15 | |
*** EricGonczer_ has quit IRC | 18:17 | |
*** EricGonczer_ has joined #openstack-dns | 18:19 | |
*** rudrajit has joined #openstack-dns | 18:22 | |
*** james_li has quit IRC | 18:25 | |
*** penick has joined #openstack-dns | 18:27 | |
*** ccneill has quit IRC | 18:27 | |
*** sonuk has joined #openstack-dns | 18:29 | |
*** sonuk has quit IRC | 18:36 | |
*** pcaruana has joined #openstack-dns | 18:57 | |
*** markd_ has joined #openstack-dns | 18:59 | |
*** bpokorny_ has joined #openstack-dns | 19:01 | |
*** markd_ has quit IRC | 19:04 | |
*** bpokorny has quit IRC | 19:04 | |
*** ccneill has joined #openstack-dns | 19:10 | |
*** pglass has joined #openstack-dns | 19:10 | |
*** rudrajit_ has joined #openstack-dns | 19:15 | |
*** rudrajit has quit IRC | 19:18 | |
*** EricGonc_ has joined #openstack-dns | 19:23 | |
*** EricGonc_ has quit IRC | 19:23 | |
*** EricGonc_ has joined #openstack-dns | 19:24 | |
*** pcaruana has quit IRC | 19:24 | |
*** james_li has joined #openstack-dns | 19:24 | |
*** EricGonczer_ has quit IRC | 19:25 | |
*** EricGonczer_ has joined #openstack-dns | 19:25 | |
*** EricGonc_ has quit IRC | 19:25 | |
*** ducttape_ has quit IRC | 19:31 | |
*** ducttape_ has joined #openstack-dns | 19:34 | |
*** pcaruana has joined #openstack-dns | 19:37 | |
*** rudrajit_ has quit IRC | 19:38 | |
*** GonZoPT has quit IRC | 19:50 | |
*** EricGonczer_ has quit IRC | 19:52 | |
*** EricGonczer_ has joined #openstack-dns | 19:52 | |
openstackgerrit | Merged openstack/designate: Ensure mocks are stopped after use https://review.openstack.org/273723 | 19:54 |
*** jasonsb has joined #openstack-dns | 19:59 | |
*** rsyed_away is now known as rsyed | 19:59 | |
*** bpokorny_ has quit IRC | 20:04 | |
*** bpokorny has joined #openstack-dns | 20:04 | |
*** pcaruana has quit IRC | 20:06 | |
*** james_li_ has joined #openstack-dns | 20:07 | |
*** pcaruana has joined #openstack-dns | 20:09 | |
*** james_li has quit IRC | 20:10 | |
*** pcaruana has quit IRC | 20:19 | |
*** penick has quit IRC | 20:24 | |
*** richm has quit IRC | 20:30 | |
*** ducttape_ has quit IRC | 20:31 | |
openstackgerrit | Paul Glass proposed openstack/designate: Update functional test TLDCLient https://review.openstack.org/274223 | 20:35 |
openstackgerrit | OpenStack Proposal Bot proposed openstack/designate: Updated from global requirements https://review.openstack.org/270938 | 20:44 |
*** richm has joined #openstack-dns | 20:46 | |
*** james_li has joined #openstack-dns | 20:48 | |
openstackgerrit | OpenStack Proposal Bot proposed openstack/python-designateclient: Updated from global requirements https://review.openstack.org/271689 | 20:49 |
*** ducttape_ has joined #openstack-dns | 20:49 | |
*** james_li_ has quit IRC | 20:51 | |
*** bpokorny_ has joined #openstack-dns | 20:58 | |
*** jasonsb has quit IRC | 20:59 | |
*** bpokorny has quit IRC | 21:02 | |
*** rudrajit_ has joined #openstack-dns | 21:24 | |
*** sonuk has joined #openstack-dns | 21:29 | |
openstackgerrit | James Li proposed openstack/designate: Make supported record type configurable https://review.openstack.org/272902 | 21:56 |
*** ducttape_ has quit IRC | 22:01 | |
openstackgerrit | James Li proposed openstack/designate: Fix wildcard NS record https://review.openstack.org/268362 | 22:24 |
*** RA_ has joined #openstack-dns | 22:28 | |
*** ducttape_ has joined #openstack-dns | 22:34 | |
*** sonuk has quit IRC | 22:36 | |
*** RA_ has quit IRC | 22:36 | |
*** rsyed is now known as rsyed_away | 22:56 | |
*** johnbelamaric has quit IRC | 22:57 | |
*** bpokorny_ has quit IRC | 23:00 | |
*** bpokorny has joined #openstack-dns | 23:01 | |
*** penick has joined #openstack-dns | 23:03 | |
*** EricGonczer_ has quit IRC | 23:07 | |
*** EricGonczer_ has joined #openstack-dns | 23:09 | |
*** pglass has quit IRC | 23:13 | |
*** EricGonczer_ has quit IRC | 23:23 | |
*** EricGonczer_ has joined #openstack-dns | 23:23 | |
openstackgerrit | Tim Simmons proposed openstack/designate: Actually poll for zone deletes https://review.openstack.org/274291 | 23:28 |
*** EricGonczer_ has quit IRC | 23:34 | |
*** EricGonczer_ has joined #openstack-dns | 23:37 | |
*** EricGonczer_ has quit IRC | 23:37 | |
openstackgerrit | Tim Simmons proposed openstack/designate: Add an on-demand single-target sync method https://review.openstack.org/274292 | 23:42 |
*** james_li has quit IRC | 23:52 | |
*** ducttape_ has quit IRC | 23:52 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!