Hi!
I'm in the process of making dnscontrol faster for large configurations and I'm ready to show off the current results.
Currently DNSControl does all of its work one domain at a time. Many people have suggested that DNSControl would preview/push faster if it did its work concurrently. Go makes it easy to do this kind of thing.
Doing all domains concurrently would be faster, but that would raise some difficult-to-solve problems. How would we deal with outputting the results? We wouldn't want each domain's output to be intermingled. We'd have to buffer the output for each domain then output each buffer at the end. That's a big change. How would we deal with "-i" (interactive) mode, where we prompt "Run? (y/N):" before each action? That seems impossible if many domains are being processed at once. How would we deal with the fact that not all providers handle rate limiting? Would some providers just error out when they receive many concurrent requests?
A few weeks ago I realized something that would mitigate most of those problems: What if DNSControl gathered information for all domains concurrently, but executed the actual corrections the old way, one at a time?
Much of the time dnscontrol spends is gathering information: Collecting the DNS records for a zone, getting a list of all the zones/domains at a provider, determining the parent delegations, etc. All of this takes a long time. There isn't much computation being done; most of that time is spent waiting for the reply to API calls. Why not do all of that concurrently? None of that requires printing, interactive questions, nor does it greatly increase the number of API calls done per second (that last part is not obvious, but most providers only count "get all records" as one API call).
There are 3 technical hurdles: First, DNSControl intermingles "gathering" with "doing". This project would need to split those into 2 phases. Second, DNSControl has to deal with errors and failures during the gathering stage. Luckily that turned out to be easier to fix once I looked at the code.
Thirdly, most providers are not "thread safe". They don't work correctly when the same provider runs twice concurrently. I've found this to only be a problem when a provider tries to be smart and cache certain data. I've fixed that for AzureDNS, Cloudflare, and AWS Route53, so far.
Which brings me to the point of this email: I have implemented a demo of what a "concurrent DNSControl" would look like. The "tlim_parallel" branch has code that adds "ppreview" and "ppush" commands (parallel preview and push). Eventually ppreview/ppush would replace preview/push but for now it's nice to have both so I can compare them.
You can try the code yourself:
$ git clone https://github.com/StackExchange/dnscontrol.git
$ cd dnscontrol
$ git checkout tlim_parallel
$ go install
$ (cd to where your dnsconfig.js is)$ dnscontrol ppreview --slow # New code, one provider at a time
$ dnscontrol ppreview # New code, gather all data concurrently
Go has a way to detect non-thread-safe code (i.e. providers that don't work when used concurrently). The code runs much slower (1/10th the speed) but outputs great diagnostics if there is a problem. Just add the "-race" flag (one hyphen, not two) to the "go install" command:
Here's an (unscientific) benchmark:
2m50.988s preview
2m29.477s ppreview --slow
1m5.072s ppreview
~3 minutes down to ~1 minute! A huge improvement!
"ppreview --slow" is an improvement over "preview" because the new code caches the list of zones at each provider. It turns out a large reason the current code is so slow is that DNSControl queries the list of zones many, many, many, times. Oops! It turns out for my configuration about 35% performance improvement is found by just caching that.
I could use help!
* Test providers with "go install -race" and find/fix any problems reported. (The race-detector isn't perfect; you might have to run many many times to find a problem)
* Report bugs
* Help design a way to have legacy (non-thread-safe) providers not run concurrently
The current code doesn't implement the "--report filename.json" feature yet. That'll come soon. You'll also see a lot of debugging print statements.
Oh, one more interesting tidbit: One of the reasons I had been avoiding this project is that the run() function in previewPush.go is super complex and difficult to change. I've rewritten it to be much more maintainable and easier to change. This will mean adding features in the future will be easier.
Please test it out and tell me what you think!
Enjoy!
Tom
-- Tom Limoncelli (he/him)
SRE TPM, Stack Overflow, Inc.