Making dnscontrol faster for large configurations through Go's concurrency.

8 views
Skip to first unread message

Tom Limoncelli

unread,
Mar 11, 2024, 7:50:10 AMMar 11
to DNSControl-discuss
Hi!

I'm in the process of making dnscontrol faster for large configurations and I'm ready to show off the current results.

Currently DNSControl does all of its work one domain at a time.  Many people have suggested that DNSControl would preview/push faster if it did its work concurrently. Go makes it easy to do this kind of thing.  

Doing all domains concurrently would be faster, but that would raise some difficult-to-solve problems.  How would we deal with outputting the results?  We wouldn't want each domain's output to be intermingled. We'd have to buffer the output for each domain then output each buffer at the end.  That's a big change.  How would we deal with "-i" (interactive) mode, where we prompt "Run? (y/N):" before each action?  That seems impossible if many domains are being processed at once.  How would we deal with the fact that not all providers handle rate limiting?   Would some providers just error out when they receive many concurrent requests?

A few weeks ago I realized something that would mitigate most of those problems: What if DNSControl gathered information for all domains concurrently, but executed the actual corrections the old way, one at a time?

Much of the time dnscontrol spends is gathering information: Collecting the DNS records for a zone, getting a list of all the zones/domains at a provider, determining the parent delegations, etc.  All of this takes a long time.  There isn't much computation being done; most of that time is spent waiting for the reply to API calls.  Why not do all of that concurrently?  None of that requires printing, interactive questions, nor does it greatly increase the number of API calls done per second (that last part is not obvious, but most providers only count "get all records" as one API call).

There are 3 technical hurdles:  First, DNSControl intermingles "gathering" with "doing".  This project would need to split those into 2 phases.  Second, DNSControl has to deal with errors and failures during the gathering stage.  Luckily that turned out to be easier to fix once I looked at the code.

Thirdly, most providers are not "thread safe".  They don't work correctly when the same provider runs twice concurrently. I've found this to only be a problem when a provider tries to be smart and cache certain data.  I've fixed that for AzureDNS, Cloudflare, and AWS Route53, so far.

Which brings me to the point of this email: I have implemented a demo of what a "concurrent DNSControl" would look like.  The "tlim_parallel" branch has code that adds "ppreview" and "ppush" commands (parallel preview and push).   Eventually ppreview/ppush would replace preview/push but for now it's nice to have both so I can compare them.

You can try the code yourself:

$ git clone https://github.com/StackExchange/dnscontrol.git

$ cd dnscontrol

$ git checkout tlim_parallel

$ go install

$ (cd to where your dnsconfig.js is)

$ dnscontrol ppreview --slow       # New code, one provider at a time

$ dnscontrol ppreview              # New code, gather all data concurrently



Go has a way to detect non-thread-safe code (i.e. providers that don't work when used concurrently).  The code runs much slower (1/10th the speed) but outputs great diagnostics if there is a problem.  Just add the "-race" flag (one hyphen, not two) to the "go install" command:

$ go install -race


Here's an (unscientific) benchmark:

2m50.988s     preview
2m29.477s     ppreview --slow
1m5.072s      ppreview

~3 minutes down to ~1 minute!  A huge improvement!

"ppreview --slow" is an improvement over "preview" because the new code caches the list of zones at each provider.  It turns out a large reason the current code is so slow is that DNSControl queries the list of zones many, many, many, times.  Oops!  It turns out for my configuration about 35% performance improvement is found by just caching that.

I could use help!

* Test providers with "go install -race" and find/fix any problems reported. (The race-detector isn't perfect; you might have to run many many times to find a problem)
* Report bugs
* Help design a way to have legacy (non-thread-safe) providers not run concurrently

The current code doesn't implement the "--report filename.json" feature yet.  That'll come soon.  You'll also see a lot of debugging print statements.

Oh, one more interesting tidbit:  One of the reasons I had been avoiding this project is that the run() function in previewPush.go is super complex and difficult to change.  I've rewritten it to be much more maintainable and easier to change.  This will mean adding features in the future will be easier.

Please test it out and tell me what you think!

Enjoy!
Tom

--

Tom Limoncelli (he/him)
SRE TPM, Stack Overflow, Inc.

Tom Limoncelli

unread,
Mar 23, 2024, 6:03:11 PMMar 23
to DNSControl-discuss
Hey folks!

The code is now stable and ready for mass testing!   I'd particularly appreciate hearing back from sites with many domains (what kind of performance improvement do you gain?).  If I don't receive any major bug reports, this could merge in a week.

Major changes since my last email:
  • There's now a mechanism for providers to opt-out of running concurrently.  Right now the only providers that are verified to work concurrently are: AZURE_DNS, CLOUDFLAREAPI, CSCGLOBAL, GCLOUD, ROUTE53.
  • Changed "--slow" to "--cmode=none"
  • Added "--cmode=all" which runs all providers concurrently (unsafe)
  • "--report filename.json"  now works
  • The "--full" output is more useful for spotting problems
The change is on the tlim_parallel branch in the report.  The PR is here: https://github.com/StackExchange/dnscontrol/pull/2873

Try the new code yourself:

$ git clone https://github.com/StackExchange/dnscontrol.git

$ cd dnscontrol

$ git checkout tlim_parallel

$ go install

$ (cd to where your dnsconfig.js is)

$ dnscontrol preview                 # Old code

$ dnscontrol ppreview --cmode=none   # New code, one provider at a time

$ dnscontrol ppreview                # New code, gather data concurrently when safe

$ dnscontrol ppreview --cmode=all    # New code, gather data concurrently no matter what! (unsafe!)


  (Don't forget the "--full" flag that is useful when debugging.)


How to verify if your provider supports concurrency is listed in the PR description: https://github.com/StackExchange/dnscontrol/pull/2873

Please please please test this change, folks!  "ppreview" is totally safe to test out.  Please report any bugs, problems, or suggestions.  Let me know the % performance improvement you see.

Thanks!
Tom


Reply all
Reply to author
Forward
0 new messages