I'm a little bit biased as well but I have direct experience in that area.
I have been using HdrHistogram in the context of OpenStack to measure how well an OpenStack cloud behaves under load. For those not familiar, OpenStack is an open source cloud OS that manages compute nodes and applications that run in virtual machines such as a company's web server (similar to Amazon Web Service but using open source software).
I was in charge of measuring how much an OpenStack cloud deployment can scale and developed a scale application that would basically load the cloud with a configurable number of web servers and HTTP traffic generators, then pass traffic and measure things like how many requests/sec and throughout could be passed and with what latency. On our largest cloud, we had hundreds of HTTP servers running and hundreds of HTTP traffic generators simulating hundreds of thousand of users on different physical compute nodes. The challenge was to summarize the latency of these thousands of emulated users that were distributed all over the place.
The traffic generator I used was Gil's fork of wrk2 (github open source) that used the C version of HdrHistogram (written by Mike Barker).
That version was simply displaying percentile latency information on the console.
What we did was to fork that version to use the latest HdrHistogram and output on the stdout the encoded histogram (in compressed/base64 form), then have our agent in each compute node lift that encoded histogram and forward it to our aggregator (in our case we use Redis as the transport).
The aggregator/main orchestration app is python based, which is why I ended up porting Gil's HdrHistogram to python (took me much more time than I wanted but it was worth the effort). The aggregator receives all these encoded histograms, decodes them and aggregates them on the fly.
In terms of scale we have tested up to several hundred traffic generators.
In simple runs we would receive as many histograms at the end of the scale run (e.g. after 10 minutes).
In monitoring runs, the traffic generators would send histograms every 5 seconds, resulting in an avalanche of histograms towards the aggregator every 5 seconds.
Because the HdrHistogram decoder has been optimized for speed, we could handle very easily a few hundred of decodes per 5 second period
I have no doubt this can scale to thousands of distributed histogram collection points if the transport used is adequate.
Continuing on this we are considering extending this scale app to support distributed storage access (still in the context of cloud). As far as I know there are very few (if any) tools that do distributed performance measurement with histograms. I think this is extremely useful and provides invaluable information on the behavior of any distributed system under load.
So to answer that question, this works pretty well ;-)