Alex and Luke, thank you for your kind words!
Regarding your questions, Luke. #3. Authors of async-profiler promise that it's quite low-overhead. Of course, some experimentation and benchmarking is needed, but I think it is possible to have the profiler always on. Some work would be required too: async-profiler doesn't flush intermediate data, so an "always on" profiler would probably have to restart repeatedly in predefined timespans, say every 1 minute.
Re: #2. That would be actually quite simple. The format of folded stacktraces before they hit the flamegraph is is the following:
<stacktrace> <number-of-occurences>
For example:
clojure.core$main.invoke;my_app$_main;my_app$init-config; 1
clojure.core$main.invoke;my_app$_main;my_app$do_hard_work;my_app$burn_cpu; 254
You can mash such files from multiple machines together, adding the numbers where stacks match.
Best regards,
Alex