Hi All,
I was looking at this code again yesterday and noticed that I'd missed a couple of potential optimizations. Specifically, when inserting an element which already exists in the set/map or deleting an element which does not exist in the set/map, both my implementation and the original implementation (in FSharp.Core) return the tree without altering it; however, this fails to take into account that the Insert and Delete methods are recursive, so even though the original tree is returned, it's rebalanced repeatedly while unwinding the recursion. I added a few lines of code to detect this and avoid the unnecessary rebalancing and pushed the improvements to GitHub (
https://github.com/jack-pappas/fs-core-optimized/tree/master/Replacements).
These improvements don't seem to make much difference in the general case: in the random set creation test, the improvement was barely noticable. However, the Intersection and Union operations make heavy use of Insert and Delete, and re-running my simple tests with random sets (in the PerfComparison project) shows a marked improvement: the Union operation now runs twice as fast as the one in FSharp.Core and Intersect is about 50% faster. I re-ran the "compiling the F# source distribution" test and found the results to be mostly the same, except that a couple of the larger projects (the compiler itself, IIRC) ran slightly faster again compared to the previous results (another 1-2%). I imagine that any applications which make heavy use of Set and Map will see more improvement though.
Responding to Anton's points:
2. Sorting by Self (Avg.) is not meant to find methods which, if optimized, would have the greatest impact on the total runtime. It is meant to find the methods which are the easiest to optimize; whether those methods are _actually_ optimizable, or if doing so would noticably improve application performance -- that is left up to you to decide. It is generally much harder to make any kind of *significant* improvement in a method which runs in 0.001 sec (i.e., the law of diminishing returns). In general, the best way to optimize these methods is just to eliminate them, but that often entails more significant architectural changes if it is even possible at all.
3. Adaptive implementations can be useful (e.g., in Dictionary<'T>), but they're only going to matter if the impact on application performance is significant. For example, your stack-based approach may be 5x faster than Stack<'T> for small inputs, but unless it provides noticable performance gains in real-world applications there's no reason to take on the additional complexity in the code. Dictionary<'T> also has O(1) speed in determining the number of elements it holds; the F# Set and Map types don't keep track of the number of elements they contain (though they easily could, if necessary) so determining which method to use -- that is, whether the input is "small" or "large" -- involves a complete traversal of the structure to get the number of elements, so you have to add that overhead to the execution-time of the adaptive algorithms. I'm not saying we *shouldn't* go the adaptive route, just that there are some additional considerations to be made in this case; as always, the only way to know is to write it and benchmark it :)
4. This is a good idea. There are a couple of other things to consider when creating the test data distributions:
- Set density: different data structures or different implementations of the same data structure may have performance characteristics which vary based on the _density_ or distribution of the input set. For example, the Patricia tries used in IntSet and IntMap are perfectly balanced (i.e., they have the best performance) when the element/key set is contiguous; however, they're not self-balancing so the right (or wrong) data set will cause them to become badly unbalanced and result in terrible performance.
- Make sure to include overlapping or non-existent data values -- as I've shown here, the way these are handled can significantly impact the performance of higher-level operations.
5. Another good idea. It'd be easy enough to hack up a custom version of FSharp.Core which uses NLog (or Log4Net, or whatever) to record the operations and input sizes. It would also be a good idea to test the benchmarks on multiple machines and multiple OS's to see if there are any critical differences between 32- and 64-bit machines, or between Windows/Mac/FreeBSD/Linux.
6. As for the measurements, it's important to be very precise about the test setup since there are so many obscure factors that can come into play. For this kind of code (performance- and memory-intensive), CPU cache size and main memory speed/latency is going to make a real difference, so it'd worth recording that information when you implement (5); this matters even if you use the same machine and setup for benchmarking various optimizations, because they influence where the inflection points occur in your performance curves and these are usually where you would put the cutoff for an adaptive algorithm (i.e., the input size which is the boundary between what you consider a "small" or "large" input).
Some additional benchmarking notes:
- ALWAYS make sure to include a "warm-up" step in your benchmarking tool which hits all of the code paths you intend to test. Generic types/methods (heavily used in F#) are instantiated (specialized) by the JIT compiler once for reference types and for each value type; for example, if you have a generic method Foo<'T> and you call it as Foo<obj>, Foo<string>, Foo<int>, Foo<int64>, Foo<DateTime>, the JIT compiler will emit the method four (4) times: for obj/string, int, int64, and DateTime. Make sure you run the warm-up procedures accordingly, otherwise the JIT overhead could skew the results. Running the warm-up procedure also eliminates the overhead of finding and loading any referenced assemblies used within the code (since assemblies are loaded on-demand).
- Run each benchmark a few times to make sure your results are accurate. Unknown factors can influence the tests in random, non-repeatable ways. For example, when I ran the "compile the F# sources" test for the new improvements (above), I ran it three times in a row and the results were slightly different each time; the first run showed a small *loss* in performance in a few projects, but after re-running the tests it appears those results were flukes.
Is there anyone in the community interested in helping Anton set up and run these benchmarks? I'm happy to do the tuning and tweaking to squeeze more performance out of the code, but I can't do that without a good benchmarking setup and I don't have the time to write it myself.
One last thought -- I hope this thread has helped spark some discussion within Microsoft with regards to accepting upstream patches to F#. If it has, I don't know if there's been any opposition to it (hopefully not). In any case, if Microsoft decides not to accept patches to the compiler itself, I hope they'll at least consider opening other parts of F# (like FSharp.Core) to the community. I certainly don't mind pitching in to help make F# even better than it already is -- and I know others want to help too -- but if the work isn't going to benefit the wider F# world there's not much point in it.
Regards,
Jack