|Optimized Set and Map implementations||Jack Pappas||4/17/13 9:48 AM|
A few months back I was implementing a data structure based on AVL trees (a balanced DIET, if you care). Out of curiousity, I benchmarked my AVL tree implementation against the Set type in FSharp.Core (also implemented with an AVL tree), and was surprised to find it was significantly faster in a number of test cases.
Fast-forward to this past weekend: with the recent conversation here and on GitHub about convincing Microsoft to accept community contributions into the "real" (upstream) F# codebase, I remembered the benchmark results and thought it would be interesting to create optimized implementations of the Set and Map types.
If you want to skip right to the code, get the 'set.fs' and 'map.fs' files from my GitHub:
Copy them into the /src/fsharp/FSharp.Core folder in your clone of the F# repository and recompile/install the binaries. (I also copied the compiled FSharp.Core into the lib/bootstrap/4.0 folder so it would be used the next time I compile the sources.)
You can run the PerfComparison project included in the repository if you want to try some basic benchmarks. The results from running the benchmark on my desktop are below; "Baseline" is for the standard F# Set, "Result" and "Result1" are for my optimized Set implementation, and "Result2" is for the IntSet implementation from my ExtCore library (also available on GitHub and NuGet):
Create Random Set<int> (n=1000000)
Baseline: 5872.456200 (ms)
Result 1: 5033.890000 (ms)
Result 2: 4718.062200 (ms)
Create Random Set<int64> (n=1000000)
Baseline: 6194.790800 (ms)
Result : 5664.715700 (ms)
Union Random Set<int> (n=1000, N=10000)
Baseline: 11125.811500 (ms)
Result : 7066.950600 (ms)
Union Random Set<int64> (n=1000, N=10000)
Baseline: 15656.635400 (ms)
Result : 10730.533600 (ms)
Intersect Random Set<int> (n=1000, N=10000)
Baseline: 652.551900 (ms)
Result : 326.114700 (ms)
Intersect Random Set<int64> (n=1000, N=10000)
Baseline: 101.318400 (ms)
Result : 57.178600 (ms)
I haven't benchmarked Map yet, or any of the other Set functions, so if anyone wants to do that please feel free to do so.
After compiling FSharp.Core with my Set and Map code, I re-compiled everything to see if there was a noticeable difference in speed. The "before" and "after" times were fairly close, but it looks like the optimizations consistently shaved off about 3-5% of the compilation time. I saved the timing results so if you want to see the details, let me know and I'll post them (or you could perform the same test on your own machine).
Hopefully, this will get some discussion going ;)
|Re: Optimized Set and Map implementations||Adam Granicz||4/17/13 11:20 AM|
That's pretty awesome Jack, thanks!
Adam Granicz, IntelliFactory
|Re: Optimized Set and Map implementations||Anton Tayanovskyy||4/17/13 11:29 AM|
Good results, sounds like some serious improvements on some of the benchmarks.
Incidentally, I was "rewriting" maps too on the weekend, hoping to speed up the F# compiler. The maps I have were basically ported from OCaml.Core with some inlining to propagate the comparator function. I was hoping for better results on Int/Int64 maps but they were only about 10% faster so I thought it's not going to be significant. I did not get to testing inside the compiler yet, but will try to do so. Would also be fun to run your benchmarks.
I think the maps in F# library are using an indirection layer to support binary serialization. Because of that, we probably can't change them much - will have to support the same behavior whatever it was. We can, though, swap out all maps/sets in F# compiler to use a different implementation.
I actually think saving 5% of compilation time is pretty good. Considering that maps are an easy-to-understand component to swap in and out.
I would also want to experiment with some weird adaptive map designs, specifically in the compiler.
|Re: Optimized Set and Map implementations||Anton Tayanovskyy||4/17/13 11:30 AM|
Sorry about the OCaml.Core typo - I meant Jane Street Core of course (https://github.com/janestreet/core/blob/master/lib/core_map.ml). Thanks. --A
|Re: Optimized Set and Map implementations||Jack Pappas||4/17/13 1:22 PM|
Adam -- Thanks!
At one point, I tried a few different tricks to inline the comparer function, but none of them were as fast as just passing it around. I have one other idea that might accomplish that, though it's quite complicated and I still need to think up some way to handle the serialization before it's a viable alternative.
My Set/Map implementations use a similar style to the existing Set/Map implementations -- they both define an internal type which is the "real" implementation of the AVL tree and a public type which wraps the tree with a nice API (and also handles serialization/deserialization).
The "standard" Set and Map implementations work similarly to those in the OCaml Core and Jane Street Core libraries. There are a few key differences in my implementations which provide the speedup:
Anyway, I think it's quite feasible we could squeeze a good bit more performance out of the compiler by just going through and fixing all of the low-hanging fruit (like this). Did you see the filtered profiling data I posted on the GitHub issue you opened? It should provide a decently accurate work-list of things that can be fixed to speed things up. As I mentioned in the issue comments -- I've made some progress on a new lexer, which I'll post the details of here sometime tomorrow.
|Re: Optimized Set and Map implementations||Anton Tayanovskyy||4/17/13 2:40 PM|
Once again - congrats on achieving some speedup.
Now, please bear with me, I am an extremely suspicious guy and I like to question everything.
1. Concerning the data table you posted based on my F# fsc bootstrapping profile, can you explain your method. I suspect things might be flawed in the first place with my profile, since it might not account for TCO correctly; second I didn't quite get it how you computed a correction to re-order the methods. If you would explain that, or better still - maybe there's a guide somewhere on effectively profiling F#/.NET applications :) My limited experience with this has been quite painful - the methods I tried to change didn't make a dent in overall timing.
2. Concerning maps/sets - we have to be quite careful not to jump to conclusions, simple benchmarks often lie (unless we are benchmarking a specific application - I like the bootstrapping FSC benchmark). That is, typical benchmarks are not constructed in a way that represents how the library is going to be used in the actual application.
Specifically about your points:
(a) Removing SetOne - interesting idea, I'd like to try myself to verify this helps though. Does the null representation trick not work at all or as well if the leaf node is kept?
(b) I do not fully understand this, but I cannot see how it would ever hurt, so I would support the change.
(c) IMO this is definitely a bad idea. Please have a look at IL, F# does an excellent and consistent job with curried functions. If you develop an intuition for it it is easy to predict what happens in IL. In particular module-level curried let methods are always compiled to N-arg methods in IL. Also, tail-rec code becomes loops inside methods. I think that if anything, in both OCaml and F# tupling hurts.
(d) Very skeptical here. I would be very curious to see your exact benchmarks. How can this possibly help? Let us take a simple example like iter. Accessing Stack<'T> involves invoking methods and doing multiple checks. Using the machine stack bypasses those checks. I would assume in 90% of the cases it's faster. This is why e.g. Ocaml uses non-tail-call List.rev. Here's a little dumb iter benchmark, the simple code is 5x better than your Iter code with Stack<'T> on my machine: https://gist.github.com/toyvo/81dc8871afa2e6af81b5
However, nice trick there with `FSharpFunc<_,_,_>.Adapt f` - this helps a lot, good idea.
(e) Right. That sounds like the way to go. Specifically in the compiler, however, there are two variations on this theme: first there are "Tagged Collections" where you *can* pass a specific comparer - so there it can sit in the instance; second, there are maps and sets with integer (int32, int64) keys - for these comparer can be entirely inlined to bypass the virtual call. I have some code for that but it didn't help as much as I thought it would.
|Re: Optimized Set and Map implementations||Anton Tayanovskyy||4/17/13 2:49 PM|
Here's a little loop unrolling that helps a tiny bit more:
let iter f tree =
let action = FSharpFunc<_,_,_>.Adapt f
let inline f (kvp: KeyValuePair<_,_>) =
let rec loop tree =
match tree with
| Empty -> ()
| Node (Empty, Empty, kvp, _) -> f kvp
| Node (Empty, r, kvp, _) -> f kvp; loop r
| Node (r, Empty, kvp, _) -> loop r; f kvp
| Node (m1, m2, kvp, _) -> loop m1; f kvp; loop m2
|Re: Optimized Set and Map implementations||Michael Newton||4/18/13 2:08 AM|
This is the kind of area where I'd really like to know whether MS are going to pull changes back in for future Visual Studio versions. I understand completely that MS are driving the language design side of things, but there is a lot of work that leaves on the tooling that could be down if we thought it wasn't going to get thrown away with the next VS release.Michael
|Re: Optimized Set and Map implementations||Dave Thomas||4/18/13 2:11 AM|
Thats why open source tooling will be safer, once F# 3.1 comes out then we should be able to roll the changes into the F# binding and the tools stemming from the compiler. Performance optimisations might be harder to get back into the Microsoft code base though.
|Re: Optimized Set and Map implementations||Michael Newton||4/18/13 2:29 AM|
Which is great for my personal projects, but it isn't going to help with wider adoption/work where we all use VS.
|Re: Optimized Set and Map implementations||Dave Thomas||4/18/13 3:14 AM|
I think the community needs to build some of these tools, yes it's a lot of work but the user base is not large enough for Microsoft to commit a large budget especially as C# already has these sorts of features. It's sad but true, we really need to start helping ourselves rather than waiting on Microsoft.
|Re: Optimized Set and Map implementations||Anton Tayanovskyy||4/18/13 4:53 AM|
I agree with Dave. Let us cultivate the programmer virtue of hubris (http://c2.com/cgi/wiki?LazinessImpatienceHubris). We can do it! I claim that in a year "we all using VS" will turn into 50% at most. People will move to Emacs, alternate IDEs like Xamarin Studio, and hopefully the in-browser one I am working on - CloudSharper :)
In addition, working on collections seems to be a good angle because:
(1) If you get better collections, you can release the resulting collections under a different namespace and have people benefit from them in other projects.
(2) If you get better collections and *convincingly* demonstrate the advantage, I think it would very tempting for MS to pull the change into their compiler code. This is one of those places where it is relatively easy to verify correctness.
|Re: Optimized Set and Map implementations||Jack Pappas||4/18/13 4:04 PM|
It would be nice, though, if some of the other F# developers could corroborate the speedup by compiling the code and trying it out in their applications. (If anyone wants to try the optimized code but can’t/won’t compile it themselves, let me know and I’ll post it as a NuGet package.)
I’m not one to throw some stranger’s “improvements” into a production application without thoroughly vetting it, so even if we do get some corroboration on the optimizations I’d want to see a complete set of benchmarks run against both the standard FSharp.Core and my version – if only to see if there are any cases where the improvements are, in fact, regressions.
Addressing your specific points about the optimizations I made to Map and Set:
|Re: Optimized Set and Map implementations||Anton Tayanovskyy||4/18/13 6:21 PM|
Thanks for the detailed response. You convinced me on #5, we'll measure #1 (presence of Leaf node) to find out what works best, and #3 is a stylistic issue I have no interest in pursuing further - if you prefer tupled style that's OK :)
Now, developing on the ideas for improving the implementation:
1. The profile I collected is a 30G binary file, I still have to figure out how to share it. I might have posted truncated data, so you don't get the actual sort order you want.
2. Sorting by avg(self) brings up methods that do not necessarily have a big impact on total runtime.. Consider: 1,000,000 invocations of M1 with avg(self)=0.001 sec, total(self)=1,000 sec. But then 1 invocation of M2 with avg(self)=total(self)=1 sec. You are going to see M2-like methods at the top of the table, where we really are after M1 methods. Even if we optimize the heck out of M2, it will only save us 1 sec, 0.1% of total runtime in the example. Best bet seems to be sorting by self, given that the profiler correctly estimates self time with tail calls - still have to check this in the Mono profiler.
3. There are plenty of examples besides memory use where winning on a benchmark is a moot point. Since we are discussing this anyway, one great example is using stack vs heap (such as System.Collections.Generic.Stack<'T>) for control flow in recursive procedures. On *large* inputs, using the heap may be faster, or even the only way forward (because of stack overflow). On *small* inputs, stack is way faster. This is why on my benchmark my `iter` beats yours - the benchmark involves a *small* enough Map. This is also why Dictionary uses adaptive, different representations for small and large tables.
4. Since you can have V1 outperform V2 on small maps, and V2 outperform V1 on large maps, the real question is how to construct the *definitive* benchmark that we will use to pick the better implementation. My intuition tells me that the statistical distribution of input sizes to procedures is shaped by a power law (lots of small inputs, and occasionally very big inputs) - but I have not found time to measure yet. I am looking forward to doing this little study on the F# compiler, to find out an approximation of the statistical distribution of input sizes for various functions. Then the distribution can be used to cheaply generate "informed" benchmarks.
5. Another idea which might be our best bet - is to instrument the compiler to record and save to disk all inputs to Map.* functions observed during bootstrapping. Input size is important but not the only factor. For example, I could trace skeletons of all int and int64-keyed maps operated on during the bootstrapping - this gives size and a little more precision for things like balancing. We could then construct the benchmark by replaying the trace using the implementation under test. I am not sure if this is practical - depends on the size of recorded data, but if so, it could be considerably faster than going through the complete F# boostrapping process. Such a benchmark would be the most convincing - after actually timing the boostrapping process itself of course. The benchmark would measure the code under different memory conditions.
6. Good question about my measurements - I have not been doing measurements consistently, but from now on I'll try to stick to .NET 4.0, AnyCPU, Windows, x64 - I think this is the configuration I'm most interested in for performance.
|Re: Optimized Set and Map implementations||Jack Pappas||4/23/13 1:22 PM|
I was looking at this code again yesterday and noticed that I'd missed a couple of potential optimizations. Specifically, when inserting an element which already exists in the set/map or deleting an element which does not exist in the set/map, both my implementation and the original implementation (in FSharp.Core) return the tree without altering it; however, this fails to take into account that the Insert and Delete methods are recursive, so even though the original tree is returned, it's rebalanced repeatedly while unwinding the recursion. I added a few lines of code to detect this and avoid the unnecessary rebalancing and pushed the improvements to GitHub (https://github.com/jack-pappas/fs-core-optimized/tree/master/Replacements).
These improvements don't seem to make much difference in the general case: in the random set creation test, the improvement was barely noticable. However, the Intersection and Union operations make heavy use of Insert and Delete, and re-running my simple tests with random sets (in the PerfComparison project) shows a marked improvement: the Union operation now runs twice as fast as the one in FSharp.Core and Intersect is about 50% faster. I re-ran the "compiling the F# source distribution" test and found the results to be mostly the same, except that a couple of the larger projects (the compiler itself, IIRC) ran slightly faster again compared to the previous results (another 1-2%). I imagine that any applications which make heavy use of Set and Map will see more improvement though.
Responding to Anton's points:
2. Sorting by Self (Avg.) is not meant to find methods which, if optimized, would have the greatest impact on the total runtime. It is meant to find the methods which are the easiest to optimize; whether those methods are _actually_ optimizable, or if doing so would noticably improve application performance -- that is left up to you to decide. It is generally much harder to make any kind of *significant* improvement in a method which runs in 0.001 sec (i.e., the law of diminishing returns). In general, the best way to optimize these methods is just to eliminate them, but that often entails more significant architectural changes if it is even possible at all.
3. Adaptive implementations can be useful (e.g., in Dictionary<'T>), but they're only going to matter if the impact on application performance is significant. For example, your stack-based approach may be 5x faster than Stack<'T> for small inputs, but unless it provides noticable performance gains in real-world applications there's no reason to take on the additional complexity in the code. Dictionary<'T> also has O(1) speed in determining the number of elements it holds; the F# Set and Map types don't keep track of the number of elements they contain (though they easily could, if necessary) so determining which method to use -- that is, whether the input is "small" or "large" -- involves a complete traversal of the structure to get the number of elements, so you have to add that overhead to the execution-time of the adaptive algorithms. I'm not saying we *shouldn't* go the adaptive route, just that there are some additional considerations to be made in this case; as always, the only way to know is to write it and benchmark it :)
4. This is a good idea. There are a couple of other things to consider when creating the test data distributions:
- Set density: different data structures or different implementations of the same data structure may have performance characteristics which vary based on the _density_ or distribution of the input set. For example, the Patricia tries used in IntSet and IntMap are perfectly balanced (i.e., they have the best performance) when the element/key set is contiguous; however, they're not self-balancing so the right (or wrong) data set will cause them to become badly unbalanced and result in terrible performance.
- Make sure to include overlapping or non-existent data values -- as I've shown here, the way these are handled can significantly impact the performance of higher-level operations.
5. Another good idea. It'd be easy enough to hack up a custom version of FSharp.Core which uses NLog (or Log4Net, or whatever) to record the operations and input sizes. It would also be a good idea to test the benchmarks on multiple machines and multiple OS's to see if there are any critical differences between 32- and 64-bit machines, or between Windows/Mac/FreeBSD/Linux.
6. As for the measurements, it's important to be very precise about the test setup since there are so many obscure factors that can come into play. For this kind of code (performance- and memory-intensive), CPU cache size and main memory speed/latency is going to make a real difference, so it'd worth recording that information when you implement (5); this matters even if you use the same machine and setup for benchmarking various optimizations, because they influence where the inflection points occur in your performance curves and these are usually where you would put the cutoff for an adaptive algorithm (i.e., the input size which is the boundary between what you consider a "small" or "large" input).
Some additional benchmarking notes:
- ALWAYS make sure to include a "warm-up" step in your benchmarking tool which hits all of the code paths you intend to test. Generic types/methods (heavily used in F#) are instantiated (specialized) by the JIT compiler once for reference types and for each value type; for example, if you have a generic method Foo<'T> and you call it as Foo<obj>, Foo<string>, Foo<int>, Foo<int64>, Foo<DateTime>, the JIT compiler will emit the method four (4) times: for obj/string, int, int64, and DateTime. Make sure you run the warm-up procedures accordingly, otherwise the JIT overhead could skew the results. Running the warm-up procedure also eliminates the overhead of finding and loading any referenced assemblies used within the code (since assemblies are loaded on-demand).
- Run each benchmark a few times to make sure your results are accurate. Unknown factors can influence the tests in random, non-repeatable ways. For example, when I ran the "compile the F# sources" test for the new improvements (above), I ran it three times in a row and the results were slightly different each time; the first run showed a small *loss* in performance in a few projects, but after re-running the tests it appears those results were flukes.
Is there anyone in the community interested in helping Anton set up and run these benchmarks? I'm happy to do the tuning and tweaking to squeeze more performance out of the code, but I can't do that without a good benchmarking setup and I don't have the time to write it myself.
One last thought -- I hope this thread has helped spark some discussion within Microsoft with regards to accepting upstream patches to F#. If it has, I don't know if there's been any opposition to it (hopefully not). In any case, if Microsoft decides not to accept patches to the compiler itself, I hope they'll at least consider opening other parts of F# (like FSharp.Core) to the community. I certainly don't mind pitching in to help make F# even better than it already is -- and I know others want to help too -- but if the work isn't going to benefit the wider F# world there's not much point in it.
|Re: Optimized Set and Map implementations||Marius Fersigan||4/24/13 1:21 PM|
You can share large data (files) using a (private) torrent. Here you have a link about it:
|Re: Optimized Set and Map implementations||Anton Tayanovskyy||5/9/13 5:34 AM|
Jack - apologies for dragging this on - but I have not given up yet on constructing the profile. It is hard to find time for this.
Also, note for the future: I recently came across purely functional hash tables, apparently e.g. Racket has them, they have amortized O(1) add and O(1) lookup. They cannot replace maps as they are less general - require a hash function. But many places in the compiler use maps with an integral domain, pure hash tables could work there well.
|Re: Optimized Set and Map implementations||Dave Thomas||5/9/13 5:35 AM|
I would really like to see the compiler also optimised for completion on small projects, the ramp-up time is quite high, I think its phase 2 of the compilation that takes most of the time.