Optimized Set and Map implementations

Jack Pappas

unread,

Apr 17, 2013, 12:48:26 PM4/17/13

to fsharp-o...@googlegroups.com

Hi all,

A few months back I was implementing a data structure based on AVL trees (a balanced DIET, if you care). Out of curiousity, I benchmarked my AVL tree implementation against the Set type in FSharp.Core (also implemented with an AVL tree), and was surprised to find it was significantly faster in a number of test cases.

Fast-forward to this past weekend: with the recent conversation here and on GitHub about convincing Microsoft to accept community contributions into the "real" (upstream) F# codebase, I remembered the benchmark results and thought it would be interesting to create optimized implementations of the Set and Map types.

If you want to skip right to the code, get the 'set.fs' and 'map.fs' files from my GitHub:

https://github.com/jack-pappas/fs-core-optimized/tree/master/Replacements

Copy them into the /src/fsharp/FSharp.Core folder in your clone of the F# repository and recompile/install the binaries. (I also copied the compiled FSharp.Core into the lib/bootstrap/4.0 folder so it would be used the next time I compile the sources.)

You can run the PerfComparison project included in the repository if you want to try some basic benchmarks. The results from running the benchmark on my desktop are below; "Baseline" is for the standard F# Set, "Result" and "Result1" are for my optimized Set implementation, and "Result2" is for the IntSet implementation from my ExtCore library (also available on GitHub and NuGet):

Create Random Set<int> (n=1000000)
Baseline: 5872.456200 (ms)
Result 1: 5033.890000 (ms)
Result 2: 4718.062200 (ms)

Create Random Set<int64> (n=1000000)
Baseline: 6194.790800 (ms)
Result : 5664.715700 (ms)

Union Random Set<int> (n=1000, N=10000)
Baseline: 11125.811500 (ms)
Result : 7066.950600 (ms)

Union Random Set<int64> (n=1000, N=10000)
Baseline: 15656.635400 (ms)
Result : 10730.533600 (ms)

Intersect Random Set<int> (n=1000, N=10000)
Baseline: 652.551900 (ms)
Result : 326.114700 (ms)

Intersect Random Set<int64> (n=1000, N=10000)
Baseline: 101.318400 (ms)
Result : 57.178600 (ms)

I haven't benchmarked Map yet, or any of the other Set functions, so if anyone wants to do that please feel free to do so.

After compiling FSharp.Core with my Set and Map code, I re-compiled everything to see if there was a noticeable difference in speed. The "before" and "after" times were fairly close, but it looks like the optimizations consistently shaved off about 3-5% of the compilation time. I saved the timing results so if you want to see the details, let me know and I'll post them (or you could perform the same test on your own machine).

Hopefully, this will get some discussion going ;)

Cheers,
Jack

Adam Granicz

unread,

Apr 17, 2013, 2:20:21 PM4/17/13

to fsharp-o...@googlegroups.com

That's pretty awesome Jack, thanks!

--
--
You received this message because you are subscribed to the Google
Groups "FSharp Open Source Community" group.
To post to this group, send email to fsharp-o...@googlegroups.com
To unsubscribe from this group, send email to
fsharp-opensou...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/fsharp-opensource?hl=en

---
You received this message because you are subscribed to the Google Groups "FSharp Open Source Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fsharp-opensou...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Adam Granicz, IntelliFactory
www.intellifactory.com

Anton Tayanovskyy

unread,

Apr 17, 2013, 2:29:25 PM4/17/13

to fsharp-o...@googlegroups.com

Good results, sounds like some serious improvements on some of the benchmarks.

Incidentally, I was "rewriting" maps too on the weekend, hoping to speed up the F# compiler. The maps I have were basically ported from OCaml.Core with some inlining to propagate the comparator function. I was hoping for better results on Int/Int64 maps but they were only about 10% faster so I thought it's not going to be significant. I did not get to testing inside the compiler yet, but will try to do so. Would also be fun to run your benchmarks.

I think the maps in F# library are using an indirection layer to support binary serialization. Because of that, we probably can't change them much - will have to support the same behavior whatever it was. We can, though, swap out all maps/sets in F# compiler to use a different implementation.

I actually think saving 5% of compilation time is pretty good. Considering that maps are an easy-to-understand component to swap in and out.

I would also want to experiment with some weird adaptive map designs, specifically in the compiler.

Thanks,

--A

Kind Regards,
Anton Tayanovskyy

WebSharper™ - type-safe JavaScript in F#
http://intellifactory.com

Anton Tayanovskyy

unread,

Apr 17, 2013, 2:30:42 PM4/17/13

to fsharp-o...@googlegroups.com

Sorry about the OCaml.Core typo - I meant Jane Street Core of course (https://github.com/janestreet/core/blob/master/lib/core_map.ml). Thanks. --A

Jack Pappas

unread,

Apr 17, 2013, 4:22:52 PM4/17/13

to fsharp-o...@googlegroups.com

Adam -- Thanks!

Anton --

At one point, I tried a few different tricks to inline the comparer function, but none of them were as fast as just passing it around. I have one other idea that might accomplish that, though it's quite complicated and I still need to think up some way to handle the serialization before it's a viable alternative.

My Set/Map implementations use a similar style to the existing Set/Map implementations -- they both define an internal type which is the "real" implementation of the AVL tree and a public type which wraps the tree with a nice API (and also handles serialization/deserialization).

The "standard" Set and Map implementations work similarly to those in the OCaml Core and Jane Street Core libraries. There are a few key differences in my implementations which provide the speedup:

I removed the "SetOne" case from SetTree/MapTree. (This is the 'Leaf' case in the Jane St. Tree0.t type.) After that, I added the [<CompilationRepresentation(...)>] attribute to the SetTree/MapTree types. This is key because it means that in the compiled IL, the Empty case is simply represented as 'null' -- which greatly speeds up the pattern-matching. The downside is that my trees use slightly more memory since each leaf holds two empty machine words. In practice there doesn't seem to be much of a difference: when I re-compiled the F# sources using my optimized FSharp.Core assembly with the bootstrap compiler, the difference in memory usage was ~1% in the worst case and less than that in most cases.
I swapped the order of the fields within the Node case so the left and right children are the first (leftmost) fields. This is just a basic data-alignment / packing optimization, but it means that for any sub-machine-word types my implementation should use less memory than the standard implementation (e.g., for Set<char>, or Set<int> on a 64-bit machine). Newer versions of the CLR may automatically make this optimization, but I know that at least the older ones (pre-4.0) don't. As a bonus, this reduced memory usage means more Node instances can fit into the CPU cache -- so proper data alignment provides some performance benefit too.
I changed most of the internal tree functions to tupled form. I didn't look at the IL for the original implementations, but I think it's generally a good idea to define functions in tupled form (as opposed to curried form) unless you're actually going to be partially applying them. The F# compiler can often figure out what you're doing and use an optimized closure or whatever, but for library code like this (which we want to be as fast as possible, within reason) I'd rather not take chances like that.
The higher-order functions (map, fold, etc.) have optimized implementations in my code. I used a mutable stack to implement these instead of simple recursion because for simple fold/map/iter functions the overhead of pushing/popping stack frames while traversing the tree dominates the overall execution time; the mutable stack eliminates that overhead so the performance is much improved.
The original implementations store the comparer function in an instance field of Set/Map. As the comparer is based only on the element/key type -- there's no way to pass in a custom function -- I changed it to use a static field. This allows it to be loaded just once per each generic instantiation (after which it's cached), and it also saves a tiny bit of memory since it eliminates one machine word from each instance of Map.

Anyway, I think it's quite feasible we could squeeze a good bit more performance out of the compiler by just going through and fixing all of the low-hanging fruit (like this). Did you see the filtered profiling data I posted on the GitHub issue you opened? It should provide a decently accurate work-list of things that can be fixed to speed things up. As I mentioned in the issue comments -- I've made some progress on a new lexer, which I'll post the details of here sometime tomorrow.

Cheers,
Jack

Anton Tayanovskyy

unread,

Apr 17, 2013, 5:40:47 PM4/17/13

to fsharp-o...@googlegroups.com

Hi Jack,

Once again - congrats on achieving some speedup.

Now, please bear with me, I am an extremely suspicious guy and I like to question everything.

1. Concerning the data table you posted based on my F# fsc bootstrapping profile, can you explain your method. I suspect things might be flawed in the first place with my profile, since it might not account for TCO correctly; second I didn't quite get it how you computed a correction to re-order the methods. If you would explain that, or better still - maybe there's a guide somewhere on effectively profiling F#/.NET applications :) My limited experience with this has been quite painful - the methods I tried to change didn't make a dent in overall timing.

2. Concerning maps/sets - we have to be quite careful not to jump to conclusions, simple benchmarks often lie (unless we are benchmarking a specific application - I like the bootstrapping FSC benchmark). That is, typical benchmarks are not constructed in a way that represents how the library is going to be used in the actual application.

Specifically about your points:

(a) Removing SetOne - interesting idea, I'd like to try myself to verify this helps though. Does the null representation trick not work at all or as well if the leaf node is kept?

(b) I do not fully understand this, but I cannot see how it would ever hurt, so I would support the change.

(c) IMO this is definitely a bad idea. Please have a look at IL, F# does an excellent and consistent job with curried functions. If you develop an intuition for it it is easy to predict what happens in IL. In particular module-level curried let methods are always compiled to N-arg methods in IL. Also, tail-rec code becomes loops inside methods. I think that if anything, in both OCaml and F# tupling hurts.

(d) Very skeptical here. I would be very curious to see your exact benchmarks. How can this possibly help? Let us take a simple example like iter. Accessing Stack<'T> involves invoking methods and doing multiple checks. Using the machine stack bypasses those checks. I would assume in 90% of the cases it's faster. This is why e.g. Ocaml uses non-tail-call List.rev. Here's a little dumb iter benchmark, the simple code is 5x better than your Iter code with Stack<'T> on my machine: https://gist.github.com/toyvo/81dc8871afa2e6af81b5

However, nice trick there with `FSharpFunc<_,_,_>.Adapt f` - this helps a lot, good idea.

(e) Right. That sounds like the way to go. Specifically in the compiler, however, there are two variations on this theme: first there are "Tagged Collections" where you *can* pass a specific comparer - so there it can sit in the instance; second, there are maps and sets with integer (int32, int64) keys - for these comparer can be entirely inlined to bypass the virtual call. I have some code for that but it didn't help as much as I thought it would.

Thanks,

Anton

Anton Tayanovskyy

unread,

Apr 17, 2013, 5:49:49 PM4/17/13

to fsharp-o...@googlegroups.com

Here's a little loop unrolling that helps a tiny bit more:

let iter f tree =

let action = FSharpFunc<_,_,_>.Adapt f

let inline f (kvp: KeyValuePair<_,_>) =

action.Invoke(kvp.Key, kvp.Value)

let rec loop tree =

match tree with

| Empty -> ()

| Node (Empty, Empty, kvp, _) -> f kvp

| Node (Empty, r, kvp, _) -> f kvp; loop r

| Node (r, Empty, kvp, _) -> loop r; f kvp

| Node (m1, m2, kvp, _) -> loop m1; f kvp; loop m2

loop tree

Thanks,

--A

Michael Newton

unread,

Apr 18, 2013, 5:08:08 AM4/18/13

to fsharp-o...@googlegroups.com

This is the kind of area where I'd really like to know whether MS are going to pull changes back in for future Visual Studio versions. I understand completely that MS are driving the language design side of things, but there is a lot of work that leaves on the tooling that could be down if we thought it wasn't going to get thrown away with the next VS release.

Michael

--

Dave Thomas

unread,

Apr 18, 2013, 5:11:04 AM4/18/13

to fsharp-o...@googlegroups.com

Thats why open source tooling will be safer, once F# 3.1 comes out then we should be able to roll the changes into the F# binding and the tools stemming from the compiler. Performance optimisations might be harder to get back into the Microsoft code base though.

D.

Michael Newton

unread,

Apr 18, 2013, 5:29:44 AM4/18/13

to fsharp-o...@googlegroups.com

Which is great for my personal projects, but it isn't going to help with wider adoption/work where we all use VS.

Dave Thomas

unread,

Apr 18, 2013, 6:14:55 AM4/18/13

to fsharp-o...@googlegroups.com

I think the community needs to build some of these tools, yes it's a lot of work but the user base is not large enough for Microsoft to commit a large budget especially as C# already has these sorts of features. It's sad but true, we really need to start helping ourselves rather than waiting on Microsoft.

Anton Tayanovskyy

unread,

Apr 18, 2013, 7:53:19 AM4/18/13

to fsharp-o...@googlegroups.com

I agree with Dave. Let us cultivate the programmer virtue of hubris (http://c2.com/cgi/wiki?LazinessImpatienceHubris). We can do it! I claim that in a year "we all using VS" will turn into 50% at most. People will move to Emacs, alternate IDEs like Xamarin Studio, and hopefully the in-browser one I am working on - CloudSharper :)

In addition, working on collections seems to be a good angle because:

(1) If you get better collections, you can release the resulting collections under a different namespace and have people benefit from them in other projects.

(2) If you get better collections and *convincingly* demonstrate the advantage, I think it would very tempting for MS to pull the change into their compiler code. This is one of those places where it is relatively easy to verify correctness.

Thanks,

--A

Jack Pappas

unread,

Apr 18, 2013, 7:04:31 PM4/18/13

to fsharp-o...@googlegroups.com

Hi Anton,

I don't mind you asking questions, and I'm glad you're suspicious (as you should be) because it helps make sure nothing's been overlooked or assumed.

Responding to your general points:

I don’t know if there’s a specific guide for profiling F# apps (though I’m sure there are plenty of them for profiling .NET apps in general, courtesy of C#). The steps I took to create the data I posted from your raw profile data are applicable to most languages or profiling tools though, so they can serve as a rough “how-to” for profiling F# apps:

First, I loaded the raw file into notepad, stripped off the headers/footers from the data, then saved it.
I opened the file in Excel as a text file. IIRC, I had to tweak the auto-detected delimiters or column widths a bit so the data would be parsed correctly.
The data is now loaded into Excel, with four columns in the worksheet: "Total (ms)", "Self (ms)", "Calls", and "Method Name".

"Total" time for a method is the execution time of the method, summed across all of the times that method was called. You can visualize this value as the actual amount of time (wall clock) during the program's execution where the execution stack contained one or more stack frames pointing to the method. Therefore, this time includes the execution time of the method PLUS the execution time of whatever calls it made to other methods.
"Self" time for a method is the execution time of the method, summed across all of the times that method was called, MINUS the execution time of any other methods it calls. Keeping with the definition of "total" time, "self" time is the actual amount of time (wall clock) during the program's execution where the method is the topmost frame on the execution stack.
"Calls" is the number of times the method was called during execution of the application.
"Method Name" is the fully-qualified method name. I did notice a slight oddity here in the way the Mono profiler prints the names of generic types/methods -- sometimes it prints the name of the type parameter and other times it prints the name of the type *argument*. This messes with the results a bit, but not enough to worry about at this point.

I inserted a column to the right of "Calls" and added a header "Avg. Self (ms)" to represent the average "self" time per call. I filled in the values by computing ("Self (ms)" / "Calls") for each method.

I used "self" time instead of "total" time here to isolate specific long-running methods (i.e., bottlenecks); later, when these specific offenders are fixed, it would make more sense to use "total" time to identify larger chunks of application code which are responsible for slowing the application down. In other words, using "self" time first allows you to pick out any low-hanging optimization fruit -- once those are taken care of, any other fixes are going to require larger architectural changes to the overall application.

Sort the data by "Avg. Self (ms)" descending, making sure to expand the selection when prompted.
Finally, I applied a couple of filters:

On the "Self (ms)" column, I filtered out any methods which have less than 1 sec (1000 ms) of self-execution time -- they're unlikely to be performance bottlenecks (for this specific application, that is).
On the "Calls" column, I filtered out any methods which were only called once; again, I did this to remove unlikely optimization candidates. In retrospect, this wasn't a good idea -- it only filtered out an additional 4 methods after applying the "Self (ms)" filter and those 4 methods are in the top 5 slowest methods. (In my defense, I added this filter before the other one when I originally made the spreadsheet.)

At this point, the data should provide a good idea of specific methods which are candidates for optimization. In fact, they're even roughly ordered by how much improvement they'll provide if any optimization gains can be made.
Now, I just used some filtering and Excel trickery to split the data into a few different worksheets according to the namespace of the containing types of the methods. If you look at the Gist where I posted the data, it has both the combined and segmented (by namespace) data so you can see which general areas of the code contain the performance issues. Note that I posted the data as a tab-delimited text file so it should be compatible with pretty much any spreadsheet software available.

I agree that synthetic benchmarks can be misleading. On the other hand, if you perform a 1-to-1 comparison between two implementations (as in this case) and one of the implementations is faster or approximately equal for every individual operation, I think there are few cases where the individually slower implementation performs better when used in a real-world application. The only counterexample I can think of is when the faster implementation is faster as a result of increased memory usage (e.g., memoization), then runs out of memory and starts paging memory to disk.

It would be nice, though, if some of the other F# developers could corroborate the speedup by compiling the code and trying it out in their applications. (If anyone wants to try the optimized code but can’t/won’t compile it themselves, let me know and I’ll post it as a NuGet package.)

I’m not one to throw some stranger’s “improvements” into a production application without thoroughly vetting it, so even if we do get some corroboration on the optimizations I’d want to see a complete set of benchmarks run against both the standard FSharp.Core and my version – if only to see if there are any cases where the improvements are, in fact, regressions.

Addressing your specific points about the optimizations I made to Map and Set:

The null trick works just fine if you keep the Leaf case. If you remove the Leaf case as I did, all you need to do for the pattern matching (at the CPU level) is a null-check and branch; on most CPUs, this is going to be a single instruction with an implicit zero operand (‘jz’ or ‘jnz’ in x86 assembly). If you leave the Leaf case, the compiled IL will either check the hidden ‘Tag’ field within the object (to determine which case the data represents) or use the ‘isinst’ OpCode (the ‘as’ keyword in C#) since that’s slightly faster when you only have two cases (Empty case still only requires a null check); whatever the F# compiler emits, it’s still going to be slower than a null-check and branch.
Yep, this is pretty straightforward but tends to be overlooked by many developers (regardless of language/platform). This arises because many CPUs require addresses to be aligned to a multiple of the data size when performing memory operations. In other words, if you want to read a System.Int64 from memory, it needs to be stored at an address that is a multiple of 8 (because sizeof<int64> = 8); consequently, the compiler (the JIT compiler, in our case) sometimes has to insert padding bytes in the in-memory representation of the type so the fields will be correctly aligned.

For example, the original F# implementation defines the SetNode case as:

'T * SetTree<'T> * SetTree<'T> * int

The SetTree<’T> fields are pointers (machine words), which means that whenever the element type (‘T) is smaller than a machine word the compiler has to insert padding bytes after it so the next field will be aligned correctly. E.g., if you have a Set<char> on a 64-bit machine, the compiler will have to stick in 6 bytes of spacing after the ‘T field so the following SetTree<’T> field will be aligned correctly. Thus, the in-memory size of the data is (2+(6)+8+8+4 = 28). With my reordering of the fields, the same case takes up (8+8+2+(2)+4=24) bytes and no padding is necessary whenever (sizeof<’T> % 4 = 0).

Wikipedia has more information if you want to dig deeper: https://en.wikipedia.org/wiki/Data_structure_alignment
I have to disagree with you here. I've done a lot of work with F# and IL, and I *know* there are certain specific cases where the F# compiler won't perform the expected optimization – sometimes because the compiler didn't recognize that it could, and other times because some nuance of the .NET semantics make the optimization invalid for a specific case. Sure, doing a lot of work with F# and IL can help you develop an intuition about how the compiler will handle the optimizations, but in this case I think it makes much more sense to directly use the tupled form – not because the method will be compiled in some special way, but because if all of the arguments are going to be applied each time the function is called, the tupled syntax is a simple way to enforce that so there's no chance you have an accidental partial application anywhere (which would then have to use a closure to call the compiled method).

Additionally, 'let rec' functions within other functions aren't *always* compiled into loops. I've seen plenty of cases where the F# compiler just performs the lambda-lifting and compiles the 'let rec' function as a separate method within the same type.
I'll be the first to admit that some of my changes could actually introduce performance regressions; we won't know either way until a full set of benchmarks exist. I also haven't claimed that my code is *maximally* performant – I posted it knowing full well there are more potential optimizations to be made; as you mentioned, the use of Stack<'T> introduces some overhead from calling the methods of the Stack<'T> class. The main reason I changed the code to use Stack<'T> wasn't to be faster (though it often is...) but so the method executes in a constant stack space. Naïve recursion isn't a problem in and of itself, but I think it's a good idea to avoid it in library functions if we can -- even if it means returning a little bit of the optimization gains we've made elsewhere. It just helps keep people out of trouble if they do anything with nested recursion, or if they've implemented some poorly-designed recursion elsewhere in their program.

Alternatively, we could either implement a simple implementation of Stack<'T> with an array and a counter (thereby effectively 'inlining' it into the method) or just use continuations (though this involves some allocation, moreso than with Stack<'T>). I usually prefer the second one (continuations) and have used that technique successfully on a number of occasions.

I am curious to know why your naïve code is *much* faster there though (I haven't tested it myself yet). You said the naive code was 5x faster, but what is the actual (wall clock) amount of time saved? Did you benchmark that on your Arch Linux box with Mono or on Windows? One reason (maybe even the main reason) exceptions are much slower in F# than in OCaml is because .NET internally hooks into Windows' structured exception handling (SEH) mechanism, which in turn requires additional information to be stored in each stack frame. (In other words, it slower because it has to push/pop larger stack frames.) Another explanation could be that the JIT compiler is able to perform a special optimization for that method -- older versions of the CLR used an optimized calling convention (something like __fastcall) when the arguments to a method were 16 bytes or smaller.
For the tagged collections, it's still possible to use a custom comparer and store it in a static field as long as it's still a per-type comparer (i.e., you don't need to pass in a closure or something). If that's the case, you could just add an extra type parameter (e.g., 'Cmp) and constrain it to ('Cmp when 'Cmp :> IComparer<'Tag> and 'Cmp : (new : unit -> 'Cmp)). Then you'd just create an instance of the comparer and bind it with 'static let' just like I did in my Set and Map implementations. For Sets and Maps with integer keys, they should be using the IntSet and IntMap types based on Patricia Tries (they're faster than AVL trees – the "Result 2" in my original post in this thread uses my IntSet implementation). I have well-optimized IntSet and IntMap implementations of these in my ExtCore library, and I've even included special versions (TagSet and TagMap) which allow you to "tag" the integer keys with unit-of-measure types to enforce some type-safety on the values.

Cheers,
Jack

Anton Tayanovskyy

unread,

Apr 18, 2013, 9:21:18 PM4/18/13

to fsharp-o...@googlegroups.com

Hi Jack,

Thanks for the detailed response. You convinced me on #5, we'll measure #1 (presence of Leaf node) to find out what works best, and #3 is a stylistic issue I have no interest in pursuing further - if you prefer tupled style that's OK :)

Now, developing on the ideas for improving the implementation:

1. The profile I collected is a 30G binary file, I still have to figure out how to share it. I might have posted truncated data, so you don't get the actual sort order you want.

2. Sorting by avg(self) brings up methods that do not necessarily have a big impact on total runtime.. Consider: 1,000,000 invocations of M1 with avg(self)=0.001 sec, total(self)=1,000 sec. But then 1 invocation of M2 with avg(self)=total(self)=1 sec. You are going to see M2-like methods at the top of the table, where we really are after M1 methods. Even if we optimize the heck out of M2, it will only save us 1 sec, 0.1% of total runtime in the example. Best bet seems to be sorting by self, given that the profiler correctly estimates self time with tail calls - still have to check this in the Mono profiler.

3. There are plenty of examples besides memory use where winning on a benchmark is a moot point. Since we are discussing this anyway, one great example is using stack vs heap (such as System.Collections.Generic.Stack<'T>) for control flow in recursive procedures. On *large* inputs, using the heap may be faster, or even the only way forward (because of stack overflow). On *small* inputs, stack is way faster. This is why on my benchmark my `iter` beats yours - the benchmark involves a *small* enough Map. This is also why Dictionary uses adaptive, different representations for small and large tables.

4. Since you can have V1 outperform V2 on small maps, and V2 outperform V1 on large maps, the real question is how to construct the *definitive* benchmark that we will use to pick the better implementation. My intuition tells me that the statistical distribution of input sizes to procedures is shaped by a power law (lots of small inputs, and occasionally very big inputs) - but I have not found time to measure yet. I am looking forward to doing this little study on the F# compiler, to find out an approximation of the statistical distribution of input sizes for various functions. Then the distribution can be used to cheaply generate "informed" benchmarks.

5. Another idea which might be our best bet - is to instrument the compiler to record and save to disk all inputs to Map.* functions observed during bootstrapping. Input size is important but not the only factor. For example, I could trace skeletons of all int and int64-keyed maps operated on during the bootstrapping - this gives size and a little more precision for things like balancing. We could then construct the benchmark by replaying the trace using the implementation under test. I am not sure if this is practical - depends on the size of recorded data, but if so, it could be considerably faster than going through the complete F# boostrapping process. Such a benchmark would be the most convincing - after actually timing the boostrapping process itself of course. The benchmark would measure the code under different memory conditions.

6. Good question about my measurements - I have not been doing measurements consistently, but from now on I'll try to stick to .NET 4.0, AnyCPU, Windows, x64 - I think this is the configuration I'm most interested in for performance.

Thanks,

--A

Jack Pappas

unread,

Apr 23, 2013, 4:22:04 PM4/23/13

to fsharp-o...@googlegroups.com

Hi All,

I was looking at this code again yesterday and noticed that I'd missed a couple of potential optimizations. Specifically, when inserting an element which already exists in the set/map or deleting an element which does not exist in the set/map, both my implementation and the original implementation (in FSharp.Core) return the tree without altering it; however, this fails to take into account that the Insert and Delete methods are recursive, so even though the original tree is returned, it's rebalanced repeatedly while unwinding the recursion. I added a few lines of code to detect this and avoid the unnecessary rebalancing and pushed the improvements to GitHub (https://github.com/jack-pappas/fs-core-optimized/tree/master/Replacements).

These improvements don't seem to make much difference in the general case: in the random set creation test, the improvement was barely noticable. However, the Intersection and Union operations make heavy use of Insert and Delete, and re-running my simple tests with random sets (in the PerfComparison project) shows a marked improvement: the Union operation now runs twice as fast as the one in FSharp.Core and Intersect is about 50% faster. I re-ran the "compiling the F# source distribution" test and found the results to be mostly the same, except that a couple of the larger projects (the compiler itself, IIRC) ran slightly faster again compared to the previous results (another 1-2%). I imagine that any applications which make heavy use of Set and Map will see more improvement though.

Responding to Anton's points:

2. Sorting by Self (Avg.) is not meant to find methods which, if optimized, would have the greatest impact on the total runtime. It is meant to find the methods which are the easiest to optimize; whether those methods are _actually_ optimizable, or if doing so would noticably improve application performance -- that is left up to you to decide. It is generally much harder to make any kind of *significant* improvement in a method which runs in 0.001 sec (i.e., the law of diminishing returns). In general, the best way to optimize these methods is just to eliminate them, but that often entails more significant architectural changes if it is even possible at all.

3. Adaptive implementations can be useful (e.g., in Dictionary<'T>), but they're only going to matter if the impact on application performance is significant. For example, your stack-based approach may be 5x faster than Stack<'T> for small inputs, but unless it provides noticable performance gains in real-world applications there's no reason to take on the additional complexity in the code. Dictionary<'T> also has O(1) speed in determining the number of elements it holds; the F# Set and Map types don't keep track of the number of elements they contain (though they easily could, if necessary) so determining which method to use -- that is, whether the input is "small" or "large" -- involves a complete traversal of the structure to get the number of elements, so you have to add that overhead to the execution-time of the adaptive algorithms. I'm not saying we *shouldn't* go the adaptive route, just that there are some additional considerations to be made in this case; as always, the only way to know is to write it and benchmark it :)

4. This is a good idea. There are a couple of other things to consider when creating the test data distributions:

   - Set density: different data structures or different implementations of the same data structure may have performance characteristics which vary based on the _density_ or distribution of the input set. For example, the Patricia tries used in IntSet and IntMap are perfectly balanced (i.e., they have the best performance) when the element/key set is contiguous; however, they're not self-balancing so the right (or wrong) data set will cause them to become badly unbalanced and result in terrible performance.

   - Make sure to include overlapping or non-existent data values -- as I've shown here, the way these are handled can significantly impact the performance of higher-level operations.

5. Another good idea. It'd be easy enough to hack up a custom version of FSharp.Core which uses NLog (or Log4Net, or whatever) to record the operations and input sizes. It would also be a good idea to test the benchmarks on multiple machines and multiple OS's to see if there are any critical differences between 32- and 64-bit machines, or between Windows/Mac/FreeBSD/Linux.

6. As for the measurements, it's important to be very precise about the test setup since there are so many obscure factors that can come into play. For this kind of code (performance- and memory-intensive), CPU cache size and main memory speed/latency is going to make a real difference, so it'd worth recording that information when you implement (5); this matters even if you use the same machine and setup for benchmarking various optimizations, because they influence where the inflection points occur in your performance curves and these are usually where you would put the cutoff for an adaptive algorithm (i.e., the input size which is the boundary between what you consider a "small" or "large" input).

   Some additional benchmarking notes:

   - ALWAYS make sure to include a "warm-up" step in your benchmarking tool which hits all of the code paths you intend to test. Generic types/methods (heavily used in F#) are instantiated (specialized) by the JIT compiler once for reference types and for each value type; for example, if you have a generic method Foo<'T> and you call it as Foo<obj>, Foo<string>, Foo<int>, Foo<int64>, Foo<DateTime>, the JIT compiler will emit the method four (4) times: for obj/string, int, int64, and DateTime. Make sure you run the warm-up procedures accordingly, otherwise the JIT overhead could skew the results. Running the warm-up procedure also eliminates the overhead of finding and loading any referenced assemblies used within the code (since assemblies are loaded on-demand).

   - Run each benchmark a few times to make sure your results are accurate. Unknown factors can influence the tests in random, non-repeatable ways. For example, when I ran the "compile the F# sources" test for the new improvements (above), I ran it three times in a row and the results were slightly different each time; the first run showed a small *loss* in performance in a few projects, but after re-running the tests it appears those results were flukes.

Is there anyone in the community interested in helping Anton set up and run these benchmarks? I'm happy to do the tuning and tweaking to squeeze more performance out of the code, but I can't do that without a good benchmarking setup and I don't have the time to write it myself.

One last thought -- I hope this thread has helped spark some discussion within Microsoft with regards to accepting upstream patches to F#. If it has, I don't know if there's been any opposition to it (hopefully not). In any case, if Microsoft decides not to accept patches to the compiler itself, I hope they'll at least consider opening other parts of F# (like FSharp.Core) to the community. I certainly don't mind pitching in to help make F# even better than it already is -- and I know others want to help too -- but if the work isn't going to benefit the wider F# world there's not much point in it.

Regards,
Jack

Marius Fersigan

unread,

Apr 24, 2013, 4:21:22 PM4/24/13

to fsharp-o...@googlegroups.com

Anton,

You can share large data (files) using a (private) torrent. Here you have a link about it:

http://bootstrike.com/Articles/CreateTorrent/

Marius Fersigan

..................................

0742775542

--

Anton Tayanovskyy

unread,

May 9, 2013, 8:34:13 AM5/9/13

to fsharp-o...@googlegroups.com

Thanks, Marius!

Jack - apologies for dragging this on - but I have not given up yet on constructing the profile. It is hard to find time for this.

Also, note for the future: I recently came across purely functional hash tables, apparently e.g. Racket has them, they have amortized O(1) add and O(1) lookup. They cannot replace maps as they are less general - require a hash function. But many places in the compiler use maps with an integral domain, pure hash tables could work there well.

Thanks,

--A

Dave Thomas

unread,

May 9, 2013, 8:35:53 AM5/9/13

to fsharp-o...@googlegroups.com

I would really like to see the compiler also optimised for completion on small projects, the ramp-up time is quite high, I think its phase 2 of the compilation that takes most of the time.

D.

Reply all

Reply to author

Forward