[erlang-questions] list vs binary performarnce, destructuring and consing

147 views
Skip to first unread message

Erik Pearson

unread,
Oct 22, 2012, 6:18:48 PM10/22/12
to erlang-q...@erlang.org
Hi,
I've read from advice given many years ago that processing binaries byte by byte (e.g. a recursive parser), performance is better using a list to accumulate the bytes, rather than using binary concatenation. So [B|Accum] rather than <<Accum/binary, B>>. There seems to be a consensus  however, on the efficiency of Binaries compared to List strings.

My own quick test, which was just to copy a list or binary element by element, showed much better performance for the list version. The test was basically to pass an arbitrary string or binary, and copy it some number of thousands of times, and output the complete copies per second.

I tried list based accumulation for a binary, using binary destructuring in the function head, and that sped things up, but it was still slower than the equivalent list string copy.

Are there any tips for binaries? Of is this not a good use case for binaries. 

test_bin_copy(Bin) ->
    test_bin_copy(Bin, <<>>).
test_bin_copy(<<>>, Accum) ->
    Accum;
test_bin_copy(<<Char, Rest/binary>>, Accum) ->
    test_bin_copy(Rest, <<Accum/binary, Char>>).

test_string_copy(Bin) ->
    test_string_copy(Bin, []).
test_string_copy([], Accum) ->
    lists:reverse(Accum);
test_string_copy([Char|Rest], Accum) ->
    test_string_copy(Rest, [Char|Accum]).

For what its worth this is part of a json module. The current practice in json libraries seems to  favor binaries, so I assumed there were inherent performance advantages. I can imagine, e.g., that an empty binary would be stored as a modest sized buffer that would be appended in place until there was a need to expand it or copy (e.g. if an older version of it was being appended), and that operations on it would be fast compared to arbitrary consing (which is however highly optimized.)

I think some of the favoritism for binaries in json libs is because it makes it easy to differentiate json strings (as erlang binaries) from json arrays (as erlang lists), but my implementation is using tagged tuples to contain each json value, so this is not a concern. Of course there are the memory concerns, but in my tests any memory concerns with list char size vs binary bytes is erased by the performance gains.

I'm sure I've put my foot in my mouth at least once, but, anyway, advice appreciated.

Thanks,
Erik.

Martynas Pumputis

unread,
Oct 23, 2012, 4:07:37 AM10/23/12
to erlang-q...@erlang.org
Hi,

Could you show the exact steps of your simulation? Binary version should
be faster, because some extra memory allocation is avoided per each
iteration and large binaries aren't being copied.

Take a look at:
http://www.erlang.org/doc/efficiency_guide/binaryhandling.html

Martynas
> _______________________________________________
> erlang-questions mailing list
> erlang-q...@erlang.org
> http://erlang.org/mailman/listinfo/erlang-questions
>

_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions

Motiejus Jakštys

unread,
Oct 23, 2012, 4:50:25 AM10/23/12
to Martynas Pumputis, erlang-q...@erlang.org
On Tue, Oct 23, 2012 at 9:07 AM, Martynas Pumputis <mart...@gmail.com> wrote:
> Hi,
>
> Could you show the exact steps of your simulation? Binary version should be
> faster, because some extra memory allocation is avoided per each iteration
> and large binaries aren't being copied.
>
> Take a look at:
> http://www.erlang.org/doc/efficiency_guide/binaryhandling.html
>

I tried that last night and was surprised, so will share.

Benchmark script:
https://gist.github.com/3937666

Session:
1> t:bench().
Creating str of size 1048576.. done.
Converting to binary.. done.
Executing STR copy 30 times... done.
Executing BIN copy 30 times... done.
BIN: Mean: 139461.1, stdev: 7829.9
STR: Mean: 79089.4, stdev: 53349.0

No matter how hard I try, I cannot outperform lists.

R15B02, 32-bit Linux.

--
Motiejus Jakštys

Motiejus Jakštys

unread,
Oct 23, 2012, 4:57:48 AM10/23/12
to Martynas Pumputis, erlang-q...@erlang.org
On Tue, Oct 23, 2012 at 9:50 AM, Motiejus Jakštys <desir...@gmail.com> wrote:
> On Tue, Oct 23, 2012 at 9:07 AM, Martynas Pumputis <mart...@gmail.com> wrote:
> Session:
> 1> t:bench().
> Creating str of size 1048576.. done.
> Converting to binary.. done.
> Executing STR copy 30 times... done.
> Executing BIN copy 30 times... done.

Even this benchmark is not fair, because while STR is doing its work,
Erlang VM is allocating more and more memory from the OS. And then
binary has a sufficiently large heap in the VM to work on.

To be fair, we should run both string copy and binary copy before
doing the benchmarks, just to "warm up" the VM. Tried it, but doesn't
seem to help.

Loïc Hoguin

unread,
Oct 23, 2012, 5:00:37 AM10/23/12
to Motiejus Jakštys, erlang-q...@erlang.org
On 10/23/2012 10:50 AM, Motiejus Jakštys wrote:
> On Tue, Oct 23, 2012 at 9:07 AM, Martynas Pumputis <mart...@gmail.com> wrote:
>> Hi,
>>
>> Could you show the exact steps of your simulation? Binary version should be
>> faster, because some extra memory allocation is avoided per each iteration
>> and large binaries aren't being copied.
>>
>> Take a look at:
>> http://www.erlang.org/doc/efficiency_guide/binaryhandling.html
>>
>
> I tried that last night and was surprised, so will share.
>
> Benchmark script:
> https://gist.github.com/3937666
>
> Session:
> 1> t:bench().
> Creating str of size 1048576.. done.
> Converting to binary.. done.
> Executing STR copy 30 times... done.
> Executing BIN copy 30 times... done.
> BIN: Mean: 139461.1, stdev: 7829.9
> STR: Mean: 79089.4, stdev: 53349.0
>
> No matter how hard I try, I cannot outperform lists.
>
> R15B02, 32-bit Linux.

Please try with the module t compiled with the native option.

Also the answer to "why binary" is generally either "because unicode" or
"because memory", not "lol speed". But it would be good if "lol speed"
could be improved eventually.

--
Loïc Hoguin
Erlang Cowboy
Nine Nines
http://ninenines.eu

Motiejus Jakštys

unread,
Oct 23, 2012, 5:16:12 AM10/23/12
to Loïc Hoguin, erlang-q...@erlang.org
On Tue, Oct 23, 2012 at 10:00 AM, Loïc Hoguin <es...@ninenines.eu> wrote:
> On 10/23/2012 10:50 AM, Motiejus Jakštys wrote:
>>
>> On Tue, Oct 23, 2012 at 9:07 AM, Martynas Pumputis <mart...@gmail.com>
>> wrote:
>>>
> Please try with the module t compiled with the native option.
Hi,
does not change a lot.

> Also the answer to "why binary" is generally either "because unicode" or
> "because memory", not "lol speed". But it would be good if "lol speed" could
> be improved eventually.

Thanks for the explanation.

FYI, with "native" and some more tricks:
BIN: Min: 124959.0, Max: 152380.0, Mean: 80449.2, stdev: 53616.5
STR: Min: 40768.0, Max: 340067.0, Mean: 137522.3, stdev: 7299.9

Was interesting, though. Interestingly standard deviation of binary is
quite high.

--
Motiejus Jakštys

Jesper Louis Andersen

unread,
Oct 23, 2012, 8:00:34 AM10/23/12
to Motiejus Jakštys, erlang-q...@erlang.org

On Oct 23, 2012, at 11:16 AM, Motiejus Jakštys <desir...@gmail.com> wrote:
>
> FYI, with "native" and some more tricks:
> BIN: Min: 124959.0, Max: 152380.0, Mean: 80449.2, stdev: 53616.5
> STR: Min: 40768.0, Max: 340067.0, Mean: 137522.3, stdev: 7299.9
>

There is no reason to believe binaries to be faster than lists for this benchmark. The list construction can take its cells from the process heap and since it is garbage collected, doing so is an (amortized) O(1) pointer move. The reverse takes O(N). So we are looking at N*O(1) + O(N) = O(N). The binary construction has to reallocate the binary once in a while which imposes a copy on the binary to make it larger. In the worst case, and when it is badly implemented, this is an O(N^2) operation. If it is better implemented, we can shave it down to O(N lg N) or thereabout, depending on the slack amount we want to allocate. But it does not seem as cheap as the other solution, barring constants.

If you want to do better at the stats, report the median or better, the 25th, 50th, 75th, 90th, 95th and 99th percentile. I don't think the mean can be used for anything since we don't know if the thing we are looking at happens to be a normal distribution. Or you could plot the kernel density. Just taking

tiefling=; dc -e '340067 40768 + 2 / p'
190417

does suggest that something is fishy since this is way away from the mean.

Jesper Louis Andersen
Erlang Solutions Ltd., Copenhagen

Erik Pearson

unread,
Oct 23, 2012, 11:11:03 AM10/23/12
to Martynas Pumputis, erlang-q...@erlang.org
I used a simple thing like:

test_iter(Mod, Fun, Args, Iters) ->
    test_iter(Mod, Fun, Args, now(), Iters, Iters).

test_iter(_Mod, _Fun, _Args, Start, Iters, 0) ->
    Iters/(timer:now_diff(now(), Start)/1000000);

test_iter(Mod, Fun, Args, Start, Iters, CountDown) ->
    erlang:apply(Mod, Fun, Args),
    test_iter(Mod, Fun, Args, Start, Iters, CountDown-1).

And was just looking at total iterations per sec. I would just repeat this several times until I found a relatively stable reading. Sure, there was variation, but I'm looking to simulate the pattern I use in this library, which is iterating through and copying many small bits of text (json keys and values.)

Since there was such a large difference in overall performance (string being more than twice as fast), I didn't feel the need to be more precise before posing the question.

E.g.

20> json:test_iter(json, test_bin_copy, [<<"hi, my name is erik">>], 1000).
232018.5614849188
21> json:test_iter(json, test_bin_copy, [<<"hi, my name is erik">>], 1000).
224870.69934787496
22> json:test_iter(json, test_bin_copy, [<<"hi, my name is erik">>], 1000).
23> 
23> json:test_iter(json, test_string_copy, ["hi, my name is erik"], 100000).
650732.3993154295
24> json:test_iter(json, test_string_copy, ["hi, my name is erik"], 100000).
608076.4716970806
25> json:test_iter(json, test_string_copy, ["hi, my name is erik"], 100000).
567359.7912115968

Many of the follow up observations and questions have been stimulating, so I'm now interested as well in a more detailed analysis.

However, in the end what I'm looking at is the differences in performance between list and binary string processing under what I believe is idiomatic Erlang for such problems.

Thanks,

Erik.

Dmitry Kolesnikov

unread,
Oct 23, 2012, 11:34:23 AM10/23/12
to erlang-questions Questions
Hello,

I believe you should trade-off between memory utilization and performance.
Yes, lists are faster then binary but requires more memory.

I have figure out that binary parsing might be efficient if you scan it by changing index and capture tokens instead of single byte.
, e.g.:

parse(In, Pos, Len, Line, Fun, Acc0) ->
case In of
% end of field
   <<_:Pos/binary, Tkn:Len/binary, ?FIELD_BY, _/binary>> ->
         parse(In, Pos + Len + 1, 0, [Tkn | Line], Fun, Acc0);
...
_ ->
    % no match increase token
      parse(In, Pos, Len + 1, Line, Fun, Acc0)
end;

- Dmitry

Erik Pearson

unread,
Oct 23, 2012, 11:56:31 AM10/23/12
to Loïc Hoguin, erlang-q...@erlang.org
I tried the tests with native compilation (erlc +native json.erl) and found about a 40-50% speedup for binary, and 10-20% for lists (median-mean). Just 10 samples each for this back of the envelope test. So lists went from being over twice as fast to about 1 1/2 times as fast.

Erik Pearson

unread,
Oct 23, 2012, 1:14:44 PM10/23/12
to erlang-questions Questions
Okay, I think I see what you are getting at. Less copying, more slicing. 
I've taken this approach, and designed another mini-test, this time emulating the parsing task more closely. The first two are the equivalent of the previous binary and list string copying functions. The third takes the approach of counting the binary length before finally returning a slice of it. My approach is a bit simpler than yours Dmitry. I realize that yours would keep on counting from the beginning of the binary and not return a Rest of the binary, but since I tested with a simple string "field:value" I don't think there would be a difference. 

Anyway, for this test the counting method was actually slower. Maybe a real world test where the parser is dealing with longer strings and multiple tokens

test_bin_parse(Bin, StopperChar) ->
    test_bin_parse(Bin, StopperChar, <<>>).
test_bin_parse(<<Char, Rest/binary>>, Char, Accum) ->
    {Accum, Rest};
test_bin_parse(<<Char, Rest/binary>>, Stopper, Accum) ->
    test_bin_parse(Rest, Stopper, <<Accum/binary, Char>>).

test_string_parse(String, StopperChar) ->
    test_string_parse(String, StopperChar, []).
test_string_parse([Char|_Rest], Char, Accum) ->
    {lists:reverse(Accum), Rest};
test_string_parse([Char|Rest], Stopper, Accum) ->
    test_string_parse(Rest, Stopper, [Char|Accum]).

test_bin_parse2(Bin, StopperChar) ->
    test_bin_parse2(Bin, StopperChar, 0).
test_bin_parse2(Bin, Stopper, Len) ->
    case Bin of 
<<_Field:Len/binary>> ->
   undefined;
<<Field:Len/binary, Stopper, Rest/binary>> ->
   {Field, Rest};
_Else ->
   test_bin_parse2(Bin, Stopper, Len+1)
    end.

BTW I couldn't get the binary length to pattern match in a function head, so had to do it in a case instead.

Erik.

Erik Søe Sørensen

unread,
Oct 23, 2012, 1:41:26 PM10/23/12
to Jesper Louis Andersen, erlang-q...@erlang.org

Den 23/10/2012 14.00 skrev "Jesper Louis Andersen" <jesper.lou...@erlang-solutions.com>:
>
> There is no reason to believe binaries to be faster than lists for this benchmark. The list construction can take its cells from the process heap and since it is garbage collected, doing so is an (amortized) O(1) pointer move. The reverse takes O(N). So we are looking at N*O(1) + O(N) = O(N). The binary construction has to reallocate the binary once in a while which imposes a copy on the binary to make it larger. In the worst case, and when it is badly implemented, this is an O(N^2) operation. If it is better implemented, we can shave it down to O(N lg N) or thereabout, depending on the slack amount we want to allocate. But it does not seem as cheap as the other solution, barring constants.

The normal slack amount to use is some factor times the current length - eg. a doubling. This gives O(n) amortized cost (remember that while you do need to copy O(lg n) times, you need to copy far less than n elements each time - it sums up to <2n copyings).

So asymptotically it should be at level with lists (I'm fairly certain that BEAM is smart about this) - but the constant factor is still wanting...

Erik Pearson

unread,
Oct 23, 2012, 2:59:40 PM10/23/12
to erlang-questions Questions
Following up, I also find that if I replace the binary accumulator with a list accumulator, but keeping the binary destructuring + matching in the function head, the performance is about 30-40% faster, but still far short of the list performance (not compiled native.) E.g.:

test_bin_list_parse(Bin, StopperChar) ->
    test_bin_list_parse(Bin, StopperChar, []).
test_bin_list_parse(<<>>, _, _) ->
    undefined;
test_bin_list_parse(<<Char, Rest/binary>>, Char, Accum) ->
    {lists:reverse(Accum), Rest};
test_bin_list_parse(<<Char, Rest/binary>>, Stopper, Accum) ->
    test_bin_list_parse(Rest, Stopper, [Char|Accum]).

I'm doing all of this testing by hand running the functions through the "test_iter" function with 10,000 iterations, repeating 10 times, and plugging the 10  iter/sec values into a spreadsheet, then comparing at the mean, median, and relative stdev. Yeah, I know, poor man's testing.

This whole exercise kind of validates this particular project, which is to develop a json api that is opaque, so that the internal implementation of json structures within erlang and the algorithms for encoding, decoding, and traversing can be disentangled from client code.

Erik.

Dmitry Kolesnikov

unread,
Oct 23, 2012, 4:05:52 PM10/23/12
to Erik Pearson, erlang-questions Questions
Hello Erik,

You brought a good point here...
The method used by me is less efficient then your test_string_parse. This explains why slicing was slower.

However, slicing method is very efficient for CSV parsing. I'll try to re-implement CSV parse again using chapter 4.3 recommendation I'll the performance difference. 


- Dmitry

Erik Pearson

unread,
Oct 23, 2012, 5:19:55 PM10/23/12
to erlang-questions Questions
Okay, given that self recursion which uses the Rest part of the pattern is optimized to use the match state rather than generating a new sub binary, and that this is specifically documented to be an optimization ...

I think that your original construction might be faster if the pattern matching were moved to the function head. As it is, the pattern matchers need to be constructed for each function call, no?

But this:

test_binpat(<<Field:Len/binary, Rest/binary>>, Len) ->
    {Len, Field, Rest}.

does not compile, complaining that "variable 'Len' is unbound", while this

Len = 11.
<<Field:Len/binary, Rest/binary>> = <<"hello world of erlang">>, {Len, Field, Rest}.

does. 

So it looks like the binary matcher can bind a variable to the length part, just not in a function head?

Erik.

Jeff Schultz

unread,
Oct 23, 2012, 8:05:43 PM10/23/12
to Erik Pearson, erlang-q...@erlang.org
On Tue, Oct 23, 2012 at 08:11:03AM -0700, Erik Pearson wrote:
> I used a simple thing like:

> 20> json:test_iter(json, test_bin_copy, [<<"hi, my name is erik">>], 1000).
> 232018.5614849188

> 23> json:test_iter(json, test_string_copy, ["hi, my name is erik"], 100000).
> 650732.3993154295

If you really did exactly this, you were copying a *one* element list,
not the 19 element list represented by "hi, my name is erik"!


Jeff Schultz

Jesper Louis Andersen

unread,
Oct 24, 2012, 6:06:32 AM10/24/12
to Erik Søe Sørensen, erlang-q...@erlang.org

On Oct 23, 2012, at 7:41 PM, Erik Søe Sørensen <eri...@gmail.com> wrote:

> So asymptotically it should be at level with lists (I'm fairly certain that BEAM is smart about this) - but the constant factor is still wanting...

Yes, indeed. Your analysis seems right. It ought to be smart about this and if you go amortized, it looks like an O(N).

Erik Pearson

unread,
Oct 24, 2012, 11:58:10 AM10/24/12
to erlang-q...@erlang.org
Interesting comments. What exactly is getting copied? The accumulator binary? In this case the allocated size of the binary (doesn't it default to 256bytes or the twice initial contents, whichever is larger?) is much greater than the size of the bytes stuffed into it for accumulation. (See comments at the very end -- the Erlang docs state that the initial iteration will cause the binary to be copied.)

In any case, I would hope that such operations are linear to the size of the sequence. On the other hand, I think the costs of the operations on each is the salient point, no? In this case binaries appear to be more expensive both to iterate over (through destructuring) and accumulate.

I didn't necessarily think that these types of operations on binaries were faster, but there has been some fuss over binaries in the text processing side of things -- web servers, json parsers, etc. The reason I ran this little test was that while creating a json parser I wanted to see for myself.

I suppose a lot of the binary excitement has to do with just "utf8 in, utf8 out" efficiency. 

Still, I think it would be surprising to some to find binaries slower for normal iterative processing. Culturally, Erlang list strings have a bad rep. Binaries to the rescue!

I wasn't able to find documentation comparing binary string to list string operations. However, there is advice as to the most efficient way to iterate over and accumulate binary strings.

From "Programming Efficiently with Binaries and Bit Strings":
- When you are building new bit strings make sure you append new data to the end of an old bit string
- When you iterate over a bit string use a direct style matching similar to what you would do for lists

And from the section on optimizations:

The basis of this optimization is that it the emulator can create bit strings with extra uninitialized space, so if a bit string is built by continuously appending to a binary the data does not need to be copied if there is enough uninitialized data at the end of the bit string.

One point I don't quite understand here:


is is this function:

my_list_to_binary(List) ->
    my_list_to_binary(List, <<>>).
my_list_to_binary([H|T], Acc) ->
    my_list_to_binary(T, <<Acc/binary,H>>);
my_list_to_binary([], Acc) ->
    Acc.
the first iteration in my_list_binary/2 that Acc will be copied, but it will not be copied after this. I don't know why any copying occurs at all, since it is evident that in this case the initial binary <<>> is used solely for the append operation in the first clause. From what I understand, it is multiple references that cause copying to occur upon append, since the append operation may result the movement of the binary in memory, and the other references would otherwise become invalid. I only see one reference threading through this function.

I would think it would be allocated in my_list_to_binary/1, passed by reference, have an initial size of 256, be appended to without any copying, and only maybe be copied if the accumulation exceeded 256 bytes (the binary may be moved when it is extended.) Is the implication is that the initial blank binary <<>> is a different type of binary, say fixed-length or static, and that the first append operation triggers the copy of that binary into a newly allocated extendable binary (with initial size twice the original or 256 whichever is bigger)?

I really have to stop obsessing over this and move on, its a wonder where the mind goes when one has a cold...

Erik.

Björn Gustavsson

unread,
Oct 25, 2012, 4:14:52 AM10/25/12
to Erik Pearson, erlang-q...@erlang.org
On Wed, Oct 24, 2012 at 5:58 PM, Erik Pearson <er...@defunweb.com> wrote:
[...]

> And from the section on optimizations:
>
> The basis of this optimization is that it the emulator can create bit
> strings with extra uninitialized space, so if a bit string is built by
> continuously appending to a binary the data does not need to be copied if
> there is enough uninitialized data at the end of the bit string.
>
> One point I don't quite understand here:
>
> http://www.erlang.org/doc/efficiency_guide/binaryhandling.html#match_context
>
> is is this function:
>
> my_list_to_binary(List) ->
> my_list_to_binary(List, <<>>).
>
> my_list_to_binary([H|T], Acc) ->
> my_list_to_binary(T, <<Acc/binary,H>>);
> my_list_to_binary([], Acc) ->
> Acc.
>
> the first iteration in my_list_binary/2 that Acc will be copied, but it will
> not be copied after this. I don't know why any copying occurs at all, since
> it is evident that in this case the initial binary <<>> is used solely for
> the append operation in the first clause. From what I understand, it is
> multiple references that cause copying to occur upon append, since the
> append operation may result the movement of the binary in memory, and the
> other references would otherwise become invalid. I only see one reference
> threading through this function.

Appending to a binary is optimized by the run-time system (which is
stated in the Efficiency Guide) with no help from the compiler. Therefore,
the run-time system has no way of knowing that the binary created in
my_list_to_binary/1 will be appended to, so it will *not* mark the empty
binary as appendable and reserve extra space in it for appending.

>
> I would think it would be allocated in my_list_to_binary/1, passed by
> reference, have an initial size of 256, be appended to without any copying,
> and only maybe be copied if the accumulation exceeded 256 bytes (the binary
> may be moved when it is extended.) Is the implication is that the initial
> blank binary <<>> is a different type of binary, say fixed-length or static,
> and that the first append operation triggers the copy of that binary into a
> newly allocated extendable binary (with initial size twice the original or
> 256 whichever is bigger)?

Yes, correct. The initial empty binary is a different type of binary. It is
a heap binary stored in the constant pool for the module, so the
"creation" of it is extremely cheap.

/Björn

--
Björn Gustavsson, Erlang/OTP, Ericsson AB

Erik Pearson

unread,
Oct 25, 2012, 1:36:36 PM10/25/12
to erlang-questions Questions
Okay, this raises, for me at least, a couple of questions:

- The efficiency guide mentions two storage areas for binaries, process heap and an area outside of the process heap.
- small binaries (up to 64 bytes) are stored on the process heap, larger ones in the shared "binary" storage.

- there is no mention in the eff. guide of storage of binaries in the constant pool, but if they are stored there as well, then wouldn't there just be a pointer reference to it and nothing in the heap?

- Some operations, such as copying, might create a small binary on heap
- Appending will always create a large binary, since the minimum size for an appendable binary is 256.

- Are there runtime efficiency strategies for re-using large binaries, or are they always allocated when needed? Or is allocation fast enough to offset any possible advantage of pooling?

- And finally, the eff. guide suggests creation of a binary by copying bytes from one to another via append, yet it is also stated that sub-binaries and match contexts are cheaper than copying. Therefore I would expect that the most efficient method of creating a new binary would be to create a sub-binary by something like part, split, or pattern matching + destructuring, as Dimitry pointed out earlier in this thread (even though my tests didn't find it to be faster.)

It may be that real world testing under load would reveal different patterns -- e.g. as excessive memory usage strains the system.

Thanks,
Erik.
Reply all
Reply to author
Forward
0 new messages