[erlang-questions] native

49 views
Skip to first unread message

Tony Rogvall

unread,
Sep 9, 2012, 4:44:04 PM9/9/12
to Erlang Questions
Hi list.

I started playing with a simple implementation of MD5 (basically straight from the wiki page) 
When the implementation produce the correct output, I did some timing for fun and also
tried to native compile it. 
To my astonishment I discovered that the code was slower when native compiled.
I am using R15B01 on mac.  Can this be true? And in this case why? 

Note that I do not care to optimize the implementation it self! I am more interested why
the native compiler produce slower code on this example.

Thanks

Tony


%%%%%%%
-module(md5).

-compile(export_all).

k(I) when I >= 0, I < 64 -> %% may construct table!
    trunc(abs(math:sin(I+1)) * (1 bsl 32)).

r(I) when I >= 0, I < 64 ->
    element(I+1,
   {7, 12, 17, 22, 7, 12, 17, 22, 7, 12, 17, 22, 7, 12, 17, 22,
    5,  9, 14, 20, 5,  9, 14, 20, 5,  9, 14, 20, 5,  9, 14, 20,
    4, 11, 16, 23, 4, 11, 16, 23, 4, 11, 16, 23, 4, 11, 16, 23,
    6, 10, 15, 21, 6, 10, 15, 21, 6, 10, 15, 21, 6, 10, 15, 21}).

hex(Bin) ->
    [ element(H+1,{$0,$1,$2,$3,$4,$5,$6,$7,$8,$9,$a,$b,$c,$d,$e,$f}) ||
<<H:4>> <= Bin ].

md5_test(Input) ->
    {hex(md5(Input)), hex(erlang:md5(Input))}.

md5(List) when is_list(List) ->
    md5(list_to_binary(List));
md5(Bits) when is_bitstring(Bits) ->
    Length = bit_size(Bits),
    Pad    = (512 - (Length+1+64) rem 512) rem 512,
    Bits1  = <<Bits/bitstring,1:1,0:Pad,Length:64/little>>,
    md5(Bits1,16#67452301,16#efcdab89,16#98badcfe,16#10325476).

md5(<<>>, H0,H1,H2,H3) ->
    <<H0:32/little,H1:32/little,H2:32/little,H3:32/little>>;
md5(Bits,H0,H1,H2,H3) ->
    <<Chunk:64/binary, Rest/binary>> = Bits,
    W = list_to_tuple([ X || <<X:32/little>> <= Chunk ]),
    {A,B,C,D} = md5_(0, W, H0, H1, H2, H3),
    md5(Rest, H0+A,H1+B,H2+C,H3+D).

md5_(I, W, A, B, C, D) when I < 16 ->
    F = (B band C) bor ((bnot B) band D),
    G = I,
    R = rotate(A+F+k(I)+element(G+1,W), r(I)),
    md5_(I+1,W,D,B+R,B,C);
md5_(I, W, A, B, C, D) when I < 32 ->
    F = (D band B) bor ((bnot D) band C),
    G = (5*I + 1) rem 16,
    R = rotate(A+F+k(I)+element(G+1,W), r(I)),
    md5_(I+1,W,D,B+R,B,C);
md5_(I, W, A, B, C, D) when I < 48 ->
    F = (B bxor C) bxor D,
    G = (3*I+5) rem 16,
    R = rotate(A+F+k(I)+element(G+1,W), r(I)),
    md5_(I+1,W,D,B+R,B,C);
md5_(I, W, A, B, C, D) when I < 64 ->
    F = C bxor (B bor (bnot D)),
    G = (7*I) rem 16,
    R = rotate(A+F+k(I)+element(G+1,W), r(I)),
    md5_(I+1,W,D,B+R,B,C);
md5_(64, _W, A, B, C, D) ->
    {A,B,C,D}.

rotate(X0, C) ->
    X = X0 band 16#ffffffff,
    ((X bsl C) bor (X bsr (32-C))) band 16#ffffffff.


%%%%%%%
"Installing applications can lead to corruption over time. Applications gradually write over each other's libraries, partial upgrades occur, user and system errors happen, and minute changes may be unnoticeable and difficult to fix"



Thomas Lindgren

unread,
Sep 10, 2012, 8:41:29 AM9/10/12
to Tony Rogvall, Erlang Questions

>I started playing with a simple implementation of MD5 (basically straight from the wiki page) 
>When the implementation produce the correct output, I did some timing for fun and also
>tried to native compile it. 
>To my astonishment I discovered that the code was slower when native compiled.
>I am using R15B01 on mac.  Can this be true? And in this case why? 
>
>Note that I do not care to optimize the implementation it self! I am more interested why
>the native compiler produce slower code on this example.


On my Mac, the best native result 7% is faster than the best beam result. But the timing varies by 20% from run to run, so I think the proper claim is they are about the same. (I also think Hipe ought to comfortably beat Beam on this sort of code.)

Unfortunately, 'pp_native' gave broken/confusing output so it's hard to trace what happens. The arithmetic looks like it's being inlined, which ought to provide a boost. There are calls to r/1 and rotate/2 in the inner loop (md5_/6) which possibly should have been inlined.

Best,
Thomas
_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions

Yiannis Tsiouris

unread,
Sep 10, 2012, 9:13:16 AM9/10/12
to erlang-q...@erlang.org
On 09/10/2012 03:41 PM, Thomas Lindgren wrote:
>> I started playing with a simple implementation of MD5 (basically straight from the wiki page)
>> When the implementation produce the correct output, I did some timing for fun and also
>> tried to native compile it.
>> To my astonishment I discovered that the code was slower when native compiled.
>> I am using R15B01 on mac. Can this be true? And in this case why?
>>
>> Note that I do not care to optimize the implementation it self! I am more interested why
>> the native compiler produce slower code on this example.
>
>
> On my Mac, the best native result 7% is faster than the best beam result. But the timing varies by 20% from run to run, so I think the proper claim is they are about the same. (I also think Hipe ought to comfortably beat Beam on this sort of code.)
>
> Unfortunately, 'pp_native' gave broken/confusing output so it's hard to trace what happens. The arithmetic looks like it's being inlined, which ought to provide a boost. There are calls to r/1 and rotate/2 in the inner loop (md5_/6) which possibly should have been inlined.

Hi,

Could you post the testcase that you 're timing?


Best,
Yiannis

--
Yiannis Tsiouris
Ph.D. student,
Software Engineering Laboratory,
National Technical University of Athens
WWW: http://www.softlab.ntua.gr/~gtsiour

Kostis Sagonas

unread,
Sep 10, 2012, 10:06:47 AM9/10/12
to erlang-q...@erlang.org
On 09/10/2012 02:41 PM, Thomas Lindgren wrote:
>
>> I started playing with a simple implementation of MD5 (basically straight from the wiki page)
>> When the implementation produce the correct output, I did some timing for fun and also
>> tried to native compile it.
>> To my astonishment I discovered that the code was slower when native compiled.
>> I am using R15B01 on mac. Can this be true? And in this case why?
>>
>> Note that I do not care to optimize the implementation it self! I am more interested why
>> the native compiler produce slower code on this example.

Aside: I contacted Tony privately and he is creating a 1MB binary and
this is what I am using below:

Bin = << <<X>> || X <- lists:seq(1, 1024*1024) >>.


> On my Mac, the best native result 7% is faster than the best beam result. But the timing varies by 20% from run to run, so I think the proper claim is they are about the same. (I also think Hipe ought to comfortably beat Beam on this sort of code.)

I do not have a Mac, but on my i7-desktop, using vanilla R15B01
configured with --enable-native-libs (I do not think this matters here)
I get:

Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:8:8] [async-threads:0]
[hipe] [kernel-poll:false]

Eshell V5.9.1 (abort with ^G)
1> Bin = << <<X>> || X <- lists:seq(1, 1024*1024) >>.
<<1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,
22,23,24,25,26,27,28,29,...>>
2> f(T), c(md5), {T, _} = timer:tc(md5, md5, [Bin]), T.
2050951
3> f(T), c(md5), {T, _} = timer:tc(md5, md5, [Bin]), T.
2061353
4> f(T), c(md5), {T, _} = timer:tc(md5, md5, [Bin]), T.
2086817
5> f(T), c(md5, [native]), {T, _} = timer:tc(md5, md5, [Bin]), T.
1889809
6> f(T), c(md5, [native]), {T, _} = timer:tc(md5, md5, [Bin]), T.
1865581
7> f(T), c(md5, [native]), {T, _} = timer:tc(md5, md5, [Bin]), T.
1927650

So indeed there is some fluctuation in the results, but native code
execution does not appear to be slower in this test.

(Many other runs on the same platform basically confirm the results
above...)


I do not know why Thomas thinks that HiPE should be able to confortably
beat BEAM on this sort of code (care to elaborate?). From a brief
glance it seems to me that the code spends a lot of time in BIFs written
in C (most notably list_to_tuple/1 and element/2, but also trunc/1,
abs/1 and math:sin/1. All of these are outside the reach of the native
code compiler.


> Unfortunately, 'pp_native' gave broken/confusing output so it's hard to trace what happens.

Well, this happens because the HiPE compiler compiles in parallel by
default (!). If you want to see the native code in the non-scrambled
version use:

c(md5, [native, {hipe, [no_concurrent_comp, pp_native]}]).

Kostis

Jesper Louis Andersen

unread,
Sep 10, 2012, 10:35:35 AM9/10/12
to Kostis Sagonas, erlang-q...@erlang.org
On Sep 10, 2012, at 4:06 PM, Kostis Sagonas <kos...@cs.ntua.gr> wrote:

>
> I do not have a Mac, but on my i7-desktop, using vanilla R15B01 configured with --enable-native-libs (I do not think this matters here) I get:
>

Perhaps you are really not measuring the CPU at full speed because it only changed frequency when there is something to do. A guess is that the CPU speed should be locked at a certain rate for this test to be fair.

Or at least that is a shot in the dark, you may utilize.

Kostis Sagonas

unread,
Sep 10, 2012, 11:02:52 AM9/10/12
to erlang-q...@erlang.org
On 09/10/2012 04:35 PM, Jesper Louis Andersen wrote:
> On Sep 10, 2012, at 4:06 PM, Kostis Sagonas<kos...@cs.ntua.gr> wrote:
>
>>
>> I do not have a Mac, but on my i7-desktop, using vanilla R15B01 configured with --enable-native-libs (I do not think this matters here) I get:
>>
>
> Perhaps you are really not measuring the CPU at full speed because it only changed frequency when there is something to do. A guess is that the CPU speed should be locked at a certain rate for this test to be fair.
>
> Or at least that is a shot in the dark, you may utilize.

Does not appear to be the case:

Eshell V5.9.1 (abort with ^G)
1> Bin = << <<X>> || X <- lists:seq(1, 1024*1024) >>.
<<1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,
22,23,24,25,26,27,28,29,...>>
2> c(md5).
{ok,md5}
3> [begin {T, _} = timer:tc(md5, md5, [Bin]), T end || _ <-
lists:seq(1,10)].
[2069001,2036776,2043727,2042717,2041177,2053276,2037912,
2050634,2045248,2044197]
4> c(md5, [native]).
{ok,md5}
5> [begin {T, _} = timer:tc(md5, md5, [Bin]), T end || _ <-
lists:seq(1,10)].
[1920904,1884088,1883944,1888162,1886673,1889737,1878162,
1883588,1892836,1879735]

Not much variance in the results, but clearly in at least the last three
runs of the bench the CPU should be at a higher frequency...

Kostis

Rich Neswold

unread,
Sep 10, 2012, 10:53:43 AM9/10/12
to Kostis Sagonas, erlang-q...@erlang.org
On Mon, Sep 10, 2012 at 10:02 AM, Kostis Sagonas <kos...@cs.ntua.gr> wrote:
> Does not appear to be the case:
>
> Eshell V5.9.1 (abort with ^G)
> 1> Bin = << <<X>> || X <- lists:seq(1, 1024*1024) >>.
> <<1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,
> 22,23,24,25,26,27,28,29,...>>
> 2> c(md5).
> {ok,md5}
> 3> [begin {T, _} = timer:tc(md5, md5, [Bin]), T end || _ <-
> lists:seq(1,10)].
> [2069001,2036776,2043727,2042717,2041177,2053276,2037912,
> 2050634,2045248,2044197]
> 4> c(md5, [native]).
> {ok,md5}
> 5> [begin {T, _} = timer:tc(md5, md5, [Bin]), T end || _ <-
> lists:seq(1,10)].
> [1920904,1884088,1883944,1888162,1886673,1889737,1878162,
> 1883588,1892836,1879735]
>
> Not much variance in the results, but clearly in at least the last three
> runs of the bench the CPU should be at a higher frequency...

What about:

c(md5, [native, inline]).

--
Rich

Thomas Lindgren

unread,
Sep 10, 2012, 1:03:25 PM9/10/12
to Kostis Sagonas, erlang-q...@erlang.org





>________________________________
> From: Kostis Sagonas <kos...@cs.ntua.gr>
>
>
>I do not know why Thomas thinks that HiPE should be able to confortably beat BEAM on this sort of code (care to elaborate?).  From a brief glance it seems to me that the code spends a lot of time in BIFs written in C (most notably list_to_tuple/1 and element/2, but also trunc/1, abs/1 and math:sin/1. All of these are outside the reach of the native code compiler.


The regular MD5 algorithm is basically a loop doing lots of integer arithmetic and bit operations as well as accessing a few arrays. This ought on the face of it to be quite amenable to native code compilation. I'm amazed that the implementation Tony used managed to get trunc/abs/sin into the inner loop, but that might as you say well explain the problem.

>
>> Unfortunately, 'pp_native' gave broken/confusing output so it's hard to trace what happens.
>
>Well, this happens because the HiPE compiler compiles in parallel by default (!).  If you want to see the native code in the non-scrambled version use:
>
>    c(md5, [native, {hipe, [no_concurrent_comp, pp_native]}]).


Well well, you learn something new every day.

Best,
Thomas

Thomas Lindgren

unread,
Sep 10, 2012, 4:44:23 PM9/10/12
to Thomas Lindgren, Kostis Sagonas, erlang-q...@erlang.org




----- Original Message -----
> From: Thomas Lindgren <thomasl...@yahoo.com>
>
>> From: Kostis Sagonas <kos...@cs.ntua.gr>
>>
>>
>> I do not know why Thomas thinks that HiPE should be able to confortably beat
> BEAM on this sort of code (care to elaborate?).  From a brief glance it seems to
> me that the code spends a lot of time in BIFs written in C (most notably
> list_to_tuple/1 and element/2, but also trunc/1, abs/1 and math:sin/1. All of
> these are outside the reach of the native code compiler.
>
>
> The regular MD5 algorithm is basically a loop doing lots of integer arithmetic
> and bit operations as well as accessing a few arrays. This ought on the face of
> it to be quite amenable to native code compilation. I'm amazed that the
> implementation Tony used managed to get trunc/abs/sin into the inner loop, but
> that might as you say well explain the problem.


After some thinking, it might also be interesting to see the effect of more BIFs getting optimized in Hipe. The result in this case would of course still be unlikely to be competitive with an efficient MD5, but doing so might gain some performance. So, list_to_tuple/1 could expand to an RTL loop, trunc/1 and abs/1 could be expanded to simple code (esp. with a float argument), etc. (Element/2 already gets optimizer attention but it seems not to help here.) Give the compiler more to work with.

Tony Rogvall

unread,
Sep 11, 2012, 2:57:20 AM9/11/12
to Thomas Lindgren, Erlang Questions
On 10 sep 2012, at 14:41, Thomas Lindgren <thomasl...@yahoo.com> wrote:


I started playing with a simple implementation of MD5 (basically straight from the wiki page) 
When the implementation produce the correct output, I did some timing for fun and also
tried to native compile it. 
To my astonishment I discovered that the code was slower when native compiled.
I am using R15B01 on mac.  Can this be true? And in this case why? 

Note that I do not care to optimize the implementation it self! I am more interested why
the native compiler produce slower code on this example.


On my Mac, the best native result 7% is faster than the best beam result. But the timing varies by 20% from run to run, so I think the proper claim is they are about the same. (I also think Hipe ought to comfortably beat Beam on this sort of code.)

Good to know. On my computer all runs I have tried so far has been slower in native compile.
I could not believe my eyes.
So, everything is good then ? ;-)

/Tony

Unfortunately, 'pp_native' gave broken/confusing output so it's hard to trace what happens. The arithmetic looks like it's being inlined, which ought to provide a boost. There are calls to r/1 and rotate/2 in the inner loop (md5_/6) which possibly should have been inlined.

Best,
Thomas

Tony Rogvall

unread,
Sep 11, 2012, 3:06:19 AM9/11/12
to Thomas Lindgren, erlang-q...@erlang.org
On 10 sep 2012, at 19:03, Thomas Lindgren <thomasl...@yahoo.com> wrote:






________________________________
From: Kostis Sagonas <kos...@cs.ntua.gr>


I do not know why Thomas thinks that HiPE should be able to confortably beat BEAM on this sort of code (care to elaborate?).  From a brief glance it seems to me that the code spends a lot of time in BIFs written in C (most notably list_to_tuple/1 and element/2, but also trunc/1, abs/1 and math:sin/1. All of these are outside the reach of the native code compiler.


The regular MD5 algorithm is basically a loop doing lots of integer arithmetic and bit operations as well as accessing a few arrays. This ought on the face of it to be quite amenable to native code compilation. I'm amazed that the implementation Tony used managed to get trunc/abs/sin into the inner loop, but that might as you say well explain the problem.

My goal with this implementation was not speed. I really wanted something simple to look at. erlang:md5 already exists, and
there is no point competing with that. I just compiled it native for fun. So do not be amazed! :-)
But as shown by kostis and others it looks like it is my computer that behaves badly, I will have a look.

Thanks

/Tony





Unfortunately, 'pp_native' gave broken/confusing output so it's hard to trace what happens.

Well, this happens because the HiPE compiler compiles in parallel by default (!).  If you want to see the native code in the non-scrambled version use:

    c(md5, [native, {hipe, [no_concurrent_comp, pp_native]}]).


Well well, you learn something new every day.

Best,
Thomas

Thomas Lindgren

unread,
Sep 11, 2012, 4:29:34 AM9/11/12
to erlang-q...@erlang.org


>________________________________
> From: Tony Rogvall <to...@rogvall.se>
>
>My goal with this implementation was not speed. I really wanted something simple to look at. erlang:md5 already exists, and
>there is no point competing with that. I just compiled it native for fun. So do not be amazed! :-)
>But as shown by kostis and others it looks like it is my computer that behaves badly, I will have a look.


Yeah, so perhaps the right word is not "amazed" but "fooled" :-) Best of luck with your experiments.

Kostis Sagonas

unread,
Sep 11, 2012, 10:22:11 AM9/11/12
to Tony Rogvall, erlang-q...@erlang.org
On 09/11/2012 09:06 AM, Tony Rogvall wrote:
>
> On 10 sep 2012, at 19:03, Thomas Lindgren <thomasl...@yahoo.com
> <mailto:thomasl...@yahoo.com>> wrote:
>
>>> ________________________________
>>> From: Kostis Sagonas <kos...@cs.ntua.gr <mailto:kos...@cs.ntua.gr>>
>>>
>>>
>>> I do not know why Thomas thinks that HiPE should be able to
>>> confortably beat BEAM on this sort of code (care to elaborate?). From
>>> a brief glance it seems to me that the code spends a lot of time in
>>> BIFs written in C (most notably list_to_tuple/1 and element/2, but
>>> also trunc/1, abs/1 and math:sin/1. All of these are outside the
>>> reach of the native code compiler.
>>
>>
>> The regular MD5 algorithm is basically a loop doing lots of integer
>> arithmetic and bit operations as well as accessing a few arrays. This
>> ought on the face of it to be quite amenable to native code
>> compilation. I'm amazed that the implementation Tony used managed to
>> get trunc/abs/sin into the inner loop, but that might as you say well
>> explain the problem.
>
> My goal with this implementation was not speed. I really wanted
> something simple to look at. erlang:md5 already exists, and
> there is no point competing with that. I just compiled it native for
> fun. So do not be amazed! :-)
> But as shown by kostis and others it looks like it is my computer that
> behaves badly, I will have a look.

I investigated this further. Can it be that your computer runs in
32-bit mode? If so, I can explain what is happening.

Kostis

Tony Rogvall

unread,
Sep 11, 2012, 10:19:47 AM9/11/12
to Kostis Sagonas, erlang-q...@erlang.org
Yes.

/Tony

Kostis
Reply all
Reply to author
Forward
0 new messages