[erlang-questions] NIF vs Erlang Binary

Andy W. Song

unread,

Jul 22, 2011, 5:45:52 AM7/22/11

to erlang-questions

Hi,

I'm trying to write an Erlang WebSocket client and server against the latest spec. It requires message masking, which simply xor the whole message using a 4 bytes key. Here is the reference.

I did some unit test on my code and felt that it's slow (it can process about 24M byte/s) on a virtual machine. HiPE can double the performance but still not quite enough. So I wrote an NIF to handle this. The speed is about 10~15x faster. Not only that, I feel that the C code is easier to write. Maybe it's due to my inexperienced with Erlang yet.

Here is the Erlang and NIF code, FYI.

Thanks

Andy

Jesper Louis Andersen

unread,

Jul 22, 2011, 7:03:57 AM7/22/11

to Andy W. Song, erlang-questions

On Fri, Jul 22, 2011 at 11:45, Andy W. Song <wso...@gmail.com> wrote:

> I did some unit test on my code and felt that it's slow (it can process
> about 24M byte/s) on a virtual machine. HiPE can double the performance but
> still not quite enough. So I wrote an NIF to handle this. The speed is about
> 10~15x faster. Not only that, I feel that the C code is easier to write.

Blindly unrolling the Key a bit gives a factor of 3 speedup:

mask(Key, Data, Accu) ->
K = binary:copy(<<Key:32>>, 512 div 32),
<<LongKey:512>> = K,
mask(Key, LongKey, Data, Accu).

mask(Key, LongKey, Data,Accu) ->
case Data of
<<A:512, Rest/binary>> ->
C = binary:encode_unsigned(A bxor LongKey),
mask(Key, LongKey, Rest, <<Accu/binary, C/binary>>);
<<A:32,Rest/binary>> ->
C = binary:encode_unsigned(A bxor Key),
mask(Key,LongKey,Rest,<<Accu/binary,C/binary>>);
<<A:24>> ->
<<B:24, _:8>> = binary:encode_unsigned(Key),
C = binary:encode_unsigned(A bxor B),
<<Accu/binary,C/binary>>;
<<A:16>> ->
<<B:16, _:16>> = binary:encode_unsigned(Key),
C = binary:encode_unsigned(A bxor B),
<<Accu/binary,C/binary>>;
<<A:8>> ->
<<B:8, _:24>> = binary:encode_unsigned(Key),
C = binary:encode_unsigned(A bxor B),
<<Accu/binary,C/binary>>;
<<>> ->
Accu
end.

Why the call to binary:encode_unsigned? Lets alter that pattern:

case Data of
<<A:512, Rest/binary>> ->
C = A bxor LongKey,
mask(Key, LongKey, Rest, <<Accu/binary, C:512>>);

Now it is 5 times faster, same result. The NIF-advantage is now a
factor of 2-3. That is in the ballpark I would expect it to be. You
are doing many more reallocations with the above solution. Then the C
NIF version. What happens if we tune it some more? Lets do runs of
8192 bits at a time...

9 times faster compared to the original here! I expect our speed will
converge to that of C if we turn it up even more and get the amount of
allocation/realloc/concatenation down.

--
J.
_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions

Mihai Balea

unread,

Jul 22, 2011, 9:58:30 AM7/22/11

to Jesper Louis Andersen, Andy W. Song, erlang-questions

On Jul 22, 2011, at 7:03 AM, Jesper Louis Andersen wrote:

> On Fri, Jul 22, 2011 at 11:45, Andy W. Song <wso...@gmail.com> wrote:
>
>> I did some unit test on my code and felt that it's slow (it can process
>> about 24M byte/s) on a virtual machine. HiPE can double the performance but
>> still not quite enough. So I wrote an NIF to handle this. The speed is about
>> 10~15x faster. Not only that, I feel that the C code is easier to write.
>
> Blindly unrolling the Key a bit gives a factor of 3 speedup:

> <snip>

> Now it is 5 times faster, same result. The NIF-advantage is now a
> factor of 2-3. That is in the ballpark I would expect it to be. You
> are doing many more reallocations with the above solution. Then the C
> NIF version. What happens if we tune it some more? Lets do runs of
> 8192 bits at a time...
>
> 9 times faster compared to the original here! I expect our speed will
> converge to that of C if we turn it up even more and get the amount of
> allocation/realloc/concatenation down.

Just for fun, I did a test with a version of mask based on binary comprehensions. Here's the code:

mask1(Key, Data) ->
S = size(Data) div 4 * 4,
<<D1:S/binary, T1/binary>> = Data,
D2 = << <<(X bxor Key):32>> || <<X:32>> <= D1 >>,
T2 = handle_tail(T1, <<Key:32>>),
<<D2/binary, T2/binary>>.

handle_tail(<<A:24>>, <<K:24, _:8>>) -> <<(A bxor K):24>>;
handle_tail(<<A:16>>, <<K:16, _:16>>) -> <<(A bxor K):16>>;
handle_tail(<<A:8>>, <<K:8, _:24>>) -> <<(A bxor K):8>>;
handle_tail(<<>>, _) -> <<>>.

The speed gain is disappointingly small, it shaves roughly 40% of Andy's original times.
I suspect that's because the comprehension is just syntactic sugar on top of recursive loops.