> I did some unit test on my code and felt that it's slow (it can process
> about 24M byte/s) on a virtual machine. HiPE can double the performance but
> still not quite enough. So I wrote an NIF to handle this. The speed is about
> 10~15x faster. Not only that, I feel that the C code is easier to write.
Blindly unrolling the Key a bit gives a factor of 3 speedup:
mask(Key, Data, Accu) ->
K = binary:copy(<<Key:32>>, 512 div 32),
<<LongKey:512>> = K,
mask(Key, LongKey, Data, Accu).
mask(Key, LongKey, Data,Accu) ->
case Data of
<<A:512, Rest/binary>> ->
C = binary:encode_unsigned(A bxor LongKey),
mask(Key, LongKey, Rest, <<Accu/binary, C/binary>>);
<<A:32,Rest/binary>> ->
C = binary:encode_unsigned(A bxor Key),
mask(Key,LongKey,Rest,<<Accu/binary,C/binary>>);
<<A:24>> ->
<<B:24, _:8>> = binary:encode_unsigned(Key),
C = binary:encode_unsigned(A bxor B),
<<Accu/binary,C/binary>>;
<<A:16>> ->
<<B:16, _:16>> = binary:encode_unsigned(Key),
C = binary:encode_unsigned(A bxor B),
<<Accu/binary,C/binary>>;
<<A:8>> ->
<<B:8, _:24>> = binary:encode_unsigned(Key),
C = binary:encode_unsigned(A bxor B),
<<Accu/binary,C/binary>>;
<<>> ->
Accu
end.
Why the call to binary:encode_unsigned? Lets alter that pattern:
case Data of
<<A:512, Rest/binary>> ->
C = A bxor LongKey,
mask(Key, LongKey, Rest, <<Accu/binary, C:512>>);
Now it is 5 times faster, same result. The NIF-advantage is now a
factor of 2-3. That is in the ballpark I would expect it to be. You
are doing many more reallocations with the above solution. Then the C
NIF version. What happens if we tune it some more? Lets do runs of
8192 bits at a time...
9 times faster compared to the original here! I expect our speed will
converge to that of C if we turn it up even more and get the amount of
allocation/realloc/concatenation down.
--
J.
_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions
> On Fri, Jul 22, 2011 at 11:45, Andy W. Song <wso...@gmail.com> wrote:
>
>> I did some unit test on my code and felt that it's slow (it can process
>> about 24M byte/s) on a virtual machine. HiPE can double the performance but
>> still not quite enough. So I wrote an NIF to handle this. The speed is about
>> 10~15x faster. Not only that, I feel that the C code is easier to write.
>
> Blindly unrolling the Key a bit gives a factor of 3 speedup:
> <snip>
> Now it is 5 times faster, same result. The NIF-advantage is now a
> factor of 2-3. That is in the ballpark I would expect it to be. You
> are doing many more reallocations with the above solution. Then the C
> NIF version. What happens if we tune it some more? Lets do runs of
> 8192 bits at a time...
>
> 9 times faster compared to the original here! I expect our speed will
> converge to that of C if we turn it up even more and get the amount of
> allocation/realloc/concatenation down.
Just for fun, I did a test with a version of mask based on binary comprehensions. Here's the code:
mask1(Key, Data) ->
S = size(Data) div 4 * 4,
<<D1:S/binary, T1/binary>> = Data,
D2 = << <<(X bxor Key):32>> || <<X:32>> <= D1 >>,
T2 = handle_tail(T1, <<Key:32>>),
<<D2/binary, T2/binary>>.
handle_tail(<<A:24>>, <<K:24, _:8>>) -> <<(A bxor K):24>>;
handle_tail(<<A:16>>, <<K:16, _:16>>) -> <<(A bxor K):16>>;
handle_tail(<<A:8>>, <<K:8, _:24>>) -> <<(A bxor K):8>>;
handle_tail(<<>>, _) -> <<>>.
The speed gain is disappointingly small, it shaves roughly 40% of Andy's original times.
I suspect that's because the comprehension is just syntactic sugar on top of recursive loops.
Anyway, just thought I'd share the results :)
Mihai