Quantcast

NIF vs Erlang Binary

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

NIF vs Erlang Binary

Andy W. Song
Hi, 

I'm trying to write an Erlang WebSocket client and server against the latest spec. It requires message masking, which simply xor the whole message using a 4 bytes key.  Here is the reference. 

I did some unit test on my code and felt that it's slow (it can process about  24M byte/s) on a virtual machine. HiPE can double the performance but still not quite enough. So I wrote an NIF to handle this. The speed is about 10~15x faster. Not only that, I feel that the C code is easier to write. Maybe it's due to my inexperienced with Erlang yet.

Here is the Erlang and NIF code, FYI.

Thanks
Andy

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: NIF vs Erlang Binary

Jesper Louis Andersen-2
On Fri, Jul 22, 2011 at 11:45, Andy W. Song <[hidden email]> wrote:

> I did some unit test on my code and felt that it's slow (it can process
> about  24M byte/s) on a virtual machine. HiPE can double the performance but
> still not quite enough. So I wrote an NIF to handle this. The speed is about
> 10~15x faster. Not only that, I feel that the C code is easier to write.

Blindly unrolling the Key a bit gives a factor of 3 speedup:

mask(Key, Data, Accu) ->
    K = binary:copy(<<Key:32>>, 512 div 32),
    <<LongKey:512>> = K,
    mask(Key, LongKey, Data, Accu).

mask(Key, LongKey, Data,Accu) ->
    case Data of
        <<A:512, Rest/binary>> ->
            C = binary:encode_unsigned(A bxor LongKey),
            mask(Key, LongKey, Rest, <<Accu/binary, C/binary>>);
        <<A:32,Rest/binary>> ->
            C = binary:encode_unsigned(A bxor Key),
            mask(Key,LongKey,Rest,<<Accu/binary,C/binary>>);
        <<A:24>> ->
            <<B:24, _:8>> = binary:encode_unsigned(Key),
            C = binary:encode_unsigned(A bxor B),
            <<Accu/binary,C/binary>>;
        <<A:16>> ->
            <<B:16, _:16>> = binary:encode_unsigned(Key),
            C = binary:encode_unsigned(A bxor B),
            <<Accu/binary,C/binary>>;
        <<A:8>> ->
            <<B:8, _:24>> = binary:encode_unsigned(Key),
            C = binary:encode_unsigned(A bxor B),
            <<Accu/binary,C/binary>>;
        <<>> ->
            Accu
    end.

Why the call to binary:encode_unsigned? Lets alter that pattern:

    case Data of
        <<A:512, Rest/binary>> ->
            C = A bxor LongKey,
            mask(Key, LongKey, Rest, <<Accu/binary, C:512>>);

Now it is 5 times faster, same result. The NIF-advantage is now a
factor of 2-3. That is in the ballpark I would expect it to be. You
are doing many more reallocations with the above solution. Then the C
NIF version. What happens if we tune it some more? Lets do runs of
8192 bits at a time...

9 times faster compared to the original here! I expect our speed will
converge to that of C if we turn it up even more and get the amount of
allocation/realloc/concatenation down.

--
J.
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: NIF vs Erlang Binary

Mihai Balea

On Jul 22, 2011, at 7:03 AM, Jesper Louis Andersen wrote:

> On Fri, Jul 22, 2011 at 11:45, Andy W. Song <[hidden email]> wrote:
>
>> I did some unit test on my code and felt that it's slow (it can process
>> about  24M byte/s) on a virtual machine. HiPE can double the performance but
>> still not quite enough. So I wrote an NIF to handle this. The speed is about
>> 10~15x faster. Not only that, I feel that the C code is easier to write.
>
> Blindly unrolling the Key a bit gives a factor of 3 speedup:
> <snip>
> Now it is 5 times faster, same result. The NIF-advantage is now a
> factor of 2-3. That is in the ballpark I would expect it to be. You
> are doing many more reallocations with the above solution. Then the C
> NIF version. What happens if we tune it some more? Lets do runs of
> 8192 bits at a time...
>
> 9 times faster compared to the original here! I expect our speed will
> converge to that of C if we turn it up even more and get the amount of
> allocation/realloc/concatenation down.

Just for fun, I did a test with a version of mask based on binary comprehensions. Here's the code:

mask1(Key, Data) ->
    S = size(Data) div 4 * 4,
    <<D1:S/binary, T1/binary>> = Data,
    D2 = << <<(X bxor Key):32>> || <<X:32>> <= D1 >>,
    T2 = handle_tail(T1, <<Key:32>>),
    <<D2/binary, T2/binary>>.

handle_tail(<<A:24>>, <<K:24, _:8>>) -> <<(A bxor K):24>>;
handle_tail(<<A:16>>, <<K:16, _:16>>) -> <<(A bxor K):16>>;
handle_tail(<<A:8>>, <<K:8, _:24>>) -> <<(A bxor K):8>>;
handle_tail(<<>>, _) -> <<>>.

The speed gain is disappointingly small, it shaves roughly 40% of Andy's original times.
I suspect that's because the comprehension is just syntactic sugar on top of recursive loops.

Anyway, just thought I'd share the results :)

Mihai
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Loading...