|
Hi,
I'm trying to write an Erlang WebSocket client and server against the latest spec. It requires message masking, which simply xor the whole message using a 4 bytes key. Here is the reference.
I did some unit test on my code and felt that it's slow (it can process about 24M byte/s) on a virtual machine. HiPE can double the performance but still not quite enough. So I wrote an NIF to handle this. The speed is about 10~15x faster. Not only that, I feel that the C code is easier to write. Maybe it's due to my inexperienced with Erlang yet.
Thanks Andy
_______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
On Fri, Jul 22, 2011 at 11:45, Andy W. Song <[hidden email]> wrote:
> I did some unit test on my code and felt that it's slow (it can process > about 24M byte/s) on a virtual machine. HiPE can double the performance but > still not quite enough. So I wrote an NIF to handle this. The speed is about > 10~15x faster. Not only that, I feel that the C code is easier to write. Blindly unrolling the Key a bit gives a factor of 3 speedup: mask(Key, Data, Accu) -> K = binary:copy(<<Key:32>>, 512 div 32), <<LongKey:512>> = K, mask(Key, LongKey, Data, Accu). mask(Key, LongKey, Data,Accu) -> case Data of <<A:512, Rest/binary>> -> C = binary:encode_unsigned(A bxor LongKey), mask(Key, LongKey, Rest, <<Accu/binary, C/binary>>); <<A:32,Rest/binary>> -> C = binary:encode_unsigned(A bxor Key), mask(Key,LongKey,Rest,<<Accu/binary,C/binary>>); <<A:24>> -> <<B:24, _:8>> = binary:encode_unsigned(Key), C = binary:encode_unsigned(A bxor B), <<Accu/binary,C/binary>>; <<A:16>> -> <<B:16, _:16>> = binary:encode_unsigned(Key), C = binary:encode_unsigned(A bxor B), <<Accu/binary,C/binary>>; <<A:8>> -> <<B:8, _:24>> = binary:encode_unsigned(Key), C = binary:encode_unsigned(A bxor B), <<Accu/binary,C/binary>>; <<>> -> Accu end. Why the call to binary:encode_unsigned? Lets alter that pattern: case Data of <<A:512, Rest/binary>> -> C = A bxor LongKey, mask(Key, LongKey, Rest, <<Accu/binary, C:512>>); Now it is 5 times faster, same result. The NIF-advantage is now a factor of 2-3. That is in the ballpark I would expect it to be. You are doing many more reallocations with the above solution. Then the C NIF version. What happens if we tune it some more? Lets do runs of 8192 bits at a time... 9 times faster compared to the original here! I expect our speed will converge to that of C if we turn it up even more and get the amount of allocation/realloc/concatenation down. -- J. _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
On Jul 22, 2011, at 7:03 AM, Jesper Louis Andersen wrote: > On Fri, Jul 22, 2011 at 11:45, Andy W. Song <[hidden email]> wrote: > >> I did some unit test on my code and felt that it's slow (it can process >> about 24M byte/s) on a virtual machine. HiPE can double the performance but >> still not quite enough. So I wrote an NIF to handle this. The speed is about >> 10~15x faster. Not only that, I feel that the C code is easier to write. > > Blindly unrolling the Key a bit gives a factor of 3 speedup: > <snip> > Now it is 5 times faster, same result. The NIF-advantage is now a > factor of 2-3. That is in the ballpark I would expect it to be. You > are doing many more reallocations with the above solution. Then the C > NIF version. What happens if we tune it some more? Lets do runs of > 8192 bits at a time... > > 9 times faster compared to the original here! I expect our speed will > converge to that of C if we turn it up even more and get the amount of > allocation/realloc/concatenation down. Just for fun, I did a test with a version of mask based on binary comprehensions. Here's the code: mask1(Key, Data) -> S = size(Data) div 4 * 4, <<D1:S/binary, T1/binary>> = Data, D2 = << <<(X bxor Key):32>> || <<X:32>> <= D1 >>, T2 = handle_tail(T1, <<Key:32>>), <<D2/binary, T2/binary>>. handle_tail(<<A:24>>, <<K:24, _:8>>) -> <<(A bxor K):24>>; handle_tail(<<A:16>>, <<K:16, _:16>>) -> <<(A bxor K):16>>; handle_tail(<<A:8>>, <<K:8, _:24>>) -> <<(A bxor K):8>>; handle_tail(<<>>, _) -> <<>>. The speed gain is disappointingly small, it shaves roughly 40% of Andy's original times. I suspect that's because the comprehension is just syntactic sugar on top of recursive loops. Anyway, just thought I'd share the results :) Mihai _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
| Powered by Nabble | Edit this page |
