Hipe and Binary - Bitstring

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Hipe and Binary - Bitstring

obi458

I use, with binaries like <<1:1000000>>,

one(F,<<Size:64/unsigned, Bitmap:Size/bitstring, _/bitstring>>) ->
  one(F,Bitmap,0,[]).
one(F, <<0:1, R/bitstring>>, N, Acc) ->
  one(F, R, N + 1, Acc);
one(F, <<1:1, R/bitstring>>, N, Acc) ->
  one(F, R, N + 1, [F(N) | Acc]);
one(_, <<>>, _, Acc) -> Acc.

union(<<Size:64/unsigned, L:Size/unsigned, P/bitstring>>,
    <<Size:64/unsigned, R:Size/unsigned, _/bitstring>>) ->
  <<Size:64/unsigned, (L bor R):Size/unsigned, P/bitstring>>.

and call this functions 1,000,000 times, this takes for 1,000 calls about 20 minutes,

if i compile with native -compile([native,{hipe, o2}]) it takes 3 seconds for 1,000 calls, so it is about 400x faster !!

OS: OSX

What is the secret?

-- 
Grüße
Oliver Bollmann

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Hipe and Binary - Bitstring

John Högberg-2
Hi Oliver,

I've tried to reproduce this discrepancy on my machine, but I can only see a modest difference on latest OTP 21 (the results are in microseconds):

Erlang/OTP 21 [erts-10.3.1] [source] [64-bit] [smp:24:24] [ds:24:24:10] [async-threads:1] [hipe]

Eshell V10.3.1  (abort with ^G)
1> c(t, []).           
{ok,t}
2> t:bench(one).       
15957304
3> t:bench(union).
559470
4> c(t, [native]).     
{ok,t}
5> t:bench(one).  
3611371
6> t:bench(union).
501871

I've attached the source code I used for this test, am I missing something?

Regards,
John Högberg

On Wed, 2019-03-27 at 10:09 +0100, Oliver Bollmann wrote:

I use, with binaries like <<1:1000000>>,

one(F,<<Size:64/unsigned, Bitmap:Size/bitstring, _/bitstring>>) ->
  one(F,Bitmap,0,[]).
one(F, <<0:1, R/bitstring>>, N, Acc) ->
  one(F, R, N + 1, Acc);
one(F, <<1:1, R/bitstring>>, N, Acc) ->
  one(F, R, N + 1, [F(N) | Acc]);
one(_, <<>>, _, Acc) -> Acc.
union(<<Size:64/unsigned, L:Size/unsigned, P/bitstring>>,
    <<Size:64/unsigned, R:Size/unsigned, _/bitstring>>) ->
  <<Size:64/unsigned, (L bor R):Size/unsigned, P/bitstring>>.

and call this functions 1,000,000 times, this takes for 1,000 calls about 20 minutes,

if i compile with native -compile([native,{hipe, o2}]) it takes 3 seconds for 1,000 calls, so it is about 400x faster !!

OS: OSX

What is the secret?

-- 
Grüße
Oliver Bollmann
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Hipe and Binary - Bitstring

Kostis Sagonas-2
On 3/27/19 1:04 PM, John Högberg wrote:

> Hi Oliver,
>
> I've tried to reproduce this discrepancy on my machine, but I can only
> see a modest difference on latest OTP 21 (the results are in microseconds):
>
> Erlang/OTP 21 [erts-10.3.1] [source] [64-bit] [smp:24:24] [ds:24:24:10] [async-threads:1] [hipe]
>
>
> Eshell V10.3.1  (abort with ^G)
>
> 1> c(t, []).
>
> {ok,t}
>
> 2> t:bench(one).
>
> 15957304
>
> 3> t:bench(union).
>
> 559470
>
> 4> c(t, [native]).
>
> {ok,t}
>
> 5> t:bench(one).
>
> 3611371
>
> 6> t:bench(union).
>
> 501871
>
>
> I've attached the source code I used for this test, am I missing something?


We are obviously missing the benchmark program that Oliver used to get
numbers.  But the "400x faster" figure cannot possibly be right.

Personally, I cannot see how one could turn function union/2, which is a
one-liner with two bitstring matches and one construction, into a
benchmark.  So, it's not surprising that the performance difference
there is very small.

On the other hand, I would not call the performance difference between
BEAM and HiPE that you observed "modest".  Four times faster execution
is IMO something that deserves a better adjective.

Kostis


> On Wed, 2019-03-27 at 10:09 +0100, Oliver Bollmann wrote:
>>
>> I use, with binaries like <<1:1000000>>,
>>
>> one(F,<<Size:64/unsigned,Bitmap:Size/bitstring,_/bitstring>>) ->
>>    one(F,Bitmap,0,[]).
>> one(F, <<0:1,R/bitstring>>,N,Acc) ->
>>    one(F,R,N +1,Acc);
>> one(F, <<1:1,R/bitstring>>,N,Acc) ->
>>    one(F,R,N +1, [F(N) |Acc]);
>> one(_, <<>>,_,Acc) ->Acc.
>> union(<<Size:64/unsigned,L:Size/unsigned,P/bitstring>>,
>>      <<Size:64/unsigned,R:Size/unsigned,_/bitstring>>) ->
>>    <<Size:64/unsigned, (L bor R):Size/unsigned,P/bitstring>>.
>>
>> and call this functions 1,000,000 times, this takes for 1,000 calls
>> about 20 minutes,
>>
>> if i compile with native -compile([native,{hipe, o2}])it takes 3
>> seconds for 1,000 calls, so it is about 400x faster !!
>>
>> OS: OSX
>>
>> What is the secret?
>>
>> --
>> Grüße
>> Oliver Bollmann
>> _______________________________________________
>> erlang-questions mailing list
>> [hidden email]  <mailto:[hidden email]>
>> http://erlang.org/mailman/listinfo/erlang-questions
>>
>
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions
>

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Hipe and Binary - Bitstring

obi458

Hi Kostis,

attached the missing module!

The fact is: the factor is stable, 3 seconds with native, 20 minutes without native,
the result of the function is the same!

Between the 1,000 steps i do not get any traces messages on native!

If i trace the gc, the last lines, at the end, with native are:

2019-03-28 00:08:47.834 gc_minor_start [{wordsize,0},{old_heap_block_size,0},{heap_block_size,410034027},{mbuf_size,105},{recent_size,285921217},{stack_size,30},{old_heap_size,0},{heap_size,410033946},{bin_vheap_size,4539999},{bin_vheap_block_size,7427891},{bin_old_vheap_size,0},{bin_old_vheap_block_size,1090210}]
2019-03-28 00:08:51.778 gc_minor_end [{wordsize,124112863},{old_heap_block_size,590448998},{heap_block_size,79467343},{mbuf_size,0},{recent_size,120},{stack_size,30},{old_heap_size,285921068},{heap_size,120},{bin_vheap_size,2},{bin_vheap_block_size,3713945},{bin_old_vheap_size,3736072},{bin_old_vheap_block_size,1090210}]
2019-03-28 00:08:53.716 gc_minor_start [{wordsize,0},{old_heap_block_size,590448998},{heap_block_size,79467343},{mbuf_size,0},{recent_size,120},{stack_size,60},{old_heap_size,285921068},{heap_size,79467276},{bin_vheap_size,651374},{bin_vheap_block_size,3713945},{bin_old_vheap_size,3736072},{bin_old_vheap_block_size,1090210}]
2019-03-28 00:08:53.716 gc_minor_end [{wordsize,0},{old_heap_block_size,590448998},{heap_block_size,79467343},{mbuf_size,0},{recent_size,120},{stack_size,60},{old_heap_size,285921068},{heap_size,79467276},{bin_vheap_size,651374},{bin_vheap_block_size,3713945},{bin_old_vheap_size,3736072},{bin_old_vheap_block_size,1090210}]
2019-03-28 00:08:53.716 gc_major_start [{wordsize,0},{old_heap_block_size,590448998},{heap_block_size,79467343},{mbuf_size,0},{recent_size,120},{stack_size,60},{old_heap_size,285921068},{heap_size,79467276},{bin_vheap_size,651374},{bin_vheap_block_size,3713945},{bin_old_vheap_size,3736072},{bin_old_vheap_block_size,1090210}]
2019-03-28 00:08:57.587 gc_major_end [{wordsize,79467130},{old_heap_block_size,0},{heap_block_size,410034027},{mbuf_size,0},{recent_size,285921226},{stack_size,60},{old_heap_size,0},{heap_size,285921226},{bin_vheap_size,3736074},{bin_vheap_block_size,6009163},{bin_old_vheap_size,0},{bin_old_vheap_block_size,1090210}]
2019-03-28 00:09:03.228 gc_minor_start [{wordsize,0},{old_heap_block_size,0},{heap_block_size,410034027},{mbuf_size,15972166},{recent_size,285921226},{stack_size,19},{old_heap_size,0},{heap_size,402266882},{bin_vheap_size,4689732},{bin_vheap_block_size,6009163},{bin_old_vheap_size,0},{bin_old_vheap_block_size,1090210}]
2019-03-28 00:09:07.676 gc_minor_end [{wordsize,116345800},{old_heap_block_size,590448998},{heap_block_size,79467343},{mbuf_size,0},{recent_size,15972177},{stack_size,19},{old_heap_size,285921071},{heap_size,15972177},{bin_vheap_size,0},{bin_vheap_block_size,3004581},{bin_old_vheap_size,3736072},{bin_old_vheap_block_size,1090210}]


On not using native i get between the 1,000 steps:

...
2019-03-28 00:54:45.408 gc_minor_start [{wordsize,10970},{old_heap_block_size,410034027},{heap_block_size,17731},{mbuf_size,0},{recent_size,21},{stack_size,38},{old_heap_size,287516968},{heap_size,11001},{bin_vheap_size,31252},{bin_vheap_block_size,46422},{bin_old_vheap_size,5156641},{bin_old_vheap_block_size,5708181}]
2019-03-28 00:54:46.123 gc_minor_end [{wordsize,0},{old_heap_block_size,410034027},{heap_block_size,28690},{mbuf_size,0},{recent_size,10980},{stack_size,38},{old_heap_size,287516989},{heap_size,10980},{bin_vheap_size,0},{bin_vheap_block_size,46422},{bin_old_vheap_size,5187893},{bin_old_vheap_block_size,5708181}]
2019-03-28 00:54:46.135 gc_minor_start [{wordsize,15627},{old_heap_block_size,410034027},{heap_block_size,28690},{mbuf_size,0},{recent_size,10980},{stack_size,38},{old_heap_size,287516989},{heap_size,22186},{bin_vheap_size,62502},{bin_vheap_block_size,46422},{bin_old_vheap_size,5187893},{bin_old_vheap_block_size,5708181}]
2019-03-28 00:54:46.838 gc_minor_end [{wordsize,22165},{old_heap_block_size,410034027},{heap_block_size,28690},{mbuf_size,0},{recent_size,21},{stack_size,38},{old_heap_size,287516989},{heap_size,21},{bin_vheap_size,31252},{bin_vheap_block_size,46422},{bin_old_vheap_size,5187893},{bin_old_vheap_block_size,5708181}]
2019-03-28 00:54:46.838 gc_minor_start [{wordsize,0},{old_heap_block_size,410034027},{heap_block_size,28690},{mbuf_size,0},{recent_size,21},{stack_size,38},{old_heap_size,287516989},{heap_size,161},{bin_vheap_size,46878},{bin_vheap_block_size,46422},{bin_old_vheap_size,5187893},{bin_old_vheap_block_size,5708181}]
2019-03-28 00:54:47.549 gc_minor_end [{wordsize,137},{old_heap_block_size,410034027},{heap_block_size,233},{mbuf_size,0},{recent_size,21},{stack_size,38},{old_heap_size,287516992},{heap_size,21},{bin_vheap_size,15626},{bin_vheap_block_size,46422},{bin_old_vheap_size,5187893},{bin_old_vheap_block_size,5708181}]
2019-03-28 00:54:47.946 gc_minor_start [{wordsize,15627},{old_heap_block_size,410034027},{heap_block_size,233},{mbuf_size,0},{recent_size,21},{stack_size,38},{old_heap_size,287516992},{heap_size,135},{bin_vheap_size,62502},{bin_vheap_block_size,46422},{bin_old_vheap_size,5187893},{bin_old_vheap_block_size,5708181}]
2019-03-28 00:54:48.313 gc_minor_end [{wordsize,114},{old_heap_block_size,410034027},{heap_block_size,17731},{mbuf_size,0},{recent_size,21},{stack_size,38},{old_heap_size,287516992},{heap_size,21},{bin_vheap_size,31252},{bin_vheap_block_size,46422},{bin_old_vheap_size,5187893},{bin_old_vheap_block_size,5708181}]
2019-03-28 00:54:48.313 gc_minor_start [{wordsize,0},{old_heap_block_size,410034027},{heap_block_size,17731},{mbuf_size,0},{recent_size,21},{stack_size,38},{old_heap_size,287516992},{heap_size,155},{bin_vheap_size,46878},{bin_vheap_block_size,46422},{bin_old_vheap_size,5187893},{bin_old_vheap_block_size,5708181}]
2019-03-28 00:54:49.030 gc_minor_end [{wordsize,131},{old_heap_block_size,410034027},{heap_block_size,233},{mbuf_size,0},{recent_size,21},{stack_size,38},{old_heap_size,287516995},{heap_size,21},{bin_vheap_size,15626},{bin_vheap_block_size,46422},{bin_old_vheap_size,5187893},{bin_old_vheap_block_size,5708181}]
2019-03-28 00:54:49.788 gc_minor_start [{wordsize,15627},{old_heap_block_size,410034027},{heap_block_size,233},{mbuf_size,0},{recent_size,21},{stack_size,38},{old_heap_size,287516995},{heap_size,135},{bin_vheap_size,62502},{bin_vheap_block_size,46422},{bin_old_vheap_size,5187893},{bin_old_vheap_block_size,5708181}]
2019-03-28 00:54:49.789 gc_minor_end [{wordsize,114},{old_heap_block_size,410034027},{heap_block_size,17731},{mbuf_size,0},{recent_size,21},{stack_size,38},{old_heap_size,287516995},{heap_size,21},{bin_vheap_size,31252},{bin_vheap_block_size,46422},{bin_old_vheap_size,5187893},{bin_old_vheap_block_size,5708181}]
2019-03-28 00:54:49.789 gc_minor_start [{wordsize,0},{old_heap_block_size,410034027},{heap_block_size,17731},{mbuf_size,0},{recent_size,21},{stack_size,38},{old_heap_size,287516995},{heap_size,161},{bin_vheap_size,46878},{bin_vheap_block_size,46422},{bin_old_vheap_size,5187893},{bin_old_vheap_block_size,5708181}]
2019-03-28 00:54:50.508 gc_minor_end [{wordsize,137},{old_heap_block_size,410034027},{heap_block_size,233},{mbuf_size,0},{recent_size,21},{stack_size,38},{old_heap_size,287516998},{heap_size,21},{bin_vheap_size,15626},{bin_vheap_block_size,46422},{bin_old_vheap_size,5187893},{bin_old_vheap_block_size,5708181}]
2019-03-28 00:54:50.520 gc_minor_start [{wordsize,15627},{old_heap_block_size,410034027},{heap_block_size,233},{mbuf_size,0},{recent_size,21},{stack_size,38},{old_heap_size,287516998},{heap_size,135},{bin_vheap_size,62502},{bin_vheap_block_size,46422},{bin_old_vheap_size,5187893},{bin_old_vheap_block_size,5708181}]
2019-03-28 00:54:51.271 gc_minor_end [{wordsize,114},{old_heap_block_size,410034027},{heap_block_size,17731},{mbuf_size,0},{recent_size,21},{stack_size,38},{old_heap_size,287516998},{heap_size,21},{bin_vheap_size,31252},{bin_vheap_block_size,46422},{bin_old_vheap_size,5187893},{bin_old_vheap_block_size,5708181}]
2019-03-28 00:54:51.271 gc_minor_start [{wordsize,0},{old_heap_block_size,410034027},{heap_block_size,17731},{mbuf_size,0},{recent_size,21},{stack_size,38},{old_heap_size,287516998},{heap_size,155},{bin_vheap_size,46878},{bin_vheap_block_size,46422},{bin_old_vheap_size,5187893},{bin_old_vheap_block_size,5708181}]
2019-03-28 00:54:51.981 gc_minor_end [{wordsize,131},{old_heap_block_size,410034027},{heap_block_size,233},{mbuf_size,0},{recent_size,21},{stack_size,38},{old_heap_size,287517001},{heap_size,21},{bin_vheap_size,15626},{bin_vheap_block_size,46422},{bin_old_vheap_size,5187893},{bin_old_vheap_block_size,5708181}]
2019-03-28 00:54:51.993 gc_minor_start [{wordsize,15627},{old_heap_block_size,410034027},{heap_block_size,233},{mbuf_size,0},{recent_size,21},{stack_size,38},{old_heap_size,287517001},{heap_size,135},{bin_vheap_size,62503},{bin_vheap_block_size,46422},{bin_old_vheap_size,5187893},{bin_old_vheap_block_size,5708181}]
2019-03-28 00:54:52.738 gc_minor_end [{wordsize,114},{old_heap_block_size,410034027},{heap_block_size,17731},{mbuf_size,0},{recent_size,21},{stack_size,38},{old_heap_size,287517001},{heap_size,21},{bin_vheap_size,31252},{bin_vheap_block_size,46422},{bin_old_vheap_size,5187893},{bin_old_vheap_block_size,5708181}]
2019-03-28 00:54:52.739 gc_minor_start [{wordsize,0},{old_heap_block_size,410034027},{heap_block_size,17731},{mbuf_size,0},{recent_size,21},{stack_size,38},{old_heap_size,287517001},{heap_size,155},{bin_vheap_size,46878},{bin_vheap_block_size,46422},{bin_old_vheap_size,5187893},{bin_old_vheap_block_size,5708181}]
...

Oliver
On 27.03.19 21:30, Kostis Sagonas wrote:
On 3/27/19 1:04 PM, John Högberg wrote:
Hi Oliver,

I've tried to reproduce this discrepancy on my machine, but I can only see a modest difference on latest OTP 21 (the results are in microseconds):

Erlang/OTP 21 [erts-10.3.1] [source] [64-bit] [smp:24:24] [ds:24:24:10] [async-threads:1] [hipe]


Eshell V10.3.1  (abort with ^G)

1> c(t, []).

{ok,t}

2> t:bench(one).

15957304

3> t:bench(union).

559470

4> c(t, [native]).

{ok,t}

5> t:bench(one).

3611371

6> t:bench(union).

501871


I've attached the source code I used for this test, am I missing something?


We are obviously missing the benchmark program that Oliver used to get numbers.  But the "400x faster" figure cannot possibly be right.

Personally, I cannot see how one could turn function union/2, which is a one-liner with two bitstring matches and one construction, into a benchmark.  So, it's not surprising that the performance difference there is very small.

On the other hand, I would not call the performance difference between BEAM and HiPE that you observed "modest".  Four times faster execution is IMO something that deserves a better adjective.

Kostis


On Wed, 2019-03-27 at 10:09 +0100, Oliver Bollmann wrote:

I use, with binaries like <<1:1000000>>,

one(F,<<Size:64/unsigned,Bitmap:Size/bitstring,_/bitstring>>) ->
   one(F,Bitmap,0,[]).
one(F, <<0:1,R/bitstring>>,N,Acc) ->
   one(F,R,N +1,Acc);
one(F, <<1:1,R/bitstring>>,N,Acc) ->
   one(F,R,N +1, [F(N) |Acc]);
one(_, <<>>,_,Acc) ->Acc.
union(<<Size:64/unsigned,L:Size/unsigned,P/bitstring>>,
     <<Size:64/unsigned,R:Size/unsigned,_/bitstring>>) ->
   <<Size:64/unsigned, (L bor R):Size/unsigned,P/bitstring>>.

and call this functions 1,000,000 times, this takes for 1,000 calls about 20 minutes,

if i compile with native -compile([native,{hipe, o2}])it takes 3 seconds for 1,000 calls, so it is about 400x faster !!

OS: OSX

What is the secret?

-- 
Grüße
Oliver Bollmann
_______________________________________________
erlang-questions mailing list
[hidden email]  [hidden email]
http://erlang.org/mailman/listinfo/erlang-questions


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions



-- 
Grüße
Oliver Bollmann

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

bitmap_hipe_test.erl (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Hipe and Binary - Bitstring

obi458
In reply to this post by John Högberg-2

Hi John,

gc tracer(native) 1,000,000 steps:

#{gc_major_end => 36,gc_major_start => 36,gc_max_heap_size => 0,gc_minor_end => 116,gc_minor_start => 116}

gc tracer(NOT native) 1,000 steps:
#{gc_major_end => 35,gc_major_start => 35,gc_max_heap_size => 0,gc_minor_end => 1202,gc_minor_start => 1202}

Oliver


On 27.03.19 17:22, John Högberg wrote:
Nevermind, I'm blind, I just noticed "UseProf" now. I may need more coffee :o)

It's possible that the native code generates less garbage on the heap, causing fewer GCs, which will be a lot faster if your processes have a lot of live data as it won't have to copy it over and over. Try comparing how many garbage collections the process has gone through with process_info(Pid, garbage_collection), maybe it will provide some clue.

/John

On Wed, 2019-03-27 at 16:31 +0100, John Högberg wrote:
Hi Oliver,

Have you tried comparing performance without eprof?

eprof uses tracing to figure out which functions take a long time to run, which adds considerable overhead to small functions that are repeated extremely often. HiPE doesn't support tracing at all, so that overhead simply disappears when the module is native-compiled.

Regards,
John Högberg

On Wed, 2019-03-27 at 16:18 +0100, Oliver Bollmann wrote:

Hi John,

indeed, on standalone the factor is about 3.7 only :-(

Attached the module i used. The code is part of: https://gitlab.com/Project-FiFo/DalmatinerDB/bitmap

I wonder, where comes the boost?

Facts: OS OSX 10.14.3(64GB)
          Erlang 20.3.18,
          the "boost" module use a lot of process directory (about 10GB, almost of this are binaries!)

Any hints?

Oliver

On 27.03.19 13:04, John Högberg wrote:
Hi Oliver,

I've tried to reproduce this discrepancy on my machine, but I can only see a modest difference on latest OTP 21 (the results are in microseconds):

Erlang/OTP 21 [erts-10.3.1] [source] [64-bit] [smp:24:24] [ds:24:24:10] [async-threads:1] [hipe]
Eshell V10.3.1  (abort with ^G)
1> c(t, []).           
{ok,t}
2> t:bench(one).       
15957304
3> t:bench(union).
559470
4> c(t, [native]).     
{ok,t}
5> t:bench(one).  
3611371
6> t:bench(union).
501871

I've attached the source code I used for this test, am I missing something?

Regards,
John Högberg

On Wed, 2019-03-27 at 10:09 +0100, Oliver Bollmann wrote:

I use, with binaries like <<1:1000000>>,

one(F,<<Size:64/unsigned, Bitmap:Size/bitstring, _/bitstring>>) ->
  one(F,Bitmap,0,[]).
one(F, <<0:1, R/bitstring>>, N, Acc) ->
  one(F, R, N + 1, Acc);
one(F, <<1:1, R/bitstring>>, N, Acc) ->
  one(F, R, N + 1, [F(N) | Acc]);
one(_, <<>>, _, Acc) -> Acc.
union(<<Size:64/unsigned, L:Size/unsigned, P/bitstring>>,
    <<Size:64/unsigned, R:Size/unsigned, _/bitstring>>) ->
  <<Size:64/unsigned, (L bor R):Size/unsigned, P/bitstring>>.

and call this functions 1,000,000 times, this takes for 1,000 calls about 20 minutes,

if i compile with native -compile([native,{hipe, o2}]) it takes 3 seconds for 1,000 calls, so it is about 400x faster !!

OS: OSX

What is the secret?

-- 
Grüße
Oliver Bollmann
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

          
-- 
Grüße
Oliver Bollmann
-- 
Grüße
Oliver Bollmann

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Hipe and Binary - Bitstring

obi458

Hi John,

problem solved!

The secret is: process_flag(min_heap_size,1024*1024*10),process_flag(min_bin_vheap_size,1024*1024*10*10),

with this i get without native 1,000,000 steps:

#{gc_major_end => 8,gc_major_start => 8,gc_max_heap_size => 0,gc_minor_end => 85,gc_minor_start => 85}

Performance is 100 time faster, the missing factor 4 comes from hipe itself!

Very nice!

Oliver
On 28.03.19 06:50, Oliver Bollmann wrote:

Hi John,

gc tracer(native) 1,000,000 steps:

#{gc_major_end => 36,gc_major_start => 36,gc_max_heap_size => 0,gc_minor_end => 116,gc_minor_start => 116}

gc tracer(NOT native) 1,000 steps:
#{gc_major_end => 35,gc_major_start => 35,gc_max_heap_size => 0,gc_minor_end => 1202,gc_minor_start => 1202}

Oliver


On 27.03.19 17:22, John Högberg wrote:
Nevermind, I'm blind, I just noticed "UseProf" now. I may need more coffee :o)

It's possible that the native code generates less garbage on the heap, causing fewer GCs, which will be a lot faster if your processes have a lot of live data as it won't have to copy it over and over. Try comparing how many garbage collections the process has gone through with process_info(Pid, garbage_collection), maybe it will provide some clue.

/John

On Wed, 2019-03-27 at 16:31 +0100, John Högberg wrote:
Hi Oliver,

Have you tried comparing performance without eprof?

eprof uses tracing to figure out which functions take a long time to run, which adds considerable overhead to small functions that are repeated extremely often. HiPE doesn't support tracing at all, so that overhead simply disappears when the module is native-compiled.

Regards,
John Högberg

On Wed, 2019-03-27 at 16:18 +0100, Oliver Bollmann wrote:

Hi John,

indeed, on standalone the factor is about 3.7 only :-(

Attached the module i used. The code is part of: https://gitlab.com/Project-FiFo/DalmatinerDB/bitmap

I wonder, where comes the boost?

Facts: OS OSX 10.14.3(64GB)
          Erlang 20.3.18,
          the "boost" module use a lot of process directory (about 10GB, almost of this are binaries!)

Any hints?

Oliver

On 27.03.19 13:04, John Högberg wrote:
Hi Oliver,

I've tried to reproduce this discrepancy on my machine, but I can only see a modest difference on latest OTP 21 (the results are in microseconds):

Erlang/OTP 21 [erts-10.3.1] [source] [64-bit] [smp:24:24] [ds:24:24:10] [async-threads:1] [hipe]
Eshell V10.3.1  (abort with ^G)
1> c(t, []).           
{ok,t}
2> t:bench(one).       
15957304
3> t:bench(union).
559470
4> c(t, [native]).     
{ok,t}
5> t:bench(one).  
3611371
6> t:bench(union).
501871

I've attached the source code I used for this test, am I missing something?

Regards,
John Högberg

On Wed, 2019-03-27 at 10:09 +0100, Oliver Bollmann wrote:

I use, with binaries like <<1:1000000>>,

one(F,<<Size:64/unsigned, Bitmap:Size/bitstring, _/bitstring>>) ->
  one(F,Bitmap,0,[]).
one(F, <<0:1, R/bitstring>>, N, Acc) ->
  one(F, R, N + 1, Acc);
one(F, <<1:1, R/bitstring>>, N, Acc) ->
  one(F, R, N + 1, [F(N) | Acc]);
one(_, <<>>, _, Acc) -> Acc.
union(<<Size:64/unsigned, L:Size/unsigned, P/bitstring>>,
    <<Size:64/unsigned, R:Size/unsigned, _/bitstring>>) ->
  <<Size:64/unsigned, (L bor R):Size/unsigned, P/bitstring>>.

and call this functions 1,000,000 times, this takes for 1,000 calls about 20 minutes,

if i compile with native -compile([native,{hipe, o2}]) it takes 3 seconds for 1,000 calls, so it is about 400x faster !!

OS: OSX

What is the secret?

-- 
Grüße
Oliver Bollmann
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
-- 
Grüße
Oliver Bollmann
-- 
Grüße
Oliver Bollmann

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
-- 
Grüße
Oliver Bollmann

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Hipe and Binary - Bitstring

Frank Muller
Can someone shed some light on the difference between min_heap_size & min_bin_vheap_size
on how to tweak them per process to tune VM’s perfs?

Thanks

Hi John,

problem solved!

The secret is: process_flag(min_heap_size,1024*1024*10),process_flag(min_bin_vheap_size,1024*1024*10*10),

with this i get without native 1,000,000 steps:

#{gc_major_end => 8,gc_major_start => 8,gc_max_heap_size => 0,gc_minor_end => 85,gc_minor_start => 85}

Performance is 100 time faster, the missing factor 4 comes from hipe itself!

Very nice!

Oliver
On 28.03.19 06:50, Oliver Bollmann wrote:

Hi John,

gc tracer(native) 1,000,000 steps:

#{gc_major_end => 36,gc_major_start => 36,gc_max_heap_size => 0,gc_minor_end => 116,gc_minor_start => 116}

gc tracer(NOT native) 1,000 steps:
#{gc_major_end => 35,gc_major_start => 35,gc_max_heap_size => 0,gc_minor_end => 1202,gc_minor_start => 1202}

Oliver


On 27.03.19 17:22, John Högberg wrote:
Nevermind, I'm blind, I just noticed "UseProf" now. I may need more coffee :o)

It's possible that the native code generates less garbage on the heap, causing fewer GCs, which will be a lot faster if your processes have a lot of live data as it won't have to copy it over and over. Try comparing how many garbage collections the process has gone through with process_info(Pid, garbage_collection), maybe it will provide some clue.

/John

On Wed, 2019-03-27 at 16:31 +0100, John Högberg wrote:
Hi Oliver,

Have you tried comparing performance without eprof?

eprof uses tracing to figure out which functions take a long time to run, which adds considerable overhead to small functions that are repeated extremely often. HiPE doesn't support tracing at all, so that overhead simply disappears when the module is native-compiled.

Regards,
John Högberg

On Wed, 2019-03-27 at 16:18 +0100, Oliver Bollmann wrote:

Hi John,

indeed, on standalone the factor is about 3.7 only :-(

Attached the module i used. The code is part of: https://gitlab.com/Project-FiFo/DalmatinerDB/bitmap

I wonder, where comes the boost?

Facts: OS OSX 10.14.3(64GB)
          Erlang 20.3.18,
          the "boost" module use a lot of process directory (about 10GB, almost of this are binaries!)

Any hints?

Oliver

On 27.03.19 13:04, John Högberg wrote:
Hi Oliver,

I've tried to reproduce this discrepancy on my machine, but I can only see a modest difference on latest OTP 21 (the results are in microseconds):

Erlang/OTP 21 [erts-10.3.1] [source] [64-bit] [smp:24:24] [ds:24:24:10] [async-threads:1] [hipe]
Eshell V10.3.1  (abort with ^G)
1> c(t, []).           
{ok,t}
2> t:bench(one).       
15957304
3> t:bench(union).
559470
4> c(t, [native]).     
{ok,t}
5> t:bench(one).  
3611371
6> t:bench(union).
501871

I've attached the source code I used for this test, am I missing something?

Regards,
John Högberg

On Wed, 2019-03-27 at 10:09 +0100, Oliver Bollmann wrote:

I use, with binaries like <<1:1000000>>,

one(F,<<Size:64/unsigned, Bitmap:Size/bitstring, _/bitstring>>) ->
  one(F,Bitmap,0,[]).
one(F, <<0:1, R/bitstring>>, N, Acc) ->
  one(F, R, N + 1, Acc);
one(F, <<1:1, R/bitstring>>, N, Acc) ->
  one(F, R, N + 1, [F(N) | Acc]);
one(_, <<>>, _, Acc) -> Acc.
union(<<Size:64/unsigned, L:Size/unsigned, P/bitstring>>,
    <<Size:64/unsigned, R:Size/unsigned, _/bitstring>>) ->
  <<Size:64/unsigned, (L bor R):Size/unsigned, P/bitstring>>.

and call this functions 1,000,000 times, this takes for 1,000 calls about 20 minutes,

if i compile with native -compile([native,{hipe, o2}]) it takes 3 seconds for 1,000 calls, so it is about 400x faster !!

OS: OSX

What is the secret?

-- 
Grüße
Oliver Bollmann
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
-- 
Grüße
Oliver Bollmann
-- 
Grüße
Oliver Bollmann

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
-- 
Grüße
Oliver Bollmann
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Hipe and Binary - Bitstring

John Högberg-2
In reply to this post by Kostis Sagonas-2
On Wed, 2019-03-27 at 21:30 +0100, Kostis Sagonas wrote:
> On the other hand, I would not call the performance difference
> between
> BEAM and HiPE that you observed "modest".  Four times faster
> execution
> is IMO something that deserves a better adjective.
>
> Kostis

Yes, it's a very impressive improvement. "Modest" was in relation to
that 400x number and I should've been clearer about that, "reasonable
difference" would have been better wording.

On Thu, 2019-03-28 at 08:34 +0100, Oliver Bollmann wrote:

> Hi John,
> problem solved!
> The secret is:
> process_flag(min_heap_size,1024*1024*10),process_flag(min_bin_vheap_s
> ize,1024*1024*10*10),
> with this i get without native 1,000,000 steps:
> #{gc_major_end => 8,gc_major_start => 8,gc_max_heap_size =>
> 0,gc_minor_end => 85,gc_minor_start => 85}
>
> Performance is 100 time faster, the missing factor 4 comes from hipe
> itself!
>
> Very nice!
>
> Oliver

I'm glad it worked out!

However, you're still going to copy those ~2GB of live data when a full
GC finally happens, and I think you should consider reducing that
figure. Do you really need all that data in one process?

On Thu, 2019-03-28 at 08:55 +0100, Frank Muller wrote:
> Can someone shed some light on the difference between min_heap_size
> & min_bin_vheap_size
>
> on how to tweak them per process to tune VM’s perfs?
>
>
> Thanks

On the process heap, off-heap binaries are essentially just a small
chunk with a pointer and size, so if we decided to GC based on the
process heap alone we would keep an unreachable 1GB binary alive for
just as long as a 1KB one (all else equal), which is a bit suboptimal.

We therefore track the combined size of all our off-heap data and GC
when they exceed the "virtual binary heap size," even if the process
heap nowhere near full. This "virtual binary heap" grows and shrinks
much like the ordinary process heap, and the min_bin_vheap_size option
is analogous to min_heap_size.

In general you shouldn't need to play around with these settings, but
if you have a process that you know will grow really fast then there
may be something to gain by bumping its minimum heap size. I don't
recommend doing this without careful consideration though.

http://erlang.org/doc/efficiency_guide/processes.html#initial-heap-size

/John

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Hipe and Binary - Bitstring

obi458
Hi,

> I'm glad it worked out!

> However, you're still going to copy those ~2GB of live data when a full
> GC finally happens, and I think you should consider reducing that
> figure. Do you really need all that data in one process?


The problem i solved with this process is to resolve nested groups, using diagraph:in_neighbours.

I have 1M groups with each have at least 100,000 members. The nested level is at least 100. Loops are allowed!

Question: Which group have which members =/= group.

I started with ets, ets copy at each access the value, i got a lot of memory peaks, not good.
I tried lists,maps and so on. 
I finished at the process directory, perfect, if the process died the memory is gone and using binaries only no copy of data on get.

Now i use a bitmap a 1Mx1M grid which each bit is a nested group, using union,intersection to resolve the nested groups.

The process runs now about 10mins and save the result in mnesia(about 5GB) and die.

BTW, the persistent_term looks good, cause the grid is a onetime grid, to split in more than one process, but what i need 10M terms with about 1TB binaries, for the next step :-)


Oliver






On 28.03.19 15:05, John Högberg wrote:
On Wed, 2019-03-27 at 21:30 +0100, Kostis Sagonas wrote:
On the other hand, I would not call the performance difference
between 
BEAM and HiPE that you observed "modest".  Four times faster
execution 
is IMO something that deserves a better adjective.

Kostis
Yes, it's a very impressive improvement. "Modest" was in relation to
that 400x number and I should've been clearer about that, "reasonable
difference" would have been better wording.

On Thu, 2019-03-28 at 08:34 +0100, Oliver Bollmann wrote:
Hi John,
problem solved!
The secret is:
process_flag(min_heap_size,1024*1024*10),process_flag(min_bin_vheap_s
ize,1024*1024*10*10),
with this i get without native 1,000,000 steps:
#{gc_major_end => 8,gc_major_start => 8,gc_max_heap_size =>
0,gc_minor_end => 85,gc_minor_start => 85}

Performance is 100 time faster, the missing factor 4 comes from hipe
itself!

Very nice!

Oliver
I'm glad it worked out!

However, you're still going to copy those ~2GB of live data when a full
GC finally happens, and I think you should consider reducing that
figure. Do you really need all that data in one process?

On Thu, 2019-03-28 at 08:55 +0100, Frank Muller wrote:
Can someone shed some light on the difference between min_heap_size
& min_bin_vheap_size

on how to tweak them per process to tune VM’s perfs?


Thanks
On the process heap, off-heap binaries are essentially just a small
chunk with a pointer and size, so if we decided to GC based on the
process heap alone we would keep an unreachable 1GB binary alive for
just as long as a 1KB one (all else equal), which is a bit suboptimal.

We therefore track the combined size of all our off-heap data and GC
when they exceed the "virtual binary heap size," even if the process
heap nowhere near full. This "virtual binary heap" grows and shrinks
much like the ordinary process heap, and the min_bin_vheap_size option
is analogous to min_heap_size.

In general you shouldn't need to play around with these settings, but
if you have a process that you know will grow really fast then there
may be something to gain by bumping its minimum heap size. I don't
recommend doing this without careful consideration though.

http://erlang.org/doc/efficiency_guide/processes.html#initial-heap-size

/John

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
-- 
Grüße
Oliver Bollmann

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions